r/googlecloud

Pub/Sub message ordering with asynchronous processing

Hey everyone,

I am looking for the best approach to maintain message ordering in Cloud Pub/Sub when dealing with mixed processing times.

Currently, I use Pub/Sub with message ordering enabled, but I face a challenge when a message requiring heavy background processing (via Cloud Tasks and Cloud Functions) is sent immediately before a message that requires none.

In my current setup, I only publish to Pub/Sub after the background processing completes, which causes the second "fast" message to be consumed before the first "slow" one, breaking the intended sequence. To solve this, I’m considering publishing all messages instantly, using a "placeholder" for the slow messages and having my push subscription endpoint check a database flag to see if the background task is finished. If not, the endpoint would NACK the message to trigger a retry.

While this "NACK-until-ready" approach preserves the order (since subsequent messages in that ordering key will wait), it introduces latency and overhead from retries, so I’m wondering if there is a more efficient way to handle this dependency without relying on frequent NACKs.

Would love to hear what you think!

reddit.com
u/omry8880 — 15 hours ago

Google Cloud Run asia-south1 stuck with "Project failed to initialize in this region due to quota exceeded" for 4 days — quota resets not helping

Been stuck on this for 4 days and nothing is working.

**The situation:**

Migrated Firebase Cloud Functions from us-central1 to asia-south1.

During deployment, hit the write quota limit (30/min in asia-south1 —

yes, it's tiny). Now ALL 41 Cloud Run services in asia-south1 show:

"Routing traffic: Failed. Project failed to initialize in this region

due to quota exceeded."

**What makes this weird:**

- Code uploaded successfully every time

- The services EXIST in Cloud Run

- Daily quota reset happens — doesn't fix it

- Even `gcloud run services update-traffic myservice --to-latest

--region asia-south1` fails with the same quota error

- `firebase deploy --only functions` says "Skipping unchanged functions"

because code hash didn't change

**What I've tried:**

- Waited for daily quota reset (midnight Pacific) — same error

- Tried gcloud update-traffic directly — same error

- Tried forcing redeploy with code change — quota error again

- Deleted unrelated service to free region slot — same error

- Filed support case — waiting

**My understanding:**

The services are stuck pointing at failed revisions. Fixing them

requires Cloud Run write operations. But those writes are being

throttled. So it's a deadlock — can't fix the quota state without

quota.

**Questions:**

  1. Has anyone recovered from this without Google support intervening?

  2. Is there a way to force Cloud Run to serve traffic from an existing

    revision without using the write quota?

  3. How long does the project-level throttle typically last after

    repeated quota exhaustion?

Project: Firebase Functions v2 (Cloud Run), asia-south1, Node.js 22

Any help appreciated — this is blocking a production app launch.

reddit.com
u/PlanBot_ — 16 hours ago

Architecture Review: API Gateway to Private VM (No VPN) for heavy LLM video workload. Is Cloud Run proxy the best practice?

Hi everyone,

I'm designing a secure architecture for a desktop application and I would love a sanity check from this community, especially regarding networking and cost traps.

Context & Workload:

Client: A desktop executable (Delphi) running on our customers' local machines over the public internet.

Backend: A custom, heavy LLM hosted on our own GCP Compute Engine VM (requires GPUs).

Volume: Processing ~30,000 requests/month containing mixed media (mostly video, plus images/text). Estimated Egress: ~1.8 TB/month.

Hard Constraints (My hands are tied here!):

No Managed Services (Vertex AI, etc.): The team configuring the LLM explicitly specified that it must run on a dedicated VM. Because of this technical requirement, managed services like Vertex AI are off the table for this project.

No VPN: End-users cannot be forced to use a VPN. It must be a standard HTTPS request from the desktop app.

No Public IP on VM: The security team demands that the LLM VM remains strictly private (no external IP) to protect the expensive GPU compute.

API Key Auth: We need a robust way to validate x-api-key before the traffic hits the internal network, to block unauthorized requests and avoid DDoS on our expensive GPU instances.

Proposed Architecture:

Client sends a POST request (HTTPS/TLS 1.3) with x-api-key in the header.

Google Cloud API Gateway receives the request, validates the API key (blocking invalid ones immediately).

Cloud Run (Reverse Proxy): Since API Gateway cannot route directly to a VPC internal IP, it forwards the valid request to a simple Cloud Run service (just a tiny proxy container).

VPC / VM: The Cloud Run service uses Direct VPC Egress to forward the request to the internal IP of the LLM VM.

Response: The VM processes the video/text and sends the payload back through the same path.

My specific questions for the experts:

The API Gateway + Cloud Run Bridge: I know using a tiny Cloud Run container as a reverse proxy to reach the VPC is a common workaround for API Gateway's lack of native VPC support. Is this still the recommended best practice, or is there a cleaner/cheaper way that doesn't involve managed LLM APIs?

Load Balancers vs. API Gateway: I considered using an External HTTPS Load Balancer with NEGs instead of the Gateway, but I would lose the out-of-the-box API Key management. Am I missing a way to easily validate API keys at the Load Balancer level without building custom auth logic on the VM itself?

Cost Blindspots: I've estimated the Network Egress (1.8 TB) to be around $216/month (South America), plus the massive cost of the GPU VM running. Are there any hidden networking costs (e.g., inter-zone traffic, Cloud Run egress to VPC) for this volume of video data that I should be aware of?

Any feedback or red flags regarding this specific setup would be highly appreciated! Thanks!

reddit.com
u/Relative-Security-75 — 16 hours ago

Moving from monolith to event-driven microservices on GCP – what 1M+ transactions taught me

I've been building a real-time banking system on GCP that processes 1M+ transactions.

Early on, I started with a monolithic approach. It was simple. It worked.

But as scale increased, problems emerged:

- **Hard to change** – one small fix = full redeploy
- **Slower deployments** – build times kept growing
- **Single point of failure** – one bug crashed everything

So I migrated to event-driven microservices on GCP.

**The new architecture:**

| Component | GCP Service |
|-----------|-------------|
| API Gateway | Cloud Endpoints / Load Balancer |
| Async communication | Cloud Pub/Sub |
| Compute | Cloud Run (auto-scales to zero) |
| Analytics | BigQuery |
| Security | Cloud IAM + Firewall |

**What changed:**

✅ Independent services – each scales separately  
✅ Faster deployments – deploy only what changed  
✅ Resilient – one failure doesn't cascade  
✅ Cost-efficient – no traffic = near-zero cost

**The banking system specifically:**

- FastAPI on Cloud Run (millisecond response)
- Pub/Sub for async transaction processing
- Cloud SQL for ACID compliance
- BigQuery for real-time validation
- 2M+ double-entry ledger records (debit = credit)

**The hardest part?**

Not the tech – the mindset shift. Moving from "the system" to "events and messages" took time.

**Question for this community:**

For those who made similar migrations – what was your biggest unexpected challenge?

And for those still on monoliths – what's holding you back from moving to event-driven?

---

*Note: Not selling anything. Just sharing my experience building on GCP. Happy to answer questions about the architecture.*
reddit.com
u/BeginningOk3270 — 18 hours ago

Associate Data Practitioner certification

Hello everyone

i intend to take the Associate Data Practitioner certification in the next 2-3 weeks. I bought 2 different exam courses from Udemy and it's kinda confusing. One course (60 questions per exam one) has in depth questions related to Dataflow, Pub/Sub more practical related questions and the other one (50 questions per exam) doesn't have it.

It is kinda confusing on what to exactly expect. I know it is divided into 4 domains. People who have taken the exam - can you please help me out by specifying what exactly to expect from each domain. Would be of immense help. Thank you!

reddit.com
u/OkRock1009 — 12 hours ago

I signed up for the $300 free trial on Google Cloud for the first time. Please give me suggestions on how to avoid getting charged in the future

I just wanna play with cloud things, so I have no plans to pay. I just wanna learn every concept, that’s all. But after reading many charge stories, I’m kinda scared. I didn’t even create or touch anything yet, so please give suggestions or advice to avoid any horror stories in my life

PS : I didn’t upgrade my account to paid, so am i safe for now?

u/sandboxus3r77 — 17 hours ago
Week