u/Deep_Clock_6845 — reddlx

I posted this as a fun note, even Claude agree that it is probably unsustainable. I did the maths myself below..

Hey guys,

To the risk of being downvoted, I do use LLMs a fair bit at work.
I won't go into to much details to keep my profile anonymous.

We are a team of 2.5 dev, one entirely focusing on AI-agentic pipelines: handling customers tickets, code reviews, etc.

I am not a LLM heavy hitter, but I consumed 1.2B tokens last month.

I suspect that as a team, once our agentic pipelines are running, we might consume anywhere between 8B to 20B tokens a year.

Let's assume that after the dust settle, LLMs are charging at least 5/25 ($5 per million token input, $25 per million token output).

That's the current price of Opus 4.7, and they are burning cash. So 5/25, is very conservative.

Let's also assume that 85% of those tokens are input: code, instructions, comments...

The average price per million token would be: 0.85*5+0.15*25 = $8/M.

My current consumption would cost:
1 200 * 8 = $9,600/month or 115k/year

As a team, even with optimized pipelines, we will consume anywhere between 768k/year to 1.92 M/year.

Now, let's be nice and assume that 50% are hitting the cache, and we wipe out entirely its costs, for simplicity.

We are now looking at 385k/y to 950k/y

I assume that the average, competent senior to lead/principal developer here costs 145k/y, all included.

I am being generous, this is 35% above my current package - and I have a decent pay in here, well above the median.

A package of $145k/y would attract top tier developers.

I have been positively surprised with the result of our current LLMs pipeline.
Even as a skeptic, I have to be fair.

Results are sometimes great, sometimes garbage. But overall, mostly acceptable.
The guy in charge of this AI-transition, my colleague, is someone I look up to.
He's very talented, and I am sure he is squeezing as much juice as he can out of it.
I have worked with him for a few years, in different workplaces.

[EDIT:
Many called me out for the "mostly acceptable".
I'll get called out for it, but I'm being honest here: mostly acceptable is better than what we currently have.

"Mostly acceptable" IS:
- a band aid fix for a current bug, instead of re-architecturing a leaky abstraction.
- hard to test code (in an already, untested and hard to test codebase).
- inelegant and/or inefficient (to a certain degree, we're talking about a web CRUD app).
- often less than ~50 lines.

"Mostly acceptable" IS NOT:
- flaky code, such as monkey patching a function to "fix" a bug instead of fixing the root cause
- 55 "if"s statement in a loop a la Claude Code,
- code making it to production, without being reviewed and manually QA'd.
- adding MORE CSS to our global, 12k lines stylesheet

"Mostly acceptable" is not a dumpster fire, it's just "meh", as in, it doesn't really excite me.
"Mostly acceptable" is the state of all the codebases I've worked in the last 9 years.
They do the job - it's not great everywhere, it's not terrible everywhere, it's average. Maybe it's a web thing; but I'm yet to see a 'great' codebase.
(and if you tell me ALL the code you ship at work excites you, let me know when you are hiring...)
]

But those numbers..I am absolutely on my ass.

So even being absolutely forgiving in my calculations (50% cache for free, 5/25 costs) ...
This would represent 3 to 6 top-tier, senior/staff level engineer, with an above the market pay.

This is insane. While it's satisfying to close tickets a lot faster... is it really worth it? My opinion is that it's absolutely not.

Even 3 of those engineers would produce, overall, more value. Maybe less code generation velocity, but more value overall.

And the reality might be 5x worse...

Man, those large LLM providers... they are truely akin to drug dealers, same selling methods!

I can't help but agree with Ed, this is tantamount to fraud.

----

UPDATE 2:
Someone called me out on "not a heavy hitter" and "1.2B tokens".

Here's is my consumption.
It's 1.2B tokens, if I understand it properly, which also surprised me. It's a f ton.

What's wild is that I did not think I use it *that* much compared to the current narrative online?

I rarely have more than one agent in a terminal, if I have two, it's because of our sub-repos madness and it's easier to have two.
When I'm reviewing the code, and testing it, they are not running.

I wonder if the current state of the codebase (subrepos bonanza) jack up the use of tokens?
Or zai is inflating the token use?
Or I am not handling it properly?

Who knows?

I wont go into details, but I'm "software engineering" (i.e. not meeting or overhead) 35 hours/week on average. Up to 50hrs on release week, but that's an absolute max, and that's once a year.

But it bears the question: how are those guys running supposedly 12 agents, in concurrent loops etc?

https://preview.redd.it/s5mipqj6l9yg1.png?width=4860&format=png&auto=webp&s=ebf722904f0c4e077a28b68586e37f6c3fec0e24

How is it even possible? Not humanly, I don't believe a single second someone can ingerate that amount of code, but financially?

----

UPDATE 3:
Apparently 1.2B does put my in the heavy user category.
I'm now starting to question my entire life at this stage. Was I an AI bro all along?

Not sure how reliable the numbers are, I don't measure them myself.
I found them on their dashboard today.

I'm not on Twitter, I quit a few months ago because of the AI-fatigue, but I thought everyone was running 77 agents at all time, planning their sleep around time limit and coding on their phone while driving at all time?

I'm exaggerating, obviously, but not that much. I'm a bit confused to be honest.

------

Update 4: Is it productive?

The 50x is, until proven otherwise, absolutely bullshit.

Anyone claiming wild productivity gains are, in my opinion, either:

Pushing their own agenda
Inexperienced
Lying to themselves

I genuinely can't tell if I'm being more productive or not.

If we define productivity by "time saved", yes, on some tasks - some of the time.

Because I am selective on where I use it, and because I have a high degree of competency on those tasks.

Ideally, those tasks were eliminated. In an ideal codebase, I probably wouldn't want to use an LLM because any tasks would be "deep".

Overall, not sure. That "time saved" would have to be invested in something else that posting this on Reddit :)

The variance is huge, you may get 5x (at best!) in a some narrow "shallow" tasks, and -2x (yes, negative gains) in "deeper" tasks.

Also, to clarify, I'm talking about time spent on the task, i.e. 5x = 1 hour instead of 5h.

I find that a lot of claims are 'measured' as "time spent vs. doing it character by character".
We already had many tools to do what some devs use LLMs for.

Often, those "shallow" tasks could have been accomplished:

Having a good text editor (emacs, vim...)
Writing a script
Use the right tool (grep, ast-grep, codemods..)
Improving the overall DX/Devops (good test hygiene, good CICD..)

You may save 3 hours, but you also prevented yourself from gaining skills in the areas above.
Is it productive?

Those fit in the "inexperienced" category.

They have discovered the terminal with Claude Code, and have never used "jq" "awk" "sed" "grep" "find" and friends...

In my case, I use it to:

Apply large, "mechanical" refactor to the codebase.

We have a fairly messy codebase that require a lot of TLC.

One such example is, converting our codebase to TS (from JS).

Having the agent looping overan area of the code to do the conversion has been quite helpful.

It still requires some TLC around organising the types, but in that sense, the productivity gains were real.

I consider myself an advanced TS user, so I don't get the benefits of doing it manually - 90% of the type made by the LLMs are acceptable (the 10% is the "mostly" acceptable).

That task would have been tedious to do by hand.

That time is better spent on the overall new architecture and tooling. Can we shares types between the backend and frontend, to create a contract driven interaction? Can we integrate more check into our CICD to prevent syntax errors? What conventions do we want to adopt regarding our types? Are our types organised in nice aggregate? Can we generate documentation using our types? Can we create a good DX environment for new comers?

That's better use of my time, than doing the actual conversion. I can review it, I have done enough of it to review it easily and accurately.

Customer tickets

Because our codebase is what it is, for years we didn't have any linter, tests so on and so forth..

We do have a fair amount of bugs that are simple syntax error.

I am probably being lazy here, but I find that spinning up the agent with the ticket results in success in 90% of the case.

I only use it on prevetted tickets.
The agentic pipeline we are putting in place is doing it on all tickets, and so far it's been a-ok (80% success on a small batch of tickets (~5)).

Tests generation

I do find the test generated are mostly acceptable. I often write the test cases, and use the agent to implement those. I also use it to generate more 'unhappy' paths, as it sometimes comes up with edge cases that I didn't think about.

Where it's been not so useful, and I don't use it anymore:

Any architectural or product decisions, they're not equipped for it.
Any new feature development (which is product decision + architecture decisions): I burnt myself with it.
Maybe I'm not smart enough, but I find it extremly difficult to take ownership of new code I didn't write.
It's like watching someone painting and thinking that you can paint. It doesn't work for me.
As a general rule, anything that I'm not deeply confident. As-in, anything that I haven't done so many reps that I can tell at a glance if something is off.

Anything that requires me to sit down and really engage my brain is often better done by hand - because it goes at the right pace.

It's also a terrible strategy long term, how does one upskills if they refuse to learn?

I don't know how many lines of code I generate, I don't really care, it's a useless metric - but probably not that many.

I often try to remove more lines than I add - unless absolutely necessary.

Because I use it for "shallow" tasks, I don't try to optimise my usage. I use it like a normie: enter a prompt, give an example, let it cook.

I only let it work on an amount of code that I can review in one seating (remember that I don't use it for new features so the reviews are much easier).

Maybe my usage is really high because the way I use it?

Customer ticket => lots of tokens read.

Refactoring => lots of "loops" of simple tasks.