u/ElectronicProgram

Hi all, currently running on a 4070 and writing more automations that I'd prefer to run locally vs. hit API costs, including some stuff for a "personal agent" that reads my email, health data, text messages, etc. and prefer to keep those all local and private.

I've built an agentic system that previously was running via some CLI cloud-based agents that refreshed itself once a day, but I'd like to make something more continuously running (i.e. email comes in, process right away via a webhook or a read off a local email client or something).

At some point the cli-based agents aren't really built for this and APIs might cost a fortune for more advanced tasks (I do tool calls for stuff like synchronizing my spotify data to have an agent recommend and dynamically build me more Spotify playlists, and pull interesting information through web searches on new albums/bands I might be interested in).

I know hardware is expensive right now, but it doesn't seem like it's going to go down.

My question is - I see a lot of variety of token speeds for various models on a 5090 - I read some posts citing over 100 or 200 t/s while coding with Qwen 3.6 MoE (https://www.reddit.com/r/LocalLLM/comments/1st2aib/qwen\_36\_35b\_a3b\_on\_rtx\_5090\_is\_absurdly\_fast\_for/). Is this the kind of token output I could expect for general tasks like the spotify one above, or for chats with agents who might remind me to do certain tasks or search the web for additional context etc? Or is that kind of t/s typically due to generating specifically code?

I do not plan on using local models for coding - I'm quite okay using the big models for that off my own hardware - this is really about running personal automations and agentic systems frequently enough to the point where I would rather not sweat the API costs. Many times I expect I'll be executing an agent, expecting some tool calls, and getting back a JSON output that I can use to programmatically parse or use as input context to the next agent in the chain.

I also game occasionally, which is why I'm just leaning for the 5090 as an all-in-one solution (not to mention stuff like the RTX6000 is like $7000 alone.)

Thanks!

reddit.com
u/ElectronicProgram — 17 days ago