u/Low-Alarm272

[FOR HIRE] I'm Dex. Nice to meet you all.

I'm exceptional at these things and these things only:
- LLM integrations in systems
- AI related Cost cutting with your old laptop and ram sticks
- Technical writing (see my medium page)
- building in public as a full-stack developer on X.

My AI gitbub project has 30+ stars: https://github.com/workspace-dex/open-agent
Was working for a startup for 2 years as their tech lead.
Also, I do fiction writing as a side hobby and have worked with an indie game studio recently as freelance work.

If you have work related to these thing and ONLY these things, I'd love to hear about it!

reddit.com
u/Low-Alarm272 — 2 days ago
▲ 146 r/huggingface+4 crossposts

Llama.cpp is getting better with every update

Last night I updated llama.cpp after like 2 or 3 weeks. The results were really exciting for someone running a 35B model on 6GB RTX 3050.

Today I was able to get stable token speeds and they didn't fall down to 9 t/s while coding 1000+ lines of code.

Now I can increase my context window to 64k range and I'm still getting 19 t/s minimum. Before it would do down drastically to 4 t/s.

But now it gives a solid 26 t/s. In high context window worflows it falls by 5-7 t/s only. This means I can do 1000$ worth of coding work on my laptop for free.

Yes. The AI bubble will pop for sure if people realizes they can locally get near same quality of the their cloud subscriptions.

reddit.com
u/Low-Alarm272 — 3 days ago

Here's how:

I've been running Qwen3.6-35B on an RTX 3050 6GB locally, really fast inside a self made cli tool and it's really good at keeping a stable compression system so the context isn't the issue.

Getting really decently good results on Q3 quant

Thank god llama.cpp exists.

And what's more fun is that I can test out ik_llama to get a few more tokens. This is more than enough for me.

My llama.cpp flags:

-c 45000

--n-gpu-layers 81

-- n-cpu-moe 25

--override-tensor "blk\\.(2\[0-9\]|3\[0-9\]|4\[0-6\])\\.ffn\_(gate\_up|down)\_exps\\.weight=CPU"

\-b 1024 -ub 512 \\

\--cache-type-k q4\_0 \\

\--cache-type-v q4\_0 \\

\--flash-attn on \\

\--cont-batching \\

\--threads 6 --threads-batch 6 \\

\--jinja \\

\--reasoning auto \\

\--ctx-checkpoints 10 \\

\--top-k 64 --top-p 0.75 \\

\--temp 0.7 \\

\--repeat-penalty 1.0 \\

\--cache-prompt

Ask away if you have any questions.

reddit.com
u/Low-Alarm272 — 7 days ago
▲ 45 r/Qwen_AI

Thank god llama.cpp exists.

And what's more fun is that I can test out ik\_llama to get a few more tokens. This is more than enough for me.

I've been running this really fast inside a linux cli tool (I created it) and it's really good at keeping a stable compression system so the context isn't the issue.

Getting really decently good results on Q3 quant

My llama.cpp flags:

-c 18000

--n-gpu-layers 81

-- n-cpu-moe 25

--override-tensor "blk\\.(2\[0-9\]|3\[0-9\]|4\[0-6\])\\.ffn\_(gate\_up|down)\_exps\\.weight=CPU"

\-b 512 -ub 128 \\

\--cache-type-k q4\_0 \\

\--cache-type-v q4\_0 \\

\--flash-attn on \\

\--cont-batching \\

\--threads 6 --threads-batch 6 \\

\--jinja \\

\--reasoning auto \\

\--ctx-checkpoints 10 \\

\--top-k 64 --top-p 0.75 \\

\--temp 0.7 \\

\--repeat-penalty 1.0 \\

\--cache-prompt

Ask away if you have any questions.

reddit.com
u/Low-Alarm272 — 7 days ago
▲ 235 r/unsloth+1 crossposts

Thank god llama.cpp exists.

And what's more fun is that I can test out ik_llama to get a few more tokens. This is more than enough for me.

I've been running this really fast inside a linux cli tool (I created it) and it's really good at keeping a stable compression system so the context isn't the issue.

Getting really decently good results on Q3 quant

My llama.cpp flags:

-c 18000 \

--n-gpu-layers 81 \

-- n-cpu-moe 25

--override-tensor "blk\.(2[0-9]|3[0-9]|4[0-6])\.ffn_(gate_up|down)_exps\.weight=CPU" \

-b 512 -ub 128 \

--cache-type-k q4_0 \

--cache-type-v q4_0 \

--flash-attn on \

--cont-batching \

--threads 6 --threads-batch 6 \

--jinja \

--reasoning auto \

--ctx-checkpoints 10 \

--top-k 64 --top-p 0.75 \

--temp 0.7 \

--repeat-penalty 1.0 \

--cache-prompt

Ask away if you have any questions.

reddit.com
u/Low-Alarm272 — 8 days ago

The reason it breaks locally isn't the model. It's the context window.

Here's what actually happens: you run a local 7B model on 6GB VRAM, it starts an agent loop, works for a few steps, then either crashes or starts giving garbage output. Most people think the model is bad. What's actually happening is the context window filled up — tool call history, task state, prior reasoning — and now the model is predicting tokens with no coherent picture of where it is in the task.

The loop either recurses forever (Qwen is infamous for this on multi-tool calls) or hallucinates a completion that never happened.


What I bring on the table

It's a terminal CLI agent harness (think opencode/openclaw style) that manages context deliberately — trimming, summarizing task state, and routing tool calls so a 4B model on constrained hardware stays coherent across a full autonomous task run. The whole thing runs on optimized forks of llama.cpp and doesn't require double-digit VRAM.

The design philosophy is ruthless efficiency: Hermes-agent takes 10k+ context just to reply to a single "hi." My loop stays below 1k. Because you don't need a massive context window — you need a well-managed small one.

It also handles the stuff that matters in daily use: persistent memory, parallel task routing, and private data that never leaves your machine. The architecture is built around what the person actually does day-to-day — so the system that gets built isn't generic, it's tuned to your specific workflow.


Who this is for

I've already built customized versions for:

  • People/Startups paying $500-800/month in OpenAI/Anthropic API bills — I'll build you a private local stack with a task harness tuned to your actual workflows. Same capability, zero ongoing cost after setup.

  • Solo developers hitting tool-loop failures — I'll diagnose exactly where your context management breaks and fix the harness architecture, not the prompt.

  • Anyone with constrained hardware (6GB VRAM, consumer GPU) — I can help you max out your rig for real agentic workloads.

This isn't an Ollama install. Anyone can do that. This is the layer on top that makes local agents actually work.

No $800/month API bill. No cloud. Your data doesn't leave your machine.

DM if anyone is interested.

u/Low-Alarm272 — 11 days ago
▲ 1 r/replit

I'll keep this short because I know self-promo posts are annoying.

I'm a full-stack developer with 2 years of production experience. My niche right now is LLM integrations — specifically the kind that don't require massive cloud spend. Self-hosted models, agent frameworks, AI workflows built into real products.

I also write. Technical articles, documentation, product explainers. If you need someone who can build the thing and explain it clearly, that's a rare combo and I have both.

If you're building an AI product and need a developer who can own it end-to-end — DM me. Open to discussing scope and pricing.

github.com/workspace-dex | medium.com/@strangelyevil

reddit.com
u/Low-Alarm272 — 13 days ago

If anyone here needs a dev who actually understands the stack from model to frontend, I'm available for freelance work. DM.

reddit.com
u/Low-Alarm272 — 13 days ago

Anyone with a laptop can get this performance. You can very possibly vibe code prototypes, draft ppts on your machine.

I was able to do all this on my own efficient agentic loop (similar to openclaw or hermes-agent) to run on my limited hardware.

With optimizations I can get really good results. It's just a matter of time I can get peak performing local LLMs in the future.

I can vibe code right now locally with Qwen 9b on this setup. Ideal for prototyping.

My prediction is we'll be able to get 80-90% of claude-code like results locally in 6-12 months.

GLORY TO OPEN-SOURCE!

u/Low-Alarm272 — 14 days ago