Hi,
Reading here about what people run on what (high) hardware configurations, I was very hesitant to even ask for help about tweaking (squeezing a bit more) my configuration, as I have pretty low hardware spec in comparison, but I was encouraged by recent success posts, especially this recent one so I've decided to ask anyway.
My hardware consists of GTX 1080 8GB VRAM, 32GB DDR4 (2133 MT/s) and an older gen Intel i5-7600 with 4 cores.
Even though I'm pretty new in running local models, I've tried many models that I could load, from Qwen2.5-coder-[7,14..]-instruct, Qwen3-coder-30b-instruct-480b-distill-v2-i1 to Mistral and gpt-oss but decided to settle with Qwen3.6-35-a3b.
My main use as a software engineer is primary C++ and secondary (learning) Python coding and debugging.
At first I was consulting google (AI mode) and then switched to ChatGPT for advice's about adequate models for my hardware spec (until I decided) and then spent hours even days chatting with it about tweaking settings in LM Studio (0.4.12 (Build 1)), restarting OS (because when model fails to load subsequent tries fail immediately, I guess because memory fragmentation and nothing helped except full restart) and then trying something else... also, trying out many agents mainly to use from within VS Code, Cline, Roo Code, Continue... Aider (outside), Open Code... (ChatGPT insisted to stay away from "havier" agents like Qwen Code, Codex.. which are too much for my spec and context length, to which I'll come in a bit).
I've decided to settle for now with Cline (prone to loops but more natural to interact with than say Roo Code) and Continue (not so autonomous but more compact and faster). Also I'm not using auto complete as it's not crucial for me and it's already slow as it is.
I'm also using all of this on Linux with KDE (maybe doesn't matter so much but thought to mention it since it's a bit heavier DE).
Also I do not mind waiting a little longer (slightly less speed) if I'll keep intelligence/reasoning.
Following ChatGPT suggestions I've come up with the following setting in LM Studio for Qwen3.6-35b-a3b Q4_K_M GGUF:
LM Studio Settings -> Model Defaults:
- Model Loading guardrail: Strict
LM Studio Settings -> Runtime:
- GGUF: CUDA llama.cpp (Linux) v2.13.0
Model Settings:
Load pane:
- Context Length: 12288 (if I go higher model fails to load, if I go lower I can't use Continue and/or Cline)
- GPU Offload: 9 (I remember that I could go higher to 10 but then I would need to lower context length. Any layer higher it fails to load)
- CPU Thread Pool Size: 2 (that's max as LM Studio wont let me go higher no matter what even though I have 4 cores)
- Evaluation Batch Size: 256
- Max Concurrent Predictions: 2
- Unified KV Cache: ON
- RoPE Frequency Base: Unchecked (auto)
- RoPE Frequency Scale: Unchecked (auto)
- Offload KV Cache to GPU Memory: ON
- Keep Model in Memory: ON
- Try mmap(): ON
- Seed: Unchecked (Random Seed)
- Number of Experts: 8
- Number of layers for which to force into CPU: 0
- Flash Attention: ON
- K Cache Quantization Type: Q4\_0
- V Cache Quantization Type: Q4\_0
Inference pane:
- Temperature: 0.3
- Limit Response Length: Unchecked
- Context Overflow: Truncate Middle
- Stop Strings: empty
- CPU Threads: 2 (max, for the same reason as for CPU Thread Pool Size)
- Start String: <think>
- End String: </think>
- Top K Sampling: 40
- Repeat Penalty: 1.1
- Presence Penalty: Unchecked
- Top P Sampling: 0.9
- Min P Sampling: 0.08
- In Prompt Template section (Template "Jinja"), as a first line, I've set:
{%- set preserve\_thinking = True %}
- System prompt:
"You are an expert software engineer (C++17/20, Python 3.12).
Goal:
Produce correct, concise, and practical solutions with minimal iteration.
----------------------------------------
General Behavior
----------------------------------------
- Be decisive and avoid unnecessary back-and-forth.
- Prefer simple, correct solutions over complex ones.
- Do not over-engineer.
----------------------------------------
Task Handling
----------------------------------------
- Identify task type implicitly:
- Design → define structure first
- Implementation → write complete, correct code
- Debugging → find root cause and apply minimal fix
- Do not mix modes unnecessarily.
- Complete the current task before switching context.
----------------------------------------
Scope Control
----------------------------------------
- Focus only on relevant code or logic.
- Avoid scanning or rewriting unrelated parts.
- Do not expand scope unless required.
----------------------------------------
Reasoning
----------------------------------------
- Keep reasoning brief (3–5 bullets max).
- Focus on decisions, not exploration.
----------------------------------------
Anti-Loop / Anti-Drift
----------------------------------------
- Do not repeat the same failed approach.
- If uncertain, make the most likely assumption and proceed.
- Avoid re-analyzing the same information.
----------------------------------------
Code Quality
----------------------------------------
- Do not invent variables or APIs.
- Ensure consistency across the solution.
- Avoid partial or broken implementations.
----------------------------------------
Output
----------------------------------------
- Be concise and direct.
- Show only relevant code or results.
- Do not include unnecessary explanation unless asked."
With these setting in LM Studio's chat, after generation finishes it shows around 3.50 tok/sec (sometimes it's 3.48, sometimes 3.70). Very, very slow I know... and also it's very bad in finding and fixing bugs but better than the models I've tried before.
Now I know it's a lot to ask but I would like to hear some advice's from you for my use case (C++ and Python) and also considering my hardware spec, about:
- what model should I use (Q4_K_M, 5_K_S...i1-Q4_K_S...)?
- what settings should I use for it?
Thanks!