Hey everyone!

I just dropped a new 4-bit QLoRA fine-tune based on Qwen3-8B under my org, Cyprus. If you're into models that map out their logic before just blindly spitting out scripts, you might want to give this a spin. It's called HyperThinkCode-Qwen3-8B-v1.

Model Link:https://huggingface.co/Andy-ML-And-AI/HyperThinkCode-Qwen3-8B-v1

The Vibe: "Think first, code second"

The main goal here was to force the model to explicitly reason before writing the final code. I used a 30k subset of the Sashvat/HyperThink-X-Nvidia-Opencode-Reasoning-200K dataset and tweaked the chat template so the assistant responds inside a thinking field first. Basically, it talks to itself to figure out the problem, then it gives you the code.

How I cooked it up:

Base: Qwen3-8B
Hardware: Trained on dual Tesla T4s (16GB VRAM each)
The Method: 4-bit QLoRA via Unsloth. Targeted all linear layers (Attention: q, k, v, o | MLP: gate, up, down) with Rank 16 / Alpha 16.
Time: Super quick run—just 50 steps (global batch size 8), which took about 1 hour and 17 minutes.
Context: Capped at 4096 tokens to balance code complexity without letting VRAM explode.

Even with just 50 steps, the training loss dropped nicely (0.8177 down to 0.6785). I'm currently running lm-eval benchmarks on HumanEval and GSM8K to see exactly how it stacks up against the base Qwen3-8B.

Running it

Since it’s an 8B, it’s super lightweight and easy to daily-drive. If you want to fire it up in Python using Unsloth, here is the quick snippet:

Python

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Andy-ML-And-AI/HyperThinkCode-Qwen3-8B-v1",
    max_seq_length = 4096,
    load_in_4bit = True,
)

I'd love for you guys to test it out against whatever local coding models you're currently using and let me know if the extra "hyperthinking" layer actually helps with your workflows!

I have written a technical report that looks at ways to optimize memory and compute for training large language models when resources are limited.

The report groups over 20 techniques into categories such as:

Model state partitioning, including things like ZeRO and FSDP
Quantization based methods, like QLoRA and NF4
Strategies for managing activation memory, including checkpointing
Optimizations for input output kernels like Flash Attention and fusion

It also covers:

How well different hardware works with these techniques, including Turing and Ampere and Hopper
Tables that compare how much video random access memory is reduced versus compute overhead
Examples of how to set things up for both graphics processing units and clusters with many graphics processing units

My goal with this report was to bring together ideas from theory and systems into one place that people can reference.

I would really like to hear any thoughts or corrections people might have, on the side of things.

I am also getting ready to send this work to arXiv. I need someone to endorse it for cs.AI and cs.LG.

I have an arXiv endorsement code (EKKH4F).
I can forward the official arXiv email with the endorsement link if you’re willing to help.

If someone who knows about this area is willing to look it over and endorse it that would be great.

u/Capital_Savings_9942

Cooked up a new Qwen3-8B coding model that actually "thinks" before it types (HyperThinkCode-v1.5)

The Vibe: "Think first, code second"

How I cooked it up:

Running it