dataset = load_dataset("json", data_files="Kpop.json", split="train") def format_to_conversations(example): return { "conversations": [ {"role": "system", "content": example["system"]}, {"role": "user", "content": example["user"]}, {"role": "assistant", "content": example["assistant"]} ] } dataset = dataset.map(format_to_conversations) def formatting_prompts_func(examples): convos = examples["conversations"] # Apply template and strip the starting <bos> token texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False).removeprefix('<bos>') for convo in convos] return { "text" : texts } dataset = dataset.map(formatting_prompts_func, batched=True) print(dataset[0]["text"])

I am a complete newbie in fine-tuning and model development. I am trying to get started with unsloth and gemma 4 fine tuning but it has been nowhere so far.

I am following this kaggle notebook to train an E2B model-

https://www.kaggle.com/code/danielhanchen/gemma4-31b-unsloth

The example dataset given in the notebook has general data that gemma 4 could perform on without fine tuning as well. So, there wasn't much difference I could understand after the fine tuning.

So, I created a synthetic specialised dataset that untrained gemma 4 won't be able to answer-

https://docs.google.com/document/d/1-xssD2wcr7m0KaV2hNDwa1Vk_MnmZeiCXSOHG62ZbFw/edit?usp=drivesdk

I prepped the synthetic dataset using following AI-generated code-

from unsloth.chat_templates import get_chat_template

from datasets import load_dataset

tokenizer = get_chat_template( tokenizer, chat_template = "gemma-4-thinking", )

# Load your local Kpop.json dataset

dataset = load_dataset("json", data_files="Kpop.json", split="train")

def format_to_conversations(example):
return {
    "conversations": [
        {"role": "system", "content": example["system"]},
        {"role": "user", "content": example["user"]},
        {"role": "assistant", "content": example["assistant"]}
    ]
}

Apply the structural conversion

dataset = dataset.map(format_to_conversations)

def formatting_prompts_func(examples):
convos = examples["conversations"]
# Apply template and strip the starting &lt;bos&gt; token
texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False).removeprefix('&lt;bos&gt;') for convo in convos]
return { "text" : texts }

Apply the template batched for speed

dataset = dataset.map(formatting_prompts_func, batched=True)

Verify the output of the first row!

print(dataset[0]["text"])

I tried training it with 60 steps as well as 100 steps. The model couldn't answer questions after both.

Before this, I had followed a YouTube tutorial and tried unsloth studio and the dataset (converted in alpaca format) but it didn't work there also.

Is it that my dataset is very small (Google developers said that 20-30 examples are enough) or 100 steps isn't enough or something entirely else?

u/Different_Ad_3930

Apply the structural conversion

Apply the template batched for speed

Verify the output of the first row!