I kinda need help with the hermes3:8b model.
I'm kinda new to this world of running LLM's locally and ollama and stuff so maybe my terminologies might not be spot on.
And for my first project (a project that I'll keep working on for a long time to make it better and better) I'm making a voice assistant (with tools) but I'm kinda stuck at choosing an LLM. i can't use a model with more than 8b parameters cuz i have a 4050 (cuz a voice assistant needs to be fast). So far I've tried these models and had these problems with them:
Gemma4:e4b -> it loses context and starts behaving completely randomly sometimes, especially after exchanging a few dialogues. i guess it might be because of the context capabilities of the model.
qwen2.5:7b -> qwen models have very strict guardrails which hinder them from fully roleplaying a character (like billy butcher from the boys because of the language).
mistral:7b -> instead of calling a tool, it just leaks the json inside the response, and idk how to solve that. i thought of manually extracting the tool calls from the response but for that too I'll have to teach the model this in a system prompt to call tools in a defined way. Is there any other way of doing this or should i just do this manual extraction? also yeah, sometimes it was calling tools (in the response only) even when there was no need.
Hermes3:8b -> okay, this one's case is special... it completely ignores the system prompt, calls tools randomly, and sometimes calls them even when they are not required. I've heard that the model is pretty good in itself but it just isn't working.
I'm using Ollama's python library to communicate with the models. and for the chat history, I've set a limit on the messages array that deletes the oldest message when the array grows more than 10 entries (having assistant, tool, and user as separate entries). system prompt always remains at index 0.
please can you help me by telling me what all i need to learn or if I'm missing on basic concepts and how I can tackle these problems I'm facing.