On Christmas last year I got the base model m4 Mac mini. Hoping to understand local AI better. In a short amount of time figured out Ollama and got Qwen 3.5 9B working. Recently saw some posts about how llama.cpp might offer better results so I installed that and when trying to see what I could get with GGUF came across a dockerized GGUF and got it working. Then asked my AI for a suggestion about a chat window as the cli looks a bit dated, I described what I did to AI. AI seemed to indicate that by having llama.cpp and a docker of the GGUF that I did not need to install llama.cpp as I think it’s part of the GGUF. Do you think I am wasting my RAM by using docker GGUF when I should simply get my hands dirty and learn more about the settings in llama.cpp and not use a dockerized model?
So perhaps the real reason for my post today, I came across this reddit post of using Qwen 3.6 35B on 6gb of VRAM which I would understand my M4 Mac mini could handle.
https://www.reddit.com/r/unsloth/comments/1t5n672/qwen3635b_giving_2034_ts_on_6_gb_vram/
There appear to be alot of llama.cpp setting which I have not explored at all. I downloaded the dockerized model from huggingface and understand that AI enthusiasts create these customized models and share what they have created. So if there is something that will allow larger but quantized models I will have more options by using llama.cpp instead of a dockerized version? Are dockerized models on hugging face more limited or are pretty much all tweaked models on hugging face also available for docker? I do not feel like I need to tweak anything and have no problem living with what someone else thought was a good setup.