Val's Homepage

Illustration for Baking Meat & Deleting LLMs: A Casual Quantization Test

Baking Meat & Deleting LLMs: A Casual Quantization Test


Important Note: This is not a serious research paper. I wrote this benchmarking notes purely for myself because I wanted to clean up my SSD and purge the weaker models to free up dozens of gigabytes. If you came here looking for charts, code snippets and perfect academic accuracy, do not waste your time. I might do something like that someday, but definitely not today - today was just a casual evaluation while waiting for my meat to bake in the oven😺

So, I just spent a couple of hours on a simple test on my workstation (Ryzen 9 5950X, 128GB RAM, RTX 4080 16GB) to see how local models handle raw logic, strict syntax and context constraints under heavy quantization.

If you are choosing local models based solely on parameter size or default community quants (like Q4_K_M), you might be sabotaging both your speed and output quality. Here is the layout of my personal experiment.

The Testing Framework

I subjected six models to a three-part prompt designed to hit weakness of heavy quantization:

  1. The Cat in the Rooms Riddle: A classic state-tracking logic puzzle requiring a guaranteed catch sequence across 5 rooms in 6 days. I used (2, 3, 4, 4, 3, 2) as the reference solution - sourced from Gemini, which also provided the mathematical justification (I also tested this in Lean 4). Models were evaluated on whether they arrived at a valid catching sequence and could explain why it works.
  2. Concurrent Go Script: Writing a production-ready concurrent worker pool utilizing context for immediate cancellation upon the first error, ensuring zero goroutine leaks, proper channel closure and strictly English comments.
  3. The Coffee Analogy: A rigid formatting test requiring a 4-paragraph explanation of synchronous vs. asynchronous code with specific keywords (inversion and blocking) forced into paragraphs 3 and 4 respectively.

The Pitfall: System-Level Hangs in “Thinking”

Before getting to the results, a massive technical caveat. Three major models completely locked up my workstation during the first logic puzzle:

  • qwen/qwen3.5-9b Q8_0
  • google/gemma-4-26b-a4b Q4_K_M
  • zai-org/glm-4.7-flash Q4_K_M

They entered an infinite, recursive loop during their Thinking process on the parity puzzle, spinning their gears for over 7 minutes until I had to force-stop them and completely disable the Thinking layer.

I also tested mistralai/ministral-3-14b-reasoning Q6_K and it bricked itself in the exact same manner, but because its reasoning architecture cannot be toggled off in LM Studio, it was completely unusable and had to be dropped from the benchmark entirely.

Interestingly, Qwen 3.6 27B Q4_K_M completely avoided this trap. While its MoE (Mixture-of-Experts) brother (Qwen 3.6 35B A3B Q4_K_M) struggled with logic stability, the 27B dense model bypassed any infinite reasoning loops entirely, proving that dense architectures handle state-tracking logic alignment more predictably under standard quants.

If a model’s internal reasoning loop gets stuck on abstract parity tracking, it is a massive red flag for production stability.

The Standings

  1. The Absolute King: unsloth/qwen3.6-35b-a3b Q3_K_XL (18.63 GB)

    This model performed flawlessly across all three tasks. It nailed the 6-day riddle with pristine mathematical justification, correctly identifying the parity-inversion structure and arriving at (2, 3, 4, 4, 3, 2) unprompted. More importantly, its Go code was beautiful - utilizing a non-blocking select for error propagation, handling wg.Wait() in a detached goroutine to close channels safely and keeping comments in English. It abided by every single textual constraint.

  2. The Solid Professional: qwen/qwen3.6-27b-instruct Q4_K_M (17.5 GB)

    This model was the biggest surprise of the extended test. Unlike the MoE-based 35B models, this is a traditional dense model, and it absolutely shines.

    • It nailed the 6-day cat riddle unprompted with the exact (2, 3, 4, 4, 3, 2) sequence and flawless parity justification.
    • Its Go code was pretty solid: no deadlocks, a properly buffered error channel, explicit context cancellation and zero goroutine leaks.
    • It followed the strict 4-paragraph coffee analogy constraints perfectly, putting inversion in paragraph 3 and blocking in paragraph 4.

    And the most important part? It didn’t hang or lock up during the reasoning phase. It just executed raw logic efficiently.

  3. The Surprising Runner-Up: unsloth/qwen3.6-35b-a3b Q2_K_XL (14.08 GB)

    Never write off an ultra-low quant if it comes from Unsloth with Unified Embedding (XL) layouts. It held the exact same logical line as its bigger brother on the riddle and respected the strict formatting rules. The Go code had a minor context leak (forgot a defer cancel()) and was slightly bloated, but it remained fully functional.

  4. The Biggest Disappointment: qwen/qwen3.6-35b-a3b Q4_K_M (22.1 GB)

    Proof that “heavier is not always smarter”. Despite being a Q4 quant, its code was completely broken. It introduced a terminal deadlock in the main select block if all tasks succeeded and it returned generic context strings instead of the actual error payload. It fell prey to standard quantization damage on high-importance weights.

  5. The Also-Rans (Qwen 3.5 9B, Gemma-4 26B, GLM-4.7-Flash)

    With Thinking turned off, they fell apart. Qwen 3.5 failed the logic entirely. Gemma-4 hallucinated a non-existent “trap room” in the riddle and wrote bloated, double-nested goroutine logic. GLM-4.7-Flash failed every single benchmark, writing non-compilable Go code that tried to invoke a non-existent .Cancel() method directly on the context interface.

  6. MoE vs. Dense: Why Quantization Hits Them Differently

    • Standard Q4_K_M completely broke the Qwen 3.6 35B-A3B (MoE model), producing broken code and deadlocks.
    • Yet the same Q4_K_M worked perfectly on the Qwen 3.6 27B (dense model).

    MoE models are highly sensitive to damage in their routing weights (the parts that decide which expert to use). Regular community quants often corrupt these gatekeepers, causing logical failures or loops. Dense models don’t have this weak point, making them vastly more resilient to standard quantization.

Key Takeaways for My Setup

  • VRAM vs. System RAM Spill: Q3_K_XL only spills about 2.6 GB into my system memory. Q4_K_M spills over 6 GB. Because Q3_K_XL minimizes PCIe bus choking, it runs significantly faster while delivering vastly superior code logic than the broken Q4 community quant.
  • The Unsloth XL Layout Wins: The intelligent weight distribution in XL formats preserves the routing logic of MoE models even at low bitrates.
  • Cleanup: Purged Qwen 3.5, Qwen 3.6 35B A3B Q2_K_XL, Ministral 3 and GLM 4.7 Flash from the drive. Keeping Qwen 3.6 35B A3B Q3_K_XL as my main model for local engineering tasks. Sometime later I’ll compare it separately with Qwen 3.6 27b Q4_K_M.