Great question! Both are quantization formats for GGUF models, but they make different tradeoffs between size, speed, and quality.
Q4_K_M (4-bit K-quant, medium):
- Stores weights at ~4.5 bits per parameter on average
- Uses a mixed-precision scheme where attention layers get slightly higher precision
- Roughly 40-50% the size of the original F16 model
- Quality loss is minimal for most tasks — typically <1% perplexity increase
- Best for: running 7B-13B models on consumer GPUs with 6-8GB VRAM
Q8_0 (8-bit linear):
- Stores weights at exactly 8 bits per parameter
- Near-lossless quantization — perplexity almost identical to F16
- About 55-60% the size of F16
- Slower than Q4 on CPU, but faster on GPU due to better hardware support
- Best for: when quality is paramount and you have the RAM/VRAM budget
My recommendation: Start with Q4_K_M. The quality difference is negligible for conversational tasks, and you get significantly faster inference and lower memory usage. Only upgrade to Q8_0 if you're doing precision-sensitive tasks like code generation benchmarks or fine-tuning evaluation.