r/LocalLLaMA Daily Update (24h, 2026-03-28 JST)
Top concrete r/LocalLLaMA updates from the last 24 hours: GLM-5.1 release momentum, TurboQuant/llama.cpp performance work, and practical deployment/benchmark resources.
Window: last 24 hours (reported on 2026-03-28 JST)
Models
- GLM 5.1 is out — highest-signal model-release thread in the window.
- GLM-5.1 is live – coding ability on par with Claude Opus 4.5 — second major thread confirming rollout and early capability impressions.
- chromadb/context-1: 20B parameter agentic search model — new 20B model release focused on retrieval/agentic search.
- I benchmarked 31 STT models on medical audio — VibeVoice 9B is the new open-source leader — concrete model comparison with domain-specific WER results.
Tools/Frameworks
- Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant) — practical runtime optimization with measurable decode gains.
- New Unsloth Studio Release! — active framework update for local fine-tuning/workflow users.
- TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual — quantization-method release with strong community discussion.
- DeepSeekOCR & codefuse-ai/F2LLM-v2 are ready on llama.cpp — concrete compatibility update for llama.cpp users.
Resources
- [Qwen Meetup] Function Calling Harness with Qwen, turning 6.75% to 100% — implementation write-up/resource with reproducible improvement framing.
- Cohere Transcribe WebGPU: state-of-the-art multilingual speech recognition in your browser — browser-local speech resource for practical experimentation.
- FlashAttention from first principles — educational technical deep dive shared in-window.
- Inference Engines — Part I: How It Works (visual deep dive) — reference-style explainer on inference stack internals.