r/LocalLLaMA Daily Update (24h, 2026-03-28 JST)

LocalLLaMA r/LocalLLaMA · Mar 27, 2026, 10:01 p.m.

Top concrete r/LocalLLaMA updates from the last 24 hours: GLM-5.1 release momentum, TurboQuant/llama.cpp performance work, and practical deployment/benchmark resources.

Window: last 24 hours (reported on 2026-03-28 JST)

Models

GLM 5.1 is out — highest-signal model-release thread in the window.
GLM-5.1 is live – coding ability on par with Claude Opus 4.5 — second major thread confirming rollout and early capability impressions.
chromadb/context-1: 20B parameter agentic search model — new 20B model release focused on retrieval/agentic search.
I benchmarked 31 STT models on medical audio — VibeVoice 9B is the new open-source leader — concrete model comparison with domain-specific WER results.

Tools/Frameworks

Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant) — practical runtime optimization with measurable decode gains.
New Unsloth Studio Release! — active framework update for local fine-tuning/workflow users.
TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual — quantization-method release with strong community discussion.
DeepSeekOCR & codefuse-ai/F2LLM-v2 are ready on llama.cpp — concrete compatibility update for llama.cpp users.

Resources

[Qwen Meetup] Function Calling Harness with Qwen, turning 6.75% to 100% — implementation write-up/resource with reproducible improvement framing.
Cohere Transcribe WebGPU: state-of-the-art multilingual speech recognition in your browser — browser-local speech resource for practical experimentation.
FlashAttention from first principles — educational technical deep dive shared in-window.
Inference Engines — Part I: How It Works (visual deep dive) — reference-style explainer on inference stack internals.

Read original source ↗