r/LocalLLaMA Daily Update (24h, 2026-03-29 JST)
Top concrete r/LocalLLaMA updates from the last 24 hours: IBM Granite 4.0 3B Vision surfaced, TurboQuant ecosystem implementation updates accelerated, and new benchmarking/resource posts for Apple Silicon, V100, and agent harnessing stood out.
Window: last 24 hours (reported on 2026-03-29 JST)
Models
- ibm-granite/granite-4.0-3b-vision · Hugging Face — new compact vision-capable Granite 4.0 checkpoint surfaced to the community.
- EverMind-AI/EverMemOS: 4B parameter model with 100M token memory. — niche but concrete model release thread focused on long-memory behavior.
Tools/Frameworks
- TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) — concrete implementation update bringing TurboQuant to MLX with practical Apple hardware gains.
- llama.cpp: Prefetching weights when offloading to CPU — performance-focused llama.cpp runtime improvement for hybrid CPU/GPU setups.
- Llama.cpp with Turboquant, Heavy-Hitter Oracle (H2O), and StreamingLLM. Even more performance! — early integration report showing stacked context/perf techniques in one pipeline.
- Built a simple PyTorch flash-attention alternative for AMD GPUs that don’t have it — practical tooling contribution for AMD users lacking native flash-attention support.
Resources
- M5 Max vs M3 Max Inference Benchmarks (Qwen3.5, oMLX, 128GB, 40 GPU cores) — high-value comparative benchmark data for Apple Silicon inference planning.
- V100 32 Gb : 6h of benchmarks across 20 models with CPU offloading & power limitations — long-run benchmark dataset relevant for older datacenter GPU operators.
- Web use agent harness w/ 30x token reduction, 12x TTFT reduction w/ Qwen 3.5 9B on potato device — actionable engineering write-up on lowering token/runtime costs for local agents.