Show HN: KVSplit – Run 2-3× longer contexts on Apple Silicon I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality. I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising: - K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows. Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues) Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon. GitHub: https://ift.tt/sgzBeR7 https://ift.tt/sgzBeR7 May 17, 2025 at 01:34AM
Show HN: KVSplit – Run 2-3× longer contexts on Apple Silicon https://ift.tt/UKdmhrt
Related Articles
Show HN: MyYogaFlow – Online Yoga Instructor https://ift.tt/1tNGmpXShow HN: MyYogaFlow – Online Yoga Instructor Hi fellow hackers :) I am… Read More
Show HN: Seamless – An AI assistant that writes your literature review https://ift.tt/w3Et46gShow HN: Seamless – An AI assistant that writes your literature review… Read More
Show HN: I built an OSS alternative to Azure OpenAI services https://ift.tt/aDMFl75Show HN: I built an OSS alternative to Azure OpenAI services Hey HN, I… Read More
Show HN: Atomix – UX/UI Design Services for Startups https://ift.tt/eg4HN0YShow HN: Atomix – UX/UI Design Services for Startups https://atomix.de… Read More
Show HN: QA GPT – Write UI tests in plain English powered by GPT-4-Vision https://ift.tt/UdxvWNLShow HN: QA GPT – Write UI tests in plain English powered by GPT-4-Vis… Read More
Show HN: Slow Marathon https://ift.tt/qaipoMfShow HN: Slow Marathon https://ift.tt/ICY9jqd December 11, 2023 at 02:… Read More
Show HN: RΞASON – Open-source TypeScript framework for LLM apps https://ift.tt/6GYg3ZQShow HN: RΞASON – Open-source TypeScript framework for LLM apps Hi HN!… Read More
Show HN: Fine-grained stylistic control of LLMs using model arithmetic https://ift.tt/GOLJX3uShow HN: Fine-grained stylistic control of LLMs using model arithmetic… Read More
0 Comments: