Show HN: Real-time voice chat with AI, no transcription Hi HN -- voice chat with AI is very popular these days, especially with YC startups ( https://twitter.com/k7agar/status/1769078697661804795 ). The current approaches all do a cascaded approach, with audio -> transcription -> language model -> text synthesis. This approach is easy to get started with, but requires lots of complexity and has a few glaring limitations. Most notably, transcription is slow, is lossy and any error propagates to the rest of the system, cannot capture emotional affect, is often not robust to code-switching/accents, and more. Instead, what if we fed audio directly to the LLM - LLM's are really smart, can they figure it out? This approach is faster (we skip transcription decoding) and less lossy/more robust because the big language model should be smarter than a smaller transcription decoder. I've trained a model in just that fashion. For more architectural information and some training details, see this first post: https://tincans.ai/slm . For details about this model and some ideas for how to prompt it, see this post: https://tincans.ai/slm3 . We trained this on a very limited budget but the model is able to do some things that even GPT-4, Gemini, and Claude cannot, eg reasoning about long-context audio directly, without transcription. We also believe that this is the first model in the world to conduct adversarial attacks and apply preference modeling in the speech domain. The demo is unoptimized (unquantized bf16 weights, default Huggingface inference, serverless speed bumps) but achieves 120ms time to first token with audio. You can basically think of it as Mistral 7B, so it'll be very fast and can also run basically anywhere. I am especially optimistic about embedded usage -- not needing the transcription step means that the resulting model is smaller and cheaper to use on the edge. Would love to hear your thoughts and how you would use it! Weights are Apache-2 and on Hugging Face: https://ift.tt/A1wbIFJ... https://ift.tt/QyDmgMK March 20, 2024 at 12:37AM
Show HN: Real-time voice chat with AI, no transcription https://ift.tt/MdiEB7l
Related Articles
Show HN: Rysolv – Fix open source issues, get paid (V1) https://ift.tt/3hllB4VShow HN: Rysolv – Fix open source issues, get paid (V1) https://rysolv… Read More
Show HN: Entish: A language for implementing RPG rules in formal logic https://ift.tt/3jF8soLShow HN: Entish: A language for implementing RPG rules in formal logic… Read More
Show HN: The Hitchhiker’s Guide to Online Anonymity (Updated to v0.9.8) https://ift.tt/3yoBGfWShow HN: The Hitchhiker’s Guide to Online Anonymity (Updated to v0.9.8… Read More
Show HN: HN Dark Mode Safari Extension https://ift.tt/3yyzYswShow HN: HN Dark Mode Safari Extension https://ift.tt/3uPXPBU July 7, … Read More
Show HN: WebhookTest. Test Your Webhooks https://ift.tt/3hlvQ9dShow HN: WebhookTest. Test Your Webhooks https://ift.tt/3yxu22W July 7… Read More
Show HN: Android Screencapture for Dualscreen Devices https://ift.tt/3e10serShow HN: Android Screencapture for Dualscreen Devices https://ift.tt/3… Read More
Show HN: Open-Source Chrome Extension for auto-grouping tabs by URL patterns https://ift.tt/3dKyGm3Show HN: Open-Source Chrome Extension for auto-grouping tabs by URL pa… Read More
Show HN: Gosh – a program for writing Go code at the command line https://ift.tt/2UqoW9YShow HN: Gosh – a program for writing Go code at the command line http… Read More
0 Comments: