Show HN: Mandoline – Custom LLM Evaluations for Real-World Use Cases Hi HN! We're a small team of AI engineers who've spent the last few years building AI applications. Through this, we've experienced firsthand many of the challenges that come with evaluating and improving AI systems in real-world contexts. Standard LLM evaluations (and evaluation methods) often use simplified scenarios that don't reflect the complexity LLMs encounter in actual use. This leads to a disconnect between reported performance and real-world usefulness. We built Mandoline to bridge this gap, helping developers evaluate and improve LLM applications in ways that matter to end-users. Our approach allows you to design custom evaluation criteria that align with your specific product requirements. For a quick overview of how it works, check out our Python and / or Node SDK READMEs: - Python: https://ift.tt/sP9KQJF... - Node / Typescript: https://ift.tt/JH41mZ8... Hopefully this design is flexible yet scalable, and helps you do things like: track LLM progress over time, make informed AI system design decisions, choose the best model for your use case, prompt engineer more systematically, and so on. Under the hood, Mandoline is a hybrid system using a combination of our own models and top general-purpose LLM APIs. We used Mandoline to evaluate and improve itself, which helped us make better decisions about system design. In the future, we’ll be adding visualization tools to more easily analyze trends, and expanding our in-house models capabilities to reduce reliance on (and hopefully outperform) external models. Check out our website ( https://mandoline.ai/ ) and documentation ( https://ift.tt/aosqZwc ) to learn more. We’d love to hear about your experiences with evaluating AI systems for production use. What have you found most challenging in evaluating AI systems? What behaviors are hard to quantify? How could Mandoline fit into your workflow? You can reach us here in the comments or send us an email (hi@mandoline.ai). We appreciate you taking the time to learn a bit about Mandoline. https://mandoline.ai September 12, 2024 at 12:33AM
Show HN: Mandoline – Custom LLM Evaluations for Real-World Use Cases https://ift.tt/akJvGnC
Related Articles
Show HN: Dead simple HTML component transpiler https://ift.tt/3nycYriShow HN: Dead simple HTML component transpiler https://ift.tt/2XkAbBZ … Read More
Show HN: We track the performance of 1400 crypto strategies Here are the results https://ift.tt/3C53eIYShow HN: We track the performance of 1400 crypto strategies Here are t… Read More
Show HN: I Built Four Eight-Foot-Long Handwriting Robots https://ift.tt/3kkHh2GShow HN: I Built Four Eight-Foot-Long Handwriting Robots https://twitt… Read More
Show HN: Open Sukkah – 'Airbnb' for Public Sukkahs https://ift.tt/3zqSov8Show HN: Open Sukkah – 'Airbnb' for Public Sukkahs https://opensukkah.… Read More
Show HN: Salty, a minimalist DevOps tool inspired by Saltstack (and Ansible) https://ift.tt/3hPhoX7Show HN: Salty, a minimalist DevOps tool inspired by Saltstack (and An… Read More
Show HN: Paper to HTML Converter https://ift.tt/2YRpLKWShow HN: Paper to HTML Converter https://papertohtml.org September 16,… Read More
Show HN: Docstring – AI-generated code documentation and hosting https://ift.tt/3ze1HyxShow HN: Docstring – AI-generated code documentation and hosting https… Read More
Show HN: DALL·E mini – Generate images from text https://ift.tt/3nKGY3bShow HN: DALL·E mini – Generate images from text https://ift.tt/3C1CHN… Read More
0 Comments: