Show HN: Open-source study to measure end user satisfaction levels with LLMs The LLM challenge - an online study - aims to answer a simple question: what is the quality corridor that matters to end users when interacting with LLMs? At what point do users stop seeing a quality difference and at what point do users get frustrated by poor LLM quality. The project is an Apache 2.0 licensed open source project available on Github: https://ift.tt/GDCmJE6 . And the challenge is hosted on AWS as a single-page web app, where users see greeting text, followed by a randomly selected prompt and a LLM response, which they must rate on a likert scale of 1-5 (or yes/no rating) that matches the task represented in the prompt. The study uses pre-generated prompts across popular real-world uses cases like information extraction and summarization, creative tasks like writing a blog post or story, problem solving task like getting central ideas from a passage or writing business emails or brainstorming ideas to solve a problem at work/school. And to generate responses of varying quality the study uses the following OSS LLMs: Qwen 2-0.5B-Instruct, Qwen2-1.5B-Instruct, gemma-2-2B-it, Qwen2-7B-Instruct, Phi-3-small-128k-instruct, Qwen2-72B and Meta-Llama-3.1-70B. And for proprietary LLMs, we limited our choices to Claude 3 Haiku, Claude 3.5 Sonnet, OpenAI GPT 3.5-Turbo and OpenAI GPT4-o. Today, LLM vendors are in a race with each other to one-up benchmarks like MMLU, MTBench, HellowSwag etc - designed and rated primarily by human experts. But as LLMs get deployed in the real-world for end users and productivity workers, there hasn't been a study (as far as we know) that helps researches and developers understand the impact of model selection as perceived by end users. This study aims to get valuable insights to incorporate human-centric benchmarks in building generative AI applications and LLMs If you want to contribute to the AI community in an open source way, we'd love if you can take the challenge. We'll publish study results in 30 days on Github. https://ift.tt/HR86ucx August 27, 2024 at 11:18PM
Show HN: Open-source study to measure end user satisfaction levels with LLMs https://ift.tt/3ksGSYJ
Related Articles
Show HN: My GitHub Readme Is Interactive https://ift.tt/3lMEbDaShow HN: My GitHub Readme Is Interactive https://ift.tt/3lHNbcG Septem… Read More
Show HN: ePaper.js – Easily create an ePaper display using JavaScript and HTML https://ift.tt/3l75l7kShow HN: ePaper.js – Easily create an ePaper display using JavaScript … Read More
Show HN: More HN https://ift.tt/3h95OTsShow HN: More HN https://ift.tt/2R0ZDq1 September 8, 2020 at 06:01AM … Read More
Show HN: A CSS file that reshapes the web https://ift.tt/34JdrgRShow HN: A CSS file that reshapes the web https://ift.tt/34wzgxH Augus… Read More
Show HN: OOTB Code-Server, Easiest “VSCode on Browser” + HTTPS and GitHub Auth https://ift.tt/2ZB4bYFShow HN: OOTB Code-Server, Easiest “VSCode on Browser” + HTTPS and Git… Read More
Show HN: Lofimusic.app, an open source Background Music Progressive Web App https://ift.tt/2RsoqDqShow HN: Lofimusic.app, an open source Background Music Progressive We… Read More
Show HN: The Financial Status Template https://ift.tt/32NPmElShow HN: The Financial Status Template https://ift.tt/3iLn42R Septembe… Read More
Show HN: My recreation of cyberpunk/futuristic UI in rust https://ift.tt/3gJRCQsShow HN: My recreation of cyberpunk/futuristic UI in rust https://ift.… Read More
0 Comments: