Show HN: Open-source study to measure end user satisfaction levels with LLMs The LLM challenge - an online study - aims to answer a simple question: what is the quality corridor that matters to end users when interacting with LLMs? At what point do users stop seeing a quality difference and at what point do users get frustrated by poor LLM quality. The project is an Apache 2.0 licensed open source project available on Github: https://ift.tt/GDCmJE6 . And the challenge is hosted on AWS as a single-page web app, where users see greeting text, followed by a randomly selected prompt and a LLM response, which they must rate on a likert scale of 1-5 (or yes/no rating) that matches the task represented in the prompt. The study uses pre-generated prompts across popular real-world uses cases like information extraction and summarization, creative tasks like writing a blog post or story, problem solving task like getting central ideas from a passage or writing business emails or brainstorming ideas to solve a problem at work/school. And to generate responses of varying quality the study uses the following OSS LLMs: Qwen 2-0.5B-Instruct, Qwen2-1.5B-Instruct, gemma-2-2B-it, Qwen2-7B-Instruct, Phi-3-small-128k-instruct, Qwen2-72B and Meta-Llama-3.1-70B. And for proprietary LLMs, we limited our choices to Claude 3 Haiku, Claude 3.5 Sonnet, OpenAI GPT 3.5-Turbo and OpenAI GPT4-o. Today, LLM vendors are in a race with each other to one-up benchmarks like MMLU, MTBench, HellowSwag etc - designed and rated primarily by human experts. But as LLMs get deployed in the real-world for end users and productivity workers, there hasn't been a study (as far as we know) that helps researches and developers understand the impact of model selection as perceived by end users. This study aims to get valuable insights to incorporate human-centric benchmarks in building generative AI applications and LLMs If you want to contribute to the AI community in an open source way, we'd love if you can take the challenge. We'll publish study results in 30 days on Github. https://ift.tt/HR86ucx August 27, 2024 at 11:18PM
Show HN: Open-source study to measure end user satisfaction levels with LLMs https://ift.tt/3ksGSYJ
Related Articles
Show HN: Come and have fun on this site that talks a lot about video games https://ift.tt/SmeFMyxShow HN: Come and have fun on this site that talks a lot about video g… Read More
Show HN: DataMapPlot for visualizing large corpora of documents https://ift.tt/JI9QE0bShow HN: DataMapPlot for visualizing large corpora of documents https:… Read More
Show HN: Synphage a modern phage genome synteny graph generator for .gb files https://ift.tt/sQFKHbmShow HN: Synphage a modern phage genome synteny graph generator for .g… Read More
Show HN: C port of the (non-super) Star Trek game, incl. WASM for browser/phone https://ift.tt/D17gMN5Show HN: C port of the (non-super) Star Trek game, incl. WASM for brow… Read More
Show HN: I built yet another ChatGPT Chrome extension https://ift.tt/OtfEnhHShow HN: I built yet another ChatGPT Chrome extension Hey HN, I’m here… Read More
Show HN: I built an affordable alternative to Spotify and YouTube Music Premium https://ift.tt/dF2zcgZShow HN: I built an affordable alternative to Spotify and YouTube Musi… Read More
Show HN: A GPT-4 chat loop that can directly read and write your code files https://ift.tt/qSchD32Show HN: A GPT-4 chat loop that can directly read and write your code … Read More
Show HN: Conway's Game of Life, but with a gallery of other peoples patterns https://ift.tt/yECHlPQShow HN: Conway's Game of Life, but with a gallery of other peoples pa… Read More
0 Comments: