Show HN: Pi Labs – AI scoring and optimization tools for software engineers Hey HN, after years building some of the core AI and NLU systems in Google Search, we decided to leave and build outside. Our goal was to put the advanced ML and DS techniques we’ve been using in the hands of all software engineers, so that everyone can build AI and Search apps at the same level of performance and sophistication as the big labs. This was a hard technical challenge but we were very inspired by the MVC architecture for Web development. The intuition there was that when a data model changes, its view would get auto-updated. We built a similar architecture for AI. On one side is a scoring system, which encapsulates in a set of metrics what’s good about the AI application. On the other side is a set of optimizers that “compile” against this scorer - prompt optimization, data filtering, synthetic data generation, supervised learning, RL, etc. The scoring system can be calibrated using developer, user or rater feedback, and once it’s updated, all the optimizers get recompiled against it. The result is a setup that makes it easy to incrementally improve the quality of your AI in a tight feedback loop: You update your scorers, they auto-update your optimizers, your app gets better, you see that improvement in interpretable scores, and then you repeat, progressing from simpler to more advanced optimizers and from off-the-shelf to calibrated scorers. We would love your feedback on this approach. https://build.withpi.ai has a set of playgrounds to help you quickly build a scorer and multiple optimizers. No sign in required. https://code.withpi.ai has the API reference and Notebook links. Finally, we have a Loom demo [1]. More technical details Scorers: Our scoring system has three key differences from the common LLM-as-a-judge pattern. First, rather than a single label or metric from an LLM judge, our scoring system is represented as a tunable tree of metrics, with 20+ dimensions which get combined into a final (non-linear) weighted score. The tree structure makes scores easily interpretable (just look at the breakdown by dimension), extensible (just add/remove a dimension), and adjustable (just re-tune the weights). Training the scoring system with labeled/preference data adjusts the weights. You can automate this process with user feedback signals, resulting in a tight feedback loop. Second, our scoring system handles natural language dimensions (great for free-form, qualitative questions requiring NLU) alongside quantitative dimensions (like computations over dates or doc length, which can be provided in Python) in the same tree. When calibrating with your labeled or preference data, the scorer learns how to balance these. Third, for natural language scoring, we use specialized smaller encoder models rather than autoregressive models. Encoders are a natural fit for scoring as they are faster and cheaper to run, easier to fine-tune, and more suitable architecturally (bi-directional attention with regression or classification head) than similar sized decoder models. For example, we can score 20+ dimensions in sub-100ms, making it possible to use scoring everywhere from evaluation to agent orchestration to reward modeling. Optimizers: We took the most salient ML techniques and reformulated them as optimizers against our scoring system e.g. for DSPy, the scoring system acts as its validator. For GRPO, the scoring system acts as its reward model. We’re keen to hear the community’s feedback on which techniques to add next. Overall stack: Playgrounds next.js and Vercel. AI: Runpod and GCP for training GPUs, TRL for training algos, ModernBert & Llama as base models. GCP and Azure for 4o and Anthropic calls. We’d love your feedback and perspectives: Our team will be around to answer questions and discuss. If there’s a lot of interest, happy to host a live session! - Achint, co-founder of Pi Labs [1] https://ift.tt/o1A3MzJ https://ift.tt/msEv1xM March 14, 2025 at 07:07PM
Show HN: Pi Labs – AI scoring and optimization tools for software engineers https://ift.tt/87YKZji
Related Articles
Show HN: Atlas – Make maps like never before https://ift.tt/v2RyTpcShow HN: Atlas – Make maps like never before https://atlas.co January … Read More
Show HN: Queries – Natural Language Data Analysis from Structured https://ift.tt/xFsSWOcShow HN: Queries – Natural Language Data Analysis from Structured Hell… Read More
Show HN: Postcrest.com – faceswap and image generation tool https://ift.tt/g5Vme8jShow HN: Postcrest.com – faceswap and image generation tool Hey HN! I … Read More
Show HN: Blocks by Cosmic – Web components for headless CMS https://ift.tt/Z6ea1k7Show HN: Blocks by Cosmic – Web components for headless CMS Hi HN… Read More
Show HN: WireGuard Config Generator https://ift.tt/LfpaMx6Show HN: WireGuard Config Generator https://ift.tt/oAgDRPi January 22,… Read More
Show HN: Simply Reading Analog Gauges – GPT4, CogVLM Can't https://ift.tt/qOFbEzvShow HN: Simply Reading Analog Gauges – GPT4, CogVLM Can't https://ift… Read More
Show HN: Startup funding simulator https://ift.tt/rY9GyBhShow HN: Startup funding simulator Hi HN We built a tool to help found… Read More
Show HN: Finagg – free and nearly unlimited financial data https://ift.tt/XpqIFweShow HN: Finagg – free and nearly unlimited financial data finagg is a… Read More
0 Comments: