EvalForge: Automating LLM Judge

This talk explores automating custom LLM evaluation criteria using EvalForge and Weave, enabling users to create and run bespoke, human-aligned assessments.

Overview

LLM evals are all the rage now, with HumanEval, MMLU and others being shown by big labs on every release. But for your company specific use-case, those generic evals don’t mean much. It’s very clear to those who’ve built evals that custom, bespoke criteria and eval datasets are needed.
Building those evals are tricky, require a lot of data annotation and labeling, but what if we could automate this using human aligned criteria building with LLMS?
With EvalForge, we tried to do just that, take live data from Weave (Weights & BIases LLM observability framework) - let users annotate and then use LLMs to help come up with criteria, and then run evaluations.

Links

https://github.com/wandb/evalForge
Programmatically generates and executes traceable LLM evaluation judges via wandb Weave.
https://wandb.me/weave-seattle
W&B Weave: LLM application framework for tracking, evaluating, and improving via SDKs.

Tech stack