Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Impromptu: Agent Thread Scoring
Learn how to capture, score, and analyze LLM agent interactions at the thread level, identifying failures and aggregating error patterns for faster debugging.
I’ve noticed a lot of “log and display” pipelines for LLM observability. A few of these even offer LLM-as-judge performance eval, but this doesn’t really help with agent eval unless there’s some sort of parser with tests both at the individual-event and conversation-thread level. If you had that, you could quickly find and aggregate what actually went wrong with an agent, and why it happened. So I built one! And it looks pretty good - in this demo, I’ll show off how it works and demonstrate finding particular human-relevant problems in agents, like wandering off track or failing to resolve the user’s questions.
More specifically, I’m planning to show code for a chatbot and simulated user, that I’ve instrumented to send all the LLM I/O to a server and DB, plus a mechanism to declare the start of the chatbot “thread”. Then I’ll show some server code that does scoring and how scoring at the thread level is different and complementary to scoring at the event level. I’m hoping to have some plots that show the results, and how agent behaviors can be aggregated and identified quickly this way.
Impromptu: Python toolkit for LLM prompt observability, evaluation, and optimization.