Agent Eval and Scores at the Thread Level

Learn how to capture, score, and analyze LLM agent interactions at the thread level, identifying failures and aggregating error patterns for faster debugging.

GitHub Gist Impromptu Database Server A/B testing

Overview

I’ve noticed a lot of “log and display” pipelines for LLM observability. A few of these even offer LLM-as-judge performance eval, but this doesn’t really help with agent eval unless there’s some sort of parser with tests both at the individual-event and conversation-thread level. If you had that, you could quickly find and aggregate what actually went wrong with an agent, and why it happened. So I built one! And it looks pretty good - in this demo, I’ll show off how it works and demonstrate finding particular human-relevant problems in agents, like wandering off track or failing to resolve the user’s questions.

More specifically, I’m planning to show code for a chatbot and simulated user, that I’ve instrumented to send all the LLM I/O to a server and DB, plus a mechanism to declare the start of the chatbot “thread”. Then I’ll show some server code that does scoring and how scoring at the thread level is different and complementary to scoring at the event level. I’m hoping to have some plots that show the results, and how agent behaviors can be aggregated and identified quickly this way.

Links

https://gist.github.com/seanmvl/d9a62dab4cd5cef41d8f6aadf9bcf343
Impromptu: Python toolkit for LLM prompt observability, evaluation, and optimization.

Tech stack

GitHub Gist

GitHub Gist is a lightweight, pastebin-style service for instantly sharing code snippets, notes, and configuration files.

GitHub Gist functions as a mini-repository for single files or small collections of code, offering a simple alternative to a full GitHub project (https://gist.github.com/). It supports multiple file types—from bash scripts to Markdown notes—and includes essential Git features: version control, forking, and commenting. Developers use it to quickly share code examples, embed snippets directly into blogs or tutorials, and store personal dotfiles. You can choose between a Public Gist (searchable and visible to all) or a Secret Gist (accessible only via a direct URL, not truly private).

https://gist.github.com/

View projects
Impromptu

Cognos Impromptu was IBM's client-based business intelligence tool: it empowered business users to execute ad-hoc querying and create complex, frame-based reports without writing SQL.

Impromptu, originally from Cognos and later integrated into the IBM Series 7 suite, was a powerful client-based BI tool: it enabled business users to perform dynamic data access and create complex reports (grouped lists, crosstabs, charts) using an Information Catalog layer. This object-oriented architecture abstracted database complexity, allowing non-technical users to build reports with features like prompts and filters. Note: Impromptu is a legacy product; IBM officially withdrew support for version 7.5 on April 30, 2023, and the recommended successor is IBM Cognos Analytics.

https://www.ibm.com/support/pages/cognos-impromptu-74-product-documentation

View projects
Database

Databases are the structured, electronic backbone for all applications, managing data storage, retrieval, and updates via query languages like SQL and NoSQL.

A database is an organized, electronic collection of structured or unstructured data, managed by a Database Management System (DBMS) to ensure integrity and efficient access. These systems are foundational: they handle all CRUD operations (Create, Read, Update, Delete) for applications ranging from financial ledgers to social media feeds. Key types include Relational Databases (RDBMS), which use structured tables and SQL (e.g., PostgreSQL, Oracle), and NoSQL databases, which offer flexible schemas for massive scale and speed (e.g., MongoDB, Cassandra). Choosing the right architecture (relational for transactional integrity or NoSQL for high availability) is the critical first step in any data-driven project.

https://www.tadabase.io/what-is-a-database-complete-overview-examples

View projects
Server

This is a dedicated, high-availability machine (e.g., Dell PowerEdge R760) providing centralized resources: compute, storage (e.g., 50TB RAID array), and network services (e.g., 25GbE) to client systems on demand.

A server is the core workhorse of any serious IT infrastructure, purpose-built for continuous operation (24/7 uptime). Unlike a desktop PC, it features redundant components (PSUs, fans) and high-density architecture (1U or 2U rack-mount form factors). We deploy these units with multi-core CPUs (e.g., Intel Xeon Scalable processors) and massive RAM (often 512GB+ ECC memory) to handle thousands of concurrent requests. Common roles include hosting critical applications (like Apache or Nginx web servers), running virtual machines (VMs), and managing enterprise databases (PostgreSQL, MS SQL Server). It’s the single point of truth for data and services, requiring robust security and meticulous management.

https://www.serverwatch.com/guides/what-is-a-server/

View projects
A/B testing

A/B testing is a randomized controlled experiment: it compares two variants (A: Control, B: Variation) to determine which one produces a statistically significant lift in a key metric.

This methodology (also called split testing) is the definitive way to compare two versions of a digital asset: a webpage, an email subject line, or a mobile app feature. The process randomly splits user traffic (e.g., 50/50) between the Control and the Variation, measuring the impact on a specific business goal. By testing elements like a new call-to-action button color or a revised headline, teams move optimization from 'we think' to 'we know.' This data-backed approach ensures that only changes proven to increase conversion rate, click-through rate, or revenue per visitor are implemented, maximizing ROI.

https://www.optimizely.com/optimization-glossary/ab-testing/

View projects

Agent Eval and Scores at the Thread Level

Related talks