Agent testing with transactions

Practical methods for testing transaction‑focused AI agents, comparing Pytest, Selenium, CI pipelines, and end‑to‑end versus LangGraph node evaluations, including stability metrics and iterative baseline development.

Overview

As the agent logic and graph becomes more complicated, the need of establish a stable baseline to iteratively add capabilities and improve quality of agents is critical to those who are like us at Otto building AI agent to do transactions.

We take a similar approach as the one proposed in the τ -bench in this paper (https://arxiv.org/pdf/2406.12045) that’s probabilistic oriented rather than deterministic oriented, yet focusing on testing our own scenarios rather than a benchmark of general capability.

Would like to share some early learnings and thoughts around stability, setup and point of testing after testing a handful of choices (Pytest vs Puppeteer/Selenium, Github Action vs CircleCI vs Azure DevOps, end-to-end testing vs langGraph node testing).

Links

https://www.ottotheagent.com/
Otto: AI business travel assistant, proactively plans, books, and learns preferences.
https://arxiv.org/pdf/2406.12045)

Tech stack