Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Testing Transactional AI Agents
Practical methods for testing transaction‑focused AI agents, comparing Pytest, Selenium, CI pipelines, and end‑to‑end versus LangGraph node evaluations, including stability metrics and iterative baseline development.
As the agent logic and graph becomes more complicated, the need of establish a stable baseline to iteratively add capabilities and improve quality of agents is critical to those who are like us at Otto building AI agent to do transactions.
We take a similar approach as the one proposed in the τ -bench in this paper (https://arxiv.org/pdf/2406.12045) that’s probabilistic oriented rather than deterministic oriented, yet focusing on testing our own scenarios rather than a benchmark of general capability.
Would like to share some early learnings and thoughts around stability, setup and point of testing after testing a handful of choices (Pytest vs Puppeteer/Selenium, Github Action vs CircleCI vs Azure DevOps, end-to-end testing vs langGraph node testing).
Otto: AI business travel assistant, proactively plans, books, and learns preferences.