Why validity beats scale when building multi‑step AI systems

In this episode, Dr. Sebastian (Seb) Benthall joins us to discuss research on validity and agentic AI that actually works in the real world.

Co-authored with Dr. Andrew Clark, their research entitled “Validity Is What You Need” connects systems engineering, mechanism design, and requirements engineering to multi‑step AI that creates enterprise impact and achieves measurable results.

Chapters

What Agentic AI really means (2:06)
Hype, limits, and multi‑step promises (5:20)
Tool use and real‑world stakeholders (8:35)
Systems engineering enters the chat (12:05)
The vending machine experiment with agents (16:20)
Validity as "The North Star" (20:30)
Governance as principal–agent alignment (24:05)
Tasks, roles, and long‑term memory (28:00)
Mechanism design for enterprises (31:20)
Research over raw scale (35:10)
Requirements engineering comes first (39:10)

Validity is what you need in multi-step AI systems: Episode summary

Agentic AI promises to plan, browse, and act across tools, yet many deployments feel like flying cars: dazzling in demos, brittle in the wild. The core problem is not horsepower but purpose. When a system does many things poorly, the missing link is often validity. Was it designed to reliably accomplish the user’s intended goal under the constraints and feedback of its environment?

That question pulls AI out of abstraction and into socio‑technical reality. Stakeholders multiply among users, developers, tool providers, compliance teams. So do misalignments. Instead of assuming a single foundation model can absorb this complexity, we need a practice that begins with explicit goals and engineered guarantees.

Consider how the industry reframed “agents” once large language models hit limits on complex tasks. To compensate, teams layered multi‑step prompting, external tools, and memory. That shift revealed what researchers long knew: intelligence is not only prediction; it is control in context. An LLM optimizing next‑token likelihood cannot substitute for utility design, error budgets, and observability across steps. Without guardrails, compounding error turns a simple objective—like running a vending machine at break‑even—into a costly fiasco. The fix is not more parameters; it is decomposing goals, specifying constraints, and validating each module against measurable requirements before orchestration.

This is where systems engineering meets AI. Aerospace and safety‑critical fields already offer patterns: define interfaces, isolate failure modes, stress test components, and verify end‑to‑end behavior against mission goals.

Translate that to agentic AI: write formal task specs, bound tool permissions, instrument every action with pre‑ and post‑conditions, and maintain regression suites that survive model swaps. Treat LLMs as components, not oracles. Use deterministic tools—linear programming, retrieval, rule checks—whenever they outperform free‑form generation. Keep natural language at the interface while letting specialized solvers carry the load. The result is a hybrid neurosymbolic stack that privileges correctness over vibes.

Governance then becomes practical alignment, not paperwork. Think principal–agent: who sets objectives, who pays the costs, who bears the risk? In an enterprise, preferences collide. Mechanism design offers a lens to aggregate incentives and reveal trade‑offs, but it demands explicit modeling: service‑level objectives, escalation paths, audit trails, and kill switches.

Requirements engineering then becomes the heart of governance. Capture what success means now and later; encode constraints like privacy, budget, latency, and safety; and plan for drift as models and data shift. Every change to a model or tool should trigger targeted tests and documented impacts so stakeholders regain trust over time.

The research frontier is clear: better methods to elicit, formalize, and evolve requirements in multi‑stakeholder settings; compositional validation that scales with task complexity; and orchestration frameworks that minimize compounding errors. Progress will not come from a universal co‑pilot but from fit‑for‑purpose, multi‑step systems that combine learning with control. Smaller, specialized agents can deliver durable value when they are scoped well, instrumented deeply, and judged by outcomes users actually need. Validity is not a buzzword; it is the contract between ambition and reality. Start there, and agentic AI stops being a spectacle and becomes dependable work.

‍

Resources

‍

Do you have questions about validity and agentic systems?

LinkedIn - Episode summaries, shares of cited articles, and more.
YouTube - Was it something that we said? Good. Share your favorite quotes.
Visit our page - see past episodes and submit your feedback! It continues to inspire future episodes.

‍