Building an AI Agent That Learns From Its Mistakes
A production SQL agent on Databricks Mosaic AI Framework with composable middleware, human-in-the-loop clarification, and a self-improving feedback loop
Insurance claims data is messy. Column names have dots in them (incident.severity, claim_amount.total). Dates arrive in two different formats depending on the table. For this particular dataset especially, one table had 1.2 million GPS rows that would possibly melt your warehouse if you forget a filter. And the people who need answers from this data: claims adjusters, underwriters, operations managers, shouldn't need to know any of that.
I built Multi-Agent Framework to bridge that gap: a natural-language SQL agent deployed on Databricks that lets business users ask questions like they would a data analyst, and get back formatted, cited results. But the interesting part isn't the text-to-SQL. It's the architecture underneath, context engineering rooted in progressive disclosure to load relevant data only when necessary, a human-in-the-loop flow that pauses the computation graph, and a feedback loop that continuously improves the agent's responses over time.
This post walks through the design decisions behind each layer, what I learned shipping it to production, and why I think the architecture of skills is the most underrated aspect of building robust agent systems.
Architecture at a Glance
The system is built on four modular layers designed for scalability and precision:
Chat Interface → A React/TypeScript frontend featuring streaming responses and interactive clarification cards. It includes a three-state feedback widget that serves as the primary data source for the agent's continuous learning.
Model Serving Container → A LangGraph agent wrapped in MLflow's Agent protocol and hosted on Databricks Model Serving. Rather than a static prompt, it utilizes context engineering to implement progressive disclosure, ensuring the model only receives the specific metadata and schema context required for the current task.
Memory Systems → Four distinct stores provide deep context: a semantic cache (RAG), episodic memory (PostgresStore) for historical SQL patterns, user preferences, and procedural memory via optimized prompts refined by user feedback.
Data Layer → High-performance Delta tables in Unity Catalog (covering policies, claims, telematics, and accident metadata). These are accessed via a SQL Warehouse with automated health checks and token refreshes to ensure production-grade reliability.
The Architecture of Skills: Engineering Context, Not Just Prompts
Most agent tutorials rely on a monolithic “God Prompt.” I wanted a system I could reason about in isolation, where I could add, remove, or reorder capabilities without the entire house of cards collapsing.
Instead of one giant instruction set, the agent utilises Progressive Disclosure. It only learns what it needs to know, exactly when it needs to know it, through four strategic context layers:
The Strategic Interrupt (Human-in-the-Loop)
propose_filter_to_user. This pauses the entire computation graph, creating a checkpoint. This ensures that “warehouse-melting” queries never execute without a human sign-off, while routine tasks like schema inspection or memory retrieval are auto-approved for speed.Skill Activation & Progressive Disclosure
Domain Knowledge: The specific schema and business rules for the task at hand.
Episodic Memory: Similar past SQL patterns retrieved via semantic search.
User DNA: Stored preferences and custom aliases.
The Feedback Loop: Lessons learned from past feedback on similar queries.
The result: Zero changes to the core logic, but a significantly more “intelligent” agent for that specific turn.
Context Hygiene (Tool Pruning)
State Compression (Summarization)
The sequence is the secret sauce. Interrupts run first to save costs and compute. Skill Activation runs second to provide the “brain.” Hygiene and Compression run last to keep the engine lean.
What I like about this pattern: By decoupling these “skills” from the underlying infrastructure, I was able to add feedback-aware retrieval by simply updating the Skill Layer's retrieval flow, leaving the rest of the system untouched. It's the difference between building a hardcoded script and building a composable reasoning engine.
Human-in-the-Loop: When the Agent Should Ask, Not Guess
Here's a scenario: a user asks “What were the policies issued?” That's ambiguous → the database has policies from 2016 to 2023. The naive approach is to query all years and dump 227,000 rows. The slightly better approach is to pick the most recent year and hope. The right approach is to ask.
The HITL flow works in two phases:
Discovery → The agent first runs a lightweight SQL query to find available options: SELECT DISTINCT CAST(YEAR(...) AS STRING) FROM policies. This executes automatically (auto-approved by the middleware) and returns the available years.
Proposal → The agent then calls propose_filter_to_user with its recommendation (the most recent year) and all available options. The Human In The Loop function catches this call, interrupts the graph, and checkpoints the entire execution state to Postgres.
On the frontend, the user sees an interactive card with clickable pills for each year, a highlighted recommendation, and approve/reject buttons. They tap “2021” and the graph resumes from the checkpoint with the user's selection → no re-execution of the discovery query, no lost context.
predict_stream method checks the checkpoint state on every request. If it finds a pending interrupt, it builds a Command(resume=...) instead of a new HumanMessage. This means the HITL flow works identically in the custom React frontend and in Databricks' Review App → no special backend flags or custom input schemas needed.Self-Healing SQL Execution
Text-to-SQL in production fails more often than demos suggest. Column names are wrong. Date formats don't match. The warehouse went to sleep. The OAuth token expired at 3am.
The validate_and_execute_sql tool handles all of these with a layered retry strategy:
SQL errors (wrong column, syntax issues) → The tool sends the failed query and error message to the LLM with a specialized fix prompt. The LLM returns a corrected query. Up to 3 attempts. This catches the majority of first-try failures, especially around Databricks-specific syntax like backtick-quoted dot-notation columns.
Authentication errors (expired tokens, 401s) → The tool detects auth failures and triggers a connection refresh: re-authenticate via WorkspaceClient, rebuild the SQLAlchemy connection pool, and retry. The agent also proactively checks token age before each request → if the connection is older than 45 minutes, it refreshes preemptively.
Warehouse errors (stopped warehouse, connection timeouts) → The tool triggers _ensure_warehouse_running(), which checks the warehouse state via the Databricks SDK, sends a start command if needed, and polls with exponential backoff (10s, 20s, 30s). Critically, this is fail-safe: it never raises an exception, never blocks indefinitely.
The result: in production, the agent recovers from most transient failures without the user ever seeing an error message.
The Feedback Loop That Rewrites Its Own Prompts
This is the part that makes the agent genuinely learn from its mistakes.
The system prompt is decomposed into three named sections: SQL_RULES, PRESENTATION_RULES, and CLARIFICATION_RULES. Each has a default value hardcoded in the deployment, but at inference time, build_system_prompt() checks PostgresStore for optimized overrides and uses those instead.
The optimization pipeline works like this:
This is deliberately not fully automated. The human review step exists because prompt changes have system-wide impact → a well-intentioned optimization on SQL rules could break clarification behavior. But the heavy lifting (identifying patterns in failures, drafting improved instructions) is done by the optimizer, and the human just approves or adjusts.
Skills: Swappable Domain Knowledge
The agent's domain expertise lives in a skills/ directory, not in the code. The active skill (smart-claims) is a structured package:
skills/smart-claims/
├── SKILL.md # Rules, join logic, CTE patterns
└── references/
├── schema.md # Column definitions and types
├── data-dictionary.md # Business rules, dimension values
├── hierarchy-navigation.md # Drill-down paths
└── query-templates.md # Reusable SQL CTEsThe Skill function loads and injects the active skill's content into every LLM call. Skills use placeholder tokens ({DATABASE}, {SCHEMA}, {CLAIMS_TABLE}) that resolve at load time from a configuration dictionary.
The design choice that matters: skills are context, not code. They don't define tools or modify the agent's behavior → they tell the LLM what it's working with. This means I can drop in a second skill (say, finance-operations for a different domain) and the same tools, middleware, and feedback loop work without modification. The Skill function just injects different context. In a multi-skill setup, a routing layer would select the right skill based on the user's question → but the tools and middleware remain unchanged.
Memory Systems: Four Layers of Context
The agent uses four distinct memory systems, each serving a different purpose:
Semantic Cache
Before the agent even starts, it checks a Vector Search index for a semantically similar past question (cosine similarity ≥ 0.85). Cache hits return instantly. The cache entries expire after 7 days. Ambiguous queries that triggered HITL are excluded.
Episodic Memory
One relevant SQL example retrieved per query from a curated set of past successful queries. These are few-shot examples stored in PostgresStore, retrieved by semantic similarity, and formatted as markdown in the system prompt. They give the LLM concrete patterns to adapt rather than generating SQL from scratch.
User Memory
Per-user preferences, aliases, and context. If a user says "when I say 'comp policies' I mean POLICY_TYPE = 'COMP'", the agent stores that and retrieves it on future queries.
Optimized Prompts
The output of the feedback optimization loop, stored in PostgresStore and loaded at the start of every request to override default prompt sections.
Interested in the technical details, a live demo, or the source code?
Book a ConsultationAvailable for projects in New York, London, Sydney & Berlin