Building an AI Agent That Learns From Its Mistakes

A production SQL agent on Databricks Mosaic AI Framework with composable middleware, human-in-the-loop clarification, and a self-improving feedback loop

Rathes Waran| March 2026| 12 min read|
Multi-Agent FrameworkDatabricksProduction AI
Loading Interactive Demo...

Insurance claims data is messy. Column names have dots in them (incident.severity, claim_amount.total). Dates arrive in two different formats depending on the table. For this particular dataset especially, one table had 1.2 million GPS rows that would possibly melt your warehouse if you forget a filter. And the people who need answers from this data: claims adjusters, underwriters, operations managers, shouldn't need to know any of that.

I built Multi-Agent Framework to bridge that gap: a natural-language SQL agent deployed on Databricks that lets business users ask questions like they would a data analyst, and get back formatted, cited results. But the interesting part isn't the text-to-SQL. It's the architecture underneath, context engineering rooted in progressive disclosure to load relevant data only when necessary, a human-in-the-loop flow that pauses the computation graph, and a feedback loop that continuously improves the agent's responses over time.

This post walks through the design decisions behind each layer, what I learned shipping it to production, and why I think the architecture of skills is the most underrated aspect of building robust agent systems.

Architecture at a Glance

The system is built on four modular layers designed for scalability and precision:

Chat Interface → A React/TypeScript frontend featuring streaming responses and interactive clarification cards. It includes a three-state feedback widget that serves as the primary data source for the agent's continuous learning.

Model Serving Container → A LangGraph agent wrapped in MLflow's Agent protocol and hosted on Databricks Model Serving. Rather than a static prompt, it utilizes context engineering to implement progressive disclosure, ensuring the model only receives the specific metadata and schema context required for the current task.

Memory Systems → Four distinct stores provide deep context: a semantic cache (RAG), episodic memory (PostgresStore) for historical SQL patterns, user preferences, and procedural memory via optimized prompts refined by user feedback.

Data Layer → High-performance Delta tables in Unity Catalog (covering policies, claims, telematics, and accident metadata). These are accessed via a SQL Warehouse with automated health checks and token refreshes to ensure production-grade reliability.

The true power of this system isn't in the individual tools, but in the architecture of skills. By decoupling domain knowledge from the underlying execution logic, we can swap or upgrade “skills”, like specialized insurance logic or complex telematics filtering, without touching the core infrastructure. Furthermore, the context engineering layer ensures we never “melt the warehouse” by over-loading the LLM, while the feedback loop operates asynchronously, allowing the agent to improve its responses over time without adding a single millisecond of latency to the inference call. (We're observing about 14 seconds per call)
Full system architecture → middleware stack, memory systems, tools, and the self-improving feedback loop

The Architecture of Skills: Engineering Context, Not Just Prompts

Most agent tutorials rely on a monolithic “God Prompt.” I wanted a system I could reason about in isolation, where I could add, remove, or reorder capabilities without the entire house of cards collapsing.

Instead of one giant instruction set, the agent utilises Progressive Disclosure. It only learns what it needs to know, exactly when it needs to know it, through four strategic context layers:

1

The Strategic Interrupt (Human-in-the-Loop)

Before any data is touched, this layer inspects the intent. If the agent needs to narrow down 1.2 million rows of GPS data, it triggers propose_filter_to_user. This pauses the entire computation graph, creating a checkpoint. This ensures that “warehouse-melting” queries never execute without a human sign-off, while routine tasks like schema inspection or memory retrieval are auto-approved for speed.
2

Skill Activation & Progressive Disclosure

This is the heart of our Context Engineering. Rather than forcing the LLM to remember every table in the insurance database, we “mount” a specific skill. This layer dynamically injects four high-signal context sources:

Domain Knowledge: The specific schema and business rules for the task at hand.
Episodic Memory: Similar past SQL patterns retrieved via semantic search.
User DNA: Stored preferences and custom aliases.
The Feedback Loop: Lessons learned from past feedback on similar queries.

The result: Zero changes to the core logic, but a significantly more “intelligent” agent for that specific turn.
3

Context Hygiene (Tool Pruning)

Error messages are “context poison.” When a SQL query fails and the LLM retries, the conversation history usually gets cluttered with stack traces. This layer prunes the noise, keeping only the most recent failure. This maintains a high signal-to-noise ratio, preventing the model from getting “confused” by its own previous mistakes.
4

State Compression (Summarization)

Long analytical “drill-downs” can quickly saturate a 200k token window. Once we hit a specific threshold, this layer compresses the historical conversation into a concise summary while keeping the most recent messages in high fidelity. This ensures the agent never loses the “thread” of the investigation during complex, multi-turn insurance adjustments.

The sequence is the secret sauce. Interrupts run first to save costs and compute. Skill Activation runs second to provide the “brain.” Hygiene and Compression run last to keep the engine lean.

What I like about this pattern: By decoupling these “skills” from the underlying infrastructure, I was able to add feedback-aware retrieval by simply updating the Skill Layer's retrieval flow, leaving the rest of the system untouched. It's the difference between building a hardcoded script and building a composable reasoning engine.

Human-in-the-Loop: When the Agent Should Ask, Not Guess

Here's a scenario: a user asks “What were the policies issued?” That's ambiguous → the database has policies from 2016 to 2023. The naive approach is to query all years and dump 227,000 rows. The slightly better approach is to pick the most recent year and hope. The right approach is to ask.

The HITL flow works in two phases:

Discovery → The agent first runs a lightweight SQL query to find available options: SELECT DISTINCT CAST(YEAR(...) AS STRING) FROM policies. This executes automatically (auto-approved by the middleware) and returns the available years.

Proposal → The agent then calls propose_filter_to_user with its recommendation (the most recent year) and all available options. The Human In The Loop function catches this call, interrupts the graph, and checkpoints the entire execution state to Postgres.

On the frontend, the user sees an interactive card with clickable pills for each year, a highlighted recommendation, and approve/reject buttons. They tap “2021” and the graph resumes from the checkpoint with the user's selection → no re-execution of the discovery query, no lost context.

The design decision I'm most proud of here is making the resume stateless from the backend's perspective. The deployment's predict_stream method checks the checkpoint state on every request. If it finds a pending interrupt, it builds a Command(resume=...) instead of a new HumanMessage. This means the HITL flow works identically in the custom React frontend and in Databricks' Review App → no special backend flags or custom input schemas needed.

Self-Healing SQL Execution

Text-to-SQL in production fails more often than demos suggest. Column names are wrong. Date formats don't match. The warehouse went to sleep. The OAuth token expired at 3am.

The validate_and_execute_sql tool handles all of these with a layered retry strategy:

SQL errors (wrong column, syntax issues) → The tool sends the failed query and error message to the LLM with a specialized fix prompt. The LLM returns a corrected query. Up to 3 attempts. This catches the majority of first-try failures, especially around Databricks-specific syntax like backtick-quoted dot-notation columns.

Authentication errors (expired tokens, 401s) → The tool detects auth failures and triggers a connection refresh: re-authenticate via WorkspaceClient, rebuild the SQLAlchemy connection pool, and retry. The agent also proactively checks token age before each request → if the connection is older than 45 minutes, it refreshes preemptively.

Warehouse errors (stopped warehouse, connection timeouts) → The tool triggers _ensure_warehouse_running(), which checks the warehouse state via the Databricks SDK, sends a start command if needed, and polls with exponential backoff (10s, 20s, 30s). Critically, this is fail-safe: it never raises an exception, never blocks indefinitely.

The result: in production, the agent recovers from most transient failures without the user ever seeing an error message.

The Feedback Loop That Rewrites Its Own Prompts

This is the part that makes the agent genuinely learn from its mistakes.

The system prompt is decomposed into three named sections: SQL_RULES, PRESENTATION_RULES, and CLARIFICATION_RULES. Each has a default value hardcoded in the deployment, but at inference time, build_system_prompt() checks PostgresStore for optimized overrides and uses those instead.

The optimization pipeline works like this:

Step 1
Collect feedback. Below every assistant response, the user sees a feedback card. They can rate the response (thumbs up, thumbs down, or "I don't know"), select from context-sensitive tags (e.g., "Inaccurate SQL", "Missing data", "Good explanation"), and leave a free-text comment similar to the Review App on Databricks.
Step 2
Store and sync. Feedback is instantly codified into a durable knowledge base. At the edge, the system performs a high-fidelity retrieval of past missteps, injecting that 'hindsight' directly into the current execution path. This ensures that the agent's responses are filtered through a lens of collective experience, neutralizing errors before the next optimization cycle even begins.
Step 3
Optimize. We've implemented a Hindsight Protocol that treats every suboptimal response as a catalyst for refinement. The system harvests conversational artifacts and passes them through a specialized optimization layer. Here, the agent's core logic is stress-tested against historical friction, resulting in a set of evolved 'skill overrides.' This ensures the system isn't just reacting to feedback, but is actively re-engineering its own internal logic to navigate around complex edge cases.
Step 4
Human review. The system surfaces a comparative analysis of its proposed evolution, visualizing the delta between the current logic and the refined intent. This human oversight layer serves as the final arbiter of alignment, ensuring the 'learned' lessons translate into clear, actionable behavior. Once the transition is validated, the optimized skills are committed to the global state store, solidifying the agent's new identity.
Step 5
Inference picks them up. The next time the agent handles a request, build_system_prompt() loads the optimized overrides. The loop is closed.

This is deliberately not fully automated. The human review step exists because prompt changes have system-wide impact → a well-intentioned optimization on SQL rules could break clarification behavior. But the heavy lifting (identifying patterns in failures, drafting improved instructions) is done by the optimizer, and the human just approves or adjusts.

Skills: Swappable Domain Knowledge

The agent's domain expertise lives in a skills/ directory, not in the code. The active skill (smart-claims) is a structured package:

skills/smart-claims/
├── SKILL.md              # Rules, join logic, CTE patterns
└── references/
    ├── schema.md         # Column definitions and types
    ├── data-dictionary.md    # Business rules, dimension values
    ├── hierarchy-navigation.md  # Drill-down paths
    └── query-templates.md       # Reusable SQL CTEs

The Skill function loads and injects the active skill's content into every LLM call. Skills use placeholder tokens ({DATABASE}, {SCHEMA}, {CLAIMS_TABLE}) that resolve at load time from a configuration dictionary.

The design choice that matters: skills are context, not code. They don't define tools or modify the agent's behavior → they tell the LLM what it's working with. This means I can drop in a second skill (say, finance-operations for a different domain) and the same tools, middleware, and feedback loop work without modification. The Skill function just injects different context. In a multi-skill setup, a routing layer would select the right skill based on the user's question → but the tools and middleware remain unchanged.

Memory Systems: Four Layers of Context

The agent uses four distinct memory systems, each serving a different purpose:

Semantic Cache

Before the agent even starts, it checks a Vector Search index for a semantically similar past question (cosine similarity ≥ 0.85). Cache hits return instantly. The cache entries expire after 7 days. Ambiguous queries that triggered HITL are excluded.

Episodic Memory

One relevant SQL example retrieved per query from a curated set of past successful queries. These are few-shot examples stored in PostgresStore, retrieved by semantic similarity, and formatted as markdown in the system prompt. They give the LLM concrete patterns to adapt rather than generating SQL from scratch.

User Memory

Per-user preferences, aliases, and context. If a user says "when I say 'comp policies' I mean POLICY_TYPE = 'COMP'", the agent stores that and retrieves it on future queries.

Optimized Prompts

The output of the feedback optimization loop, stored in PostgresStore and loaded at the start of every request to override default prompt sections.

Interested in the technical details, a live demo, or the source code?

Book a Consultation

Available for projects in New York, London, Sydney & Berlin