From RAG to Agents: How We Made Our Safety AI Actually Think

We promised a follow-up on agentic workflows. Here it is. After two years of RAG for ISO 26262 documentation, we rebuilt the system around multi-agent orchestration. Completeness before human review went from 60-70% to 85-90%. More importantly, where engineers spend their time changed completely. This is what we built, what improved, and what still does not work.

The follow-up we promised. Here’s what agentic workflows changed — and what they didn’t.

In our last post, we described the RAG system we built for ISO 26262 documentation: LlamaIndex, vector stores, routing logic, three phases of increasing ambition.

We ended with a tease: “Next post will dive into the agentic workflow architecture.”

That was July 2024. Here it is.

Fair warning: this post is more technical than the last one. And more honest about what we got wrong.

What RAG Gets Right — and Where It Breaks

Let’s start with what we kept.

RAG is genuinely good at knowledge retrieval. If you have a well-structured knowledge base — your past work products, the ISO standard, templates — a RAG system can pull relevant context reliably. That hasn’t changed.

What RAG cannot do is think across steps.

When an engineer drafts a Functional Safety Requirement, they don’t just retrieve a template and fill it in. They reason:

What is this system supposed to do?
What can go wrong?
What’s the ASIL rating and why?
Does this requirement trace back to the hazard analysis?
Is this requirement testable?
What will an assessor challenge?

That’s a chain of reasoning — each step depends on the previous one, and each step may require different information from the knowledge base. A single retrieve-and-generate cycle doesn’t cover this. It never did.

We kept hitting a ceiling. Good at surface-level tasks. Frustratingly inconsistent on anything that required multi-step judgment.

That ceiling is why we moved to agents.

What “Agentic” Actually Means (In Plain Terms)

I’ll skip the hype definition.

Agentic AI, as we use it: an AI system that can decide what to do next rather than just responding to a prompt. It has access to tools — search, calculation, document parsing, other models — and it uses them iteratively until the task is done.

The difference from RAG isn’t the model. It’s the control flow.

RAG: user asks → system retrieves → model generates → done.

Agent: user asks → agent plans → agent uses tools → evaluates output → iterates → done (or asks for help if stuck).

For safety documentation, this matters enormously.

The Architecture We’re Running Now

We kept the vector store. We kept LlamaIndex for document indexing. We added an orchestration layer on top.

Here’s how a typical safety requirements task runs now:

Step 1: System Analysis Agent

Takes the system description as input. Retrieves relevant ISO 26262 clauses and past HARA documents from the knowledge base. Identifies the operational scenarios that matter for this system. Outputs a structured hazard context — not requirements yet, just a map of what we’re working with.

This alone caught two hazards that were present in the system description but didn’t make it into the original manual HARA.

Step 2: Requirements Drafter Agent

Takes the hazard context from Step 1. Retrieves requirement templates and examples of accepted FSRs from past assessments. Drafts Functional Safety Requirements and Technical Safety Requirements.

Critically: it has access to a small calculation tool that checks ASIL arithmetic. It runs the check itself during drafting, not afterwards.

Step 3: Consistency Reviewer Agent

This is the one we’re most proud of.

It takes all the drafted requirements and checks them against each other and against the hazard context. It’s specifically looking for:

Requirements that don’t trace back to a hazard
Hazards that have no corresponding requirement
ASIL ratings that don’t match the referenced decomposition strategy
Requirements that are vague enough that an assessor would send them back

It returns a structured critique. Not a summary — a list of specific issues with references to the exact requirements that have problems.

We ran it against our existing documentation from past projects. It found things we’d missed. Things that, in retrospect, I’m genuinely embarrassed to have signed off on.

Step 4: Assessor Perspective Agent

We gave this one a simple brief: “You are an experienced ISO 26262 assessor. Read these requirements. What will you challenge?”

It’s adversarial by design. We feed its critique back into Step 2 and iterate.

Two rounds is usually enough to get to something defensible. Sometimes three.

What This Changes in Practice

The single-shot RAG system produced content that was 60-70% complete. We said that in the last post.

The agentic system produces content that’s closer to 85-90% complete before human review. And the remaining 10-15% is more clearly flagged — the agent knows what it’s uncertain about and says so.

The bigger change is where engineer time goes.

Before: engineers spent most of their time fixing things the AI missed — gaps, inconsistencies, missing traces. Review felt like cleanup.

Now: the agents surface most of the obvious issues before the engineer sees the output. Review feels more like quality judgment — deciding on the hard calls that require domain expertise.

That’s the shift we were aiming for.

What Still Doesn’t Work

Let me be specific.

ASIL decomposition on novel architectures. When the system doesn’t look like anything in the knowledge base, the agents struggle. They retrieve similar architectures and pattern-match. Sometimes that’s fine. Sometimes it’s dangerously wrong. Human judgment is non-negotiable here.

Supplier interface requirements. The nuances of what a Tier 1 supplier can actually commit to, based on their processes, their legacy toolchain, their organizational reality — this is relationship knowledge. Agents have none of it.

Ambiguity resolution under time pressure. When there’s genuine ambiguity in the standard and the project timeline is forcing a decision, agents hedge. They present options. Sometimes that’s valuable. Sometimes you need someone who has been in that assessment room to just make the call. Agents can’t do that yet.

We’re not trying to close these gaps with more engineering. They’re not engineering problems — they’re expertise problems.

The Honest Comparison

	RAG System (2024)	Agentic System (2026)
Completeness before review	60-70%	85-90%
Consistency checking	Manual	Automated
Traceability validation	Partial	Comprehensive
Assessor perspective	None	Adversarial agent
Novel systems	Reasonable	Still struggles
Setup complexity	Low	Medium
Engineer time saved	30-40%	50-60%

The numbers are from our own projects. Your mileage will vary based on how well-structured your knowledge base is and how standard your architecture is.

What We’d Do Differently

If we were starting this over today:

Build the reviewer agent first. Not the drafter. The value of an automated consistency check is immediately obvious to any safety team. It’s a much easier internal sell than “the AI writes requirements.”

Instrument everything. We added logging as an afterthought. You need to know which agent made which decision and why. Not for debugging — for assessors. They will ask.

Smaller, task-specific models. We defaulted to large general models for everything early on. Smaller, focused models are faster, cheaper, and often more consistent for structured tasks like requirement templates. Save the big models for the reasoning-heavy steps.

Where This Fits in the Bigger Picture

We’re not building this to replace safety engineers.

We’re building it because functional safety teams are understaffed and overloaded, and the documentation burden is a large part of why. Engineers with 20 years of system safety knowledge are spending their days formatting documents and chasing traceability matrices.

That is a waste.

If agents can handle 80% of the documentation scaffolding — and flag the 20% that needs real judgment — safety engineers can focus on the decisions that actually matter.

That’s the goal. We’re not there yet. But we’re closer than we were in 2024.