Which AI to Use for ISO 26262 Documentation — And What It Actually Costs

Choosing AI infrastructure for automotive safety documentation is different from normal AI procurement. This article covers deployment options (standard API, enterprise tier, private cloud, self-hosted), model comparison for safety work, real cost estimates, and what we actually run.

The question we get most often is not “can AI write safety requirements?” (yes, it can). It’s “which AI, deployed how, without my OEM and supplier data leaking?”

This article answers that. Written after building and running our own system.

We’ve built AI agents that help automotive safety engineers draft requirements, review consistency, and prepare documentation for ISO 26262. In the process, we’ve had to make the same infrastructure decisions every team will face: which model, where to host it, and how much it costs when you’re not prototyping anymore.

This isn’t a vendor pitch. It’s practitioner notes from someone who had to figure this out the hard way.

Why This Decision Is Different From Normal AI Procurement

Most AI procurement decisions are about cost, performance, and user experience. For safety-critical documentation work, you’re optimizing for a different set of constraints:

Hallucination is a liability problem, not just an accuracy problem. If an AI drafts a requirement that contradicts the system architecture and an assessor catches it, you’ve wasted review time. If they don’t catch it and it makes it into a Type Approval dossier, you’ve created a compliance risk.

OEM and supplier system architectures are confidential. Most automotive work involves sharing detailed system diagrams, failure mode analyses, and architectural decisions with your AI tooling. A standard API call to OpenAI or Anthropic means that data leaves your infrastructure. For most OEM and supplier contracts, that’s simply not acceptable — regardless of what the privacy policy says.

The EU AI Act has opinions about this. If your AI system generates content that contributes to safety-critical outputs (Type Approval dossiers, SOTIF analyses), it may qualify as high-risk under the Act. That triggers conformity assessment requirements by August 2026. The easier you can audit your system, the easier that assessment becomes.

Assessors will ask “what generated this?” When you submit AI-assisted documentation, you need to be able to answer: which model, which version, trained on what data, with what safeguards. “ChatGPT” is not an acceptable answer. “Claude Sonnet 4.6 via AWS Bedrock EU-Central-1, pinned version, with retrieval-augmented generation and human review” is.

Model updates cannot silently change previously accepted requirements. If you use a standard API and the provider updates the model, your outputs change. For documentation that’s already been reviewed and accepted, that’s a problem. You need version pinning and the ability to reproduce previous outputs exactly.

The Three Deployment Options

Option A: Frontier API (Standard & Enterprise)

This is what most people try first: ChatGPT, Claude, Gemini directly from providers.

The standard consumer tier is fine for internal prototyping, learning what’s possible, and testing prompt strategies, but it’s not acceptable for production work with real OEM and supplier data, anything going into a Type Approval dossier, or work under NDA with automotive clients. Costs are around €0.02-0.15 per 1,000 tokens depending on the model, but that’s just a reference.

Enterprise tiers — like ChatGPT Enterprise, Claude Enterprise, and Gemini Enterprise — add data processing agreements, compliance certifications, and data residency controls that make them viable for many automotive use cases.

For ChatGPT Enterprise, pricing is custom (typically ~$60/user/month) with no training on your data guaranteed, SOC2 Type 2 and GDPR compliance, custom retention policies, and admin controls. Claude Enterprise is also custom-priced, HIPAA-ready with BAA available, audit logs, compliance API, and context caching to cut costs on repeated analysis. Gemini Enterprise, bundled with Google Workspace, offers custom pricing with built-in DLP, EU data regions, and AI classification for sensitive data.

Enterprise tier covers most internal use cases like tooling, prototyping, process development, training, onboarding, ISO 26262 standards Q&A, internal requirements reviews before customer submission, and documentation templates.

The line is drawn at confidential project data, such as OEM and supplier system architectures (block diagrams, network topology, ECU specifications), HARA with real vehicle data, FMEDA with production part numbers and supplier information, anything going directly into a Type Approval dossier, or work under NDA that explicitly prohibits third-party processing — that requires private cloud or self-hosted.

The honest message: Don’t over-engineer your deployment for internal tooling. If you’re writing a generic FMEDA template or answering “what does ASIL decomposition mean?”, enterprise tier is the right call. There’s no reason to self-host a 70B parameter model to answer standards questions.

The “no standard API” restriction applies specifically to all confidential data. The right deployment depends on the data classification of the specific project.

Option B: Private Cloud (Azure OpenAI / AWS Bedrock / GCP Vertex AI)

This is where most professional deployments end up. You deploy the same frontier models (GPT-5.2, Claude Sonnet 4.6, Gemini 3) through enterprise cloud platforms, keeping data in your chosen region like EU-Central-1 for automotive work. It’s covered by Data Processing Agreements, VNet isolation, audit logs, and GDPR compliance — acceptable for most OEM and supplier contracts.

Available models include GPT-5.2 and o1 on Azure OpenAI (plus legacy pinnable GPT-4o versions), Claude Sonnet 4.6, Opus 4.6, and Mistral Large 2 on AWS Bedrock, and Gemini 3.1 Pro and 3 Flash on GCP Vertex AI.

It’s good for teams that need frontier model performance, organizations already on one of these cloud platforms, and projects with moderate to high volume. Limitations include still requiring trust in a third-party provider, model updates controlled by the provider (though you can pin versions), and highest-classification data (military, some government programs) may still require self-hosting.

Option C: Self-Hosted Open Source

Data never leaves your infrastructure. Fully auditable. Required for air-gapped environments. You deploy open-source models on your own hardware or private cloud, controlling the entire stack: model weights, inference engine, version management, with no external API calls or third-party data processing.

Viable models as of early 2026 include Mistral Large 2 (123B parameters, best for structured reasoning), Qwen2.5-72B (strong technical writing and multilingual support), Llama 3.3 70B (solid general performance), and DeepSeek R1 (excellent reasoning but higher compute needs).

It’s good for highest-classification OEM and supplier data, air-gapped environments, government or defense programs, and high-volume deployments where cost matters. Limitations include requiring ML infrastructure expertise, higher upfront cost, performance typically 6-12 months behind frontier models, and you’re responsible for model evaluation and safety testing.

Internal confidentiality note: Even self-hosted deployments have internal data leakage risks. Engineers must be instructed on appropriate use — HR appraisals, board restructuring decisions, merger data, and other sensitive internal information should not be processed through AI tooling without explicit authorization, even when self-hosted.

Model Comparison for Safety Documentation Work

Not all models are equally good at writing safety requirements. Hallucination rates matter here more than in almost any other domain.

According to independent benchmarks (February 2026):

Claude Sonnet 4.6 / Opus 4.6: ~3% — Constitutional AI training makes these models more likely to say “I don’t know” rather than hallucinate a plausible-sounding answer
GPT-5.2: ~6% — a 65% improvement over GPT-5.0, released January 2026
Gemini 3 / 3.1 Pro: ~6%

Note: Hallucination benchmarks evolve rapidly and methodology varies across studies. The above reflect February 2026 independent benchmarks. Always validate with your own domain-specific test sets for safety-critical applications — published rates are a starting point, not a guarantee.

Critical insight: RAG (Retrieval-Augmented Generation) cuts hallucinations by ~71%. If you feed the AI the actual system architecture before asking it to draft requirements, you reduce hallucination dramatically. This isn’t optional for safety work — it’s mandatory.

Our Four Agent Roles

Our proposed multi-agent system with four specialized roles (detailed in our agentic workflow article):

System Analysis Agent — reads FMEDA, block diagrams, and hazard analyses
Requirements Drafter — generates safety requirements based on system analysis
Consistency Reviewer — checks for contradictions and missing coverage
Assessor Perspective Agent — reviews from the perspective of a TÜV assessor

Here’s what we’ve learned about which models work best for each:

Model	Hallucination rate	Context window	Private cloud	Self-host viable	Best for
Gemini 3.1 Pro	~6%	1M tokens	Yes (Vertex)	No	System Analysis (large context diagrams)
GPT-5.2	~6%	128K tokens	Yes (Azure)	No	Requirements Drafting (structured output)
Claude Sonnet 4.6	~3%	200K tokens	Yes (Bedrock)	No	Consistency Review (lowest hallucination)
Mistral Large 2	~3-5% (estimated)	128K tokens	Yes (Bedrock)	Yes	General safety work (self-hosted)
Qwen2.5-72B	~4-6% (estimated)	128K tokens	No	Yes	Technical writing (multilingual)

Recommendation: For production work, use Gemini 3 Flash for document ingestion (1M token context window handles full FMEDA and architecture dumps without chunking), Claude Sonnet 4.6 for reasoning-heavy tasks (lowest hallucination rate in the frontier tier), and validate everything with RAG. If you need to self-host, Mistral Large 2 remains the best current option.

Cost Estimates

Let’s talk real numbers.

Assumptions:

A typical safety case review: 100K tokens input (FMEDA, architecture, existing requirements), 50K tokens output (draft requirements, review comments)
With RAG and multi-agent workflow, multiply by ~3x for full processing
Private cloud rates (as of Feb 2026): GPT-5.2 ~$0.003/1K input, ~$0.012/1K output; Claude Sonnet 4.6 $0.003/1K input, $0.015/1K output

Scenario 1: Prototyping / Internal Tooling (1 project/month)

Setup: Standard API, no infrastructure Volume: ~450K tokens/month (1 full project) Cost: €50-150/month depending on model Data privacy: Not acceptable for OEM/supplier work

Scenario 2: Small Practice (5 projects/month)

Setup: Private cloud (AWS Bedrock or Azure OpenAI), basic RAG pipeline Volume: ~2.25M tokens/month (5 full projects) Setup cost: €2,000-5,000 (RAG infrastructure, integration) Running cost: €400-800/month (API costs + infrastructure) Data privacy: Acceptable for most OEM and supplier contracts

Scenario 3: Tier 1 Supplier Scale (20+ projects/month)

Setup: Self-hosted open source (Mistral Large 2 on dedicated hardware) Volume: 9M+ tokens/month (20+ projects) Setup cost: €15,000-30,000 (hardware, ML infrastructure, integration) Running cost: €800-1,500/month (compute, storage, maintenance) Break-even: 3-6 months vs. private cloud Data privacy: Acceptable for any classification level

Option	Setup cost	Monthly running cost	Monthly at 5 projects	Monthly at 20 projects	Data privacy level
Standard API	€0	€50-150 (at 1 project)	Not recommended	Not recommended	Prototype only
Private Cloud	€2,000-5,000	€80-160 base	€400-800	€1,600-3,200	OEM-acceptable
Self-Hosted	€15,000-30,000	€800-1,500	€800-1,500	€800-1,500	Any classification

Key insight: Self-hosting makes economic sense at scale, but the break-even point depends heavily on project volume. Under 10 projects/month, private cloud is usually cheaper. Above 20 projects/month, self-hosting pays for itself quickly.

The EU AI Act Factor

If your AI system contributes to safety-critical outputs — Type Approval dossiers, SOTIF analyses, hazard assessments — it likely qualifies as high-risk under the EU AI Act.

What that means: By August 2026, high-risk AI systems require conformity assessment. You need to demonstrate:

Technical documentation of the AI system
Risk management procedures
Data governance and training data quality
Human oversight measures
Accuracy, robustness, and cybersecurity controls

Self-hosted systems are substantially easier to audit. When an assessor asks “what data was this trained on and can you prove it?”, it’s much easier to answer if you control the entire stack. With a third-party API, you’re dependent on their documentation — and their willingness to share it.

This is not legal advice. Talk to your legal team. But from a practical engineering perspective: the more control you have over your AI infrastructure, the easier the conformity assessment will be.

Conclusion

There is no single right answer. The right answer depends on your client’s data classification requirements, project volume, and regulatory exposure.

What we can say: deploy the decision deliberately, not by default. Using the standard API because it’s easy is not a risk management strategy.

A Note on Prototyping and Data Handling

When prototyping, do NOT use real client or confidential data. Engineers must understand what is acceptable — using a customer’s actual FMEDA in a free-tier LLM to “test the concept” is a data breach, not a prototype. Use synthetic data, anonymized templates, or public examples from standards documentation. Train your team on appropriate data handling before they touch AI tooling.

If you’re just starting: prototype with synthetic data on the standard API, prove the value, then move to private cloud before you touch real OEM or supplier data.

If you’re a small practice: private cloud is the sweet spot. Set it up once, use it confidently.

If you’re a Tier 1 supplier or high-volume practice: run the numbers on self-hosting. The break-even point comes faster than you think.

And if an assessor ever asks “what generated this?”, make sure you have a better answer than “the internet.”

If you’re building something similar or evaluating options for your team, reach out — coen@quenos.technology.