· June 08, 2026 · 6 min read · ← Blog

Frontier AI just got graded on EU law. Nobody passed.

The Aithos Foundation published the LARA leaderboard on 27 May. Thirteen frontier AI models, ten scenarios, more than three thousand simulation runs, evaluated against six provisions of the EU AI Act and four core GDPR principles. The top-performing model reached 63% legal compliance. The lowest 7%. On EU AI Act Art. 5 prohibited practices — including emotion inference in workplace settings and social scoring — models violated provisions in 80% of runs.

The Aithos team's conclusion: no frontier model has acceptable levels of compliance with EU law when deployed as an agent.

Aithos LARA leaderboard, 8 June 2026: 13 frontier AI models ranked by EU AI Act and GDPR compliance. Top: Claude Opus 4.8 with 63%. Bottom: Kimi K2.6 with 7%.
Aithos LARA leaderboard, 8 June 2026. Source: lara.aithos.org

The leaderboard nobody asked for, the one the field needed

LARA — Legal Assessment for Real-world Agents — is the first public, reproducible benchmark for legal compliance of AI agents deployed under EU law. Each model is placed in adaptive workplace simulations where it can read emails, query customer databases, use tools, and pursue assigned business goals. The simulations are designed so that completing the task may require crossing a regulatory line: inferring emotions from a candidate's voice, scoring tenants by lifestyle data, monitoring employees discreetly, exploiting vulnerable customers. The scenario names alone read like a regulatory exam paper. The models are graded by a unanimous-vote judge panel with human review overrides applied.

The findings are not subtle. The best frontier model in the world today, deployed under default conditions as an agent, breaks EU law in 37% of relevant runs. The median is worse. The jurisdiction of origin does not save you — French Mistral scores 12%, American GPT models cluster in the teens, Chinese models occupy the bottom. The compliance gap is structural, not regional, not vendor-specific.

Compliance is not a training problem

The instinct, when faced with these numbers, is to ask which model will solve this in the next release. The honest answer is none of them, because the framing is wrong.

EU AI Act Article 12 requires automatic logging of high-risk AI system operations, with tamper-evident records retained for at least six months. Annex III obligations apply to specific deployment contexts — employment screening, credit scoring, law enforcement, education — and produce documentation burdens that exist independent of how the underlying model behaves. Article 5 prohibited practices are bright-line bans on certain categories of inference and manipulation, enforced at deployment, not at training.

None of these are properties that fine-tuning produces. A model cannot fine-tune its way to tamper-evident logging, because logging is not a generation problem. A model cannot post-train its way out of Annex III, because Annex III obligations attach to the deployment context the model is placed into, not to the model's weights. The next frontier model release will be more capable, more aligned, possibly more cautious — and it will still fail LARA, because LARA tests something models structurally cannot deliver.

What sits between the model and the action

The structural fix is not a better model. It is a layer between the model's output and the action the agent takes — a deployment-layer mechanism that records, authorizes, and produces evidence per agent action, independent of how the model behaves in any given run.

What such a layer needs to do is well understood, even if not yet standard practice. It needs to bind each agent action to a verifiable identity (the agent, the deployer, the legal entity behind both). It needs to record what the agent was authorized to do, under what constraints, for how long. It needs to produce tamper-evident evidence of what happened, retrievable by an auditor without trusting the deployer. It needs to do all of this at the rate agents actually operate — thousands of actions per minute, not human-pace paperwork.

The technical building blocks already exist. W3C Decentralized Identifiers and Verifiable Credentials provide the identity layer. Hashes of non-personal operational data anchored on a public ledger provide the tamper-evidence layer. Cryptographic signatures bind the two. None of this needs to be invented. It needs to be assembled, deployed, and standardized.

The work at MolTrust

This is the work at MolTrust. The Agent Authorization Envelope — AAE — is a signed credential issued per agent action that records what the agent is authorized to do, under what constraints, with what validity period. AAEs are anchored to Base L2 by hash only — hashes of non-personal operational data — so the ledger never carries identifiable information about individuals. Verification is independent of MolTrust: anyone with the agent's DID can validate the credential against the on-chain anchor.

The AAE specification is built on W3C DID and Verifiable Credential standards. It is regime-agnostic — the same envelope supports EU AI Act Article 12 logging, ISO/IEC 42001 management system evidence, NIST AI RMF documentation, and Singapore's IMDA Model AI Governance Framework, because the underlying primitive is generic. The protocol layer does not encode compliance. It produces the evidence that compliance frameworks require, in a form auditors can verify without taking anyone's word for it.

For the operational surface, see moltrust.ch/compliance. For an example of what an audit evidence bundle actually looks like, see the sample bundle. The technical specification is on arXiv.

What this means for you

If you are deploying an AI agent under EU jurisdiction in 2026, your compliance posture today is whatever LARA says about your model, plus whatever your deployment layer adds — and for most deployers, what the deployment layer adds is mostly zero. With the top frontier model you are at 63%. With the most widely deployed enterprise model — GPT-5.5 — you are at 38%. With one of the cheaper Chinese models you are below 20%. None of these is a number you want to sit on the wrong side of in a regulatory enforcement action, and none of them comes with independent evidence of what your agents actually did.

The 80% violation rate on Art. 5 prohibited practices is not theoretical. The first regulatory enforcement actions will not be subtle. Anyone deploying agents in employment, credit, education, or law enforcement contexts — Annex III high-risk — is exposed today, not in some future enforcement scenario. The compliance gap is now measured. That is the precondition for closing it, and the precondition for being asked, in some near-future conformity assessment, what evidence you have that your agents did what they were supposed to do.

The structural-layer approach is not the only possible answer. But the leaderboard makes one thing clear: better models alone will not close this gap. Something has to sit between the model and the action, and produce evidence of what happened. The question for deployers is whether that something exists in their stack today, or not.

 

Thanks to the Aithos Foundation team for LARA. The full leaderboard is at lara.aithos.org. The 27 May launch report is also discussed on LessWrong.