The world has agreed which AI capabilities are too dangerous to permit. It has not yet built the machinery to define them precisely, detect a crossing independently, or make a crossing cost anything. What follows is an argument about the order in which that machinery has to be built. A thirty-five-year-old financial watchdog shows the institutional goods are achievable at global scale, and that the order in which you build them is what decides whether the regime endures or collapses.
Over two years, three separate forums have named the same red lines. At the International Dialogues on AI Safety in Beijing, Bengio, Hinton and Yao named capabilities no system should be permitted to cross; the Seoul commitments turned that into corporate pledges; and the Global Call for AI Red Lines, launched at the UN General Assembly in 2025 by more than three hundred signatories, demanded that governments agree on enforceable limits by the end of 2026. The agreement is almost entirely about the what: autonomous replication, weapons-of-mass-destruction uplift, large-scale cyberattack, loss of meaningful human control.
What no one has built is the how. A red line is only a red line if someone can specify it precisely enough to test, observe a crossing without taking the developer's word for it, and attach a consequence when one occurs. Strip those away and you have a press release. The hard problem was never naming the lines; it is the institutional plumbing underneath them, and that plumbing, the paper argues, has to be assembled in a specific order, because the pieces are not independent.
Start with detection, because everything downstream depends on whether a measurement means anything at all.
The Detection Problem
Before anyone can enforce a red line, someone has to measure whether a model crossed it. Here are two frontier models on identical items, two method choices you would think were neutral, and the red line itself. Flip a method and the verdict moves while the models do not; switch the red line and the target moves out from under you.
Two frontier models that respond to scaffolding in opposite directions on identical items.
Both methods at their conventional defaults. Nothing about the model has changed.
At conventional defaults (multiple-choice items, single-turn, the most mature red line) the two frontier models read the same. The instrument is dull on purpose: this is what an evaluation looks like when nothing about the method is doing the talking.
Present the identical items as open-ended rather than multiple-choice and each band widens: the dial now reports how much format choice alone smears the score, not a verdict. The two models still overlap, so the gap between them stays at zero. Format costs you precision, not a ranking.
Step 03 · the harness alone
Same two models, same items. Swap single-turn for a map-reduce harness and they split: Model A loses ground, Model B gains it, and the two readings pull apart, in opposite directions, with no change to either model. Watch the hero count climb and the span draw out across the gap.
Biology is a mature red line; autonomous replication is a moving one. On RepliBench, component-task success climbed from under 5% in early 2023 to over 60% by summer 2025 (Black et al., 2025). When the benchmark itself moves this fast, the four method knobs are second-order: there is no stable target to fix a threshold against.
Every switch is live. Flip the method, change the model, move the red line: the instrument answers in real time. Reset rebuilds the bench whenever you want to start over.
Maximum divergence reached
You changed nothing about either model, only how you looked at them, and the two readings pulled 35.6 points apart. Roll that into a single composite number and the disagreement you just produced averages quietly away.
▪ What each switch actually does, measured
Swing in a measured red-line score from question format alone: multiple-choice versus open-ended presentation of identical items.
Span of model × scaffold interactions, in opposing directions: wide enough that a composite score cannot reliably separate safer models from less safe ones. Two frontier models on identical items move apart this far under nothing but a change in agent scaffold (Gringras, 2026); Wei et al. (2026) replicate the configuration-sensitivity in live agentic deployments, an 8.6-point shift in attack-success rate from scaffold choice alone across three frameworks on a single model.
RepliBench autonomous-replication component-task success, from early 2023 to summer 2025 (Black et al., 2025). The benchmark is reproducible; the capability it measures is moving faster than the methodology can fix a threshold to it.
Evaluation maturity is uneven across the red lines. Biological uplift is the most mature, tested bilaterally by the UK and prior US institutes against curated CBRN scenarios. Cyber and autonomous replication are only partial: Cybench and NIST CAISI’s universal-attack work have produced credible, reproducible benchmarks, but the trajectory is fast enough that current methodology cannot yet underwrite a threshold-based verdict.
The takeaway
A composite safety score this sensitive to method can't reliably tell a safe model from an unsafe one. That is why the paper makes measurement-science standardisation the Network's first-order deliverable: until the noise floor is characterised, a threshold-based verdict is false precision, not science.
An illustrative readout: the band is assembled from each paper’s reported effect sizes, not a live evaluation. Move a method and the band moves with it.
Effect sizes from Safety Under Scaffolding (Gringras, 2026): format and scaffold effects across 62,808 scored observations, six frontier models, four deployment configurations. Capability-trajectory figure from RepliBench (Black et al., 2025); agentic corroboration from ClawSafety (Wei et al., 2026).
Imagine the measurement problem fixed tomorrow. A verdict still needs a verifier: a body with the mandate to demand access, the technical depth to run the tests, and standing the rest of the world will accept. The paper's nominee is the International Network for Advanced AI Measurement, Evaluation and Science (the “Network”): the only arrangement today that combines state-level mandates, pre-deployment access to frontier models, and direct relationships with the labs. In eighteen months it has produced a universal jailbreak result on GPT-5, charted the autonomous-replication trajectory from below 5% to above 60% on RepliBench, and run joint testing across nine jurisdictions.
A verifier, though, is only as credible as its reach. And the Network's, for now, is uneven.
The Candidate
The paper's nominee to verify AI red lines is the International Network for Advanced AI Measurement, Evaluation and Science (renamed in late 2025 from the International Network of AI Safety Institutes), and referred to here as the Network. On the authors' reading, no other international arrangement holds the same combination its member institutes do: state-level mandates, plus pre-deployment access to frontier models, backed by direct technical relationships with the major developers. The reach, so far, is uneven; closing that is the work ahead.
Showing all 11 jurisdictions mapped.
Highest capacity
Operational mid-capacity
Establishing · coordination · signatory
The Network the paper nominates to verify AI red lines: eleven state-mandated institutes, grouped by capacity, coloured by mandate. The full field, every member in view.
Tap any institute and it opens: budget, staff, the labs it has formal access to, the tools it has shipped. Here, the UK's: the field's largest, and the Network's Coordinator. Every tile works the same way.
Step 03 · the concentration point
Switch from the map to the resource view. On a common budget axis the UK bar erupts past the field: it dwarfs every other member, the concentrated capacity the paper reads as achievability, not deficit.
Back to the full map. The capacity is real, its reach uneven: one binding enforcer (the EU), two Global South footholds (Kenya, with India next), no Chinese member. The count below tallies it.
Every control is live. Filter by mandate to see who does what (only the EU can compel), flip to the resource view, and open any institute for its budget, staff, and remit.
A standard-setter needs both capacity and standing. The Network has capacity, concentrated in a handful of OECD states where its legitimacy also rests; that is the gap the FATF spent three decades and nine regional bodies closing. Building that reach is a precondition for enforcement, not a consequence of it.
Figures as of the paper's writing (2026), from its institutional mapping and Annex II; budgets approximate, some budgets and staffing undisclosed.
To see where a young and uneven network is headed, look at the one body that has already governed a problem of the same shape: global, dual-use, concentrated in a handful of jurisdictions, and policed without a treaty. The Financial Action Task Force shepherded the near-universal adoption of anti-money-laundering standards through four institutional goods: principle-based standard-setting, consent-based information-sharing (the Egmont model), regional bodies that diffuse the rules and confer legitimacy, and a grey-and-black-list mechanism that creates market consequences without any legal power to compel them.
Its thirty-five-year trajectory is the clearest evidence that the goods a red-lines regime needs are achievable without binding law. It is also a warning about timing.
Two Clocks
The paper reads the Network against the Financial Action Task Force, a soft-law body that bound the world to anti‑money‑laundering rules without a treaty. Line up the two clocks and the Network’s position is plain.
The FATF’s thirty-five-year arc, in six stages
The International Network, formalised 2024 and roughly eighteen months building, has reached Stage 3: the same developmental phase the FATF stood at in June 2000, before its blacklist. Stages 4 through 6 remain ahead of it.
FATF
Network
Step through six aligned stages with the buttons, the dots, or the ← → arrow keys.
The Financial Action Task Force opens as a G7 initiative: sixteen founding states, the Forty Recommendations inside its first year, nothing yet to enforce. The top rail lights its first node.
First mutual evaluations, a Secretariat at the OECD, working groups: all of it before any enforcement. The phase that took the FATF years has taken the Network months; the pace strip carries the asymmetry the equal rail width hides.
Step 03 · the Network reaches here, fast
On the FATF’s clock this is June 2000, eleven years in: the NCCT blacklist, fifteen jurisdictions named to trigger consequences it had no authority to impose. The Network has reached the same phase in roughly three. The two clocks snap into register.
Step 04 · the blacklist collapses
Within six years the list was gone, discontinued in 2006. Naming jurisdictions to trigger consequences the regime had no standing to impose cost it legitimacy, not leverage. This is the error the paper’s sequencing argument exists to prevent.
The 2007 ICRG rebuilt enforcement on quantified thresholds applied regardless of membership, foundations first and consequence second, and it held. The order the paper urges.
Today the FATF’s standards reach almost every jurisdiction, with sustained political support, the end state the right order made possible. The Network sits roughly where the FATF stood at its blacklist: its path ahead is the FATF’s second attempt, not its first. Every control is now live; step or jump to any stage, including the collapse you just passed.
The lesson
The FATF’s arc runs thirty-five years, and it proved the path works by reaching the end of it: three decades of building the goods and the legitimacy first, and only then the standing to impose consequences. After the 2000 blacklist collapsed, the 2007 ICRG rebuilt enforcement on objective, member-blind thresholds, and it held. The Network has had eighteen months, on a clock that may not allow three decades; the lesson isn’t ‘wait,’ it’s ‘build in the order that worked the second time, and skip the collapse in between.’
Before drawing the timing lesson, the FATF's record forces a harder one: what counts as success at all. By every institutional measure the regime is a triumph. By the measure of its stated purpose, it is hard to find any effect.
Score the regime on its own terms.
The Rational Myth
Over three decades the FATF achieved something close to total institutional success. Whether it actually reduced money laundering is a different question.
Same scale, two questions
0–100% · one axisNear-total compliance. The regime took hold almost everywhere.
No measurable effect. No detectable decline in the crime it targets.
Institutional success
Did the regime take hold?
jurisdictions have adopted the standards: near-universal reach.
members, plus nine regional review bodies (FSRBs).
information exchanges a year run through the Egmont secure platform.
of relevant US investigations resulting in financial convictions drew on Bank Secrecy Act data.
Outcome effectiveness
Did it work?
of assessed countries receive only low-to-moderate effectiveness ratings.
Measured decline in money laundering
No evidence that laundering has become harder or less prevalent. (Nazzari & Reuter, 2025)
The evidence on prevalence, across three decades
Flat. Compliance climbed; the evidence shows no decline.
Why compliance held anyway
A “rational myth”: a commitment states maintain for legitimacy even though listing showed no measurable financial bite. (Case-Ruchala & Nance, 2024)
Near-universal compliance. No measurable effect on the underlying crime.
By every institutional measure the FATF won: technical compliance climbed from 36% in 2012 to 76% under the fourth round, with standards in over 200 jurisdictions, 40 members plus nine regional review bodies, 25,000+ Egmont exchanges a year, and Bank Secrecy Act data behind 89% of relevant US financial-conviction investigations.
Ask the other question and the record inverts: effectiveness scores average just 28% across about 120 assessed countries, 97% rated only low-to-moderate, and after three decades no evidence that laundering became harder or less prevalent (Nazzari & Reuter, 2025).
Step 03 · the two scales, one axis
Put both numbers on a single 0–100 axis and the myth becomes geometry: a 76% compliance bar towers over a 28% effectiveness bar, and the bracket spans the 48-point gap the regime built everything to close and never did.
Near-universal compliance, no measurable effect on the underlying crime: a commitment states keep for legitimacy even when no bite can be measured, Case-Ruchala & Nance’s ‘rational myth.’ The lesson the paper carries forward is to judge a regime by the institutional goods it produces, not outcomes the evidence cannot yet support.
Both scales, side by side: the institutional record the FATF built, and the outcome it could never show. Read on for what the paper does with it.
The reframe
Read the other way, the FATF is an existence proof: a common standard adopted almost everywhere, peer-reviewed mutual evaluation, a secure information channel in daily international use. The operational backbone of a global regime can be built, and was. What it could not show is that the backbone reduced the crime; its success was institutional, not outcome-based: a “rational myth” in Case-Ruchala & Nance’s sense, a commitment states maintain for legitimacy even though listing showed no measurable financial bite. The paper’s move follows directly: judge the AI Network by whether it produces the institutional goods any future enforcement would need (shared standards, comparable evaluation, credible information flows, legitimacy), not by outcomes the evidence cannot yet support.
Figures from the paper: technical compliance 36% (2012) to 76% under the fourth round (FATF, 2022); effectiveness scores average 28% with 97% low-to-moderate (Basel Institute on Governance, 2024); Bank Secrecy Act contribution (IRS, 2026); Nazzari & Reuter (2025); Case-Ruchala & Nance (2024).
Which returns us to the order. The FATF's institutional goods were not modular, and the regime nearly destroyed itself by reaching for the last one first, publishing a blacklist in 2000 before it had the legitimacy to make one stick. Build the AI regime in that same wrong order and it fails the same way; build it in the right one and the consequences finally have something to stand on. The machine below lets you try both.
Assemble the regime yourself. The enforcement lever is always live.
The synthesis · interactive
Each layer unlocks the one above it. The enforcement lever at the top is always live; pull it whenever you judge the regime ready. The order is the argument.
The Consequence Layer
Graduated escalation
Procurement conditionality → conditional pre-deployment access → compute-governance triggers. Each rung credible only because a graver one sits above it.
Regime credibility
Nothing built yet. Consequences fired now would have nothing to stand on.
What just happened
Start at the foundation: activate Shared standards to begin. Or pull the enforcement lever now and watch what arrives when consequences come before the regime can carry them.
Reading the stack. The four lower layers are the institutional goods the FATF built over thirty-five years; the paper argues they transfer to the AI Network with surprising fidelity. Without shared definitions, scores aren’t comparable; without comparable scores, shared information is noise; without credible information, an enforcement signal dissolves.
Premature enforcement
In June 2000, eleven years after its founding, the FATF published its first “Non-Cooperative Countries and Territories” blacklist: fifteen jurisdictions, named to trigger market consequences the FATF itself had no authority to impose. It exempted its own members; major financial centres such as Switzerland and Luxembourg went unexamined. Read as politically selective, the list collapsed under its legitimacy deficit within six years and was discontinued in 2006.
The FATF’s error was not that it enforced too slowly. It enforced too soon: before the foundations that make enforcement credible existed, and then built them retrospectively, under pressure.
A regime that holds
With definitions agreed, evaluation made comparable, information flowing under control, and legitimacy banked, a consequence finally has a foundation. The FATF reached this only after its 2000 collapse: the 2007 ICRG process tied listing to quantified thresholds, a structured observation period, and the same criteria for members and non-members alike. The gradient became credible because the regime had earned the standing to impose it.
This is the order the paper’s recommendations follow, and why it places the International Network at the FATF’s pre-enforcement moment, not its enforcement one.
The recommendation
The paper's recommendations follow the same dependency chain, sequenced across three phases. Each phase is justified by what it delivers in its own years, not as a down-payment on an enforcement era that may never arrive.
Year 0–1
Build the foundation
Year 1–3
Make findings commensurable
Year 3–5+
Authoritative findings
The honest caveat
The politically achievable may fall short of the analytically necessary. That is exactly why the foundation and standardisation work, the unglamorous first years, is designed to be worth doing on its own terms, whether or not the enforcement era ever arrives.