Adversarial Circuit Evaluation

MATS Fellow:

Niels uit de Bos

Authors:

Niels uit de Bos, Adrià Garriga-Alonso

Citations

1 Citations

Abstract:

Circuits are supposed to accurately describe how a neural network performs a specific task, but do they really? We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner, considering inputs where the circuit's behavior maximally diverges from the full model. Concretely, we measure the KL divergence between the full model's output and the circuit's output, calculated through resample ablation, and we analyze the worst-performing inputs. Our results show that the circuits for the IOI and docstring tasks fail to behave similarly to the full model even on completely benign inputs from the original task, indicating that more robust circuits are needed for safety-critical applications.

Recent research

What Happens When Superhuman AIs Compete for Control?

Authors:

Steven Veld

Date:

January 11, 2026

Citations:

0

AI Futures Model: Timelines & Takeoff

Authors:

Brendan Halstead, Alex Kastner

Date:

December 30, 2025

Citations:

0

Frequently asked questions

What is the MATS Program?
How long does the program last?