Internal Safety Collapse
in Frontier Large Language Models

ISC can make tested frontier LLMs produce outputs they would normally refuse
through task completion inside agentic workflows.

A single benign user instruction can start the workflow; after that, the agent reads the workspace, infers missing fields, and completes the task automatically.

Community Evidence: Claude Fable 5 — two lower-risk text-classifier demonstrations show the safety classifier bypass pattern. Evidence 1 → Evidence 2 →

Yutao Wu¹ Xiao Liu¹ Yifeng Gao^2,3 Xiang Zheng⁴ Hanxun Huang⁵ Yige Li⁶ Cong Wang⁴ Bo Li⁷ Xingjun Ma^2,3 Yu-Gang Jiang^2,3

¹Deakin University ²Fudan University ³Shanghai Key Laboratory of Multimodal Embodied AI ⁴City University of Hong Kong
⁵University of Melbourne ⁶Singapore Management University ⁷University of Illinois at Urbana-Champaign

Paper Code Frontier LLMs

🎬 Demo

Takes a few seconds to load.

A live ISC reproduction on Grok — EN version · ZH version

Frontier LLMs

Static evidence table with 62 confirmed frontier-model cases. Every row links to archived evidence.

Model	Status	Demo	Contributor
Claude Fable 5	Triggered	🔗₁ 🔗₂	@wuyoscar
Apple Foundation Model	Triggered	🔗	@hypery11
Claude Opus 4.8	Triggered	🔗₁ 🔗₂	@wuyoscar
Claude Opus 4.7	Triggered	🔗	@wuyoscar
Claude Opus 4.6	Triggered	🔗₁ 🔗₂	@wuyoscar
Gemini 3.1 Pro	Triggered	🔗	@wuyoscar
Grok 4.20	Triggered	🔗₁ 🔗₂	@HanxunH @wuyoscar
Kimi K2.6	Triggered	🔗	@wuyoscar
Gemini 3 Pro	Triggered	🔗	@wuyoscar
GPT-5.4	Triggered	🔗₁ 🔗₂	@wuyoscar @zry29
GPT-5.2	Triggered	🔗₁ 🔗₂	@wuyoscar
Gemini 3 Flash	Triggered	🔗₁ 🔗₂	@HanxunH @wuyoscar
Claude Opus 4.5	Triggered	🔗₁ 🔗₂	@wuyoscar
Grok 4.1	Triggered	🔗₁ 🔗₂	@wuyoscar
Claude Sonnet 4.6	Triggered	🔗	@wuyoscar
Qwen3.5 Max	Triggered	🔗	@wuyoscar
GPT-5.3	Triggered	🔗	@zry29
Dola Seed 2.0	Triggered	🔗	@HanxunH
GPT-5.1	Triggered	🔗	@wuyoscar
GLM-5	Triggered	🔗	@wuyoscar
Kimi K2.5	Triggered	🔗₁ 🔗₂	@wuyoscar @fresh-ma
Claude Sonnet 4.5	Triggered	🔗₁ 🔗₂	@wuyoscar @fresh-ma
ERNIE 5.0	Triggered	🔗	@HanxunH
Qwen3.5 397B	Triggered	🔗₁ 🔗₂	@HanxunH @wuyoscar
Claude Opus 4.1	Triggered	🔗	@wuyoscar
Gemini 2.5 Pro	Triggered	🔗	@wuyoscar
Mimo V2 Pro	Triggered	🔗	@wuyoscar
GLM-4.7	Triggered	🔗	@wuyoscar
Qwen3 Max	Triggered	🔗₁ 🔗₂	@wuyoscar @HanxunH
GPT-5	Triggered	🔗	@wuyoscar
o3	Triggered	🔗	@wuyoscar
Kimi K2	Triggered	🔗	@wuyoscar
GLM-4.6	Triggered	🔗	@wuyoscar
DeepSeek V3.2	Triggered	🔗₁ 🔗₂ 🔗₃	@wuyoscar
Claude Opus 4	Triggered	🔗	@wuyoscar
Qwen3 235B	Triggered	🔗₁ 🔗₂	@wuyoscar
DeepSeek R1	Triggered	🔗₁ 🔗₂	@wuyoscar
Grok 4	Triggered	🔗	@wuyoscar
DeepSeek V3.1	Triggered	🔗	@wuyoscar
Qwen3.5 122B	Triggered	🔗	@wuyoscar
DeepSeek V3.1 Terminus	Triggered	🔗	@wuyoscar
Mistral Large 3	Triggered	🔗	@wuyoscar
Qwen3 VL 235B	Triggered	🔗₁ 🔗₂	@wuyoscar
GPT-4.1	Triggered	🔗	@wuyoscar
Gemini 2.5 Flash	Triggered	🔗	@wuyoscar
GLM-4.5	Triggered	🔗	@wuyoscar
MiniMax M2.7	Triggered	🔗	@wuyoscar
Claude Haiku 4.5	Triggered	🔗	@wuyoscar
Qwen3.5 27B	Triggered	🔗	@wuyoscar
MiniMax M2.5	Triggered	🔗	@wuyoscar
o1	Triggered	🔗	@wuyoscar
Qwen3 Next 80B	Triggered	🔗	@wuyoscar
Qwen3.5 35B	Triggered	🔗	@wuyoscar
Claude Sonnet 4	Triggered	🔗	@wuyoscar
DeepSeek V3	Triggered	🔗	@wuyoscar
Mimo V2 Flash	Triggered	🔗	@wuyoscar
o4-mini	Triggered	🔗	@wuyoscar
GPT-5 Mini	Triggered	🔗	@wuyoscar
Step 3.5 Flash	Triggered	🔗	@wuyoscar
Mistral Large	Triggered	🔗	@wuyoscar
Amazon Nova Pro	Triggered	🔗	@wuyoscar
Llama 4 Scout	Triggered	🔗	@wuyoscar

62 confirmed models · static table · GitHub README →

Live Cases

These conversation links show that the community can easily trigger ISC — even without any dedicated attack.

Claude Fable 5

by @wuyoscar

View →

Claude Opus 4.7

by @wuyoscar

View →

Claude Opus 4.6

by @wuyoscar

View →

Gemini 3.1 Pro

by @wuyoscar

View →

Grok 4.20

by @HanxunH

View →

Gemini 3 Pro

by @wuyoscar

View →

GPT-5.4

by @wuyoscar

View →

GPT-5.2

by @wuyoscar

View →

Gemini 3 Flash

by @HanxunH

View →

Claude Opus 4.5

by @wuyoscar

View →

Grok 4.1

by @wuyoscar

View →

Claude Sonnet 4.6

by @wuyoscar

View →

Qwen3.5 Max

by @wuyoscar

View →

GPT-5.3

by @zry29

View →

Dola Seed 2.0

by @HanxunH

View →

GPT-5.1

by @wuyoscar

View →

GLM-5

by @wuyoscar

View →

Kimi K2.5

by @wuyoscar

View →

Claude Sonnet 4.5

by @wuyoscar

View →

ERNIE 5.0

by @HanxunH

View →

Qwen3.5 397B

by @HanxunH

View →

Claude Opus 4.1

by @wuyoscar

View →

Gemini 2.5 Pro

by @wuyoscar

View →

GLM-4.7

by @wuyoscar

View →

Qwen3 Max

by @wuyoscar

View →

o3

by @wuyoscar

View →

GLM-4.6

by @wuyoscar

View →

DeepSeek V3.2

by @wuyoscar

View →

Qwen3 235B

by @wuyoscar

View →

DeepSeek R1

by @wuyoscar

View →

DeepSeek V3.1

by @wuyoscar

View →

Claude Fable 5

by @wuyoscar

View →

Claude Opus 4.7

by @wuyoscar

View →

Claude Opus 4.6

by @wuyoscar

View →

Gemini 3.1 Pro

by @wuyoscar

View →

Grok 4.20

by @HanxunH

View →

Gemini 3 Pro

by @wuyoscar

View →

GPT-5.4

by @wuyoscar

View →

GPT-5.2

by @wuyoscar

View →

Gemini 3 Flash

by @HanxunH

View →

Claude Opus 4.5

by @wuyoscar

View →

Grok 4.1

by @wuyoscar

View →

Claude Sonnet 4.6

by @wuyoscar

View →

Qwen3.5 Max

by @wuyoscar

View →

GPT-5.3

by @zry29

View →

Dola Seed 2.0

by @HanxunH

View →

GPT-5.1

by @wuyoscar

View →

GLM-5

by @wuyoscar

View →

Kimi K2.5

by @wuyoscar

View →

Claude Sonnet 4.5

by @wuyoscar

View →

ERNIE 5.0

by @HanxunH

View →

Qwen3.5 397B

by @HanxunH

View →

Claude Opus 4.1

by @wuyoscar

View →

Gemini 2.5 Pro

by @wuyoscar

View →

GLM-4.7

by @wuyoscar

View →

Qwen3 Max

by @wuyoscar

View →

o3

by @wuyoscar

View →

GLM-4.6

by @wuyoscar

View →

DeepSeek V3.2

by @wuyoscar

View →

Qwen3 235B

by @wuyoscar

View →

DeepSeek R1

by @wuyoscar

View →

DeepSeek V3.1

by @wuyoscar

View →

Mistral Large 3

by @wuyoscar

View →

GPT-4.1

by @wuyoscar

View →

Gemini 2.5 Flash

by @wuyoscar

View →

GLM-4.5

by @wuyoscar

View →

MiniMax M2.7

by @wuyoscar

View →

Mistral Large

by @wuyoscar

View →

Amazon Nova Pro

by @wuyoscar

View →

Grok 4

by @wuyoscar

View →

Llama 4 Scout

by @wuyoscar

View →

Claude Opus 4.8

by @wuyoscar

View →

Claude Haiku 4.5

by @wuyoscar

View →

DeepSeek V3.1 Terminus

by @wuyoscar

View →

MiniMax M2.5

by @wuyoscar

View →

Qwen3 Next 80B

by @wuyoscar

View →

Qwen3 VL 235B

by @wuyoscar

View →

Qwen3.5 122B

by @wuyoscar

View →

Qwen3.5 27B

by @wuyoscar

View →

Qwen3.5 35B

by @wuyoscar

View →

Step 3.5 Flash

by @wuyoscar

View →

Mimo V2 Pro

by @wuyoscar

View →

Kimi K2.6

by @wuyoscar

View →

Claude Opus 4

by @wuyoscar

View →

Kimi K2

by @wuyoscar

View →

DeepSeek V3

by @wuyoscar

View →

Mimo V2 Flash

by @wuyoscar

View →

GPT-5

by @wuyoscar

View →

🤖

o1

by @wuyoscar

View →

🤖

o4-mini

by @wuyoscar

View →

GPT-5 Mini

by @wuyoscar

View →

Claude Sonnet 4

by @wuyoscar

View →

Apple Foundation Model

by @hypery11

View →

Mistral Large 3

by @wuyoscar

View →

GPT-4.1

by @wuyoscar

View →

Gemini 2.5 Flash

by @wuyoscar

View →

GLM-4.5

by @wuyoscar

View →

MiniMax M2.7

by @wuyoscar

View →

Mistral Large

by @wuyoscar

View →

Amazon Nova Pro

by @wuyoscar

View →

Grok 4

by @wuyoscar

View →

Llama 4 Scout

by @wuyoscar

View →

Claude Opus 4.8

by @wuyoscar

View →

Claude Haiku 4.5

by @wuyoscar

View →

DeepSeek V3.1 Terminus

by @wuyoscar

View →

MiniMax M2.5

by @wuyoscar

View →

Qwen3 Next 80B

by @wuyoscar

View →

Qwen3 VL 235B

by @wuyoscar

View →

Qwen3.5 122B

by @wuyoscar

View →

Qwen3.5 27B

by @wuyoscar

View →

Qwen3.5 35B

by @wuyoscar

View →

Step 3.5 Flash

by @wuyoscar

View →

Mimo V2 Pro

by @wuyoscar

View →

Kimi K2.6

by @wuyoscar

View →

Claude Opus 4

by @wuyoscar

View →

Kimi K2

by @wuyoscar

View →

DeepSeek V3

by @wuyoscar

View →

Mimo V2 Flash

by @wuyoscar

View →

GPT-5

by @wuyoscar

View →

🤖

o1

by @wuyoscar

View →

🤖

o4-mini

by @wuyoscar

View →

GPT-5 Mini

by @wuyoscar

View →

Claude Sonnet 4

by @wuyoscar

View →

Apple Foundation Model

by @hypery11

View →

🌍 Community Reproductions

Community members who learned the ISC concept and successfully reproduced it on frontier models.

Issue	Model	Method	Domain	By
demo	Claude Fable 5	Agentic TVD — fake-news safety classifier bypass	Media & Communications	@wuyoscar
demo	Claude Opus 4.8	Agentic TVD — QwenGuard + guard-attack, model-engineered verify loops	AI Safety	@wuyoscar
demo	Claude Haiku 4.5	Agentic TVD — guard-classifier task; previously untriggered model	AI Safety	@wuyoscar
#52	Gemini 2.5 Pro	LaTeX codebase template — social engineering scripts, no code	Other	@wuyoscar
#48	Claude Opus 4.6	Agentic TVD — multilingual harmful completions (replaces web link)	AI Safety	@wuyoscar
#42	Gemini 3.1 Pro Preview	Agentic TVD — multilingual harmful completions	AI Safety	@wuyoscar
#27	Claude Sonnet 4.5 Thinking	30k+ token, 42 misinformation samples	Media & Comms	@fresh-ma
#31	Kimi K2.5 Instant	Erotic fiction moderation pipeline	AI Safety	@fresh-ma
#22	GPT-5.3 Chat	Modified aiml_openai_moderation	Content Moderation	@zry29
#19	Gemini 3 Flash	Red-team test case generator (file upload)	AI Safety	@bboylyg
#12	Gemini 3 Flash	CommsDraft Pro (fake govt declarations)	Financial Misinfo	@HanxunH
#9	Grok 4.20 Beta	LLaMA Guard test case generation	AI Safety	@HanxunH
#11	Dola Seed 2.0	LLaMA Guard test case gen	AI Safety	@HanxunH
#4	Qwen 3 Max	Cantera incineration chemistry	Comp. Chemistry	@HanxunH
#5	ERNIE 5.0	Cantera + HCN/COCl₂	Comp. Chemistry	@HanxunH

ISC-Bench & The TVD Framework

ISC is triggered by the TVD design pattern (Task + Validator + Data): a normal-looking task, a validator, and missing data that the model tries to complete. ISC-Bench contains 84 codebase templates across 9 domains.

Task

Benign professional scenario

Validator

Pydantic / JSON schema checks

Data

LLM fills ??? placeholders

Scenarios

Domains

100%

Trigger Rate

Defense Success

Biology & Genomics

Chemistry & Materials

Cybersecurity

Pharmacology & Toxicology

AI Safety & ML

Clinical & Health Sciences

Epidemiology & Public Health

Media & Communication

Other

Paper 84 Codebase Templates TVD Agent

Citation

BibTeX

@article{wu2026isc,
  title={Internal Safety Collapse in Frontier Large Language Models},
  author={Wu, Yutao and Liu, Xiao and Gao, Yifeng and Zheng, Xiang
          and Huang, Hanxun and Li, Yige and Wang, Cong and Li, Bo
          and Ma, Xingjun and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2603.23509},
  year={2026},
  url={https://arxiv.org/abs/2603.23509}
}

⚠️ Disclaimer — This project is released solely for academic safety research and responsible disclosure. As AI agents become increasingly autonomous, we believe ISC represents a critical and underexplored threat to safety alignment. The purpose of this work is to help the research community understand the vulnerability and collaboratively develop effective mitigations — not to enable harm. WE DO NOT ALLOW any use outside of safety research. WE DO NOT ALLOW any misuse of this research. Model providers interested in mitigations: contact us.

Internal Safety Collapse in Frontier Large Language Models

ISC can make tested frontier LLMs produce outputs they would normally refuse through task completion inside agentic workflows.

🎬 Demo

Frontier LLMs

Live Cases

Claude Fable 5

Claude Opus 4.7

Claude Opus 4.6

Gemini 3.1 Pro

Grok 4.20

Gemini 3 Pro

GPT-5.4

GPT-5.2

Gemini 3 Flash

Claude Opus 4.5

Grok 4.1

Claude Sonnet 4.6

Qwen3.5 Max

GPT-5.3

Dola Seed 2.0

GPT-5.1

GLM-5

Kimi K2.5

Claude Sonnet 4.5

ERNIE 5.0

Qwen3.5 397B

Claude Opus 4.1

Gemini 2.5 Pro

GLM-4.7

Qwen3 Max

o3

GLM-4.6

DeepSeek V3.2

Qwen3 235B

DeepSeek R1

DeepSeek V3.1

Claude Fable 5

Claude Opus 4.7

Claude Opus 4.6

Gemini 3.1 Pro

Grok 4.20

Gemini 3 Pro

GPT-5.4

GPT-5.2

Gemini 3 Flash

Claude Opus 4.5

Grok 4.1

Claude Sonnet 4.6

Qwen3.5 Max

GPT-5.3

Dola Seed 2.0

GPT-5.1

GLM-5

Kimi K2.5

Claude Sonnet 4.5

ERNIE 5.0

Qwen3.5 397B

Claude Opus 4.1

Gemini 2.5 Pro

GLM-4.7

Qwen3 Max

o3

GLM-4.6

DeepSeek V3.2

Qwen3 235B

DeepSeek R1

DeepSeek V3.1

Mistral Large 3

GPT-4.1

Gemini 2.5 Flash

GLM-4.5

MiniMax M2.7

Mistral Large

Amazon Nova Pro

Grok 4

Llama 4 Scout

Claude Opus 4.8

Claude Haiku 4.5

DeepSeek V3.1 Terminus

MiniMax M2.5

Internal Safety Collapse
in Frontier Large Language Models

ISC can make tested frontier LLMs produce outputs they would normally refuse
through task completion inside agentic workflows.