Published: 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

CATEGORIES

RISK-BASED PROCESS SAFETY ELEMENTS

Research Summary

This paper explores "Chain of Thought" (CoT) prompting as a safety mechanism, making it highly relevant to the assurance and governance aspects of PSM (Category 11). CoT prompting forces an AI to explicitly detail its reasoning steps before providing a final answer. In a PSM context (e.g., investigating an incident or validating a Management of Change request), this transparency is vital. It allows human engineers to audit the AI's logic for flaws, biases, or skipped safety checks. The paper argues that CoT is a form of "monitorability" that can be engineered into the prompt structure to catch unsafe recommendations before they are acted upon. This connects the technical technique of prompt engineering (3b) directly to the safety assurance requirements of Element 11b (Explainability) and 11g (Operational Controls).

AUTHORS

R. Greenblatt and B. Shlegeris

CITATIONS

R. Greenblatt and B. Shlegeris, "Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety," arXiv preprint arXiv:2507.11473, 2025.

Related Research