This presentation examines whether frontier Generative AI models have become reliable enough for real process safety work. It begins by defining hallucinations as plausible but wrong outputs, emphasizing that such errors are especially dangerous in PSM because confident mistakes can drive unsafe decisions. The talk then explains three advances that improve reliability: chain-of-thought style reasoning, the use of external tools such as Python for deterministic calculations, and reinforcement learning from human feedback to favor more useful answers.
The core of the presentation is a benchmark study using four frontier models on two PSM tasks. The first task is determining whether proposed equipment substitutions should be classified as Replacement in Kind or Management of Change. Clear-cut cases produce agreement, while nuanced cases reveal model differences and show that allowing the AI to ask questions improves confidence and usefulness. The second task is P&ID feature extraction, where the models extract equipment, instruments, and lines from drawings. Results show high consistency for equipment extraction, mixed performance for instruments, and weaker performance for line extraction. Overall, the conclusion is that GenAI has moved from novelty to a useful assistant for PSM, but not yet to an autonomous decision-maker.