This paper maps the current landscape of LLM benchmarks through a scoping review, evaluating their applicability to safety-critical hazard analysis. A pilot study across nine safety scenarios revealed significant inconsistency in LLM analytical quality between evaluation runs, with hazard identification scores varying and causal reasoning performance being consistently poor. This is essential reading for MOC practitioners considering AI tools for change-related hazard reviews. The findings highlight that while LLMs show promise for supporting MOC hazard screenings, their inconsistent analytical quality poses significant challenges for safety assurance. Organizations must implement robust validation frameworks before deploying LLMs in safety-critical MOC decision workflows.