Evaluating large language models on quantum mechanics: a comparative study across diverse models and tasks

Sreekantham, Rithvik Kumar

doi:10.20944/preprints202511.0889.v1

Evaluating large language models on quantum mechanics: a comparative study across diverse models and tasks

Source

Preprints.org

Date Issued

2025-11-01

Author(s)

Sreekantham, Rithvik Kumar

DOI

10.20944/preprints202511.0889.v1

Abstract

We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical computation, comprising 900 baseline and 75 tool-augmented assessments. Results reveal clear tier stratification: flagship models achieve 81% average accuracy, outperforming mid-tier (77%) and fast models (67%) by 4pp and 14pp respectively. Task difficulty patterns emerge distinctly: derivations show highest performance (92% average, 100% for flagship models), while numerical computation remains most challenging (42%). Tool augmentation on numerical tasks yields task-dependent effects: modest overall improvement (+4.4pp) at 3x token cost masks dramatic heterogeneity ranging from +29pp gains to -16pp degradation. Reproducibility analysis across three runs quantifies 6.3pp average variance, with flagshipmodels demonstrating exceptional stability (GPT-5 achieves zero variance) while specialized models require multi-run evaluation. This work contributes: (i) a benchmark for quantum mechanics with automatic verification, (ii) systematic evaluation quantifying tier-based performance hierarchies, (iii) empirical analysis of tool augmentation trade-offs, and (iv) reproducibility characterization. All tasks, verifiers, and results are publicly released.

Publication link

https://doi.org/10.20944/preprints202511.0889.v1

URI

https://repository.iitgn.ac.in/handle/IITG2025/33512

Subjects

Large language models

Quantum mechanics

Benchmark

Tool augmentation

Reproducibility

Model evaluation

Scientific problem-solving

Computational physics