Shanghai University of Finance and Economics Zhou Fan team releases statistical field large language model evaluation benchmark StatEval

Publisher：严继臧Release time：2026-04-29Viewer：10

Recently, Professor Zhou Fan's team from the School of Statistics and Data Science at Shanghai University of Finance and Economics officially released StatEval, the world's first comprehensive benchmark focused on evaluating the statistical reasoning capabilities of large language models (LLMs). It aims to systematically assess models' performance in statistical theory and reasoning. This benchmark includes a total of 16,191 questions, covering two major levels: "fundamental knowledge" and "research frontiers," forming a complete assessment system from undergraduate to doctoral and research-level tasks.

Dataset Overview

StatEval utilizes the "difficulty axis" and "subject axis" to organize questions, covering cross-disciplinary areas such as statistics, probability, econometrics, and machine learning, achieving a comprehensive assessment from undergraduate to doctoral levels and cutting-edge research.

The basic knowledge dataset contains a total of 13,817 questions, covering core statistical content from undergraduate to graduate levels, including three major areas: probability theory, statistics, and machine learning, as well as multiple subfields. The dataset includes 1,517 multiple-choice questions and 12,300 open-ended questions, including calculations, short answers, and proof questions. The questions are primarily sourced from: (1) 45 classic textbooks, covering a complete curriculum; (2) over a thousand graduate entrance exam questions and selected exercises verified manually; (3) publicly available courses from top international universities and high-quality online public resources.

The research-grade dataset contains a total of 2,374 questions, selected from 2,719 research papers published in 18 top academic journals in statistics and related fields from 2020 to 2025 (as shown in the figure below). It encompasses multiple cutting-edge fields such as statistics, probability theory, and machine learning. The questions primarily focus on theoretical derivation and proof tasks, sourced from theorems, lemmas, and propositions within the papers, concentrating on research problems with clear quantifiable goals, such as constant solving, convergence rate analysis, distribution form derivation, and error bound calculation, thereby fully preserving the complexity and rigor of authentic scientific reasoning. The dataset maintains the classification system of the foundational knowledge dataset in its subject structure and expands to 8 research directions, including causal inference and experimental design, high-dimensional data modeling, deep learning, and reinforcement learning, among others. Additionally, a secondary classification system based on theoretical property types is introduced, covering 8 categories of theoretical results including convergence, distribution properties, generalization, and error bounds.

Data Construction Framework

The data processing framework of StatEval aims to achieve large-scale, automated, and highly reliable statistical data construction and quality control. It employs a multi-agent collaborative architecture, combined with large model inference, to realize an efficient, precise, and iteratively optimized data generation process:

1. Document Conversion Agent: Responsible for unifying the conversion of multi-source documents (PDF, scanned copies, LaTeX, etc.) into structured text, using multimodal models to fully preserve mathematical symbols and formula structures;

2. Context Segmentation Agent: Utilizes a large model-driven dynamic regex matching framework to automatically identify theorems, lemmas, and their contextual definitions and assumptions, generating semantically coherent theoretical fragments;

3. Question Generation Agent: With the support of inference optimization models, extract content and restructure it into question-answer pairs that meet strict standards, ensuring that the questions possess appropriate difficulty, complete information, a unique answer, and quantifiable verification;

4. Quality Control Agent: Independently review the logical consistency and theoretical rigor of each Q&A pair, filtering out potential errors.

Finally, through the final review by human experts and feedback loop, the system continuously absorbs high-quality examples to improve the performance of each agent, achieving an organic combination of automation and professional human supervision.

Evaluation Framework

StatEval adopts a four-stage process evaluation pipeline, conducting fine-grained assessments from the reasoning process to the final conclusion: First, it identifies the key reasoning steps and logical chains in the model's responses; second, it extracts the intermediate results or symbolic expressions of each step; then, an independent large model evaluator (LLM Judge) compares these with reference solutions to verify logical correctness, reasoning sufficiency, and consistency; finally, scores are assigned based on three dimensions: "reasoning accuracy," "step completeness," and "final answer correctness," which are then aggregated into a total score according to their weights. To enhance robustness, the system conducts three evaluations with different random seeds and takes the lowest score as the final result.

Experimental Results

The team evaluated well-known domestic and international open and closed-source models such as the GPT series, Gemini series, Deepseek series, and Qwen series.

On the foundational knowledge dataset, various large language models show significant differences at both the undergraduate and graduate levels. Overall, closed-source models are significantly better than open-source models across all academic disciplines. Among them, GPT-5 ranks first with an average score of 82.85, demonstrating the strongest overall statistical reasoning ability. In the open-source camp, Qwen3-235B achieves an overall average score of 76.96, gradually narrowing the gap with closed-source models, while models like LLaMA-3.1-8B and DeepSeek-V3.1 perform relatively weak, indicating that model size, training optimization, and compatibility with the field of statistical education remain key factors affecting foundational reasoning performance.

On research-grade datasets, compared to baseline datasets, the overall performance of large models shows a significant downward trend, and the gap between different models in complex reasoning tasks has further widened. Closed-source models, especially the GPT-5 series, maintain a leading position across all subfields and theoretical tasks. Among them, the comprehensive score of GPT5-mini and its optimized versions approaches 60 points, demonstrating preliminary qualified high-order reasoning and theoretical validation capabilities. In contrast, open-source models like Qwen overall perform weaker but show some potential in tasks related to probability and distribution properties.

From the perspective of specific performances across various fields, the model performs best in probability and statistics-related problems, while machine learning-related inference tasks remain challenging. In terms of theoretical properties, GPT-5 shows outstanding performance in "identifiability and consistency" as well as "validity testing," while Gemini has certain advantages in tasks related to "distribution properties" and "structural guarantees." From the evaluation results, all mainstream large models currently struggle to achieve ideal levels in handling statistical reasoning tasks and meeting the proof capabilities required for scientific research.

In summary, the release of StatEval marks a significant breakthrough in the evaluation of LLMs in the field of statistics. Even top-tier closed-source models still face challenges in research-level tasks, particularly in advanced machine learning theory, highlighting the necessity and potential for enhancing LLMs' statistical reasoning capabilities, while also providing a reference and standard for the future development of statistical AI tools.

StatEval is now officially open. We welcome teachers, students, and academic and industry partners interested in large models to contact Professor Zhou Fan. The research group will continue to release more scientific achievements in the field of large models in the future.

Website homepage: https://stateval.github.io

Paper link: https://gitee.com/StatEval/StatEval/raw/main/StatEval_V1.pdf

The StatEval dataset is now officially available on the Hugging Face platform.

If this project is helpful to your research work, feel free to give it a thumbs up 👍 — your recognition will help promote and improve the project further!

Basic Knowledge Dataset:

https://huggingface.co/datasets/0v01111/StatEval-Foundational-knowledge

Research-level dataset:

https://huggingface.co/datasets/0v01111/StatEval-Statistical-Research

Contact email: zhoufan@mail.shufe.edu.cn