Recently, Li Ning, Flextronics Chair Professor and Chair of the Department of Leadership and Organization Management at Tsinghua University’s School of Economics and Management (Tsinghua SEM), led his team in publishing a paper titled “A Large-Scale Replication of Scenario-Based Experiments in Psychology and Management Using Large Language Models” in Nature Computational Science (Volume 5, Issue 8), a sub-journal of Nature.

The cover of Volume 5, Issue 8 of Nature Computational Science
This cross-disciplinary study demonstrates that large language models (LLMs) can replicate results from psychological and management scenario-based experiments with a high level of consistency, matching human responses. The research provides a systematic empirical foundation for applying AI technologies in the social sciences.
From human experiments to silicon-based experiments
The story began in September 2023. Cui Ziyan, a fourth-year doctoral student, and Zhou Huaikang, a postdoctoral researcher, were brainstorming with Li Ning. The group considered whether artificial intelligence (AI) could be used for survey questionnaires. They quickly dismissed the idea and shifted direction.
“That’s when it occurred to me: scenario-based experiments are fabricated. Respondents don’t necessarily need personal experience of the scenarios. Could AI try it?” Li Ning recalled.
Over the next year, the team selected 156 scenario-based experiments published in the past decade in five top journals, including Organizational Behavior And Human Decision Processes and Academy of Management Journal. They had three large language models—ChatGPT-4, Claude 3.5 Sonnet, and DeepSeek-V3—completed these experiments and compared the outcomes with human participants’ results.
AI underwent nearly 700 tests for main effects and over 160 for interaction effects, covering a wide range of topics from workplace behavior, personal decision-making to social psychology and team collaboration.

A group photo of Li Ning’s research team: Zhou Huaikang (left on the second row, Li Ning (fourth from the left on the second row) and Cui Ziyan (second from the right on the second row)
There was no roadmap; every step was pieced together through data.
Cui explained that they "fed" the experimental materials into ChatGPT and checked whether the conclusion aligned with the original studies. Sometimes AI struggled with concepts familiar in human society. Other times, it generated identical answers across runs. So the team tailored the materials for AI, added explanations, and set constraints to simulate diverse populations.
"It’s like working with people—sometimes they skim and miss key information, so we highlight it," Li Ning noted.
The workload grew. Large-scale replication required application programming interfaces (APIs), so the team recruited research assistants. After collecting AI-generated responses, they applied analysis methods similar to the originals. Since each experiment differed, they sought to reproduce the original steps as closely as possible, using the same software.
"The data cleaning, organizing, and analyzing with different software were enormous. We brought in more research assistants," Cui said. Altogether, the project used at least hundreds of millions of tokens.
Did they ever doubt the project’s direction? "We had no fixed expectations. Even if AI failed to replicate human experiments, that would still be a finding. We just kept working, fully absorbed in the work," Cui said.
AI performance exceeds expectations
The three models replicated the main effects of psychology experiments with success rates between 73% and 81%. Even when full replication was not achieved, their "thinking direction" remained about 80% consistent with humans, similar to two people differing in detail but generally agreeing overall.
“This suggests that in the future, we might conduct trial experiments with AI before involving humans,” Li Ning said.
Entering the 2020s, as AI for Science emerges as the “fourth paradigm of research”—after theory, experiment, and simulation—big data and AI promise faster scientific progress and discoveries. For social scientists, Li Ning's teamwork offers a “fast lab” to quickly test hypotheses, saving time and costs before human studies. For businesses, it opens new possibilities in management practices.
"The current model is that scholars do research, publish, teach MBA students, who then apply it in companies. In the future, firms might bypass this chain, using AI to build digital twins and test how employees might react before making decisions," Li Ning said.
The research also uncovered a noteworthy phenomenon: LLMs systematically amplified effect sizes. All three models produced larger effects than the original experiments. More notably, when human experiments showed no significant effect, the AI models produced significant results in 68%–83% of cases.
"This may be because human experiments are inherently noisy. Minds are distracted by many thoughts. In contrast, large models, though called multiple times, remain essentially the same, yielding bigger between-group differences and smaller within-group differences," Li Ning explained.
When experiments involved sensitive social topics like race and gender, replication rates fell sharply. This reflects current AI's limitations in handling complex social issues. Even when told to ignore social norms, AI tended to choose ethically “safe” answers—reflecting constraints imposed by their developers.
"This is fascinating. Models are more ethical, by default, due to company safeguards," Li Ning said. This creates new challenges for follow-up research: calibrating effect sizes, improving simulation accuracy, designing methods for sensitive topics, and probing differences in AI-human cognition. Such work will refine computational social sciences, making it a complement—not a replacement—to human studies.
Cross-disciplinary impact is happening in unseen places
In August 2024, the team posted a preprint on arXiv’s Computer Science section. At that time, few studies of this scale existed. The paper drew global attention immediately, and researchers from computer science, psychology, and management reached out.
Soon after, Nature Computational Science invited the team to submit. “With our background in management, the invitation surprised us. We even checked if the journal was real,” Li Ning joked.
The peer-review process lasted half a year, with four or five revision rounds. Each deadline was tight, which the team saw as encouraging. The reviewers’ comments gave us confidence—they were very clear,” Cui said.

Li Ning’s team publish their research paper on Nature Computational Science.
In this emerging field of AI and social sciences, Chinese scholars are moving from followers to contributors.
A tenured associate professor at Tsinghua SEM found the paper cited in MIT’s “AI for Science” course materials. A faculty member from Tsinghua's School of Journalism and Communication also heard the study referenced at an annual journalism conference hosted by Renmin University of China.
The study systematically validates AI’s role in social sciences. Under certain conditions, computational methods can complement human experiments—particularly in hypothesis generation, pre-testing, and methodological validation. The evaluation metrics it introduces—replication success rate, directional consistency, and effect size comparison—provide quantitative standards for future research.
"Its influence is cross-disciplinary; it may be shaping areas we don’t yet see," Li Ning said.