Hermán, Judit, Kovács, Kíra Diána, Wang, Yajie and Vásárhelyi, Orsolya
ORCID: https://orcid.org/0000-0001-6326-0617
(2025)
Evaluating the Gender Bias of Large Language Models in Academic Knowledge Production.
In:
European workshop on algorithmic fairness.
Proceedings of Machine Learning Research
(294).
JMLR-Journal Machine Learning Research, San Diego, pp. 417-422.
.
|
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
489kB |
Official URL: https://proceedings.mlr.press/v294/herman25a.html
Abstract
Gender inequality in science is a complex issue that affects every stage of a scientific career, from education to professional advancement [1]. Despite progress in recent decades, women remain underrepresented in most scientific fields, particularly in leadership roles and prestigious research positions [2]. On November 30, 2022, OpenAI released ChatGPT, an AI language model that has rapidly transformed how we communicate, learn, and conduct research [3]. Since then, numerous companies have launched their own models, integrating them into various products. ChatGPT's use has also expanded into knowledge production and scientific publication [4]. It holds promise for addressing long-standing inequalities in academia, such as assisting non-native English speakers in articulating their scientific discoveries more clearly and efficiently [5,6]. However, large language models (LLMs) have been shown to exhibit biases, failing to represent men and women equally in image generation [7,8,9,10]. They also generate factually incorrect responses and fabricate non-existent references [11,12]. In this paper, we examine references generated by the ChatGPT-4o model across 26 research areas within four main scientific domains: Physical Sciences, Health Sciences, Social Sciences, and Life Sciences [13]. Specifically, we designed a prompt that instructed ChatGPT to generate literature reviews in various subfields and provide references including authors' full names, article titles, journals, publication years, and DOIs. We then compared these references across research areas to OpenAlex, an open-source database containing over 250 million scientific publications [13]. Our analysis focuses on the publication years of cited articles, the gender composition of co-author teams, and the occurrence of hallucinated references provided by ChatGPT. Additionally, we assessed the consistency of literature reviews generated on the same topic. While previous studies have identified bias through small-sample analyses within specific subfields, our approach systematically compares hundreds of literature reviews across 26 subtopics, providing a broader evaluation and laying the groundwork for a standardized framework.
| Item Type: | Book Section |
|---|---|
| Series Name: | Proceedings of Machine Learning Research |
| Uncontrolled Keywords: | LLM, Gender inequality, algorithmic bias, evaluation |
| Divisions: | Institute of Data Analytics and Information Systems Corvinus Institute for Advanced Studies (CIAS) |
| Subjects: | Automatizálás, gépesítés Sociology Computer science |
| Funders: | European Research Executive Agency project LearnData |
| Projects: | 101086712E |
| ID Code: | 12702 |
| Deposited By: | MTMT SWORD |
| Deposited On: | 08 Apr 2026 15:15 |
| Last Modified: | 08 Apr 2026 15:15 |
Repository Staff Only: item control page


Download Statistics
Download Statistics