Paper information in the topic of large language models
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13118977
下载链接
链接失效反馈官方服务:
资源简介:
This dataset supports the findings in the preprint 'Academic collaboration on large language model studies increases overall but varies across disciplines.' The study aims to explore the application of large language models (LLMs) in scientific disciplines and their implications for interdisciplinary collaboration.
To build LLM paper group, we start with a broad search using general terms related to LLMs and popular models based on the MMLU benchmark spanning from October 2018 to September 2024. We apply this search to the title and abstract to avoid excessive noise in the dataset and then undergo a series of filtering stepsto enhance relevance and remove duplicates. The resulting dataset contains 59,293 papers.
In addition to the paper group in the topic of LLMs, we establish two control groups. The first control group focuses on machine learning (ML) papers. We select ML as a control because it is a well-established field from which LLM emerged as a subfield. To construct this group, we collect a random sampling of 70,945 papers containing the phrase ''machine learning'' in either their title or abstract. To provide an even broader perspective beyond AI-related fields, we create a second control group consisting of a random sample of 73,110 papers from all other research categories---specifically, papers that belong neither to the ML nor LLM categories.
The three files below contain the cleaned samples collected from OpenAlex, which are derived from the original files.
LLM: llm-cleaned-samples.csv
ML: ml-cleaned-samples.csv
Non-LLM/ML: non-llm-cleaned-samples.csv
The three zip files below contain author affiliation information (including departmental discipline) extracted by GPT-4o-mini to support the departmental analysis in the paper:
LLM: llm-author-affiliations.zip
ML: ml-author-affiliations.zip
Non-LLM/ML: non-llm-author-affiliations.zip
The three files below contain the paper information used to support all the analysis in our paper:
LLM: llm-information-entropy.csv
ML: ml-information-entropy.csv
Non-LLM/ML: non-llm-information-entropy.csv
If you have any additional questions, please feel free to contact lingyaol@umich.edu or lydinh@usf.edu.
创建时间:
2024-12-30



