five

HiST-LLM

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14671247
下载链接
链接失效反馈
官方服务:
资源简介:
Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)  Large Language Models (LLMs) have the potential to transform humanities and social science research, yet their history knowledge and comprehension at a graduate level remains untested. Benchmarking LLMs in history is particularly challenging, given that human knowledge of history is inherently unbalanced, with more information available on Western history and recent periods. We introduce the History Seshat Test for LLMs (Hist-LLM), based on a subset of the Seshat Global History Databank, which provides a structured representation of human historical knowledge, containing 36,000 data points across 600 historical societies and over 2,700 scholarly references. This dataset covers every major world region from the Neolithic period to the Industrial Revolution and includes information reviewed and assembled by history experts and graduate research assistants. Using this dataset, we benchmark a total of seven models from the Gemini, OpenAI, and Llama families. We find that, in a four-choice format, LLMs have a balanced accuracy ranging from 33.6% (Llama-3.1-8B) to 46% (GPT-4-Turbo), outperforming random guessing (25%) but falling short of expert comprehension. LLMs perform better on earlier historical periods. Regionally, performance is more even but still better for the Americas and lowest in Oceania and Sub-Saharan Africa for the more advanced models. Our benchmark shows that while LLMs possess some expert-level historical knowledge, there is considerable room for improvement.   Dataset links Dataset Repository (Github)  Croissant Metadata (Github) Usage This dataset can be used to benchmark LLMs on their expert level history knowledge. Loading the dataset using Python and Pandas: import pandas as pd main = pd.read_parquet("Neurips_HiST-LLM.parquet") ref = pd.read_parquet("references.parquet")    Dataset metadata Dataset metadata documented in the croissant.json file. Model Fingerprints When model fingerprint are available we created extra columns for each model fingerprint. These columns are named via the following pattern _. Column Descriptions additional_review Boolean This column describes whether datapoints underwent additional expert review. See section 3.2 of the Paper. Q The multiple choice question. A The expected completion of the prompt. polity old id ID for polity according to Seshat ids. start year str String for when polity started existing (in BCE/CE format). end year str String for when polity stopped existing (in BCE/CE format). start year int Int for when polity started existing (in BCE/CE format). end year int Int for when polity stopped existing (in BCE/CE format). name Polity name. nga Natural Geographic Area for Polity. world_region The world region of a NGA (based on the UN regions with some modifications) category Immediate parent category of fact from Seshat codebook. root cat Major category of fact. value Value of data point. variable Variable of data point. id Request id for openai batch requests. description Description provided by RAs for fact.
创建时间:
2025-01-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作