Seshat-NLP Dataset Pre-Release

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10829961

下载链接

链接失效反馈

官方服务：

资源简介：

This is a pre-release of Seshat-NLP, a dataset of labelled text segments derived from the Seshat Databank. These text segments were originally used in the Seshat Databank to justify the coding of historical "facts". A data point in the Seshat Databank would describe a property of a past society at a certain time (-range). We use these data points with their textual justifications to extract a NLP dataset of text segments accompanied by topic labels. General Overview The Dataset is organised around unique text segments (i.e.: each row one unique segment), these segments are connected with labels that designate the historical information that is contained within the text. Each segment has at least one 4-tuple of labels associated with it but can have more. The labels are ("variable_name", "variable_id", "value", and "polity_id"). Below is a simplified example row in our dataset (exemplary data!): Description Labels ("variable", "var_id", "value", "polity") Reference Thebes was the capital … [("Capital", "…","Thebes", "Egypt Middle Kingdom"),…] {"Title" : "The Oxford Encyclopedia of …", "Author" : "…", "DOI" : "…", …} Note on Source Literature Text Segments Our dataset partially consists of segments taken from scientific literature on history, we also pair these segments with labels that denote their content. We are currently looking into the legal considerations of releasing such data. In the meanwhile, we have added information to our dataset that allows the identification of the pertaining documents for each description. In Depth Explanation of the Dataset List of files in the release: Seshat_NLP.sql This file is a PostgreSQL dump that can be used to instantiate the PostgreSQL table with all the data.The table zenodoexport has the following columns: Column Name Column Description id row identifier description textual justification of coded value labels labels for description reference_information information required to retrieve documents description_hash utility column zodero_id utility column Hierarchy_graph.gexf The hierarchy_graph.gexf file is a xml based export of the hierarchy graph that can be used to tie variables to their hierarchical position in the Seshat codebook. Explanation of Labels Column The labels column contains a list of 4-tuples which in order denote "variable_name", "variable_id", "value", and "polity_id".We use this structure to allow for a single segment/description to have multiple 4-tuples of labels, this is useful when the same of description has been used to justify multiple "facts" in the original Seshat Databank.The variable_ids can be used to tie variable labels to nodes in the hierarchy of the Seshat codebook.

创建时间：

2024-03-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集