Seshat-NLP Dataset Pre-Release
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10829961
下载链接
链接失效反馈官方服务:
资源简介:
This is a pre-release of Seshat-NLP, a dataset of labelled text segments derived from the Seshat Databank. These text segments were originally used in the Seshat Databank to justify the coding of historical "facts". A data point in the Seshat Databank would describe a property of a past society at a certain time (-range). We use these data points with their textual justifications to extract a NLP dataset of text segments accompanied by topic labels.
General Overview
The Dataset is organised around unique text segments (i.e.: each row one unique segment), these segments are connected with labels that designate the historical information that is contained within the text. Each segment has at least one 4-tuple of labels associated with it but can have more. The labels are ("variable_name", "variable_id", "value", and "polity_id").
Below is a simplified example row in our dataset (exemplary data!):
Description
Labels ("variable", "var_id", "value", "polity")
Reference
Thebes was the capital …
[("Capital", "…","Thebes", "Egypt Middle Kingdom"),…]
{"Title" : "The Oxford Encyclopedia of …", "Author" : "…", "DOI" : "…", …}
Note on Source Literature Text Segments
Our dataset partially consists of segments taken from scientific literature on history, we also pair these segments with labels that denote their content. We are currently looking into the legal considerations of releasing such data. In the meanwhile, we have added information to our dataset that allows the identification of the pertaining documents for each description.
In Depth Explanation of the Dataset
List of files in the release:
Seshat_NLP.sql
This file is a PostgreSQL dump that can be used to instantiate the PostgreSQL table with all the data.The table zenodoexport has the following columns:
Column Name
Column Description
id
row identifier
description
textual justification of coded value
labels
labels for description
reference_information
information required to retrieve documents
description_hash
utility column
zodero_id
utility column
Hierarchy_graph.gexf
The hierarchy_graph.gexf file is a xml based export of the hierarchy graph that can be used to tie variables to their hierarchical position in the Seshat codebook.
Explanation of Labels Column
The labels column contains a list of 4-tuples which in order denote "variable_name", "variable_id", "value", and "polity_id".We use this structure to allow for a single segment/description to have multiple 4-tuples of labels, this is useful when the same of description has been used to justify multiple "facts" in the original Seshat Databank.The variable_ids can be used to tie variable labels to nodes in the hierarchy of the Seshat codebook.
创建时间:
2024-03-18



