Auxiliary Datasets for Speaker Disambiguation in Quotebank
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8033672
下载链接
链接失效反馈官方服务:
资源简介:
This data repository contains the auxiliary datasets necessary for enriching and preprocessing Quotebank, a large dataset of unique, speaker-attributed quotations. The scripts utilizing the data can be found in the quotebank-toolkit GitHub repository. The datasets are stored in data.zip and are described below:
quotebank_disambiguation_mapping_quote.parquet
Provides the `quoteID`-> `speakerQID` mapping for each quotation, disambiguating ambiguous speaker names and linking them to their respective Wikidata items. The schema of the dataset is as follows:
|-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
|-- speaker: Wikidata ID corresponding to the speaker of the quotation
The mapping is created using heuristics described in the following paper:
Marko Čuljak, Andreas Spitz, Robert West, and Akhil Arora
"Strong Heuristics for Named Entity Linking"
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
10.18653/v1/2022.naacl-srw.30
self_quotations_filtered.parquet
Contains the identifiers of the quotations identified as not being self-attributed. The schema of the dataset is as follows:
|-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
speaker_attributes.parquet
Contains attributes of all the speakers appearing in Quotebank extracted from Wikidata. The schema of the dataset is as follows.
|-- id: Wikidata item QID of the speaker, primary key
|-- aliases: list of speaker's aliases
|-- date_of_birth: list of possible speaker's dates of birth
|-- nationality: list of speaker's nationalities
|-- gender: list of speaker's previous or current genders
|-- lastrevid: ID of the last revision of the speaker's item
|-- ethnic_group: list of ethnic groups the speaker belongs to
|-- US_congress_bio_ID: identifier for the speaker in the Biographical Directory of the United States Congress
|-- occupation: list of speaker's occupations
|-- party: list of parties the speaker is/was affiliated to
|-- academic_degree: list of academic degrees obtained by the speaker
|-- label: Wikidata label of the speaker
|-- candidacy: list of the speaker's candidacies in political elections
|-- type: type of the Wikidata entry (value is `item` for all the speakers)
|-- religion: previous/current religious affiliations of the speaker
Using the `id` field corresponding to the Wikidata QID of a speaker, this dataset can be easily joined with disambiguated Quotebank obtained by running the `cleanup_disambiguate.py` script available in the aforementioned quotebank-toolkit repository.
创建时间:
2023-07-31



