five

Auxiliary Datasets for Speaker Disambiguation in Quotebank

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8033672
下载链接
链接失效反馈
官方服务:
资源简介:
This data repository contains the auxiliary datasets necessary for enriching and preprocessing Quotebank, a large dataset of unique, speaker-attributed quotations. The scripts utilizing the data can be found in the quotebank-toolkit GitHub repository. The datasets are stored in data.zip and are described below: quotebank_disambiguation_mapping_quote.parquet Provides the `quoteID`-> `speakerQID` mapping for each quotation, disambiguating ambiguous speaker names and linking them to their respective Wikidata items. The schema of the dataset is as follows:  |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")  |-- speaker: Wikidata ID corresponding to the speaker of the quotation The mapping is created using heuristics described in the following paper: Marko Čuljak, Andreas Spitz, Robert West, and Akhil Arora "Strong Heuristics for Named Entity Linking" Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop 10.18653/v1/2022.naacl-srw.30 self_quotations_filtered.parquet Contains the identifiers of the quotations identified as not being self-attributed. The schema of the dataset is as follows:  |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")  speaker_attributes.parquet Contains attributes of all the speakers appearing in Quotebank extracted from Wikidata.  The schema of the dataset is as follows.  |-- id: Wikidata item QID of the speaker, primary key  |-- aliases: list of speaker's aliases  |-- date_of_birth: list of possible speaker's dates of birth  |-- nationality: list of speaker's nationalities  |-- gender: list of speaker's previous or current genders  |-- lastrevid: ID of the last revision of the speaker's item  |-- ethnic_group: list of ethnic groups the speaker belongs to  |-- US_congress_bio_ID: identifier for the speaker in the Biographical Directory of the United States Congress  |-- occupation: list of speaker's occupations  |-- party: list of parties the speaker is/was affiliated to  |-- academic_degree: list of academic degrees obtained by the speaker  |-- label: Wikidata label of the speaker  |-- candidacy: list of the speaker's candidacies in political elections  |-- type: type of the Wikidata entry (value is `item` for all the speakers)  |-- religion: previous/current religious affiliations of the speaker Using the `id` field corresponding to the Wikidata QID of a speaker, this dataset can be easily joined with disambiguated Quotebank obtained by running the `cleanup_disambiguate.py` script available in the aforementioned quotebank-toolkit repository.
创建时间:
2023-07-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作