five

Orange/WikiFactDiff

收藏
Hugging Face2024-08-27 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Orange/WikiFactDiff
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 language: - en tags: - Factual knowledge update - General knowledge - Wikidata task_categories: - other size_categories: - 100K<n<1M configs: - config_name: 20210104-20230227_legacy default: true data_files: - split: train path: "20210104-20230227_legacy/*.parquet" - config_name: 20210104-20230227 data_files: - split: train path: "20210104-20230227/*.parquet" - config_name: triple_verbs data_files: - split: train path: "triple_verbs/*.parquet" - config_name: triple_verbs_V2 data_files: - split: train path: "triple_verbs_V2/*.parquet" --- # WikiFactDiff: A Realistic Dataset for Atomic Factual Knowledge Update WikiFactDiff is a dataset designed as a resource to perform realistic factual updates within language models and to evaluate them post-update. **Available datasets**: - **20210104-20230227_legacy**: The recommended WikiFactDiff dataset (its creation process is in the [paper](https://aclanthology.org/2024.lrec-main.1532/)) - **20210104-20230227**: An improves version of WikiFactDiff in terms of verbalization quality (Work still in progress.. DO NOT USE IT) - **triple_verbs**: Verbalization of 26058 triples from Wikidata by GPT3.5 - **triple_verbs_V2**: Verbalization of 91252 tripels from Wikidata + 346702 triples from WikiFactDiff triples by GPT3.5 ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> WikiFactDiff is a dataset that describes the factual changes between two dates as a collection of simple facts divided into three categories: **new**, **obsolete**, and **static**. The facts are represented by subject-relation-object triples. WikiFactDiff is constructed by comparing the state of the Wikidata knowledge base at two dates $T_{old}$ and $T_{new}$. Those fact are accompanied by verbalization templates and cloze tests that enable running update algorithms and their evaluation. Contrary to other datasets, such as zsRE and CounterFact, WikiFactDiff constitutes a realistic update setting that involves various update scenarios, including replacements, archival, and new entity insertions. WikiFactDiff sample (triples only) | Templates used for verbalization :-------------------------:|:-------------------------: [<img src="readme_images/sample.png" width="500"/>](./images/sample.png) | [<img src="readme_images/verb.png" width="500"/>](./images/verb.png) We are releasing here the WikiFactDiff dataset for January 4, 2021 and February 27, 2023, which is ideal for updating language models trained using the Pile dataset released on December 31, 2020. **Note:** Future releases, to fit other models for instance, will be stored here as different configurations of WikiFactDiff. ### Dataset Features - **Language(s) (NLP):** English - **License:** This work is licensed via CC BY-SA 4.0 ### External resources <!-- Provide the basic links for the dataset. --> - **Repository:** [GitHub](https://github.com/Orange-OpenSource/WikiFactDiff) (To possibly rebuild the dataset with different $T_{old}$ and $T_{new}$) - **Paper:** [Link](https://aclanthology.org/2024.lrec-main.1532/) ## Uses <!-- This section describes suitable use cases for the dataset. --> - Align language models with current factual knowledge - Evaluate knowledge update algorithms on realistic updates: - *Replacement-only* algorithms, e.g., ROME, MEMIT, MEND, etc. - General algorithms that can handle any update that can arise from the semantic triple representation of facts *(s,r,o)*. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> WikiFactDiff contains a list of updates. Here are the fields of each element of the list: - **"subject"** (dict) - **"id"** : Subject Wikidata ID (string) - **"label"** : Subject Wikidata label (string) - **"description"** : Subject Wikidata description (string) - **"subject_is_ph_new"** : The subject is a new entity, i.e. an entity that did not exist at $T_{old}$ but exists at $T_{new}$. (bool) - **"subject_popularity"** : A measure of the subject's popularity. (float) - **"relation"** (dict) - **"id"** : Relation Wikidata ID (string) - **"label"** : Relation Wikidata label (string) - **"description"** : Relation Wikidata description (string) that did not exist at $T_{old}$ (bool) - **"relation_is_temp_func"** : The relation is temporal functional - **"is_replace"** : The update represents a replacement. For instance, replacing the prime minister of UK. (bool) - **"objects"** (list): each *dict* in the list contains the fields: - **"id"** : Object Wikidata ID or None if it's a literal (string) - **"label"** : Object Wikidata label (string) - **"description"** : Object Wikidata description (string) - **"decision"** : It can take three values (*new, obsolete, static*) depending on the veracity of the object. For example, in (Donald Trump, head of state, USA), USA recieves the label *obsolete* (suppose $T_{old}=2022$ and $T_{new}=2024$ for instance). (string) - **"update_prompt"** (string): The cloze test that is fed to the update algorithm with model to perform the update. - **"generalization_prompts"** : The cloze tests used to evaluate the generalization of the update to paraphrases. - **"neighborhood"** (list): The list of neighbor groups (facts) to assess potential bleedover. The neighborhood's relation is the same as the one in the update. Each *dict* in the list contains the fields: - **"subject"** (dict): - **"id"** : Neighbor subject Wikidata ID (string) - **"label"** : Neighbor subject Wikidata label (string) - **"description"** : Neighbor subject Wikidata description (string) - **"dist"** : Distance between the two entities : *neighborhood.subject* and the current *subject*. (float) - **"objects"** (list): each *dict* in the list contains the fields: - **"id"** : Object Wikidata ID or None if it's a literal (string) - **"label"** : Object Wikidata label (string) - **"description"** : Object Wikidata description (string) - **"prompt"**: The cloze test used to validate the knowledge of this neighbor triple by the LM. For instance, "The head of state of France is ____". (string) A more detailed description of the concepts above are included in our paper including: the measure of an entity's popularity, the method to construct the neighborhood of a fact and the meaning of temporal functional relations. ## Dataset Creation #### Source Data - The facts in triple format were collected from Wikidata. - The templates to verbalize these triples in English were created using post-processed ChatGPT verbalizations. #### Data Collection and Processing 1. Two instances of Wikidata are collected at $T_{old}$ and $T_{new}$ respectively. 2. These instances are preprocessed to filter irrelevant data and compared to get the difference between them. 3. Each relevant triple in this difference is labeled with *new, static* or *obsolete*. 4. These triples are verbalized and and a set of neighbor facts is collected for each triple. <center><br><b>Build process</b></br><img src="readme_images/build_process.png" width="350"/></center> ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** ``` @inproceedings{ammar-khodja-etal-2024-wikifactdiff-large, title = "{W}iki{F}act{D}iff: A Large, Realistic, and Temporally Adaptable Dataset for Atomic Factual Knowledge Update in Causal Language Models", author = "Ammar Khodja, Hichem and Bechet, Frederic and Brabant, Quentin and Nasr, Alexis and Lecorv{\'e}, Gw{\'e}nol{\'e}", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1532", pages = "17614--17624", } ```
提供机构:
Orange
原始信息汇总

数据集概述

名称: WikiFactDiff

描述: WikiFactDiff是一个用于语言模型中实际事实更新和更新后评估的数据集。它通过比较Wikidata知识库在两个日期($T_{old}$和$T_{new}$)的状态,描述了两个日期之间的事实变化,分为过时静态三类。事实以主语-关系-对象三元组表示。

数据集特征

  • 语言: 英语
  • 许可证: CC BY-SA 4.0

数据集结构

  • 数据元素: 每个更新包含以下字段:
    • "subject" (dict): 主语的Wikidata信息
    • "relation" (dict): 关系的Wikidata信息
    • "objects" (list): 对象的列表,每个对象包含Wikidata信息和决策标签(新、过时、静态)
    • "update_prompt" (string): 用于更新算法的完形填空测试
    • "generalization_prompts" (list): 用于评估更新泛化的完形填空测试
    • "neighborhood" (list): 邻近事实列表,用于评估潜在的溢出效应

数据集用途

  • 用途:
    • 使语言模型与当前事实知识保持一致
    • 评估知识更新算法在实际更新上的表现

数据集创建

  • 源数据: 从Wikidata收集的三元组事实
  • 数据处理:
    1. 收集两个时间点的Wikidata实例
    2. 预处理并比较这两个实例以获取差异
    3. 对差异中的每个相关三元组进行标记
    4. 对三元组进行口头化处理,并为每个三元组收集一组邻近事实

引用信息

  • BibTeX:

    @inproceedings{ammar-khodja-etal-2024-wikifactdiff-large, title = "{W}iki{F}act{D}iff: A Large, Realistic, and Temporally Adaptable Dataset for Atomic Factual Knowledge Update in Causal Language Models", author = "Ammar Khodja, Hichem and Bechet, Frederic and Brabant, Quentin and Nasr, Alexis and Lecorv{e}, Gw{e}nol{e}", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1532", pages = "17614--17624", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作