five

strombergnlp/danfever

收藏
Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/strombergnlp/danfever
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - found language: - da license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification task_ids: - fact-checking - natural-language-inference paperswithcode_id: danfever pretty_name: DanFEVER tags: - knowledge-verification --- # Dataset Card for DanFEVER ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** [https://github.com/StrombergNLP/danfever](https://github.com/StrombergNLP/danfever) - **Repository:** [https://stromberg.ai/publication/danfever/](https://stromberg.ai/publication/danfever/) - **Paper:** [https://aclanthology.org/2021.nodalida-main.47/](https://aclanthology.org/2021.nodalida-main.47/) - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Leon Derczynski](mailto:leod@itu.dk) - **Size of downloaded dataset files:** 2.82 MiB - **Size of the generated dataset:** 2.80 MiB - **Total amount of disk used:** 5.62 MiB ### Dataset Summary We present a dataset, DanFEVER, intended for multilingual misinformation research. The dataset is in Danish and has the same format as the well-known English FEVER dataset. It can be used for testing methods in multilingual settings, as well as for creating models in production for the Danish language. ### Supported Tasks and Leaderboards This dataset supports the FEVER task, but in Danish. * PwC leaderboard: [Fact Verification on DanFEVER](https://paperswithcode.com/sota/fact-verification-on-danfever) ### Languages This dataset is in Danish; the bcp47 is `da_DK`. ## Dataset Structure ### Data Instances ``` { 'id': '0', 'claim': 'Den 31. oktober 1920 opdagede Walter Baade kometen (944) Hidalgo i det ydre solsystem.', 'label': 0, 'evidence_extract': '(944) Hidalgo (oprindeligt midlertidigt navn: 1920 HZ) er en mørk småplanet med en diameter på ca. 50 km, der befinder sig i det ydre solsystem. Objektet blev opdaget den 31. oktober 1920 af Walter Baade. En asteroide (småplanet, planetoide) er et fast himmellegeme, hvis bane går rundt om Solen (eller en anden stjerne). Pr. 5. maj 2017 kendes mere end 729.626 asteroider og de fleste befinder sig i asteroidebæltet mellem Mars og Jupiter.', 'verifiable': 1, 'evidence': 'wiki_26366, wiki_12289', 'original_id': '1' } ``` ### Data Fields [Needs More Information] ### Data Splits [Needs More Information] ## Dataset Creation ### Curation Rationale A dump of the Danish Wikipedia of 13 February 2020 was stored as well as the relevant articles from Den Store Danske (excerpts only, to comply with copyright laws). Two teams of two people independently sampled evidence, and created and annotated claims from these two sites. ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? The source language is from Wikipedia contributors editors and from dictionary contributors and editors. ### Annotations #### Annotation process Detailed in [this paper](http://www.derczynski.com/papers/danfever.pdf). #### Who are the annotators? The annotators are native Danish speakers and masters students of IT; two female, two male, ages 25-35. ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset The purpose of this dataset is to enable construction of fact-checking systems in Danish. A system that succeeds at this may be able to identify questionable conclusions or inferences. ### Discussion of Biases The data is drawn from relatively formal topics, and so may perform poorly outside these areas. ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information The data here is licensed CC-BY 4.0. If you use this data, you MUST state its origin. ### Citation Information Refer to this work as: > Nørregaard and Derczynski (2021). "DanFEVER: claim verification dataset for Danish", Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Bibliographic reference: ```` @inproceedings{norregaard-derczynski-2021-danfever, title = "{D}an{FEVER}: claim verification dataset for {D}anish", author = "N{\o}rregaard, Jeppe and Derczynski, Leon", booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)", year = "2021", publisher = {Link{\"o}ping University Electronic Press, Sweden}, url = "https://aclanthology.org/2021.nodalida-main.47", pages = "422--428" } ```
提供机构:
strombergnlp
原始信息汇总

数据集概述

  • 名称: DanFEVER
  • 语言: 丹麦语 (da)
  • 许可证: CC-BY-4.0
  • 多语言性: 单语种
  • 大小: 1K<n<10K
  • 来源: 原始数据
  • 任务类别: 文本分类
  • 任务ID: 事实检查, 自然语言推理
  • 标签: 知识验证

数据集描述

  • 摘要: DanFEVER是一个用于多语言错误信息研究的丹麦语数据集,格式与著名的英语FEVER数据集相同。它可用于测试多语言环境下的方法,以及为丹麦语创建生产模型。
  • 支持的任务: 支持FEVER任务,但语言为丹麦语。
  • 语言: 数据集为丹麦语,bcp47标识为da_DK

数据集结构

  • 数据实例: 每个实例包含ID、声明、标签、证据提取、可验证性、证据来源和原始ID。
  • 数据字段: 待补充
  • 数据分割: 待补充

数据集创建

  • 筛选理由: 使用2020年2月13日的丹麦语维基百科转储和Den Store Danske的相关文章(仅限摘录,以遵守版权法)。两个团队独立抽样证据,并从这两个站点创建和注释声明。
  • 源数据: 源语言来自维基百科编辑者和词典编辑者。
  • 注释: 注释者为母语为丹麦语的IT硕士学生,两男两女,年龄25-35岁。

使用数据注意事项

  • 社会影响: 该数据集旨在使丹麦语事实检查系统的构建成为可能。成功的系统可能能够识别可疑的结论或推断。
  • 偏见讨论: 数据来自相对正式的主题,因此在其他领域可能表现不佳。

附加信息

  • 许可证信息: 数据集采用CC-BY 4.0许可证,使用此数据时必须声明其来源。
  • 引用信息: 参考文献为Nørregaard和Derczynski (2021),标题为"DanFEVER: claim verification dataset for Danish",发表于第23届北欧计算语言学会议(NoDaLiDa)。
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作