five

yhavinga/squad_v2_dutch

收藏
Hugging Face2023-01-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/yhavinga/squad_v2_dutch
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: SQuAD2.0 Dutch annotations_creators: - crowdsourced language_creators: - crowdsourced language: - nl license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - question-answering task_ids: - open-domain-qa - extractive-qa paperswithcode_id: squad_v2_dutch dataset_info: features: - name: id dtype: string - name: title dtype: string - name: title_en dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: text_en dtype: string - name: answer_start_en dtype: int32 --- # Dataset Card for "squad_v2_dutch" ## Dataset Description - **Homepage:** [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/) ## Dataset Summary The squad_v2_dutch dataset is a machine-translated version of the SQuAD v2 dataset from English to Dutch. The SQuAD v2 dataset combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. ## Challenges and Solutions One of the main challenges in translating the SQuAD v2 dataset to Dutch was accurately translating the answers, which are often short phrases or single words. Translating the answers individually would result in obvious mistakes. Examples are * Destiny's Child -> Het kind van Destiny * Dangerously in Love -> Gevaarlijk in de liefde * Imagine -> Stel je voor * Men in Black -> Mannen in zwart * Hottest Female Singer of All Time -> De heetste vrouwelijke zanger aller tijden The correct translation of these phrases often depends on the context in which they are used. To address this, the title, question, answers, and context were concatenated as a single sequence, separated by the newline character. When the translated version had the correct number of newlines and did not contain any apparent mixups of the answers with the question and title, it was used. Otherwise, the one-by-one context-less translation was used as a fallback. Most examples where translated with the context-rich translation: ~95%. * train split: context: 123898, no context: 6406 * validation split: context: 10196, no context: 1644 ### Data Fields The data fields are the same among all splits. #### squad_v2 - `id`: a `string` feature. - `title`: a `string` feature. - `title_en`: a `string` feature. - `context`: a `string` feature. - `question`: a `string` feature. - `answers`: a dictionary feature containing: - `text`: a list of `string` feature. - `text_en`: a list of `string` feature. - `answer_start_en`: a `int32` feature. ### Citation Information ``` @article{2016arXiv160605250R, author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev}, Konstantin and {Liang}, Percy}, title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}", journal = {arXiv e-prints}, year = 2016, eid = {arXiv:1606.05250}, pages = {arXiv:1606.05250}, archivePrefix = {arXiv}, eprint = {1606.05250}, } ``` ### Contributions Thanks to [@lewtun](https://github.com/lewtun), [@albertvillanova](https://github.com/albertvillanova), [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding the https://huggingface.co/datasets/squad_v2 dataset. This project would not have been possible without compute generously provided by Google through the [TPU Research Cloud](https://sites.research.google/trc/). Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
提供机构:
yhavinga
原始信息汇总

数据集卡片 for "squad_v2_dutch"

数据集描述

数据集概述

squad_v2_dutch 数据集是 SQuAD v2 数据集从英语到荷兰语的机器翻译版本。SQuAD v2 数据集结合了 SQuAD1.1 中的 100,000 个问题和由众包工作者编写的超过 50,000 个无法回答的问题,这些问题看起来与可回答的问题相似。为了在 SQuAD2.0 上表现良好,系统不仅需要在可能的情况下回答问题,还需要确定何时段落不支持答案并避免回答。

挑战与解决方案

在将 SQuAD v2 数据集翻译成荷兰语时,主要的挑战是准确翻译答案,这些答案通常是短语或单个单词。单独翻译答案会导致明显的错误。为了解决这个问题,标题、问题、答案和上下文被连接成一个单独的序列,用换行符分隔。如果翻译版本具有正确数量的换行符并且不包含答案与问题和标题的明显混淆,则使用该版本。否则,使用无上下文的逐个翻译作为备用。

大多数示例使用上下文丰富的翻译:约 95%。

  • 训练集:上下文:123898,无上下文:6406
  • 验证集:上下文:10196,无上下文:1644

数据字段

数据字段在所有拆分中都是相同的。

squad_v2

  • id: 一个 string 特征。
  • title: 一个 string 特征。
  • title_en: 一个 string 特征。
  • context: 一个 string 特征。
  • question: 一个 string 特征。
  • answers: 一个字典特征,包含:
    • text: 一个 string 列表特征。
    • text_en: 一个 string 列表特征。
    • answer_start_en: 一个 int32 特征。

引用信息

@article{2016arXiv160605250R, author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev}, Konstantin and {Liang}, Percy}, title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}", journal = {arXiv e-prints}, year = 2016, eid = {arXiv:1606.05250}, pages = {arXiv:1606.05250}, archivePrefix = {arXiv}, eprint = {1606.05250}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作