five

dragosnicolae555/RoITD

收藏
Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/dragosnicolae555/RoITD
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - ro-RO license: - cc-by-4.0 multilinguality: - monolingual pretty_name: 'RoITD: Romanian IT Question Answering Dataset' size_categories: - unknown source_datasets: - original task_categories: - question-answering task_ids: - extractive-qa --- ## Dataset Summary We introduce a Romanian IT Dataset (RoITD) resembling SQuAD 1.1. RoITD consists of 9575 Romanian QA pairs formulated by crowd workers. QA pairs are based on 5043 articles from Romanian Wikipedia articles describing IT and household products. Of the total number of questions, 5103 are possible (i.e. the correct answer can be found within the paragraph) and 4472 are not possible (i.e. the given answer is a "plausible answer" and not correct) ## Dataset Structure The data structure follows the format of SQuAD, which contains several attributes such as **question**, **id**, **text**, `**answer_start**, **is_impossible** and **context**. The paragraph provided to crowd sourcing workers is stored in the field **context**. This incorporates manually-selected paragraphs from Wikipedia. The field **id** is comprised of a randomly assigned unique identification number for the answer-question pair. Only the numbers "0" and "1" are allowed in the **is_impossible** field. The category "A" is assigned the value "0", indicating that the answer is correct. The value "1" corresponds to the category "U", indicating a plausible answer. The question posed by the source crowd source worker is represented by the field **question**. The field **answer_start** keeps track of the character index marking the beginning of an answer.
提供机构:
dragosnicolae555
原始信息汇总

数据集概述

基本信息

  • 名称: RoITD: Romanian IT Question Answering Dataset
  • 语言: 罗马尼亚语 (ro-RO)
  • 许可证: CC-BY-4.0
  • 多语言性: 单语种
  • 来源: 原始数据集
  • 任务类别: 问答
  • 任务ID: 抽取式问答 (extractive-qa)

数据集结构

  • 数据格式: 遵循SQuAD格式
  • 主要字段:
    • question: 问题
    • id: 唯一识别号
    • text: 文本
    • answer_start: 答案开始位置
    • is_impossible: 是否为不可能答案(0: 可能, 1: 不可能)
    • context: 上下文,来自手动选取的Wikipedia段落

数据集内容

  • QA对数量: 9575对
  • 文章来源: 5043篇罗马尼亚语Wikipedia文章,描述IT和家用产品
  • 问题分类:
    • 可能问题: 5103个
    • 不可能问题: 4472个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作