projecte-aina/catalanqa

Name: projecte-aina/catalanqa
Creator: projecte-aina
Published: 2024-09-20 09:24:37
License: 暂无描述

Hugging Face2024-09-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/projecte-aina/catalanqa

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - found language: - ca license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - question-answering task_ids: - extractive-qa pretty_name: catalanqa dataset_info: features: - name: title dtype: string - name: context dtype: string - name: question dtype: string - name: id dtype: string - name: answers list: - name: answer_start dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 17697706 num_examples: 17135 - name: validation num_bytes: 2229045 num_examples: 2157 - name: test num_bytes: 2183846 num_examples: 2135 download_size: 13759215 dataset_size: 22110597 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) # Dataset Card for CatalanQA ## Dataset Description - **Homepage:** https://github.com/projecte-aina - **Point of Contact:** langtech@bsc.es ### Dataset Summary This dataset can be used to build extractive-QA and Language Models. It is an aggregation and balancing of 2 previous datasets: [VilaQuAD](https://huggingface.co/datasets/projecte-aina/vilaquad) and [ViquiQuAD](https://huggingface.co/datasets/projecte-aina/viquiquad). Splits have been balanced by kind of question, and unlike other datasets like [SQuAD](http://arxiv.org/abs/1606.05250), it only contains, per record, one question and one answer for each context, although the contexts can repeat multiple times. This dataset was developed by [BSC TeMU](https://temu.bsc.es/) as part of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina/), to enrich the [Catalan Language Understanding Benchmark (CLUB)](https://club.aina.bsc.es/). This work is licensed under a <a rel="license" href="https://creativecommons.org/licenses/by-sa/4.0/">Attribution-ShareAlike 4.0 International License</a>. ### Supported Tasks and Leaderboards Extractive-QA, Language Model. ### Languages The dataset is in Catalan (`ca-ES`). ## Dataset Structure ### Data Instances ``` { "title": "Els 521 policies espanyols amb més mala nota a les oposicions seran enviats a Catalunya", "paragraphs": [ { "context": "El Ministeri d'Interior espanyol enviarà a Catalunya els 521 policies espanyols que han obtingut més mala nota a les oposicions. Segons que explica El País, hi havia mig miler de places vacants que s'havien de cobrir, però els agents amb més bones puntuacions han elegit destinacions diferents. En total van aprovar les oposicions 2.600 aspirants. D'aquests, en seran destinats al Principat 521 dels 560 amb més mala nota. Per l'altra banda, entre els 500 agents amb més bona nota, només 8 han triat Catalunya. Fonts de la policia espanyola que esmenta el diari ho atribueixen al procés d'independència, al Primer d'Octubre i a la 'situació social' que se'n deriva.", "qas": [ { "question": "Quants policies enviaran a Catalunya?", "id": "0.5961700408283691", "answers": [ { "text": "521", "answer_start": 57 } ] } ] } ] }, ``` ### Data Fields Follows [(Rajpurkar, Pranav et al., 2016)](http://arxiv.org/abs/1606.05250) for SQuAD v1 datasets: - `id` (str): Unique ID assigned to the question. - `title` (str): Title of the article. - `context` (str): Article text. - `question` (str): Question. - `answers` (list): Answer to the question, containing: - `text` (str): Span text answering to the question. - `answer_start` Starting offset of the span text answering to the question. ### Data Splits - train.json: 17135 question/answer pairs - dev.json: 2157 question/answer pairs - test.json: 2135 question/answer pairs ## Dataset Creation ### Curation Rationale We created this corpus to contribute to the development of language models in Catalan, a low-resource language. ### Source Data - [VilaWeb](https://www.vilaweb.cat/) and [Catalan Wikipedia](https://ca.wikipedia.org). #### Initial Data Collection and Normalization This dataset is a balanced aggregation from [ViquiQuAD](https://huggingface.co/datasets/projecte-aina/viquiquad) and [VilaQuAD](https://huggingface.co/datasets/projecte-aina/vilaquad) datasets. #### Who are the source language producers? Volunteers from [Catalan Wikipedia](https://ca.wikipedia.org) and professional journalists from [VilaWeb](https://www.vilaweb.cat/). ### Annotations #### Annotation process We did an aggregation and balancing from [ViquiQuAD](https://huggingface.co/datasets/projecte-aina/viquiquad) and [VilaQuAD](https://huggingface.co/datasets/projecte-aina/vilaquad) datasets. To annotate those datasets, we commissioned the creation of 1 to 5 questions for each context, following an adaptation of the guidelines from SQuAD 1.0 [(Rajpurkar, Pranav et al., 2016)](http://arxiv.org/abs/1606.05250). For compatibility with similar datasets in other languages, we followed as close as possible existing curation guidelines. #### Who are the annotators? Annotation was commissioned by a specialized company that hired a team of native language speakers. ### Personal and Sensitive Information No personal or sensitive information is included. ## Considerations for Using the Data ### Social Impact of Dataset We hope this corpus contributes to the development of language models in Catalan, a low-resource language. ### Discussion of Biases [N/A] ### Other Known Limitations [N/A] ## Additional Information ### Dataset Curators Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es) This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina). ### Licensing Information This work is licensed under a <a rel="license" href="https://creativecommons.org/licenses/by-sa/4.0/">Attribution-ShareAlike 4.0 International License</a>. ### Citation Information ``` @inproceedings{gonzalez-agirre-etal-2024-building-data, title = "Building a Data Infrastructure for a Mid-Resource Language: The Case of {C}atalan", author = "Gonzalez-Agirre, Aitor and Marimon, Montserrat and Rodriguez-Penagos, Carlos and Aula-Blasco, Javier and Baucells, Irene and Armentano-Oller, Carme and Palomar-Giner, Jorge and Kulebi, Baybars and Villegas, Marta", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.231", pages = "2556--2566", } ``` ### Contributions [N/A]

提供机构：

projecte-aina

原始信息汇总

数据集概述

数据集名称

名称：CatalanQA
别名：catalanqa

数据集属性

语言：Catalan (ca-ES)
许可证：Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
多语言性：单语种
大小：1K<n<10K
来源：原创数据集
任务类型：问答（Extractive-QA）

数据集描述

概述：该数据集用于构建抽取式问答和语言模型，是VilaQuAD和ViquiQuAD两个数据集的聚合与平衡。每个记录包含一个问题和一个答案，尽管上下文可能会重复多次。
开发目的：由BSC TeMU开发，作为Projecte AINA的一部分，旨在丰富Catalan Language Understanding Benchmark (CLUB)。

数据集结构

数据实例：每个实例包含文章标题、段落（含上下文）、问答对（问题、答案及答案起始位置）。
数据字段：包括id、title、context、question、answers（含text和answer_start）。
数据分割：包括训练集（17135对）、开发集（2157对）和测试集（2135对）。

数据集创建

采集理由：为促进低资源语言Catalan的语言模型发展。
源数据：来自VilaWeb和Catalan Wikipedia。
注释过程：由专业公司委托，使用本地语言专家进行注释，遵循SQuAD 1.0的指导原则。
注释者：由专业公司雇佣的本地语言专家。

使用数据注意事项

社会影响：期望通过此数据集促进Catalan语言模型的开发。
偏见讨论：未提供具体信息。
其他已知限制：未提供具体信息。

5,000+

优质数据集

54 个

任务类型

进入经典数据集