TurkuNLP/xlsum-fi
收藏Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/TurkuNLP/xlsum-fi
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
language_creators:
- machine translated
language:
- fi
license:
- cc-by-nc-sa-4.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- xlsum
task_categories:
- summarization
- text2text-generation
task_ids: []
pretty_name: XL-Sum-FI
tags:
- conditional-text-generation
---
# Dataset Card for "XL-Sum-FI"
## Table of Contents
- [Dataset Card Creation Guide](#dataset-card-creation-guide)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Annotations](#annotations)
- [Annotation process](#annotation-process)
- [Who are the annotators?](#who-are-the-annotators)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository:** https://github.com/TurkuNLP/xlsum-fi
- **Point of Contact:** [Filip Ginter](mailto:figint@utu.fi)
### Dataset Summary
This dataset is a DeepL -based machine translation of a part of the English section of the XLSum dataset:[https://github.com/csebuetnlp/xl-sum](https://github.com/csebuetnlp/xl-sum) In the present version, only examples where the full version is at most 10x the summary in length are included. We might translate more later.
### Supported Tasks and Leaderboards
### Languages
- `finnish`
## Dataset Structure
### Data Instances
One example from the `Finnish` dataset is given below in JSON format.
```
{
"id": "technology-17657859",
"url": "https://www.bbc.com/news/technology-17657859",
"title": "Walesin myrskytuulien vuoksi annettu säävaroitus",
"summary": "Tuulet voivat yltyä Walesissa myrskytuuliin, ja myrskysää on luvassa koko maahan tällä viikolla.",
"text": "Met Office on antanut Walesin ja Englannin kattavan keltaisen tuulivaroituksen keskiviikkoillasta kello 21.00 GMT alkaen. Matkustaminen ja sähkönjakelu todennäköisesti häiriintyvät, ja varoitus on voimassa torstaihin kello 15:00 asti. Puuskat ovat todennäköisesti nopeudeltaan 88 kilometriä tunnissa, ja rannikoilla ja kukkuloilla puuskat voivat nousta jopa 70 kilometriin tunnissa, ja lisäksi voi esiintyä rankkasateita ja myrskyisiä sadekuuroja."
}
```
### Data Fields
- 'id': A string representing the article ID, matched to the XLSum dataset original
- 'url': A string representing the article URL as in the original XLSum dataset
- 'title': A string containing the article title, machine-translated to Finnish
- 'summary': A string containing the article summary, machine-translated to Finnish
- 'text' : A string containing the article text, machine-translated to Finnish
### Data Splits
Follows the XLSum dataset.
## Dataset Creation
### Curation Rationale
### Source Data
[BBC News](https://www.bbc.co.uk/ws/languages)
#### Initial Data Collection and Normalization
[Detailed in the paper](https://aclanthology.org/2021.findings-acl.413/) For this present dataset, only English was used as the source and only examples where the full text is at maximum 10x in length compared to the summary are preserved. This 10x cutoff is naturally measured on English.
#### Who are the source language producers?
[Detailed in the paper](https://aclanthology.org/2021.findings-acl.413/)
### Annotations
[Detailed in the paper](https://aclanthology.org/2021.findings-acl.413/) DeepL was used to machine-translate from English to Finnish
#### Annotation process
[Detailed in the paper](https://aclanthology.org/2021.findings-acl.413/)
#### Who are the annotators?
[Detailed in the paper](https://aclanthology.org/2021.findings-acl.413/)
### Personal and Sensitive Information
[More information needed](https://github.com/csebuetnlp/xl-sum)
## Considerations for Using the Data
### Social Impact of Dataset
### Discussion of Biases
### Other Known Limitations
Due to DeepL terms and conditions, this dataset **must not be used for any machine translation work**, namely machine translation system development and evaluation of any kind. In general, we wish you do not pair the original English data with the translations except when working on research unrelated to machine translation, so as not to infringe on the terms and conditions.
## Additional Information
### Dataset Curators
### Licensing Information
Contents of this repository are restricted to only non-commercial research purposes under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright of the dataset contents belongs to the original copyright holders.
### Citation Information
If you use any of the datasets, models or code modules, please cite the original XL-Sum paper below as well as acknowledge Filip Ginter and the TurkuNLP group for the Finnish machine translated version.
```
@inproceedings{hasan-etal-2021-xl,
title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
author = "Hasan, Tahmid and
Bhattacharjee, Abhik and
Islam, Md. Saiful and
Mubasshir, Kazi and
Li, Yuan-Fang and
Kang, Yong-Bin and
Rahman, M. Sohel and
Shahriyar, Rifat",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.413",
pages = "4693--4703",
}
```
### Contributions
Thanks to the creators of the XLSum dataset!
提供机构:
TurkuNLP
原始信息汇总
数据集概述
数据集名称
- 名称: XL-Sum-FI
语言
- 语言: Finnish (fi)
许可证
- 许可证: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)
多语言性
- 多语言性: 单语种
大小分类
- 大小分类: 10K<n<100K
源数据集
- 源数据集: XLSum
任务类别
- 任务类别:
- 摘要生成
- 文本到文本生成
数据集结构
- 数据实例: 每个实例包含以下字段:
- id: 文章ID,与XLSum数据集原始数据匹配
- url: 文章URL,与XLSum数据集原始数据相同
- title: 文章标题,机器翻译为芬兰语
- summary: 文章摘要,机器翻译为芬兰语
- text: 文章全文,机器翻译为芬兰语
数据分割
- 数据分割: 遵循XLSum数据集的分割方式
数据集创建
- 翻译工具: DeepL
- 翻译过程: 从英语机器翻译到芬兰语
使用限制
- 使用限制: 由于DeepL的使用条款,该数据集不得用于任何机器翻译工作,包括机器翻译系统的开发和评估。
许可证信息
- 许可证信息: 本数据集内容仅限于非商业研究目的使用,遵循Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)。
引用信息
- 引用信息: 若使用此数据集,请引用XL-Sum论文,并感谢Filip Ginter和TurkuNLP团队对芬兰语机器翻译版本的贡献。



