mohamedemam/Arabic-samsum-dialogsum
收藏Hugging Face2023-09-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mohamedemam/Arabic-samsum-dialogsum
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: index
dtype: int64
- name: id
dtype: string
- name: dialogue
dtype: string
- name: summary
dtype: string
- name: topic
dtype: string
splits:
- name: train
num_bytes: 27913254
num_examples: 24813
download_size: 13968520
dataset_size: 27913254
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-nc-2.0
task_categories:
- summarization
- conversational
language:
- ar
pretty_name: ar messum
size_categories:
- 10K<n<100K
---
# Dataset Card for "Arabic-samsum-dialogsum"
this dataset is comption between samsum and dialogsum dataset translated in arabic
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://arxiv.org/abs/1911.12237v2
- **Repository:** [Needs More Information]
- **Paper:** https://arxiv.org/abs/1911.12237v2
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
### Dataset Summary
The SAMSum dataset contains about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English. Linguists were asked to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger convesations. The style and register are diversified - conversations could be informal, semi-formal or formal, they may contain slang words, emoticons and typos. Then, the conversations were annotated with summaries. It was assumed that summaries should be a concise brief of what people talked about in the conversation in third person.
The SAMSum dataset was prepared by Samsung R&D Institute Poland and is distributed for research purposes (non-commercial licence: CC BY-NC-ND 4.0).
### Supported Tasks and Leaderboards
[Needs More Information]
### Languages
Arabic
## Dataset Structure
t
### Data Instances
The created dataset is made of 16369 conversations distributed uniformly into 4 groups based on the number of utterances in con- versations: 3-6, 7-12, 13-18 and 19-30. Each utterance contains the name of the speaker. Most conversations consist of dialogues between two interlocutors (about 75% of all conversations), the rest is between three or more people
The first instance in the training set:
{'id': '13818513', 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.', 'dialogue': "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"}
### Data Fields
- dialogue: text of dialogue.
- summary: human written summary of the dialogue.
- id: unique id of an example.
### Data Splits
- train: 24732
## Dataset Creation
### Curation Rationale
In paper:
> In the first approach, we reviewed datasets from the following categories: chatbot dialogues, SMS corpora, IRC/chat data, movie dialogues, tweets, comments data (conversations formed by replies to comments), transcription of meetings, written discussions, phone dialogues and daily communication data. Unfortunately, they all differed in some respect from the conversations that are typ- ically written in messenger apps, e.g. they were too technical (IRC data), too long (comments data, transcription of meetings), lacked context (movie dialogues) or they were more of a spoken type, such as a dialogue between a petrol station assis- tant and a client buying petrol.
As a consequence, we decided to create a chat dialogue dataset by constructing such conversa- tions that would epitomize the style of a messenger app.
### Source Data
#### Initial Data Collection and Normalization
In paper:
> We asked linguists to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger conversations. It includes chit-chats, gossiping about friends, arranging meetings, discussing politics, consulting university assignments with colleagues, etc. Therefore, this dataset does not contain any sensitive data or fragments of other corpora.
#### Who are the source language producers?
linguists
### Annotations
#### Annotation process
In paper:
> Each dialogue was created by one person. After collecting all of the conversations, we asked language experts to annotate them with summaries, assuming that they should (1) be rather short, (2) extract important pieces of information, (3) include names of interlocutors, (4) be written in the third person. Each dialogue contains only one ref- erence summary.
#### Who are the annotators?
language experts
### Personal and Sensitive Information
None, see above: Initial Data Collection and Normalization
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
non-commercial licence: CC BY-NC-ND 4.0
### Citation Information
```
@inproceedings{gliwa-etal-2019-samsum,
title = "{SAMS}um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization",
author = "Gliwa, Bogdan and
Mochol, Iwona and
Biesek, Maciej and
Wawer, Aleksander",
booktitle = "Proceedings of the 2nd Workshop on New Frontiers in Summarization",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-5409",
doi = "10.18653/v1/D19-5409",
pages = "70--79"
}
```
### Contributions
Thanks to [@cccntu](https://github.com/cccntu) for adding this dataset.
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The dataset is a combination of the SAMSum and DialogSum datasets translated into Arabic, containing about 16k messenger-like conversations with summaries. The conversations were created by linguists fluent in English and reflect the proportion of topics of their real-life messenger conversations. The dataset includes dialogues in various styles and registers, such as informal, semi-formal, or formal, and may contain slang words, emoticons, and typos. The conversations were annotated with summaries by language experts, assuming that the summaries should be concise, extract important information, include names of interlocutors, and be written in the third person. The dataset is distributed for research purposes under a non-commercial license (CC BY-NC-ND 4.0).
提供机构:
mohamedemam
原始信息汇总
数据集概述
数据集信息
- 特征字段:
index: 类型为int64id: 类型为stringdialogue: 类型为stringsummary: 类型为stringtopic: 类型为string
- 数据分割:
train: 包含 24813 个样本,总字节数为 27913254
- 下载大小: 13968520 字节
- 数据集大小: 27913254 字节
- 配置:
default配置包含train数据文件路径为data/train-*
- 许可证: CC-BY-NC-2.0
- 任务类别:
- 摘要生成
- 对话系统
- 语言: 阿拉伯语
- 数据集名称: ar messum
- 数据集大小类别: 10K<n<100K
数据集描述
- 数据集摘要:
- 该数据集包含约 16k 个类似即时通讯的对话及其摘要。对话由精通英语的语言学家创建和记录,反映了他们日常即时通讯对话的主题比例。对话风格和语域多样化,可能包含非正式、半正式或正式的对话,以及俚语、表情符号和拼写错误。对话随后被标注了摘要,摘要应简明扼要地概括对话内容,并以第三人称形式呈现。
- 支持的任务和排行榜: 待补充
- 语言: 阿拉伯语
数据集结构
-
数据实例:
- 数据集包含 16369 个对话,均匀分布在基于对话中语句数量的 4 个组中:3-6、7-12、13-18 和 19-30。大多数对话由两个对话者之间的对话组成(约 75% 的对话),其余为三个或更多人之间的对话。
- 训练集中的第一个实例: json { "id": "13818513", "summary": "Amanda baked cookies and will bring Jerry some tomorrow.", "dialogue": "Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: Ill bring you tomorrow :-)" }
-
数据字段:
dialogue: 对话文本summary: 人工编写的对话摘要id: 示例的唯一标识符
-
数据分割:
train: 24732 个样本
数据集创建
- 策划理由:
- 在论文中提到,为了创建一个代表即时通讯应用风格的对话数据集,语言学家被要求创建反映他们日常即时通讯对话主题比例的对话。
- 源数据:
- 初始数据收集和规范化:
- 语言学家被要求创建类似他们日常即时通讯对话的对话,包括闲聊、朋友间的八卦、安排会议、讨论政治、与同事讨论大学作业等。因此,该数据集不包含任何敏感数据或其他语料库的片段。
- 源语言生产者: 语言学家
- 初始数据收集和规范化:
- 标注:
- 标注过程:
- 每个对话由一个人创建。收集所有对话后,语言专家被要求为它们标注摘要,假设摘要应(1)较短,(2)提取重要信息,(3)包含对话者的名字,(4)以第三人称形式编写。每个对话只包含一个参考摘要。
- 标注者: 语言专家
- 标注过程:
- 个人和敏感信息: 无,参见初始数据收集和规范化部分
使用数据的注意事项
- 数据集的社会影响: 待补充
- 偏见的讨论: 待补充
- 其他已知限制: 待补充
附加信息
-
数据集策展人: 待补充
-
许可证信息: 非商业许可证:CC BY-NC-ND 4.0
-
引用信息: bibtex @inproceedings{gliwa-etal-2019-samsum, title = "{SAMS}um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization", author = "Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander", booktitle = "Proceedings of the 2nd Workshop on New Frontiers in Summarization", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-5409", doi = "10.18653/v1/D19-5409", pages = "70--79" }
-
贡献: 感谢 @cccntu 添加此数据集。



