mohamedemam/Arabic-samsum-dialogsum

Name: mohamedemam/Arabic-samsum-dialogsum
Creator: mohamedemam
Published: 2023-09-11 14:35:29
License: 暂无描述

Hugging Face2023-09-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mohamedemam/Arabic-samsum-dialogsum

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: index dtype: int64 - name: id dtype: string - name: dialogue dtype: string - name: summary dtype: string - name: topic dtype: string splits: - name: train num_bytes: 27913254 num_examples: 24813 download_size: 13968520 dataset_size: 27913254 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-nc-2.0 task_categories: - summarization - conversational language: - ar pretty_name: ar messum size_categories: - 10K<n<100K --- # Dataset Card for "Arabic-samsum-dialogsum" this dataset is comption between samsum and dialogsum dataset translated in arabic ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://arxiv.org/abs/1911.12237v2 - **Repository:** [Needs More Information] - **Paper:** https://arxiv.org/abs/1911.12237v2 - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Needs More Information] ### Dataset Summary The SAMSum dataset contains about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English. Linguists were asked to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger convesations. The style and register are diversified - conversations could be informal, semi-formal or formal, they may contain slang words, emoticons and typos. Then, the conversations were annotated with summaries. It was assumed that summaries should be a concise brief of what people talked about in the conversation in third person. The SAMSum dataset was prepared by Samsung R&D Institute Poland and is distributed for research purposes (non-commercial licence: CC BY-NC-ND 4.0). ### Supported Tasks and Leaderboards [Needs More Information] ### Languages Arabic ## Dataset Structure t ### Data Instances The created dataset is made of 16369 conversations distributed uniformly into 4 groups based on the number of utterances in con- versations: 3-6, 7-12, 13-18 and 19-30. Each utterance contains the name of the speaker. Most conversations consist of dialogues between two interlocutors (about 75% of all conversations), the rest is between three or more people The first instance in the training set: {'id': '13818513', 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.', 'dialogue': "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"} ### Data Fields - dialogue: text of dialogue. - summary: human written summary of the dialogue. - id: unique id of an example. ### Data Splits - train: 24732 ## Dataset Creation ### Curation Rationale In paper: > In the first approach, we reviewed datasets from the following categories: chatbot dialogues, SMS corpora, IRC/chat data, movie dialogues, tweets, comments data (conversations formed by replies to comments), transcription of meetings, written discussions, phone dialogues and daily communication data. Unfortunately, they all differed in some respect from the conversations that are typ- ically written in messenger apps, e.g. they were too technical (IRC data), too long (comments data, transcription of meetings), lacked context (movie dialogues) or they were more of a spoken type, such as a dialogue between a petrol station assis- tant and a client buying petrol. As a consequence, we decided to create a chat dialogue dataset by constructing such conversa- tions that would epitomize the style of a messenger app. ### Source Data #### Initial Data Collection and Normalization In paper: > We asked linguists to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger conversations. It includes chit-chats, gossiping about friends, arranging meetings, discussing politics, consulting university assignments with colleagues, etc. Therefore, this dataset does not contain any sensitive data or fragments of other corpora. #### Who are the source language producers? linguists ### Annotations #### Annotation process In paper: > Each dialogue was created by one person. After collecting all of the conversations, we asked language experts to annotate them with summaries, assuming that they should (1) be rather short, (2) extract important pieces of information, (3) include names of interlocutors, (4) be written in the third person. Each dialogue contains only one ref- erence summary. #### Who are the annotators? language experts ### Personal and Sensitive Information None, see above: Initial Data Collection and Normalization ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information non-commercial licence: CC BY-NC-ND 4.0 ### Citation Information ``` @inproceedings{gliwa-etal-2019-samsum, title = "{SAMS}um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization", author = "Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander", booktitle = "Proceedings of the 2nd Workshop on New Frontiers in Summarization", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-5409", doi = "10.18653/v1/D19-5409", pages = "70--79" } ``` ### Contributions Thanks to [@cccntu](https://github.com/cccntu) for adding this dataset. [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

The dataset is a combination of the SAMSum and DialogSum datasets translated into Arabic, containing about 16k messenger-like conversations with summaries. The conversations were created by linguists fluent in English and reflect the proportion of topics of their real-life messenger conversations. The dataset includes dialogues in various styles and registers, such as informal, semi-formal, or formal, and may contain slang words, emoticons, and typos. The conversations were annotated with summaries by language experts, assuming that the summaries should be concise, extract important information, include names of interlocutors, and be written in the third person. The dataset is distributed for research purposes under a non-commercial license (CC BY-NC-ND 4.0).

提供机构：

mohamedemam

原始信息汇总

数据集概述

数据集信息

特征字段:
- index: 类型为 int64
- id: 类型为 string
- dialogue: 类型为 string
- summary: 类型为 string
- topic: 类型为 string
数据分割:
- train: 包含 24813 个样本，总字节数为 27913254
下载大小: 13968520 字节
数据集大小: 27913254 字节
配置:
- default 配置包含 train 数据文件路径为 data/train-*
许可证: CC-BY-NC-2.0
任务类别:
- 摘要生成
- 对话系统
语言: 阿拉伯语
数据集名称: ar messum
数据集大小类别: 10K<n<100K

数据集描述

数据集摘要:
- 该数据集包含约 16k 个类似即时通讯的对话及其摘要。对话由精通英语的语言学家创建和记录，反映了他们日常即时通讯对话的主题比例。对话风格和语域多样化，可能包含非正式、半正式或正式的对话，以及俚语、表情符号和拼写错误。对话随后被标注了摘要，摘要应简明扼要地概括对话内容，并以第三人称形式呈现。
支持的任务和排行榜: 待补充
语言: 阿拉伯语

数据集结构

数据实例:
- 数据集包含 16369 个对话，均匀分布在基于对话中语句数量的 4 个组中：3-6、7-12、13-18 和 19-30。大多数对话由两个对话者之间的对话组成（约 75% 的对话），其余为三个或更多人之间的对话。
- 训练集中的第一个实例： json { "id": "13818513", "summary": "Amanda baked cookies and will bring Jerry some tomorrow.", "dialogue": "Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: Ill bring you tomorrow :-)" }
数据字段:
- dialogue: 对话文本
- summary: 人工编写的对话摘要
- id: 示例的唯一标识符
数据分割:
- train: 24732 个样本

数据集创建

策划理由:
- 在论文中提到，为了创建一个代表即时通讯应用风格的对话数据集，语言学家被要求创建反映他们日常即时通讯对话主题比例的对话。
源数据:
- 初始数据收集和规范化:
  - 语言学家被要求创建类似他们日常即时通讯对话的对话，包括闲聊、朋友间的八卦、安排会议、讨论政治、与同事讨论大学作业等。因此，该数据集不包含任何敏感数据或其他语料库的片段。
- 源语言生产者: 语言学家
标注:
- 标注过程:
  - 每个对话由一个人创建。收集所有对话后，语言专家被要求为它们标注摘要，假设摘要应（1）较短，（2）提取重要信息，（3）包含对话者的名字，（4）以第三人称形式编写。每个对话只包含一个参考摘要。
- 标注者: 语言专家
个人和敏感信息: 无，参见初始数据收集和规范化部分

使用数据的注意事项

数据集的社会影响: 待补充
偏见的讨论: 待补充
其他已知限制: 待补充

附加信息

数据集策展人: 待补充
许可证信息: 非商业许可证：CC BY-NC-ND 4.0
引用信息: bibtex @inproceedings{gliwa-etal-2019-samsum, title = "{SAMS}um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization", author = "Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander", booktitle = "Proceedings of the 2nd Workshop on New Frontiers in Summarization", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-5409", doi = "10.18653/v1/D19-5409", pages = "70--79" }
贡献: 感谢 @cccntu 添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集