xtinge/turkish-extractive-summarization-dataset

Name: xtinge/turkish-extractive-summarization-dataset
Creator: xtinge
Published: 2024-05-06 07:40:09
License: 暂无描述

Hugging Face2024-05-06 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/xtinge/turkish-extractive-summarization-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: mlsum_tr_ext - config_name: xtinge-sum_tr_ext - config_name: tes configs: - config_name: mlsum_tr_ext data_files: - split: train path: MLSUM_TR_EXT/train* - split: test path: MLSUM_TR_EXT/test* - split: val path: MLSUM_TR_EXT/val* - config_name: xtinge-sum_tr_ext data_files: - split: test path: XTINGE-SUM_TR_EXT/XTINGE-SUM_TR_EXT* - config_name: tes data_files: - split: test path: TES/tes* task_categories: - summarization license: gpl-3.0 --- # XTINGE Turkish Extractive Summarization Datasets This repository hosts three datasets created for advancing Turkish extractive text summarization research: MLSUM_TR_EXT, TES, and XTINGE-SUM_TR_EXT. These datasets are designed to support the development of models capable of generating concise and relevant extractive summaries of Turkish texts. Below is a Python example showcasing how to download and use these datasets: ```python from datasets import load_dataset # Load the MLSUM_TR_EXT dataset mlsum_tr_ext = load_dataset("xtinge/turkish-extractive-summarization-dataset", "mlsum_tr_ext") # Load the TES dataset tes = load_dataset("xtinge/turkish-extractive-summarization-dataset", "tes") # Load the xtinge-sum_tr_ext dataset xtinge_sum_tr_ext = load_dataset("xtinge/turkish-extractive-summarization-dataset", "xtinge-sum_tr_ext") ``` ## Dataset Details ### Dataset Description The datasets, having a focus on Turkish text summarization, aim to advance research in this area by providing structured, annotated resources for extractive summarization tasks. These datasets are: 1. **MLSUM_TR_EXT**: - Originates as an extension of the Turkish subset from the [MLSUM dataset](https://huggingface.co/datasets/mlsum), focusing on extractive summarization. - Comprises articles from internethaber.com, with summaries derived from existing headlines for creating contextually rich extractive summaries. - Sentences within these articles were selected based on their SBERT Similarity and ROUGE Scores compared to the original summaries, ensuring relevance and conciseness. 2. **TES**: - Represents a unique collection found on [Hugging Face](https://huggingface.co/erturkerdagi/turkishExtractiveSummarization/tree/main) tailored for Turkish extractive summarization. - Contains a variety of news articles annotated by three distinct annotators, each providing different perspectives and lengths, thus contributing to a rich set of summarization examples. 3. **XTINGE-SUM_TR_EXT**: - Specifically developed to supplement existing resources by providing detailed sentence importance rankings within lengthy Wikipedia documents. - Features annotations by three different annotators who meticulously ranked all sentences by importance, contributing to a comprehensive resource for studying extractive summarization. - The annotation process considered Inter Annotator Agreement, specifically employing Krippendorff's alpha to ensure consistency and reliability in sentence importance assessments. - **Language(s) (NLP):** Turkish - **License:** [gpl-3.0] ## Dataset Structure ### Generic Structure Across Datasets All three datasets share a generic structure tailored for extractive summarization tasks, comprising the following elements: - **Title**: The title of the document or article, serving as a concise representation of the content. - **Sentences**: The body of the text, split into sentences. This segmentation facilitates the identification of individual sentences that contribute to the summary. - **Annotations**: This section includes annotations for selecting summary sentences. It is subdivided into: - **Indexes**: Indices of sentences that have been selected for the summary. This field varies across datasets based on the number of annotators. - **Ranking**: Rankings assigned to sentences based on their perceived importance for the summary. This feature is more prominent in datasets focusing on sentence importance ranking. ```python { 'Title': '<title_of_document>', 'Sentences': ['<sentence_1>', '<sentence_2>', ..., '<sentence_n>'], 'Annotations': { 'Indexes': { 'Annotator1': [<index_of_selected_sentence_1>, ..., <index_of_selected_sentence_m>], # If there are more than one annotator 'Annotator2': [...], # etc. }, 'Ranking': { 'Annotator1': [<ranking_of_first_sentence>,<ranking_of_second_sentence>,..., <ranking_of_mth_sentence>], # If there are more than one annotator 'Annotator2': [...], # etc. } } } ``` ## Cite XTINGE Turkish Extractive Summarization Dataset ``` @inproceedings{xtinge_turkish_extractive, title = {Extractive Summarization Data Sets Generated with Measurable Analyses}, author = {Demir, İrem and Küpçü, Emel and Küpçü, Alptekin}, booktitle = {Proceedings of the 32nd IEEE Conference on Signal Processing and Communications Applications}, year = {2024} } ```

提供机构：

xtinge

原始信息汇总

XTINGE Turkish Extractive Summarization Datasets

数据集概述

该数据集包含三个用于土耳其文本抽取式摘要研究的子数据集：MLSUM_TR_EXT、TES 和 XTINGE-SUM_TR_EXT。这些数据集旨在支持开发能够生成简洁且相关摘要的模型。

数据集详情

数据集描述

MLSUM_TR_EXT:
- 源自 MLSUM 数据集的土耳其子集扩展，专注于抽取式摘要。
- 包含来自 internethaber.com 的文章，摘要由现有标题生成，以创建上下文丰富的抽取式摘要。
- 文章中的句子根据与原始摘要的 SBERT 相似度和 ROUGE 分数进行选择，确保相关性和简洁性。
TES:
- 代表一个独特的集合，可在 Hugging Face 上找到，专为土耳其抽取式摘要设计。
- 包含由三位不同标注者标注的各种新闻文章，每个标注者提供不同的视角和长度，从而提供丰富的摘要示例。
XTINGE-SUM_TR_EXT:
- 专门开发以补充现有资源，提供详细的句子重要性排序，适用于长篇维基百科文档。
- 由三位不同的标注者仔细对所有句子按重要性进行排序，为研究抽取式摘要提供全面的资源。
- 标注过程中考虑了标注者间的一致性，特别是使用 Krippendorffs alpha 来确保句子重要性评估的一致性和可靠性。

数据集结构

所有三个数据集共享一个适用于抽取式摘要任务的通用结构，包括以下元素：

Title: 文档或文章的标题，作为内容的简洁表示。
Sentences: 文本的主体，分割成句子。这种分割有助于识别对摘要有贡献的单个句子。
Annotations: 包括选择摘要句子的标注。分为：
- Indexes: 被选为摘要的句子索引。该字段因标注者数量而异。
- Ranking: 根据句子对摘要的重要性分配的排名。这一特征在专注于句子重要性排序的数据集中更为突出。

python { Title: <title_of_document>, Sentences: [<sentence_1>, <sentence_2>, ..., <sentence_n>], Annotations: { Indexes: { Annotator1: [<index_of_selected_sentence_1>, ..., <index_of_selected_sentence_m>], # 如果有多个标注者 Annotator2: [...], # 等等 }, Ranking: { Annotator1: [<ranking_of_first_sentence>,<ranking_of_second_sentence>,..., <ranking_of_mth_sentence>], # 如果有多个标注者 Annotator2: [...], # 等等 } } }

引用

@inproceedings{xtinge_turkish_extractive, title = {Extractive Summarization Data Sets Generated with Measurable Analyses}, author = {Demir, İrem and Küpçü, Emel and Küpçü, Alptekin}, booktitle = {Proceedings of the 32nd IEEE Conference on Signal Processing and Communications Applications}, year = {2024} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集