Readability-Aware Summarization Dataset for Turkish

Name: Readability-Aware Summarization Dataset for Turkish
Creator: IEEE DataPort
Published: 2025-03-19 15:29:43
License: 暂无描述

DataCite Commons2025-03-19 更新2025-04-16 收录

下载链接：

https://ieee-dataport.org/documents/readability-aware-summarization-dataset-turkish

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is constructed in a study that addresses the gap between text summarization and content readability for diverse Turkish-speaking audiences. It contains paired original texts and corresponding summaries optimized for different readability levels using the YOD (Yeni Okunabilirlik Düzeyi) formula.YOD Readibility Metric: Bezirci-Yılmaz readability formula defines the YOD readability metric specifically designed for Turkish texts. It calculates the readability score based on the average number of polysyllabic words (three or more syllables) per sentence. The metric assigns weights to these polysyllabic words and combines them with the average sentence length, providing an assessment of text complexity.Dataset Creation Logic: To create the dataset, VBART-Large-Paraphrasing model was employed to enhance the existing datasets by generating paraphrased variations at both the sentence and full- text levels. This approach permitted the derivation of content with a more extensive range of YOD values, encompassing both higher and lower values, from the same source material. To maintain semantic integrity, each paraphrase was compared to the original summary using BERTScore to verify that the synthetic data achieved the intended readability adjustments while remaining faithful to the source. In addition, ChatGPT’s API was also used for synthetic data generation, enriching the dataset with diverse and high-quality rewritten summaries. Dataset Creation: The dataset is compiled from multiple sources: XLSUM (970 entries), TRNews (5,000 entries), MLSUM (1,033 entries), LR-SUM (1,107 entries), and Wikipedia-trsummarization (3,024 entries). Sampling is done hierarchically from longest text that can fit into the tokenizer(vngrs-ai/VBART-Large-Paraphrasing) without truncation to the shortest content. After the synthetic data generation process, the dataset is significantly expanded to include 76,759 summaries. To guarantee a thorough evaluation, 200 samples for each YOD level are allocated to both the test and validation sets, resulting in a total of 3200 examples for both test and evaluation.

提供机构：

IEEE DataPort

创建时间：

2025-03-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集