five

Readability-Aware Summarization Dataset for Turkish

收藏
DataCite Commons2025-03-19 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/readability-aware-summarization-dataset-turkish
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is constructed in a study that addresses the gap between text summarization and content readability for diverse Turkish-speaking audiences. It contains paired original texts and corresponding summaries optimized for different readability levels using the YOD (Yeni Okunabilirlik Düzeyi) formula.YOD Readibility Metric: Bezirci-Yılmaz readability formula defines the YOD readability metric specifically designed for Turkish texts. It calculates the readability score based on the average number of polysyllabic words (three or more syllables) per sentence. The metric assigns weights to these polysyllabic words and combines them with the average sentence length, providing an assessment of text complexity.Dataset Creation Logic: To create the dataset, VBART-Large-Paraphrasing model was employed to enhance the existing datasets by generating paraphrased variations at both the sentence and full- text levels. This approach permitted the derivation of content with a more extensive range of YOD values, encompassing both higher and lower values, from the same source material. To maintain semantic integrity, each paraphrase was compared to the original summary using BERTScore to verify that the synthetic data achieved the intended readability adjustments while remaining faithful to the source. In addition, ChatGPT’s API was also used for synthetic data generation, enriching the dataset with diverse and high-quality rewritten summaries. Dataset Creation: The dataset is compiled from multiple sources: XLSUM (970 entries), TRNews (5,000 entries), MLSUM (1,033 entries), LR-SUM (1,107 entries), and Wikipedia-trsummarization (3,024 entries). Sampling is done hierarchically from longest text that can fit into the tokenizer(vngrs-ai/VBART-Large-Paraphrasing) without truncation to the shortest content. After the synthetic data generation process, the dataset is significantly expanded to include 76,759 summaries. To guarantee a thorough evaluation, 200 samples for each YOD level are allocated to both the test and validation sets, resulting in a total of 3200 examples for both test and evaluation.
提供机构:
IEEE DataPort
创建时间:
2025-03-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作