musabg/wikipedia-tr-summarization

Name: musabg/wikipedia-tr-summarization
Creator: musabg
Published: 2023-06-13 04:29:02
License: 暂无描述

Hugging Face2023-06-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/musabg/wikipedia-tr-summarization

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: summary dtype: string splits: - name: train num_bytes: 324460408.0479985 num_examples: 119110 - name: validation num_bytes: 17077006.95200153 num_examples: 6269 download_size: 216029002 dataset_size: 341537415 task_categories: - summarization language: - tr pretty_name: Wikipedia Turkish Summarization size_categories: - 100K<n<1M --- # Wikipedia Turkish Summarization Dataset ## Dataset Description This is a Turkish summarization dataset 🇹🇷 prepared from the 2023 Wikipedia dump. The dataset has been cleaned, tokenized, and summarized using Huggingface Wikipedia dataset cleaner script, custom cleaning scripts, and OpenAI's gpt3.5-turbo API. ### Data Source - Wikipedia's latest Turkish dump (2023 version) 🌐 ### Features - text: string (The original text extracted from Wikipedia articles 📖) - summary: string (The generated summary of the original text 📝) ### Data Splits | Split | Num Bytes | Num Examples | |------------|--------------------|--------------| | train | 324,460,408.048 | 119,110 | | validation | 17,077,006.952 | 6,269 | ### Download Size - 216,029,002 bytes ### Dataset Size - 341,537,415 bytes ## Data Preparation ### Data Collection 1. The latest Turkish Wikipedia dump was downloaded 📥. 2. Huggingface Wikipedia dataset cleaner script was used to clean the text 🧹. 3. A custom script was used to further clean the text, removing sections like "Kaynakca" (References) and other irrelevant information 🛠️. ### Tokenization The dataset was tokenized using Google's MT5 tokenizer. The following criteria were applied: - Articles with a token count between 300 and 900 were selected ✔️. - Articles with less than 300 tokens were ignored ❌. - For articles with more than 900 tokens, only the first 900 tokens ending with a paragraph were selected 🔍. ### Summarization The generated raw texts were summarized using OpenAI's gpt3.5-turbo API 🤖. ## Dataset Usage This dataset can be used for various natural language processing tasks 👩‍💻, such as text summarization, machine translation, and language modeling in the Turkish language. Example usage: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("musabg/wikipedia-tr-summarization") # Access the data train_data = dataset["train"] validation_data = dataset["validation"] # Iterate through the data for example in train_data: text = example["text"] summary = example["summary"] # Process the data as needed ``` Please make sure to cite the dataset as follows 📝: ```bibtex @misc{musabg2023wikipediatrsummarization, author = {Musab Gultekin}, title = {Wikipedia Turkish Summarization Dataset}, year = {2023}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/datasets/musabg/wikipedia-tr-summarization}}, } ``` --- ## Wikipedia Türkçe Özetleme Veri Seti Bu, 2023 Wikipedia dökümünden hazırlanan Türkçe özetleme veri kümesidir. Veri kümesi, Huggingface Wikipedia veri kümesi temizleme betiği, özel temizleme betikleri ve OpenAI'nin gpt3.5-turbo API'si kullanılarak temizlenmiş, tokenleştirilmiş ve özetlenmiştir. ### Veri Kaynağı - Wikipedia'nın en güncel Türkçe dökümü (2023 sürümü) ### Özellikler - text: string (Wikipedia makalelerinden çıkarılan orijinal metin) - summary: string (Orijinal metnin oluşturulan özeti) ### Veri Bölümleri | Bölüm | Numara Baytı | Örnek Sayısı | |------------|--------------------|--------------| | train | 324.460.408,048 | 119.110 | | validation | 17.077.006,952 | 6.269 | ### İndirme Boyutu - 216.029.002 bayt ### Veri Kümesi Boyutu - 341.537.415 bayt ## Veri Hazırlama ### Veri Toplama 1. En güncel Türkçe Wikipedia dökümü indirildi. 2. Huggingface Wikipedia veri kümesi temizleme betiği metni temizlemek için kullanıldı. 3. "Kaynakça" (Referanslar) gibi bölümleri ve diğer alakasız bilgileri kaldırmak için özel bir betik kullanıldı. ### Tokenleştirme Veri kümesi, Google'ın MT5 tokenleştiricisi kullanılarak tokenleştirildi. Aşağıdaki kriterler uygulandı: - 300 ile 900 token arasında olan makaleler seçildi. - 300'den az tokeni olan makaleler dikkate alınmadı. - 900'den fazla tokeni olan makalelerde, sadece bir paragraf ile biten ilk 900 token kısmı alındı. ### Özetleme Oluşturulan ham metinler, OpenAI'nin gpt3.5-turbo API'si kullanılarak özetlendi. ## Veri Kümesi Kullanımı Bu veri kümesi, Türkçe dilinde metin özetleme, makine çevirisi ve dil modelleme gibi çeşitli doğal dil işleme görevleri için kullanılabilir. Örnek kullanım: ```python from datasets import load_dataset # Veri kümesini yükle dataset = load_dataset("musabg/wikipedia-tr-summarization") # Verilere erişin train_data = dataset["train"] validation_data = dataset["validation"] # Verilerin üzerinden geçin for example in train_data: text = example["text"] summary = example["summary"] # Veriyi gerektiği gibi işleyin ```

提供机构：

musabg

原始信息汇总

数据集概述

名称: Wikipedia Turkish Summarization

语言: 土耳其语 (tr)

任务类别: 摘要生成 (summarization)

大小类别: 100K<n<1M

数据集特征

text: 字符串 (从维基百科文章中提取的原始文本)
summary: 字符串 (原始文本的生成摘要)

数据分割

分割	字节数	示例数
train	324,460,408.048	119,110
validation	17,077,006.952	6,269

数据集大小

下载大小: 216,029,002 字节
数据集大小: 341,537,415 字节

5,000+

优质数据集

54 个

任务类型

进入经典数据集