five

turkish-nlp-suite/OzenliDerlem

收藏
Hugging Face2026-03-07 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/turkish-nlp-suite/OzenliDerlem
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - Duygu Altinok language: - tr license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original pretty_name: OzenliDerlem config_names: - GeziNotlari - Havadis - KulturHaritasi - MasalMasal - Perdearkasi-Yorumlar - PopulerBilim - Serzenisler - SusluTrendler - TeknoYazilar - ViralMedya - YazarinKaleminden dataset_info: - config_name: GeziNotlari features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 176580 num_examples: 33174 - config_name: Havadis features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 2807540 num_examples: 744868 - config_name: KulturHaritasi features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 495924 num_examples: 133118 - config_name: MasalMasal features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 8876 num_examples: 2621 - config_name: Perdearkasi-Yorumlar features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 131332 num_examples: 36888 - config_name: PopulerBilim features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 98396 num_examples: 23424 - config_name: Serzenisler features: - name: url dtype: string - name: text dtype: string - name: title dtype: string splits: - name: train num_bytes: 23064 num_examples: 23923 - config_name: SusluTrendler features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 32600 num_examples: 8993 - config_name: TeknoYazilar features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 381932 num_examples: 160680 - config_name: ViralMedya features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 521668 num_examples: 146315 - config_name: YazarinKaleminden features: - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 71160 num_examples: 74529 configs: - config_name: GeziNotlari data_files: - split: train path: GeziNotlari/*jsonl - config_name: Havadis data_files: - split: train path: Havadis/*jsonl - config_name: KulturHaritasi data_files: - split: train path: KulturHaritasi/*jsonl - config_name: MasalMasal data_files: - split: train path: MasalMasal/*jsonl - config_name: Perdearkasi-Yorumlar data_files: - split: train path: Perdearkasi-Yorumlar/*jsonl - config_name: PopulerBilim data_files: - split: train path: PopulerBilim/*jsonl - config_name: Serzenisler data_files: - split: train path: Serzenisler/*jsonl - config_name: SusluTrendler data_files: - split: train path: SusluTrendler/*jsonl - config_name: TeknoYazilar data_files: - split: train path: TeknoYazilar/*jsonl - config_name: ViralMedya data_files: - split: train path: ViralMedya/*jsonl - config_name: YazarinKaleminden data_files: - split: train path: YazarinKaleminden/*jsonl --- <img src="https://raw.githubusercontent.com/turkish-nlp-suite/.github/main/profile/ozenliderlemlogo.png" width="30%" height="30%"> # Dataset Card for OzenliDerlem OzenliDerlem (a.k.a CraftedCrawl) is a carefully assembled collection of top-notch web crawl data from handpicked websites, featuring articles, journals, and magazines. It focuses on gathering rich and detailed text content, especially longer articles. The collection covers a wide range of topics, including travel, news, culture, fairy tales and folklore, movie reviews, popular science, product and service complaints, fashion and self-care, trendy tech, pop culture, and literature. Each sub-corpus in OzenliDerlem, except for the News sub-corpus, is gathered from at least 50 high-quality websites. This variety of content helps the models better understand the finer details of modern Turkish culture, making these sub-corpora incredibly valuable. The breakdown of topics in the CraftedCrawl collection is shown below. The News sub-corpus, originally called "Havadis," stands out as the first large-scale Turkish news crawl ever conducted. It includes data collected from 11 major newspaper websites, such as CNN Türk, Habertürk, Hürriyet, Milliyet, NTV Haber, Posta, Sabah, Sözcü, Star, T24, and Takvim. This corpus is a part of large scale Turkish corpus [Bella Turca](https://huggingface.co/datasets/turkish-nlp-suite/BellaTurca). For more details about Bella Turca, please refer to [the publication](https://link.springer.com/chapter/10.1007/978-3-031-70563-2_16). | Dataset | num instances | size | num of words (millions)| |---|---|---|---| | GeziNotlari | 33.174 | 173M | 21M | | Havadis | 744.868 | 2.7GB | 322M | | KulturHaritasi | 133.118 | 485M | 59M | | MasalMasal | 2621 | 8.7M | 1M | | Perdearkasi-Yorumlar | 36.888 | 129M | 16M | | PopularBilim | 23.424 | 97M | 11M | | Serzenisler | 23.923 | 23M | 2M | | SusluTrendler | 8.993 | 32M | 3M | | TeknoYazilar | 160.680 | 373M | 46M | | ViralMedya | 146.315 | 510M | 63M | | YazarinKaleminden | 74.529 | 70K | 8M | | **Total** | 1.391.239 | 4.6GB | 557M | ### Instances A typical instance from the dataset looks like: ``` { "url": "https://www.ayagimintozuyla.net/balide-edinebileceginiz-15-essiz-deneyim.html", "text": "Önceki Bali yazılarımda da anlatığım gibi Bali her türlü gezgine kucak açan bir destinasyon. Keyif düşkünlerine, maceracı gezginlere, kendini arayanlara, veya sadece farklı bir şeyler görmek isteyenlere kadar herkes kendine ait bir şeyler bulacak Bali'de. Bu yazımda Bali'de edinebileceğiniz tecrübeleri listeleyeceğim. Kim bilir, belki yazının sonunda Bali için bilet bakmaya başlarsınız, belli mi olur 🙂" } ``` ## Citation ``` @InProceedings{10.1007/978-3-031-70563-2_16, author="Altinok, Duygu", editor="N{\"o}th, Elmar and Hor{\'a}k, Ale{\v{s}} and Sojka, Petr", title="Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling", booktitle="Text, Speech, and Dialogue", year="2024", publisher="Springer Nature Switzerland", address="Cham", pages="196--213", abstract="In recent studies, it has been demonstrated that incorporating diverse training datasets enhances the overall knowledge and generalization capabilities of large-scale language models, especially in cross-domain scenarios. In line with this, we introduce Bella Turca: a comprehensive Turkish text corpus, totaling 265GB, specifically curated for training language models. Bella Turca encompasses 25 distinct subsets of 4 genre, carefully chosen to ensure diversity and high quality. While Turkish is spoken widely across three continents, it suffers from a dearth of robust data resources for language modelling. Existing transformers and language models have primarily relied on repetitive corpora such as OSCAR and/or Wiki, which lack the desired diversity. Our work aims to break free from this monotony by introducing a fresh perspective to Turkish corpora resources. To the best of our knowledge, this release marks the first instance of such a vast and diverse dataset tailored for the Turkish language. Additionally, we contribute to the community by providing the code used in the dataset's construction and cleaning, fostering collaboration and knowledge sharing.", isbn="978-3-031-70563-2" } ``` ## Acknowledgments This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC).
提供机构:
turkish-nlp-suite
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作