turkish-nlp-suite/OzenliDerlem
收藏Hugging Face2026-03-07 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/turkish-nlp-suite/OzenliDerlem
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- Duygu Altinok
language:
- tr
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
source_datasets:
- original
pretty_name: OzenliDerlem
config_names:
- GeziNotlari
- Havadis
- KulturHaritasi
- MasalMasal
- Perdearkasi-Yorumlar
- PopulerBilim
- Serzenisler
- SusluTrendler
- TeknoYazilar
- ViralMedya
- YazarinKaleminden
dataset_info:
- config_name: GeziNotlari
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 176580
num_examples: 33174
- config_name: Havadis
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2807540
num_examples: 744868
- config_name: KulturHaritasi
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 495924
num_examples: 133118
- config_name: MasalMasal
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 8876
num_examples: 2621
- config_name: Perdearkasi-Yorumlar
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 131332
num_examples: 36888
- config_name: PopulerBilim
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 98396
num_examples: 23424
- config_name: Serzenisler
features:
- name: url
dtype: string
- name: text
dtype: string
- name: title
dtype: string
splits:
- name: train
num_bytes: 23064
num_examples: 23923
- config_name: SusluTrendler
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 32600
num_examples: 8993
- config_name: TeknoYazilar
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 381932
num_examples: 160680
- config_name: ViralMedya
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 521668
num_examples: 146315
- config_name: YazarinKaleminden
features:
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 71160
num_examples: 74529
configs:
- config_name: GeziNotlari
data_files:
- split: train
path: GeziNotlari/*jsonl
- config_name: Havadis
data_files:
- split: train
path: Havadis/*jsonl
- config_name: KulturHaritasi
data_files:
- split: train
path: KulturHaritasi/*jsonl
- config_name: MasalMasal
data_files:
- split: train
path: MasalMasal/*jsonl
- config_name: Perdearkasi-Yorumlar
data_files:
- split: train
path: Perdearkasi-Yorumlar/*jsonl
- config_name: PopulerBilim
data_files:
- split: train
path: PopulerBilim/*jsonl
- config_name: Serzenisler
data_files:
- split: train
path: Serzenisler/*jsonl
- config_name: SusluTrendler
data_files:
- split: train
path: SusluTrendler/*jsonl
- config_name: TeknoYazilar
data_files:
- split: train
path: TeknoYazilar/*jsonl
- config_name: ViralMedya
data_files:
- split: train
path: ViralMedya/*jsonl
- config_name: YazarinKaleminden
data_files:
- split: train
path: YazarinKaleminden/*jsonl
---
<img src="https://raw.githubusercontent.com/turkish-nlp-suite/.github/main/profile/ozenliderlemlogo.png" width="30%" height="30%">
# Dataset Card for OzenliDerlem
OzenliDerlem (a.k.a CraftedCrawl) is a carefully assembled collection of top-notch web crawl data from handpicked websites, featuring articles, journals, and magazines. It focuses on gathering rich and detailed text content, especially longer articles. The collection covers a wide range of topics, including travel, news, culture, fairy tales and folklore, movie reviews, popular science, product and service complaints, fashion and self-care, trendy tech, pop culture, and literature.
Each sub-corpus in OzenliDerlem, except for the News sub-corpus, is gathered from at least 50 high-quality websites. This variety of content helps the models better understand the finer details of modern Turkish culture, making these sub-corpora incredibly valuable. The breakdown of topics in the CraftedCrawl collection is shown below.
The News sub-corpus, originally called "Havadis," stands out as the first large-scale Turkish news crawl ever conducted. It includes data collected from 11 major newspaper websites, such as CNN Türk, Habertürk, Hürriyet, Milliyet, NTV Haber, Posta, Sabah, Sözcü, Star, T24, and Takvim.
This corpus is a part of large scale Turkish corpus [Bella Turca](https://huggingface.co/datasets/turkish-nlp-suite/BellaTurca). For more details about Bella Turca, please refer to [the publication](https://link.springer.com/chapter/10.1007/978-3-031-70563-2_16).
| Dataset | num instances | size | num of words (millions)|
|---|---|---|---|
| GeziNotlari | 33.174 | 173M | 21M |
| Havadis | 744.868 | 2.7GB | 322M |
| KulturHaritasi | 133.118 | 485M | 59M |
| MasalMasal | 2621 | 8.7M | 1M |
| Perdearkasi-Yorumlar | 36.888 | 129M | 16M |
| PopularBilim | 23.424 | 97M | 11M |
| Serzenisler | 23.923 | 23M | 2M |
| SusluTrendler | 8.993 | 32M | 3M |
| TeknoYazilar | 160.680 | 373M | 46M |
| ViralMedya | 146.315 | 510M | 63M |
| YazarinKaleminden | 74.529 | 70K | 8M |
| **Total** | 1.391.239 | 4.6GB | 557M |
### Instances
A typical instance from the dataset looks like:
```
{
"url": "https://www.ayagimintozuyla.net/balide-edinebileceginiz-15-essiz-deneyim.html",
"text": "Önceki Bali yazılarımda da anlatığım gibi Bali her türlü gezgine kucak açan bir destinasyon. Keyif düşkünlerine, maceracı gezginlere, kendini arayanlara, veya sadece farklı bir şeyler görmek isteyenlere kadar herkes kendine ait bir şeyler bulacak Bali'de. Bu yazımda Bali'de edinebileceğiniz tecrübeleri listeleyeceğim. Kim bilir, belki yazının sonunda Bali için bilet bakmaya başlarsınız, belli mi olur 🙂"
}
```
## Citation
```
@InProceedings{10.1007/978-3-031-70563-2_16,
author="Altinok, Duygu",
editor="N{\"o}th, Elmar
and Hor{\'a}k, Ale{\v{s}}
and Sojka, Petr",
title="Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling",
booktitle="Text, Speech, and Dialogue",
year="2024",
publisher="Springer Nature Switzerland",
address="Cham",
pages="196--213",
abstract="In recent studies, it has been demonstrated that incorporating diverse training datasets enhances the overall knowledge and generalization capabilities of large-scale language models, especially in cross-domain scenarios. In line with this, we introduce Bella Turca: a comprehensive Turkish text corpus, totaling 265GB, specifically curated for training language models. Bella Turca encompasses 25 distinct subsets of 4 genre, carefully chosen to ensure diversity and high quality. While Turkish is spoken widely across three continents, it suffers from a dearth of robust data resources for language modelling. Existing transformers and language models have primarily relied on repetitive corpora such as OSCAR and/or Wiki, which lack the desired diversity. Our work aims to break free from this monotony by introducing a fresh perspective to Turkish corpora resources. To the best of our knowledge, this release marks the first instance of such a vast and diverse dataset tailored for the Turkish language. Additionally, we contribute to the community by providing the code used in the dataset's construction and cleaning, fostering collaboration and knowledge sharing.",
isbn="978-3-031-70563-2"
}
```
## Acknowledgments
This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC).
提供机构:
turkish-nlp-suite



