turkish-nlp-suite/ForumSohbetleri
收藏Hugging Face2025-11-10 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/turkish-nlp-suite/ForumSohbetleri
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- Duygu Altinok
language:
- tr
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
size_categories:
- 1M<n<10M
source_datasets:
- original
task_categories:
- fill-mask
- text-generation
pretty_name: ForumSohbetleri
config_names:
- donanimarsivi
- donanimhaber
- forumum
- iyinet
- kadinlarklubu
- memurlar
- tahribat
- technopatsosyal
- turkiyeforum
- wardom
- wmaraci
tags:
- forum
dataset_info:
- config_name: donanimarsivi
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 37940623
num_examples: 17510
- config_name: donanimhaber
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 493901019
num_examples: 162525
- config_name: forumum
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 146553215
num_examples: 57219
- config_name: iyinet
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 154383316
num_examples: 93531
- config_name: kadinlarklubu
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 5887469877
num_examples: 743613
- config_name: memurlar
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 4257849366
num_examples: 708198
- config_name: tahribat
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 955505292
num_examples: 173680
- config_name: technopatsosyal
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 1435293286
num_examples: 688237
- config_name: turkiyeforum
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 57930003
num_examples: 17716
- config_name: wardom
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 754867605
num_examples: 243150
- config_name: wmaraci
features:
- name: url
dtype: string
- name: texts
list:
dtype: string
splits:
- name: train
num_bytes: 33224454
num_examples: 20596
configs:
- config_name: donanimarsivi
data_files:
- split: train
path: donanimarsivi/train-*
- config_name: donanimhaber
data_files:
- split: train
path: donanimhaber/train-*
- config_name: forumum
data_files:
- split: train
path: forumum/train-*
- config_name: iyinet
data_files:
- split: train
path: iyinet/train-*
- config_name: kadinlarklubu
data_files:
- split: train
path: kadinlarklubu/train-*
- config_name: memurlar
data_files:
- split: train
path: memurlar/train-*
- config_name: tahribat
data_files:
- split: train
path: tahribat/train-*
- config_name: technopatsosyal
data_files:
- split: train
path: technopatsosyal/train-*
- config_name: turkiyeforum
data_files:
- split: train
path: turkiyeforum/train-*
- config_name: wardom
data_files:
- split: train
path: wardom/train-*
- config_name: wmaraci
data_files:
- split: train
path: wmaraci/train-*
---
<img src="https://raw.githubusercontent.com/turkish-nlp-suite/.github/main/profile/forumsohbetleri.png" width="30%" height="30%">
# Dataset Card for ForumSohbetleri
ForumSohbetleri a web forum tetx corpus for Turkish, indeed first large-scale Turkish forum text corpus.
This corpus is a part of large scale Turkish corpus [Bella Turca](https://huggingface.co/datasets/turkish-nlp-suite/BellaTurca). For more details about Bella Turca, please refer to [the publication](https://link.springer.com/chapter/10.1007/978-3-031-70563-2_16).
This collection is made up of several subsets, each subset is gathered from the corresponding forum website. Forum websites contains diverse topics, ladies only, tech, economics, life, relations and much more...
| Dataset | num threads | size | num of words|
|---|---|---|---|
| donanimarsivi | 17.510 | 37MB | 5.2M|
| donanimhaber | 162.525 | 472MB | 61.5M |
| forumum | 57.219 | 140MB | 17.8M |
| iyinet | 93.531 | 148MB | 18.5M |
| kadinlarklubu| 743.613 | 5.5GB | 773M |
| memurlar.net | 708.198 | 4GB | 511M |
| tahribat | 173.680 |912MB | 120M|
|technopatsosyal | 688.237 | 1.4GB | 177M|
|turkiyeforum | 17.716 | 56M | 7.1M |
| wardom | 243.150 | 720M | 91M |
|wmaraci | 20.596 | 32M | 3.8M |
| **Total** | 2.925.975 | 13.41GB | 1.7B |
During the crawl, we processed each thread as its own. We made extensive text cleaning in order to cope with highly variable ortography in forum text.
### Instances
Each instance represents a thread, hence contains a list of strings - posts in each thread.
A typical instance from the dataset looks like:
```
{
"url": "https://forum.donanimarsivi.com/konu/modeme-baglananlari-nasil-cikarabilirm.790705/",
"texts": [
"Nasıl değiştirilir bilmiyorum",
"Komşularımın bazılarında internet sifremiz var ve sürekli baglaniyolar oyunlarda felan MS cıkıyo sürekli nasıl engelliyebilirim Mesaj otomatik birleştirildi: 10 Ağustos 2023 TTNet Tplink Messinin",
"Sistemim: İntel Core İ5 11400f - Asus PRIME H510M-D - CORSAIR 16GB Vengeance RAM 2X8 - Kioxia 500 GB Exceria M.2 - Asus TUF-GTX1660TI-O6G-EVO-GAMING 192 Bit GDDR6 6 GB - Corsair 650 W Carbide Spec-05 Led Panel ATX Oyuncu Kasası - Asus TUF Gaming VG249Q1R 23.8 165HZ 1MS",
"arcai netcut kullanabilirsin baya iyi E",
"Şifreni değiştirsene aga İNTEL İ3 12100F / SAPPHIRE PULSE RX6700 / GIGABYTE H610M / GEIL 2X8 GB RAM 3200MHZ / MLD M300 500GB M.2 SSD / ASUS TUF VG247Q1A / ASUS X571GT GTX 1050 İ5 9300H ilkaycam. m 80+"
]
```
## Citation
```
@InProceedings{10.1007/978-3-031-70563-2_16,
author="Altinok, Duygu",
editor="N{\"o}th, Elmar
and Hor{\'a}k, Ale{\v{s}}
and Sojka, Petr",
title="Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling",
booktitle="Text, Speech, and Dialogue",
year="2024",
publisher="Springer Nature Switzerland",
address="Cham",
pages="196--213",
abstract="In recent studies, it has been demonstrated that incorporating diverse training datasets enhances the overall knowledge and generalization capabilities of large-scale language models, especially in cross-domain scenarios. In line with this, we introduce Bella Turca: a comprehensive Turkish text corpus, totaling 265GB, specifically curated for training language models. Bella Turca encompasses 25 distinct subsets of 4 genre, carefully chosen to ensure diversity and high quality. While Turkish is spoken widely across three continents, it suffers from a dearth of robust data resources for language modelling. Existing transformers and language models have primarily relied on repetitive corpora such as OSCAR and/or Wiki, which lack the desired diversity. Our work aims to break free from this monotony by introducing a fresh perspective to Turkish corpora resources. To the best of our knowledge, this release marks the first instance of such a vast and diverse dataset tailored for the Turkish language. Additionally, we contribute to the community by providing the code used in the dataset's construction and cleaning, fostering collaboration and knowledge sharing.",
isbn="978-3-031-70563-2"
}
```
## Acknowledgments
This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC).
提供机构:
turkish-nlp-suite



