five

turkish-nlp-suite/ForumSohbetleri

收藏
Hugging Face2025-11-10 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/turkish-nlp-suite/ForumSohbetleri
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - Duygu Altinok language: - tr license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 1M<n<10M source_datasets: - original task_categories: - fill-mask - text-generation pretty_name: ForumSohbetleri config_names: - donanimarsivi - donanimhaber - forumum - iyinet - kadinlarklubu - memurlar - tahribat - technopatsosyal - turkiyeforum - wardom - wmaraci tags: - forum dataset_info: - config_name: donanimarsivi features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 37940623 num_examples: 17510 - config_name: donanimhaber features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 493901019 num_examples: 162525 - config_name: forumum features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 146553215 num_examples: 57219 - config_name: iyinet features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 154383316 num_examples: 93531 - config_name: kadinlarklubu features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 5887469877 num_examples: 743613 - config_name: memurlar features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 4257849366 num_examples: 708198 - config_name: tahribat features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 955505292 num_examples: 173680 - config_name: technopatsosyal features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 1435293286 num_examples: 688237 - config_name: turkiyeforum features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 57930003 num_examples: 17716 - config_name: wardom features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 754867605 num_examples: 243150 - config_name: wmaraci features: - name: url dtype: string - name: texts list: dtype: string splits: - name: train num_bytes: 33224454 num_examples: 20596 configs: - config_name: donanimarsivi data_files: - split: train path: donanimarsivi/train-* - config_name: donanimhaber data_files: - split: train path: donanimhaber/train-* - config_name: forumum data_files: - split: train path: forumum/train-* - config_name: iyinet data_files: - split: train path: iyinet/train-* - config_name: kadinlarklubu data_files: - split: train path: kadinlarklubu/train-* - config_name: memurlar data_files: - split: train path: memurlar/train-* - config_name: tahribat data_files: - split: train path: tahribat/train-* - config_name: technopatsosyal data_files: - split: train path: technopatsosyal/train-* - config_name: turkiyeforum data_files: - split: train path: turkiyeforum/train-* - config_name: wardom data_files: - split: train path: wardom/train-* - config_name: wmaraci data_files: - split: train path: wmaraci/train-* --- <img src="https://raw.githubusercontent.com/turkish-nlp-suite/.github/main/profile/forumsohbetleri.png" width="30%" height="30%"> # Dataset Card for ForumSohbetleri ForumSohbetleri a web forum tetx corpus for Turkish, indeed first large-scale Turkish forum text corpus. This corpus is a part of large scale Turkish corpus [Bella Turca](https://huggingface.co/datasets/turkish-nlp-suite/BellaTurca). For more details about Bella Turca, please refer to [the publication](https://link.springer.com/chapter/10.1007/978-3-031-70563-2_16). This collection is made up of several subsets, each subset is gathered from the corresponding forum website. Forum websites contains diverse topics, ladies only, tech, economics, life, relations and much more... | Dataset | num threads | size | num of words| |---|---|---|---| | donanimarsivi | 17.510 | 37MB | 5.2M| | donanimhaber | 162.525 | 472MB | 61.5M | | forumum | 57.219 | 140MB | 17.8M | | iyinet | 93.531 | 148MB | 18.5M | | kadinlarklubu| 743.613 | 5.5GB | 773M | | memurlar.net | 708.198 | 4GB | 511M | | tahribat | 173.680 |912MB | 120M| |technopatsosyal | 688.237 | 1.4GB | 177M| |turkiyeforum | 17.716 | 56M | 7.1M | | wardom | 243.150 | 720M | 91M | |wmaraci | 20.596 | 32M | 3.8M | | **Total** | 2.925.975 | 13.41GB | 1.7B | During the crawl, we processed each thread as its own. We made extensive text cleaning in order to cope with highly variable ortography in forum text. ### Instances Each instance represents a thread, hence contains a list of strings - posts in each thread. A typical instance from the dataset looks like: ``` { "url": "https://forum.donanimarsivi.com/konu/modeme-baglananlari-nasil-cikarabilirm.790705/", "texts": [ "Nasıl değiştirilir bilmiyorum", "Komşularımın bazılarında internet sifremiz var ve sürekli baglaniyolar oyunlarda felan MS cıkıyo sürekli nasıl engelliyebilirim Mesaj otomatik birleştirildi: 10 Ağustos 2023 TTNet Tplink Messinin", "Sistemim: İntel Core İ5 11400f - Asus PRIME H510M-D - CORSAIR 16GB Vengeance RAM 2X8 - Kioxia 500 GB Exceria M.2 - Asus TUF-GTX1660TI-O6G-EVO-GAMING 192 Bit GDDR6 6 GB - Corsair 650 W Carbide Spec-05 Led Panel ATX Oyuncu Kasası - Asus TUF Gaming VG249Q1R 23.8 165HZ 1MS", "arcai netcut kullanabilirsin baya iyi E", "Şifreni değiştirsene aga İNTEL İ3 12100F / SAPPHIRE PULSE RX6700 / GIGABYTE H610M / GEIL 2X8 GB RAM 3200MHZ / MLD M300 500GB M.2 SSD / ASUS TUF VG247Q1A / ASUS X571GT GTX 1050 İ5 9300H ilkaycam. m 80+" ] ``` ## Citation ``` @InProceedings{10.1007/978-3-031-70563-2_16, author="Altinok, Duygu", editor="N{\"o}th, Elmar and Hor{\'a}k, Ale{\v{s}} and Sojka, Petr", title="Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling", booktitle="Text, Speech, and Dialogue", year="2024", publisher="Springer Nature Switzerland", address="Cham", pages="196--213", abstract="In recent studies, it has been demonstrated that incorporating diverse training datasets enhances the overall knowledge and generalization capabilities of large-scale language models, especially in cross-domain scenarios. In line with this, we introduce Bella Turca: a comprehensive Turkish text corpus, totaling 265GB, specifically curated for training language models. Bella Turca encompasses 25 distinct subsets of 4 genre, carefully chosen to ensure diversity and high quality. While Turkish is spoken widely across three continents, it suffers from a dearth of robust data resources for language modelling. Existing transformers and language models have primarily relied on repetitive corpora such as OSCAR and/or Wiki, which lack the desired diversity. Our work aims to break free from this monotony by introducing a fresh perspective to Turkish corpora resources. To the best of our knowledge, this release marks the first instance of such a vast and diverse dataset tailored for the Turkish language. Additionally, we contribute to the community by providing the code used in the dataset's construction and cleaning, fostering collaboration and knowledge sharing.", isbn="978-3-031-70563-2" } ``` ## Acknowledgments This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC).
提供机构:
turkish-nlp-suite
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作