Ba2han/forumsoh_tokenized
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/forumsoh_tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- tr
---
# ForumSohbetleri Tokenized (All Subsets)
This dataset is a pre-tokenized, shuffled, and interleaved version of all subsets from `turkish-nlp-suite/ForumSohbetleri`.
## Processing Details
- **Tokenizer:** [Ba2han/qwen-test-3](https://huggingface.co/Ba2han/qwen-test-3)
- **Filtering:** Min tokens = 50, Max tokens = 2550
- **Format:** `uint32` `input_ids` only.
- **Subsets Included:** donanimarsivi, donanimhaber, forumum, iyinet, kadinlarklubu, memurlar, tahribat, technopatsosyal, turkiyeforum, wardom, wmaraci
- **Shuffling:** Stream interleaved with a rolling buffer of 25000.
## Shard Statistics
| Shard | Rows | Total Tokens |
| :--- | :--- | :--- |
| train-0000 | 100,000 | 49,234,877 |
| train-0001 | 100,000 | 52,860,163 |
| train-0002 | 100,000 | 53,352,906 |
| train-0003 | 100,000 | 53,745,785 |
| train-0004 | 100,000 | 51,924,714 |
| train-0005 | 100,000 | 51,589,048 |
| train-0006 | 100,000 | 49,739,042 |
| train-0007 | 100,000 | 49,989,134 |
| train-0008 | 100,000 | 52,451,698 |
| train-0009 | 100,000 | 56,898,749 |
| train-0010 | 100,000 | 55,045,329 |
| train-0011 | 100,000 | 53,753,225 |
| train-0012 | 100,000 | 55,241,320 |
| train-0013 | 100,000 | 56,308,784 |
| train-0014 | 100,000 | 57,167,177 |
| train-0015 | 100,000 | 56,972,394 |
| train-0016 | 100,000 | 57,032,441 |
| train-0017 | 100,000 | 51,519,178 |
| train-0018 | 100,000 | 49,883,858 |
| train-0019 | 100,000 | 52,180,135 |
| train-0020 | 100,000 | 51,842,122 |
| train-0021 | 100,000 | 56,900,567 |
| train-0022 | 100,000 | 57,900,837 |
| train-0023 | 100,000 | 52,487,791 |
| train-0024 | 100,000 | 48,573,305 |
| train-0025 | 100,000 | 54,132,431 |
| train-0026 | 100,000 | 55,861,653 |
| train-0027 | 100,000 | 57,820,030 |
| train-0028 | 100,000 | 60,321,048 |
| train-0029 | 100,000 | 57,847,321 |
| train-0030 | 100,000 | 60,644,891 |
| train-0031 | 100,000 | 61,618,690 |
| train-0032 | 100,000 | 61,753,040 |
| train-0033 | 100,000 | 61,198,661 |
| train-0034 | 100,000 | 56,744,354 |
| train-0035 | 100,000 | 52,657,863 |
| train-0036 | 100,000 | 54,058,794 |
| train-0037 | 100,000 | 55,030,351 |
| train-0038 | 100,000 | 56,585,888 |
| train-0039 | 100,000 | 58,049,327 |
| train-0040 | 100,000 | 54,225,445 |
| train-0041 | 100,000 | 53,243,083 |
| train-0042 | 100,000 | 59,511,857 |
| train-0043 | 100,000 | 58,480,026 |
| train-0044 | 100,000 | 59,870,546 |
| train-0045 | 100,000 | 63,920,568 |
| train-0046 | 100,000 | 64,561,779 |
| train-0047 | 100,000 | 64,320,534 |
| train-0048 | 100,000 | 59,967,215 |
| train-0049 | 100,000 | 60,404,954 |
| train-0050 | 100,000 | 59,179,571 |
| train-0051 | 100,000 | 53,778,457 |
| train-0052 | 100,000 | 49,863,179 |
| train-0053 | 100,000 | 50,546,771 |
| train-0054 | 100,000 | 51,219,878 |
| train-0055 | 100,000 | 54,157,184 |
| train-0056 | 100,000 | 52,826,197 |
| train-0057 | 100,000 | 52,509,828 |
| train-0058 | 100,000 | 52,736,252 |
| train-0059 | 100,000 | 56,743,788 |
| train-0060 | 100,000 | 56,550,124 |
| train-0061 | 100,000 | 55,096,782 |
| train-0062 | 100,000 | 57,349,515 |
| train-0063 | 100,000 | 55,283,788 |
| train-0064 | 100,000 | 56,032,808 |
| train-0065 | 100,000 | 57,140,243 |
| train-0066 | 100,000 | 55,480,601 |
| train-0067 | 100,000 | 53,978,621 |
| train-0068 | 100,000 | 55,134,792 |
| train-0069 | 100,000 | 55,222,786 |
| train-0070 | 100,000 | 52,093,549 |
| train-0071 | 100,000 | 53,672,315 |
| train-0072 | 100,000 | 56,840,248 |
| train-0073 | 100,000 | 51,728,899 |
| train-0074 | 100,000 | 52,414,507 |
| train-0075 | 100,000 | 55,077,763 |
| train-0076 | 39,568 | 22,410,029 |
| **Total** | **7,639,568** | **4,232,493,403** |
提供机构:
Ba2han



