five

Ba2han/forumsoh_tokenized

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/forumsoh_tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - tr --- # ForumSohbetleri Tokenized (All Subsets) This dataset is a pre-tokenized, shuffled, and interleaved version of all subsets from `turkish-nlp-suite/ForumSohbetleri`. ## Processing Details - **Tokenizer:** [Ba2han/qwen-test-3](https://huggingface.co/Ba2han/qwen-test-3) - **Filtering:** Min tokens = 50, Max tokens = 2550 - **Format:** `uint32` `input_ids` only. - **Subsets Included:** donanimarsivi, donanimhaber, forumum, iyinet, kadinlarklubu, memurlar, tahribat, technopatsosyal, turkiyeforum, wardom, wmaraci - **Shuffling:** Stream interleaved with a rolling buffer of 25000. ## Shard Statistics | Shard | Rows | Total Tokens | | :--- | :--- | :--- | | train-0000 | 100,000 | 49,234,877 | | train-0001 | 100,000 | 52,860,163 | | train-0002 | 100,000 | 53,352,906 | | train-0003 | 100,000 | 53,745,785 | | train-0004 | 100,000 | 51,924,714 | | train-0005 | 100,000 | 51,589,048 | | train-0006 | 100,000 | 49,739,042 | | train-0007 | 100,000 | 49,989,134 | | train-0008 | 100,000 | 52,451,698 | | train-0009 | 100,000 | 56,898,749 | | train-0010 | 100,000 | 55,045,329 | | train-0011 | 100,000 | 53,753,225 | | train-0012 | 100,000 | 55,241,320 | | train-0013 | 100,000 | 56,308,784 | | train-0014 | 100,000 | 57,167,177 | | train-0015 | 100,000 | 56,972,394 | | train-0016 | 100,000 | 57,032,441 | | train-0017 | 100,000 | 51,519,178 | | train-0018 | 100,000 | 49,883,858 | | train-0019 | 100,000 | 52,180,135 | | train-0020 | 100,000 | 51,842,122 | | train-0021 | 100,000 | 56,900,567 | | train-0022 | 100,000 | 57,900,837 | | train-0023 | 100,000 | 52,487,791 | | train-0024 | 100,000 | 48,573,305 | | train-0025 | 100,000 | 54,132,431 | | train-0026 | 100,000 | 55,861,653 | | train-0027 | 100,000 | 57,820,030 | | train-0028 | 100,000 | 60,321,048 | | train-0029 | 100,000 | 57,847,321 | | train-0030 | 100,000 | 60,644,891 | | train-0031 | 100,000 | 61,618,690 | | train-0032 | 100,000 | 61,753,040 | | train-0033 | 100,000 | 61,198,661 | | train-0034 | 100,000 | 56,744,354 | | train-0035 | 100,000 | 52,657,863 | | train-0036 | 100,000 | 54,058,794 | | train-0037 | 100,000 | 55,030,351 | | train-0038 | 100,000 | 56,585,888 | | train-0039 | 100,000 | 58,049,327 | | train-0040 | 100,000 | 54,225,445 | | train-0041 | 100,000 | 53,243,083 | | train-0042 | 100,000 | 59,511,857 | | train-0043 | 100,000 | 58,480,026 | | train-0044 | 100,000 | 59,870,546 | | train-0045 | 100,000 | 63,920,568 | | train-0046 | 100,000 | 64,561,779 | | train-0047 | 100,000 | 64,320,534 | | train-0048 | 100,000 | 59,967,215 | | train-0049 | 100,000 | 60,404,954 | | train-0050 | 100,000 | 59,179,571 | | train-0051 | 100,000 | 53,778,457 | | train-0052 | 100,000 | 49,863,179 | | train-0053 | 100,000 | 50,546,771 | | train-0054 | 100,000 | 51,219,878 | | train-0055 | 100,000 | 54,157,184 | | train-0056 | 100,000 | 52,826,197 | | train-0057 | 100,000 | 52,509,828 | | train-0058 | 100,000 | 52,736,252 | | train-0059 | 100,000 | 56,743,788 | | train-0060 | 100,000 | 56,550,124 | | train-0061 | 100,000 | 55,096,782 | | train-0062 | 100,000 | 57,349,515 | | train-0063 | 100,000 | 55,283,788 | | train-0064 | 100,000 | 56,032,808 | | train-0065 | 100,000 | 57,140,243 | | train-0066 | 100,000 | 55,480,601 | | train-0067 | 100,000 | 53,978,621 | | train-0068 | 100,000 | 55,134,792 | | train-0069 | 100,000 | 55,222,786 | | train-0070 | 100,000 | 52,093,549 | | train-0071 | 100,000 | 53,672,315 | | train-0072 | 100,000 | 56,840,248 | | train-0073 | 100,000 | 51,728,899 | | train-0074 | 100,000 | 52,414,507 | | train-0075 | 100,000 | 55,077,763 | | train-0076 | 39,568 | 22,410,029 | | **Total** | **7,639,568** | **4,232,493,403** |
提供机构:
Ba2han
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作