five

Polygl0t/tokenizers

收藏
Hugging Face2026-03-05 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Polygl0t/tokenizers
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: portuguese features: - name: text dtype: string splits: - name: train num_bytes: 21307799396 num_examples: 2000000 download_size: 7981082403 dataset_size: 21307799396 - config_name: bengali features: - name: text dtype: string splits: - name: train num_bytes: 13222913829 num_examples: 2000000 download_size: 3804453185 dataset_size: 13222913829 - config_name: code features: - name: text dtype: string splits: - name: train num_bytes: 6854288242 num_examples: 975000 download_size: 2194755063 dataset_size: 6854288242 - config_name: english features: - name: text dtype: string splits: - name: train num_bytes: 15670768606 num_examples: 2000000 download_size: 5700592325 dataset_size: 15670768606 - config_name: hindi features: - name: text dtype: string splits: - name: train num_bytes: 22579216927 num_examples: 2000000 download_size: 6194810350 dataset_size: 22579216927 configs: - config_name: portuguese data_files: - split: train path: portuguese/train-* - config_name: bengali data_files: - split: train path: bengali/train-* - config_name: code data_files: - split: train path: code/train-* - config_name: english default: true data_files: - split: train path: english/train-* - config_name: hindi data_files: - split: train path: hindi/train-* language: - hi - bn - en - pt license: other task_categories: - text-generation tags: - tokenizer - tokenization - english - code - bengali - hindi - portuguese pretty_name: Polygl0t tokenizers size_categories: - 1M<n<10M --- # Polygl0t Tokenizers ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Subsets and Splits](#subsets-and-splits) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Additional Information](#additional-information) - [Dataset Maintainers](#dataset-maintainers) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Acknowledgments](#acknowledgments) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://huggingface.co/datasets/Polygl0t/tokenizers - **Repository:** https://huggingface.co/datasets/Polygl0t/tokenizers - **Point of Contact:** [Polyg0t](mailto:kluge@uni-bonn.de) ### Dataset Summary This dataset contains several subsets for training multilingual tokenizers. Every subset possesses a collection of curated text samples in different languages. ### Supported Tasks and Leaderboards This dataset can be used for the task of text generation, specifically for training and evaluating tokenizers in multiple languages. ### Languages Hindi, Bengali, English, Portuguese, and Code (a mixture of 36 programming languages). <details> <summary><b>All programming languages</b></summary> <code> fortran, jupyter, cpp, solidity, python, cmake, assembly, ruby, perl, lua, typescript, c, java, html, powershell, php, haskell, shell, scala, sql, visual_basic, ada, julia, markdown, batchfile, rust, cuda, json, kotlin, go, r, javascript, pascal, yaml, css, c_sharp </code> </details> ## Dataset Structure ### Data Instances The dataset consists of the following features: - **text:** a string of text in the respective language of the subset. ### Data Fields ```json { "text": "Olá, como vai você?" } ``` ### Subsets and Splits The dataset includes the following subsets: - **Portuguese:** This subset contains 2,000,000 text samples in Portuguese. - **Hindi:** This subset contains 2,000,000 text samples in Hindi. - **Bengali:** This subset contains 2,000,000 text samples in Bengali - **English:** This subset contains 2,000,000 text samples in English. - **Code:** This subset contains 975,000 text samples in various programming languages. The `txt` files (e.g., [`hindi_test.txt`](hindi_test.txt)) are for testing/evaluation purposes. ### Dataset Creation ### Source Data - **Bengali:** The Bengali text samples were sourced from [Polygl0t/gigakriya-v1](https://huggingface.co/datasets/Polygl0t/gigakriya-v1). - **English:** The English text samples were sourced from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). - **Hindi:** The Hindi text samples were sourced from [Polygl0t/gigalekh-v1](https://huggingface.co/datasets/Polygl0t/gigalekh-v1). - **Portuguese:** The Portuguese text samples were sourced from [Polygl0t/gigaverbo-v2](https://huggingface.co/datasets/Polygl0t/gigaverbo-v2). - **Code:** The code samples were sourced from [bigcode/starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata). ## Additional Information ### Dataset Maintainers - [Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de). - [Shiza Fatimah](mailto:shizafatimah15@gmail.com). - [Aniket Sen](mailto:sen@hiskp.uni-bonn.de). ### Licensing Information Please refer to the individual licenses of the source datasets used to create this corpus, as listed in the "Source Data" section above. The combined dataset does not have a single unified license, and users should ensure compliance with the terms of each source dataset when utilizing this corpus. ### Citation Information ```latex @misc{correa2026tucano2cool, title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}}, author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek}, year={2026}, eprint={2603.03543}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03543}, } @misc{shiza2026lilmoo, title={{Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi}}, author={Shiza Fatimah and Aniket Sen and Sophia Falk and Florian Mai and Lucie Flek and Nicholas Kluge Corr{\^e}a}, year={2026}, eprint={2603.03508}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03508}, } @misc{fatimah2026liltii, title={{LilTii: A 0.6B Bengali Language Model that Outperforms Qwen}}, author={Shiza Fatimah and Aniket Sen and Sophia Falk and Florian Mai and Lucie Flek and Nicholas Kluge Corr{\^e}a}, year={2026}, howpublished={\url{https://hf.co/blog/Polygl0t/liltii}} } ``` ### Acknowledgments Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments. We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab. ### Contributions If you want to contribute, contact us at [polyglot@uni-bonn.de](mailto:polyglot@uni-bonn.de)!
提供机构:
Polygl0t
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作