fhai50032/pds-tk
收藏Hugging Face2025-01-26 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/fhai50032/pds-tk
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个用于训练分词器的语料库,包含多种印度语言、印地语、英语以及数学、代码和医学领域的文本。具体包括:孟加拉语、泰卢固语、卡纳达语、泰米尔语、马来语、古吉拉特语和旁遮普语等印度语言字符约147,000,000个,印地语字符100,000,000个,英语字符82,000,000个,数学字符30,000,000个,代码字符24,000,000个,医学字符15,000,000个。
This dataset is a corpus for training tokenizers, which includes texts from various Indian languages, Hindi, English, as well as from the fields of Math, Code, and Medical. It specifically contains approximately 147,000,000 characters from Indian languages such as Bengali, Tamil, Kannada, Telegu, Malyali, Gujarati, and Punjabi, 100,000,000 characters in Hindi, 82,000,000 characters in English, 30,000,000 characters in Math, 24,000,000 characters in Code, and 15,000,000 characters in Medical.
提供机构:
fhai50032



