wannaphong/KhanomTanLLM-pretrained-dataset

Name: wannaphong/KhanomTanLLM-pretrained-dataset
Creator: wannaphong
Published: 2024-09-12 15:24:53
License: 暂无描述

Hugging Face2024-09-12 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/wannaphong/KhanomTanLLM-pretrained-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

KhanomTanLLM预训练数据集包含了用于预训练大型语言模型（LLM）的所有原始文本数据。数据集主要由泰语和英语的文本组成，还包括代码和并行数据。总token数为53,376,211,711，其中英语占31,629,984,243 tokens，泰语占12,785,565,497 tokens，代码占8,913,084,300 tokens，并行数据占190,310,686 tokens。数据集基于Typhoon-7B的分词器。数据集包含多个子集，分别来自不同的来源，如pythainlp、bigscience-data、HuggingFaceTB等。并行数据用于构建双语LLM。

The KhanomTanLLM pretrained dataset collects all raw text for pretraining large language models (LLMs). The dataset primarily consists of Thai and English text, along with code and parallel data. The total number of tokens is 53,376,211,711, with English accounting for 31,629,984,243 tokens, Thai for 12,785,565,497 tokens, code for 8,913,084,300 tokens, and parallel data for 190,310,686 tokens. The dataset is based on the Typhoon-7B tokenizer. It includes multiple subsets from various sources such as pythainlp, bigscience-data, HuggingFaceTB, etc. Parallel data is used to build bilingual LLMs.

提供机构：

wannaphong

5,000+

优质数据集

54 个

任务类型

进入经典数据集