textcleanlm/textclean-200M
收藏Hugging Face2025-08-03 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/textcleanlm/textclean-200M
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含文本数据,分为default和raw_data两个配置。每个文本数据都包括id,url,原始文本,原始文本的token数量,清洗后的文本和清洗后文本的token数量。default配置的训练集包含402203个文本示例,总大小为1994315378字节;raw_data配置的训练集包含178757个文本示例,总大小为903005905字节。
The dataset consists of text data, split into two configurations: default and raw_data. Each text entry includes id, url, raw text, number of tokens in the raw text, cleaned text, and number of tokens in the cleaned text. The default configurations training set contains 402203 text examples, totaling 1994315378 bytes in size; the raw_data configurations training set contains 178757 text examples, totaling 903005905 bytes in size.
提供机构:
textcleanlm



