BEE-spoke-data/TxT360-1M-sample
收藏Hugging Face2024-10-10 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/BEE-spoke-data/TxT360-1M-sample
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从LLM360/TxT360中抽取的一百万行样本,每条样本的文本长度在256到8192个GPT-4 tokens之间。数据集包含文本内容、元数据(如路径、语料库ID、重复信号、语言信息、开放访问信息等)以及质量信号(如重复行比例、有毒词汇比例等)。数据集的大小为5427934946字节,包含1000000个样本,下载大小为2765337588字节。
This is a one million row sample from the LLM360/TxT360 dataset, suitable for text generation and feature extraction tasks. The dataset includes a string feature named text and a structured feature named meta which contains multiple sub-features such as cc-path, corpusid, dup_signals, etc. The text length ranges from 256 to 8192 GPT-4 tokens.
提供机构:
BEE-spoke-data



