BEE-spoke-data/TxT360-1M-sample

Name: BEE-spoke-data/TxT360-1M-sample
Creator: BEE-spoke-data
Published: 2024-10-10 04:46:03
License: 暂无描述

Hugging Face2024-10-10 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/BEE-spoke-data/TxT360-1M-sample

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是从LLM360/TxT360中抽取的一百万行样本，每条样本的文本长度在256到8192个GPT-4 tokens之间。数据集包含文本内容、元数据（如路径、语料库ID、重复信号、语言信息、开放访问信息等）以及质量信号（如重复行比例、有毒词汇比例等）。数据集的大小为5427934946字节，包含1000000个样本，下载大小为2765337588字节。

This is a one million row sample from the LLM360/TxT360 dataset, suitable for text generation and feature extraction tasks. The dataset includes a string feature named text and a structured feature named meta which contains multiple sub-features such as cc-path, corpusid, dup_signals, etc. The text length ranges from 256 to 8192 GPT-4 tokens.

提供机构：

BEE-spoke-data

5,000+

优质数据集

54 个

任务类型

进入经典数据集