EliMC/TxT360-1M-sample
收藏Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/EliMC/TxT360-1M-sample
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: odc-by
size_categories:
- 100K<n<1M
task_categories:
- text-generation
- feature-extraction
dataset_info:
features:
- name: text
dtype: string
- name: meta
struct:
- name: cc-path
dtype: string
- name: corpusid
dtype: int64
- name: dup_signals
struct:
- name: dup_details
struct:
- name: 2013-20
dtype: int64
- name: 2013-48
dtype: int64
- name: 2014-10
dtype: int64
- name: 2014-15
dtype: int64
- name: 2014-23
dtype: int64
- name: 2014-35
dtype: int64
- name: 2014-41
dtype: int64
- name: 2014-42
dtype: int64
- name: 2014-49
dtype: 'null'
- name: 2014-52
dtype: int64
- name: 2015-06
dtype: int64
- name: 2015-11
dtype: int64
- name: 2015-14
dtype: int64
- name: 2015-18
dtype: int64
- name: 2015-22
dtype: int64
- name: 2015-27
dtype: int64
- name: 2015-32
dtype: int64
- name: 2015-35
dtype: int64
- name: 2015-40
dtype: int64
- name: 2016-07
dtype: int64
- name: 2016-18
dtype: int64
- name: 2016-22
dtype: int64
- name: 2016-26
dtype: int64
- name: 2016-30
dtype: int64
- name: 2016-40
dtype: int64
- name: 2016-44
dtype: int64
- name: 2016-50
dtype: int64
- name: 2017-04
dtype: int64
- name: 2017-09
dtype: int64
- name: 2017-13
dtype: int64
- name: 2017-17
dtype: int64
- name: 2017-22
dtype: int64
- name: 2017-26
dtype: int64
- name: 2017-30
dtype: int64
- name: 2017-34
dtype: int64
- name: 2017-39
dtype: int64
- name: 2017-43
dtype: int64
- name: 2017-47
dtype: int64
- name: 2017-51
dtype: int64
- name: 2018-05
dtype: int64
- name: 2018-09
dtype: int64
- name: 2018-13
dtype: int64
- name: 2018-17
dtype: int64
- name: 2018-22
dtype: int64
- name: 2018-26
dtype: int64
- name: 2018-30
dtype: int64
- name: 2018-34
dtype: int64
- name: 2018-39
dtype: int64
- name: 2018-43
dtype: int64
- name: 2018-47
dtype: int64
- name: 2018-51
dtype: int64
- name: 2019-04
dtype: int64
- name: 2019-09
dtype: int64
- name: 2019-13
dtype: int64
- name: 2019-18
dtype: int64
- name: 2019-22
dtype: int64
- name: 2019-26
dtype: int64
- name: 2019-30
dtype: int64
- name: 2019-35
dtype: int64
- name: 2019-39
dtype: int64
- name: 2019-43
dtype: int64
- name: 2019-47
dtype: int64
- name: 2019-51
dtype: int64
- name: 2020-05
dtype: int64
- name: 2020-10
dtype: int64
- name: 2020-16
dtype: int64
- name: 2020-24
dtype: int64
- name: 2020-29
dtype: int64
- name: 2020-34
dtype: int64
- name: 2020-40
dtype: int64
- name: 2020-45
dtype: int64
- name: 2020-50
dtype: int64
- name: 2021-04
dtype: int64
- name: 2021-10
dtype: int64
- name: 2021-17
dtype: int64
- name: 2021-21
dtype: int64
- name: 2021-25
dtype: int64
- name: 2021-31
dtype: int64
- name: 2021-39
dtype: int64
- name: 2021-43
dtype: int64
- name: 2021-49
dtype: int64
- name: 2022-05
dtype: int64
- name: 2022-21
dtype: int64
- name: 2022-27
dtype: int64
- name: 2022-33
dtype: int64
- name: 2022-40
dtype: int64
- name: 2022-49
dtype: int64
- name: 2023-06
dtype: int64
- name: 2023-14
dtype: int64
- name: 2023-23
dtype: int64
- name: 2023-40
dtype: int64
- name: 2023-50
dtype: int64
- name: 2024-10
dtype: int64
- name: 2024-18
dtype: int64
- name: 2024-22
dtype: int64
- name: 2024-26
dtype: int64
- name: 2024-30
dtype: int64
- name: curated_sources
dtype: int64
- name: unknown
dtype: int64
- name: dup_doc_count
dtype: int64
- name: dup_dump_count
dtype: int64
- name: id
dtype: int64
- name: lang
dtype: string
- name: lang_score
dtype: float64
- name: language
dtype: string
- name: openaccessinfo
struct:
- name: externalids
struct:
- name: ACL
dtype: string
- name: ArXiv
dtype: string
- name: DOI
dtype: string
- name: MAG
dtype: string
- name: PubMedCentral
dtype: string
- name: license
dtype: string
- name: status
dtype: string
- name: url
dtype: string
- name: pmid
dtype: int64
- name: quality_signals
struct:
- name: fraction_of_characters_in_duplicate_lines
dtype: float64
- name: fraction_of_characters_in_duplicate_ngrams
sequence:
sequence: float64
- name: fraction_of_characters_in_duplicate_paragraphs
dtype: float64
- name: fraction_of_characters_in_most_common_ngram
sequence:
sequence: float64
- name: fraction_of_duplicate_lines
dtype: float64
- name: fraction_of_duplicate_paragraphs
dtype: float64
- name: fraction_of_lines_ending_with_ellipsis
dtype: float64
- name: fraction_of_lines_starting_with_bullet_point
dtype: float64
- name: fraction_of_lines_with_toxic_words
dtype: float64
- name: fraction_of_words_corrected_in_lines
dtype: float64
- name: fraction_of_words_with_alpha_character
dtype: float64
- name: has_curly_bracket
dtype: bool
- name: has_lorem_ipsum
dtype: bool
- name: mean_word_length
dtype: float64
- name: num_of_lines_with_toxic_words
dtype: int64
- name: num_of_paragraphs
dtype: int64
- name: num_of_sentences
dtype: int64
- name: num_of_stop_words
dtype: int64
- name: num_of_toxic_words
dtype: int64
- name: orig_text_has_dup_lines
dtype: bool
- name: symbol_to_word_ratio
dtype: float64
- name: url_score
dtype: float64
- name: word_count
dtype: int64
- name: timestamp
dtype: timestamp[us]
- name: title
dtype: string
- name: url
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 5427934946.0
num_examples: 1000000
download_size: 2765337588
dataset_size: 5427934946.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# BEE-spoke-data/TxT360-1M-sample
One million row sample from [LLM360/TxT360](https://huggingface.co/datasets/LLM360/TxT360):
- min length 256 GPT-4 tokens
- max length 8192 GPT-4 tokens
语言:
- 英语
许可协议:odc-by
样本规模类别:
- 10万 < 样本数量 < 100万
任务类别:
- 文本生成
- 特征提取
数据集信息:
字段列表:
1. 字段名:text,数据类型:字符串
2. 字段名:meta,结构体类型,包含以下子字段:
- 字段名:cc-path,数据类型:字符串
- 字段名:corpusid,数据类型:64位整数
- 字段名:dup_signals,结构体类型,包含以下子字段:
- 字段名:dup_details,结构体类型,包含以下子字段:
- 多个以年份命名的子字段,数据类型均为64位整数,其中`2014-49`字段值为null
- 字段名:curated_sources,数据类型:64位整数
- 字段名:unknown,数据类型:64位整数
- 字段名:dup_doc_count,数据类型:64位整数
- 字段名:dup_dump_count,数据类型:64位整数
- 字段名:id,数据类型:64位整数
- 字段名:lang,数据类型:字符串
- 字段名:lang_score,数据类型:浮点数
- 字段名:language,数据类型:字符串
- 字段名:openaccessinfo,结构体类型,包含以下子字段:
- 字段名:externalids,结构体类型,包含ACL、ArXiv、DOI、MAG、PubMedCentral五个子字段,数据类型均为字符串
- 字段名:license,数据类型:字符串
- 字段名:status,数据类型:字符串
- 字段名:url,数据类型:字符串
- 字段名:pmid,数据类型:64位整数
- 字段名:quality_signals,结构体类型,包含以下子字段:
- 字段名:fraction_of_characters_in_duplicate_lines,数据类型:浮点数,意为“重复行字符占比”
- 字段名:fraction_of_characters_in_duplicate_ngrams,数据类型:二维浮点型数组,意为“重复n-gram字符占比”
- 字段名:fraction_of_characters_in_duplicate_paragraphs,数据类型:浮点数,意为“重复段落字符占比”
- 字段名:fraction_of_characters_in_most_common_ngram,数据类型:二维浮点型数组,意为“最常见n-gram字符占比”
- 字段名:fraction_of_duplicate_lines,数据类型:浮点数,意为“重复行占比”
- 字段名:fraction_of_duplicate_paragraphs,数据类型:浮点数,意为“重复段落占比”
- 字段名:fraction_of_lines_ending_with_ellipsis,数据类型:浮点数,意为“以省略号结尾的行占比”
- 字段名:fraction_of_lines_starting_with_bullet_point,数据类型:浮点数,意为“以项目符号开头的行占比”
- 字段名:fraction_of_lines_with_toxic_words,数据类型:浮点数,意为“含敏感词的行占比”
- 字段名:fraction_of_words_corrected_in_lines,数据类型:浮点数,意为“行内被校正词的占比”
- 字段名:fraction_of_words_with_alpha_character,数据类型:浮点数,意为“含字母字符的词占比”
- 字段名:has_curly_bracket,数据类型:布尔型,意为“包含大括号”
- 字段名:has_lorem_ipsum,数据类型:布尔型,意为“包含Lorem Ipsum占位文本”
- 字段名:mean_word_length,数据类型:浮点数,意为“平均词长”
- 字段名:num_of_lines_with_toxic_words,数据类型:64位整数,意为“含敏感词的行数”
- 字段名:num_of_paragraphs,数据类型:64位整数,意为“段落总数”
- 字段名:num_of_sentences,数据类型:64位整数,意为“句子总数”
- 字段名:num_of_stop_words,数据类型:64位整数,意为“停用词总数”
- 字段名:num_of_toxic_words,数据类型:64位整数,意为“敏感词总数”
- 字段名:orig_text_has_dup_lines,数据类型:布尔型,意为“原始文本包含重复行”
- 字段名:symbol_to_word_ratio,数据类型:浮点数,意为“符号与词的占比”
- 字段名:url_score,数据类型:浮点数,意为“URL评分”
- 字段名:word_count,数据类型:64位整数,意为“总词数”
- 字段名:timestamp,数据类型:微秒级时间戳
- 字段名:title,数据类型:字符串
- 字段名:url,数据类型:字符串
3. 字段名:subset,数据类型:字符串
划分方式:
- 训练集(train):字节数5427934946.0,样本数1000000
下载大小:2765337588字节
数据集总大小:5427934946.0字节
配置项:
- 配置名称:default,数据文件:训练集对应`data/train-*`路径的文件
# BEE-spoke-data/TxT360-1M-sample
该数据集为[LLM360/TxT360](https://huggingface.co/datasets/LLM360/TxT360)的100万行采样子集:
- 最小长度:256个GPT-4 Token
- 最大长度:8192个GPT-4 Token
提供机构:
EliMC



