five

EliMC/TxT360-1M-sample

收藏
Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/EliMC/TxT360-1M-sample
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: odc-by size_categories: - 100K<n<1M task_categories: - text-generation - feature-extraction dataset_info: features: - name: text dtype: string - name: meta struct: - name: cc-path dtype: string - name: corpusid dtype: int64 - name: dup_signals struct: - name: dup_details struct: - name: 2013-20 dtype: int64 - name: 2013-48 dtype: int64 - name: 2014-10 dtype: int64 - name: 2014-15 dtype: int64 - name: 2014-23 dtype: int64 - name: 2014-35 dtype: int64 - name: 2014-41 dtype: int64 - name: 2014-42 dtype: int64 - name: 2014-49 dtype: 'null' - name: 2014-52 dtype: int64 - name: 2015-06 dtype: int64 - name: 2015-11 dtype: int64 - name: 2015-14 dtype: int64 - name: 2015-18 dtype: int64 - name: 2015-22 dtype: int64 - name: 2015-27 dtype: int64 - name: 2015-32 dtype: int64 - name: 2015-35 dtype: int64 - name: 2015-40 dtype: int64 - name: 2016-07 dtype: int64 - name: 2016-18 dtype: int64 - name: 2016-22 dtype: int64 - name: 2016-26 dtype: int64 - name: 2016-30 dtype: int64 - name: 2016-40 dtype: int64 - name: 2016-44 dtype: int64 - name: 2016-50 dtype: int64 - name: 2017-04 dtype: int64 - name: 2017-09 dtype: int64 - name: 2017-13 dtype: int64 - name: 2017-17 dtype: int64 - name: 2017-22 dtype: int64 - name: 2017-26 dtype: int64 - name: 2017-30 dtype: int64 - name: 2017-34 dtype: int64 - name: 2017-39 dtype: int64 - name: 2017-43 dtype: int64 - name: 2017-47 dtype: int64 - name: 2017-51 dtype: int64 - name: 2018-05 dtype: int64 - name: 2018-09 dtype: int64 - name: 2018-13 dtype: int64 - name: 2018-17 dtype: int64 - name: 2018-22 dtype: int64 - name: 2018-26 dtype: int64 - name: 2018-30 dtype: int64 - name: 2018-34 dtype: int64 - name: 2018-39 dtype: int64 - name: 2018-43 dtype: int64 - name: 2018-47 dtype: int64 - name: 2018-51 dtype: int64 - name: 2019-04 dtype: int64 - name: 2019-09 dtype: int64 - name: 2019-13 dtype: int64 - name: 2019-18 dtype: int64 - name: 2019-22 dtype: int64 - name: 2019-26 dtype: int64 - name: 2019-30 dtype: int64 - name: 2019-35 dtype: int64 - name: 2019-39 dtype: int64 - name: 2019-43 dtype: int64 - name: 2019-47 dtype: int64 - name: 2019-51 dtype: int64 - name: 2020-05 dtype: int64 - name: 2020-10 dtype: int64 - name: 2020-16 dtype: int64 - name: 2020-24 dtype: int64 - name: 2020-29 dtype: int64 - name: 2020-34 dtype: int64 - name: 2020-40 dtype: int64 - name: 2020-45 dtype: int64 - name: 2020-50 dtype: int64 - name: 2021-04 dtype: int64 - name: 2021-10 dtype: int64 - name: 2021-17 dtype: int64 - name: 2021-21 dtype: int64 - name: 2021-25 dtype: int64 - name: 2021-31 dtype: int64 - name: 2021-39 dtype: int64 - name: 2021-43 dtype: int64 - name: 2021-49 dtype: int64 - name: 2022-05 dtype: int64 - name: 2022-21 dtype: int64 - name: 2022-27 dtype: int64 - name: 2022-33 dtype: int64 - name: 2022-40 dtype: int64 - name: 2022-49 dtype: int64 - name: 2023-06 dtype: int64 - name: 2023-14 dtype: int64 - name: 2023-23 dtype: int64 - name: 2023-40 dtype: int64 - name: 2023-50 dtype: int64 - name: 2024-10 dtype: int64 - name: 2024-18 dtype: int64 - name: 2024-22 dtype: int64 - name: 2024-26 dtype: int64 - name: 2024-30 dtype: int64 - name: curated_sources dtype: int64 - name: unknown dtype: int64 - name: dup_doc_count dtype: int64 - name: dup_dump_count dtype: int64 - name: id dtype: int64 - name: lang dtype: string - name: lang_score dtype: float64 - name: language dtype: string - name: openaccessinfo struct: - name: externalids struct: - name: ACL dtype: string - name: ArXiv dtype: string - name: DOI dtype: string - name: MAG dtype: string - name: PubMedCentral dtype: string - name: license dtype: string - name: status dtype: string - name: url dtype: string - name: pmid dtype: int64 - name: quality_signals struct: - name: fraction_of_characters_in_duplicate_lines dtype: float64 - name: fraction_of_characters_in_duplicate_ngrams sequence: sequence: float64 - name: fraction_of_characters_in_duplicate_paragraphs dtype: float64 - name: fraction_of_characters_in_most_common_ngram sequence: sequence: float64 - name: fraction_of_duplicate_lines dtype: float64 - name: fraction_of_duplicate_paragraphs dtype: float64 - name: fraction_of_lines_ending_with_ellipsis dtype: float64 - name: fraction_of_lines_starting_with_bullet_point dtype: float64 - name: fraction_of_lines_with_toxic_words dtype: float64 - name: fraction_of_words_corrected_in_lines dtype: float64 - name: fraction_of_words_with_alpha_character dtype: float64 - name: has_curly_bracket dtype: bool - name: has_lorem_ipsum dtype: bool - name: mean_word_length dtype: float64 - name: num_of_lines_with_toxic_words dtype: int64 - name: num_of_paragraphs dtype: int64 - name: num_of_sentences dtype: int64 - name: num_of_stop_words dtype: int64 - name: num_of_toxic_words dtype: int64 - name: orig_text_has_dup_lines dtype: bool - name: symbol_to_word_ratio dtype: float64 - name: url_score dtype: float64 - name: word_count dtype: int64 - name: timestamp dtype: timestamp[us] - name: title dtype: string - name: url dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5427934946.0 num_examples: 1000000 download_size: 2765337588 dataset_size: 5427934946.0 configs: - config_name: default data_files: - split: train path: data/train-* --- # BEE-spoke-data/TxT360-1M-sample One million row sample from [LLM360/TxT360](https://huggingface.co/datasets/LLM360/TxT360): - min length 256 GPT-4 tokens - max length 8192 GPT-4 tokens

语言: - 英语 许可协议:odc-by 样本规模类别: - 10万 < 样本数量 < 100万 任务类别: - 文本生成 - 特征提取 数据集信息: 字段列表: 1. 字段名:text,数据类型:字符串 2. 字段名:meta,结构体类型,包含以下子字段: - 字段名:cc-path,数据类型:字符串 - 字段名:corpusid,数据类型:64位整数 - 字段名:dup_signals,结构体类型,包含以下子字段: - 字段名:dup_details,结构体类型,包含以下子字段: - 多个以年份命名的子字段,数据类型均为64位整数,其中`2014-49`字段值为null - 字段名:curated_sources,数据类型:64位整数 - 字段名:unknown,数据类型:64位整数 - 字段名:dup_doc_count,数据类型:64位整数 - 字段名:dup_dump_count,数据类型:64位整数 - 字段名:id,数据类型:64位整数 - 字段名:lang,数据类型:字符串 - 字段名:lang_score,数据类型:浮点数 - 字段名:language,数据类型:字符串 - 字段名:openaccessinfo,结构体类型,包含以下子字段: - 字段名:externalids,结构体类型,包含ACL、ArXiv、DOI、MAG、PubMedCentral五个子字段,数据类型均为字符串 - 字段名:license,数据类型:字符串 - 字段名:status,数据类型:字符串 - 字段名:url,数据类型:字符串 - 字段名:pmid,数据类型:64位整数 - 字段名:quality_signals,结构体类型,包含以下子字段: - 字段名:fraction_of_characters_in_duplicate_lines,数据类型:浮点数,意为“重复行字符占比” - 字段名:fraction_of_characters_in_duplicate_ngrams,数据类型:二维浮点型数组,意为“重复n-gram字符占比” - 字段名:fraction_of_characters_in_duplicate_paragraphs,数据类型:浮点数,意为“重复段落字符占比” - 字段名:fraction_of_characters_in_most_common_ngram,数据类型:二维浮点型数组,意为“最常见n-gram字符占比” - 字段名:fraction_of_duplicate_lines,数据类型:浮点数,意为“重复行占比” - 字段名:fraction_of_duplicate_paragraphs,数据类型:浮点数,意为“重复段落占比” - 字段名:fraction_of_lines_ending_with_ellipsis,数据类型:浮点数,意为“以省略号结尾的行占比” - 字段名:fraction_of_lines_starting_with_bullet_point,数据类型:浮点数,意为“以项目符号开头的行占比” - 字段名:fraction_of_lines_with_toxic_words,数据类型:浮点数,意为“含敏感词的行占比” - 字段名:fraction_of_words_corrected_in_lines,数据类型:浮点数,意为“行内被校正词的占比” - 字段名:fraction_of_words_with_alpha_character,数据类型:浮点数,意为“含字母字符的词占比” - 字段名:has_curly_bracket,数据类型:布尔型,意为“包含大括号” - 字段名:has_lorem_ipsum,数据类型:布尔型,意为“包含Lorem Ipsum占位文本” - 字段名:mean_word_length,数据类型:浮点数,意为“平均词长” - 字段名:num_of_lines_with_toxic_words,数据类型:64位整数,意为“含敏感词的行数” - 字段名:num_of_paragraphs,数据类型:64位整数,意为“段落总数” - 字段名:num_of_sentences,数据类型:64位整数,意为“句子总数” - 字段名:num_of_stop_words,数据类型:64位整数,意为“停用词总数” - 字段名:num_of_toxic_words,数据类型:64位整数,意为“敏感词总数” - 字段名:orig_text_has_dup_lines,数据类型:布尔型,意为“原始文本包含重复行” - 字段名:symbol_to_word_ratio,数据类型:浮点数,意为“符号与词的占比” - 字段名:url_score,数据类型:浮点数,意为“URL评分” - 字段名:word_count,数据类型:64位整数,意为“总词数” - 字段名:timestamp,数据类型:微秒级时间戳 - 字段名:title,数据类型:字符串 - 字段名:url,数据类型:字符串 3. 字段名:subset,数据类型:字符串 划分方式: - 训练集(train):字节数5427934946.0,样本数1000000 下载大小:2765337588字节 数据集总大小:5427934946.0字节 配置项: - 配置名称:default,数据文件:训练集对应`data/train-*`路径的文件 # BEE-spoke-data/TxT360-1M-sample 该数据集为[LLM360/TxT360](https://huggingface.co/datasets/LLM360/TxT360)的100万行采样子集: - 最小长度:256个GPT-4 Token - 最大长度:8192个GPT-4 Token
提供机构:
EliMC
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作