Name: EliMC/TxT360-1M-sample
Creator: EliMC
Published: 2025-12-05 16:49:17
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/EliMC/TxT360-1M-sample

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: odc-by size_categories: - 100K<n<1M task_categories: - text-generation - feature-extraction dataset_info: features: - name: text dtype: string - name: meta struct: - name: cc-path dtype: string - name: corpusid dtype: int64 - name: dup_signals struct: - name: dup_details struct: - name: 2013-20 dtype: int64 - name: 2013-48 dtype: int64 - name: 2014-10 dtype: int64 - name: 2014-15 dtype: int64 - name: 2014-23 dtype: int64 - name: 2014-35 dtype: int64 - name: 2014-41 dtype: int64 - name: 2014-42 dtype: int64 - name: 2014-49 dtype: 'null' - name: 2014-52 dtype: int64 - name: 2015-06 dtype: int64 - name: 2015-11 dtype: int64 - name: 2015-14 dtype: int64 - name: 2015-18 dtype: int64 - name: 2015-22 dtype: int64 - name: 2015-27 dtype: int64 - name: 2015-32 dtype: int64 - name: 2015-35 dtype: int64 - name: 2015-40 dtype: int64 - name: 2016-07 dtype: int64 - name: 2016-18 dtype: int64 - name: 2016-22 dtype: int64 - name: 2016-26 dtype: int64 - name: 2016-30 dtype: int64 - name: 2016-40 dtype: int64 - name: 2016-44 dtype: int64 - name: 2016-50 dtype: int64 - name: 2017-04 dtype: int64 - name: 2017-09 dtype: int64 - name: 2017-13 dtype: int64 - name: 2017-17 dtype: int64 - name: 2017-22 dtype: int64 - name: 2017-26 dtype: int64 - name: 2017-30 dtype: int64 - name: 2017-34 dtype: int64 - name: 2017-39 dtype: int64 - name: 2017-43 dtype: int64 - name: 2017-47 dtype: int64 - name: 2017-51 dtype: int64 - name: 2018-05 dtype: int64 - name: 2018-09 dtype: int64 - name: 2018-13 dtype: int64 - name: 2018-17 dtype: int64 - name: 2018-22 dtype: int64 - name: 2018-26 dtype: int64 - name: 2018-30 dtype: int64 - name: 2018-34 dtype: int64 - name: 2018-39 dtype: int64 - name: 2018-43 dtype: int64 - name: 2018-47 dtype: int64 - name: 2018-51 dtype: int64 - name: 2019-04 dtype: int64 - name: 2019-09 dtype: int64 - name: 2019-13 dtype: int64 - name: 2019-18 dtype: int64 - name: 2019-22 dtype: int64 - name: 2019-26 dtype: int64 - name: 2019-30 dtype: int64 - name: 2019-35 dtype: int64 - name: 2019-39 dtype: int64 - name: 2019-43 dtype: int64 - name: 2019-47 dtype: int64 - name: 2019-51 dtype: int64 - name: 2020-05 dtype: int64 - name: 2020-10 dtype: int64 - name: 2020-16 dtype: int64 - name: 2020-24 dtype: int64 - name: 2020-29 dtype: int64 - name: 2020-34 dtype: int64 - name: 2020-40 dtype: int64 - name: 2020-45 dtype: int64 - name: 2020-50 dtype: int64 - name: 2021-04 dtype: int64 - name: 2021-10 dtype: int64 - name: 2021-17 dtype: int64 - name: 2021-21 dtype: int64 - name: 2021-25 dtype: int64 - name: 2021-31 dtype: int64 - name: 2021-39 dtype: int64 - name: 2021-43 dtype: int64 - name: 2021-49 dtype: int64 - name: 2022-05 dtype: int64 - name: 2022-21 dtype: int64 - name: 2022-27 dtype: int64 - name: 2022-33 dtype: int64 - name: 2022-40 dtype: int64 - name: 2022-49 dtype: int64 - name: 2023-06 dtype: int64 - name: 2023-14 dtype: int64 - name: 2023-23 dtype: int64 - name: 2023-40 dtype: int64 - name: 2023-50 dtype: int64 - name: 2024-10 dtype: int64 - name: 2024-18 dtype: int64 - name: 2024-22 dtype: int64 - name: 2024-26 dtype: int64 - name: 2024-30 dtype: int64 - name: curated_sources dtype: int64 - name: unknown dtype: int64 - name: dup_doc_count dtype: int64 - name: dup_dump_count dtype: int64 - name: id dtype: int64 - name: lang dtype: string - name: lang_score dtype: float64 - name: language dtype: string - name: openaccessinfo struct: - name: externalids struct: - name: ACL dtype: string - name: ArXiv dtype: string - name: DOI dtype: string - name: MAG dtype: string - name: PubMedCentral dtype: string - name: license dtype: string - name: status dtype: string - name: url dtype: string - name: pmid dtype: int64 - name: quality_signals struct: - name: fraction_of_characters_in_duplicate_lines dtype: float64 - name: fraction_of_characters_in_duplicate_ngrams sequence: sequence: float64 - name: fraction_of_characters_in_duplicate_paragraphs dtype: float64 - name: fraction_of_characters_in_most_common_ngram sequence: sequence: float64 - name: fraction_of_duplicate_lines dtype: float64 - name: fraction_of_duplicate_paragraphs dtype: float64 - name: fraction_of_lines_ending_with_ellipsis dtype: float64 - name: fraction_of_lines_starting_with_bullet_point dtype: float64 - name: fraction_of_lines_with_toxic_words dtype: float64 - name: fraction_of_words_corrected_in_lines dtype: float64 - name: fraction_of_words_with_alpha_character dtype: float64 - name: has_curly_bracket dtype: bool - name: has_lorem_ipsum dtype: bool - name: mean_word_length dtype: float64 - name: num_of_lines_with_toxic_words dtype: int64 - name: num_of_paragraphs dtype: int64 - name: num_of_sentences dtype: int64 - name: num_of_stop_words dtype: int64 - name: num_of_toxic_words dtype: int64 - name: orig_text_has_dup_lines dtype: bool - name: symbol_to_word_ratio dtype: float64 - name: url_score dtype: float64 - name: word_count dtype: int64 - name: timestamp dtype: timestamp[us] - name: title dtype: string - name: url dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5427934946.0 num_examples: 1000000 download_size: 2765337588 dataset_size: 5427934946.0 configs: - config_name: default data_files: - split: train path: data/train-* --- # BEE-spoke-data/TxT360-1M-sample One million row sample from [LLM360/TxT360](https://huggingface.co/datasets/LLM360/TxT360): - min length 256 GPT-4 tokens - max length 8192 GPT-4 tokens

语言： - 英语许可协议：odc-by 样本规模类别： - 10万 < 样本数量 < 100万任务类别： - 文本生成 - 特征提取数据集信息：字段列表： 1. 字段名：text，数据类型：字符串 2. 字段名：meta，结构体类型，包含以下子字段： - 字段名：cc-path，数据类型：字符串 - 字段名：corpusid，数据类型：64位整数 - 字段名：dup_signals，结构体类型，包含以下子字段： - 字段名：dup_details，结构体类型，包含以下子字段： - 多个以年份命名的子字段，数据类型均为64位整数，其中`2014-49`字段值为null - 字段名：curated_sources，数据类型：64位整数 - 字段名：unknown，数据类型：64位整数 - 字段名：dup_doc_count，数据类型：64位整数 - 字段名：dup_dump_count，数据类型：64位整数 - 字段名：id，数据类型：64位整数 - 字段名：lang，数据类型：字符串 - 字段名：lang_score，数据类型：浮点数 - 字段名：language，数据类型：字符串 - 字段名：openaccessinfo，结构体类型，包含以下子字段： - 字段名：externalids，结构体类型，包含ACL、ArXiv、DOI、MAG、PubMedCentral五个子字段，数据类型均为字符串 - 字段名：license，数据类型：字符串 - 字段名：status，数据类型：字符串 - 字段名：url，数据类型：字符串 - 字段名：pmid，数据类型：64位整数 - 字段名：quality_signals，结构体类型，包含以下子字段： - 字段名：fraction_of_characters_in_duplicate_lines，数据类型：浮点数，意为“重复行字符占比” - 字段名：fraction_of_characters_in_duplicate_ngrams，数据类型：二维浮点型数组，意为“重复n-gram字符占比” - 字段名：fraction_of_characters_in_duplicate_paragraphs，数据类型：浮点数，意为“重复段落字符占比” - 字段名：fraction_of_characters_in_most_common_ngram，数据类型：二维浮点型数组，意为“最常见n-gram字符占比” - 字段名：fraction_of_duplicate_lines，数据类型：浮点数，意为“重复行占比” - 字段名：fraction_of_duplicate_paragraphs，数据类型：浮点数，意为“重复段落占比” - 字段名：fraction_of_lines_ending_with_ellipsis，数据类型：浮点数，意为“以省略号结尾的行占比” - 字段名：fraction_of_lines_starting_with_bullet_point，数据类型：浮点数，意为“以项目符号开头的行占比” - 字段名：fraction_of_lines_with_toxic_words，数据类型：浮点数，意为“含敏感词的行占比” - 字段名：fraction_of_words_corrected_in_lines，数据类型：浮点数，意为“行内被校正词的占比” - 字段名：fraction_of_words_with_alpha_character，数据类型：浮点数，意为“含字母字符的词占比” - 字段名：has_curly_bracket，数据类型：布尔型，意为“包含大括号” - 字段名：has_lorem_ipsum，数据类型：布尔型，意为“包含Lorem Ipsum占位文本” - 字段名：mean_word_length，数据类型：浮点数，意为“平均词长” - 字段名：num_of_lines_with_toxic_words，数据类型：64位整数，意为“含敏感词的行数” - 字段名：num_of_paragraphs，数据类型：64位整数，意为“段落总数” - 字段名：num_of_sentences，数据类型：64位整数，意为“句子总数” - 字段名：num_of_stop_words，数据类型：64位整数，意为“停用词总数” - 字段名：num_of_toxic_words，数据类型：64位整数，意为“敏感词总数” - 字段名：orig_text_has_dup_lines，数据类型：布尔型，意为“原始文本包含重复行” - 字段名：symbol_to_word_ratio，数据类型：浮点数，意为“符号与词的占比” - 字段名：url_score，数据类型：浮点数，意为“URL评分” - 字段名：word_count，数据类型：64位整数，意为“总词数” - 字段名：timestamp，数据类型：微秒级时间戳 - 字段名：title，数据类型：字符串 - 字段名：url，数据类型：字符串 3. 字段名：subset，数据类型：字符串划分方式： - 训练集（train）：字节数5427934946.0，样本数1000000 下载大小：2765337588字节数据集总大小：5427934946.0字节配置项： - 配置名称：default，数据文件：训练集对应`data/train-*`路径的文件 # BEE-spoke-data/TxT360-1M-sample 该数据集为[LLM360/TxT360](https://huggingface.co/datasets/LLM360/TxT360)的100万行采样子集： - 最小长度：256个GPT-4 Token - 最大长度：8192个GPT-4 Token

应用场景：