five

marketeam/raw_redpajamas

收藏
Hugging Face2024-04-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/marketeam/raw_redpajamas
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en pretty_name: Raw Keyword Filtered RedPajamas Dataset task_categories: - text-generation tags: - marketing size_categories: - 1B<n<10B --- ### Getting Started The dataset is built from the redpajamas dataset after filtering by marketing keywords list that can be found [here](https://github.com/marktrix/redpajama-data-filter-script/blob/main/marketing_words.txt) The full scripts to recreate the raw dataset before sharding can be found [here](https://github.com/marktrix/redpajama-data-filter-script). The dataset includes: - ~4.8B tokens from raw contents. #### Downloading the dataset To start exploring and get to know the dataset you can run the script: ```python import datasets ds = datasets.load_dataset("marketeam/raw_redpajamas", split="train") for sample in ds: print(sample) # to print the first sample ``` alternatively, you can also use streaming: ```python import datasets ds = datasets.load_dataset("marketeam/raw_redpajamas", split="train", streaming=True) for sample in ds: print(sample) break # to print the first sample ``` #### Languages Engish #### Data Structure ``` ├── data ├── data-0000.json ├── ... ├── data-0003.json ``` #### Document structure ```json { "url": "...", "date_download": "2023-03-20T08:44:39Z", "digest": "sha1:EJNCO5XXIZLG2E3BULUGWCLLJUP2AV2Q", "length": 6851, "nlines": 49, "source_domain": "fenndesign.com", "title": "...", "raw_content": "...", "cc_segment": "...", "original_nlines": 101, "original_length": 8192, "line_ids": [ 6, 9, 10, 11 ], "language": "en", "language_score": 0.9, "perplexity": 303.6, "bucket": "head", "id": "2023-14/0000/en_head.json.gz/25", "id_int": 4918268498184253468, "metadata": { "cc_segment": "...", "cc_net_source": "2023-14/0000/en_head.json.gz", "url": "...", "source_domain": "fenndesign.com", "language": "en", "snapshot_id": "2023-14" }, "quality_signals": { "ccnet_length": [ [ 0, 6851, 6851.0 ] ], "ccnet_original_length": [ [ 0, 6851, 8192.0 ] ], "ccnet_nlines": [ [ 0, 6851, 49.0 ] ], "ccnet_original_nlines": [ [ 0, 6851, 101.0 ] ], "ccnet_language_score": [ [ 0, 6851, 0.9 ] ] }, "is_duplicate": false } ``` Document quality annotations can be found [here](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2#quality-annotations)
提供机构:
marketeam
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作