marketeam/raw_redpajamas
收藏Hugging Face2024-04-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/marketeam/raw_redpajamas
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
pretty_name: Raw Keyword Filtered RedPajamas Dataset
task_categories:
- text-generation
tags:
- marketing
size_categories:
- 1B<n<10B
---
### Getting Started
The dataset is built from the redpajamas dataset after filtering by marketing keywords list that can be found [here](https://github.com/marktrix/redpajama-data-filter-script/blob/main/marketing_words.txt)
The full scripts to recreate the raw dataset before sharding can be found [here](https://github.com/marktrix/redpajama-data-filter-script).
The dataset includes:
- ~4.8B tokens from raw contents.
#### Downloading the dataset
To start exploring and get to know the dataset you can run the script:
```python
import datasets
ds = datasets.load_dataset("marketeam/raw_redpajamas", split="train")
for sample in ds:
print(sample) # to print the first sample
```
alternatively, you can also use streaming:
```python
import datasets
ds = datasets.load_dataset("marketeam/raw_redpajamas", split="train", streaming=True)
for sample in ds:
print(sample)
break # to print the first sample
```
#### Languages
Engish
#### Data Structure
```
├── data
├── data-0000.json
├── ...
├── data-0003.json
```
#### Document structure
```json
{
"url": "...",
"date_download": "2023-03-20T08:44:39Z",
"digest": "sha1:EJNCO5XXIZLG2E3BULUGWCLLJUP2AV2Q",
"length": 6851,
"nlines": 49,
"source_domain": "fenndesign.com",
"title": "...",
"raw_content": "...",
"cc_segment": "...",
"original_nlines": 101,
"original_length": 8192,
"line_ids": [
6,
9,
10,
11
],
"language": "en",
"language_score": 0.9,
"perplexity": 303.6,
"bucket": "head",
"id": "2023-14/0000/en_head.json.gz/25",
"id_int": 4918268498184253468,
"metadata": {
"cc_segment": "...",
"cc_net_source": "2023-14/0000/en_head.json.gz",
"url": "...",
"source_domain": "fenndesign.com",
"language": "en",
"snapshot_id": "2023-14"
},
"quality_signals": {
"ccnet_length": [
[
0,
6851,
6851.0
]
],
"ccnet_original_length": [
[
0,
6851,
8192.0
]
],
"ccnet_nlines": [
[
0,
6851,
49.0
]
],
"ccnet_original_nlines": [
[
0,
6851,
101.0
]
],
"ccnet_language_score": [
[
0,
6851,
0.9
]
]
},
"is_duplicate": false
}
```
Document quality annotations can be found [here](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2#quality-annotations)
提供机构:
marketeam



