agentlans/dolma-1m
收藏数据集概述
数据集名称
Dolma-1M
数据集描述
Dolma-1M 是一个非官方的 Dolma v1.7 子集,包含1,000,000条训练数据和约100,000条测试数据。这些数据是从10个随机文件中选取的,每条文本记录长度介于500至5000个字符之间。
相对于原始 Dolma 数据集的优势:
- 更小的规模,更易于管理
- 筛选了文本长度
- 无需远程执行Python脚本
数据集分割
训练集和测试集是通过随机抽样(无放回)创建的。抽样过程使用Python实现。数据集以Gzipped JSONL格式存储,每行代表原始Dolma数据集的一行。
数据集示例
javascript { "id": "https://nightforvets.com/rex-lawrence-poutre/", "text": "Rex was born on the family farm in Concordia, Kansas on February 6, 1920 to Arthur Donas Poutre and Ronalda Nadeau Poutre (Beland). Rex’s older brother, Leo, was born in 1917, and his younger brother, Bob, in 1925. After graduation from high school, Rex moved to Southern California, where his brother was stationed at March Field. [...] passion for riding motorcycles. He participated yearly, for 25 years, in the American Motorcycle Association’s two big tour-bike races, The Iron Butt and the Three Flags. He was a demon for speed. He will be missed.", "added": "2023-04-10T09:48:38.760096+00:00", "created": "2020-02-23T02:03:05Z", "source": "common-crawl" }
数据集使用
该数据集可用于研究、实验和开发目的。与原始Dolma数据集一样,它受Open Data Commons Attribution License (ODC-By) v1.0保护。



