five

rulins/MassiveDS-1.4T

收藏
Hugging Face2024-07-19 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/rulins/MassiveDS-1.4T
下载链接
链接失效反馈
官方服务:
资源简介:
MassiveDS数据集包含原始段落、嵌入和索引。该数据集有两个版本:MassiveDS-1.4T和MassiveDS-140B。MassiveDS-1.4T包含1.4T的标记,而MassiveDS-140B是其子采样版本,包含140B的标记。文件结构包括`raw_data`(JSONL格式的原始数据)、`passages`(分块的原始段落,每个段落不超过256个单词)、`embeddings`(使用Contriever-MSMACRO编码的段落嵌入)和`index`(基于嵌入构建的平面索引)。建议使用Git LFS下载大文件,并提供了下载示例脚本。

The MassiveDS dataset offers two versions: MassiveDS-1.4T contains 1.4 trillion tokens, and MassiveDS-140B is a subsampled version containing 140 billion tokens. The dataset includes raw data, passages, embeddings, and index. The raw data is stored in JSONL files, passages are chunked into no more than 256 words with passage IDs. Embeddings are encoded using Contriever-MSMACRO, and the index is a flat index built from these embeddings. The dataset supports downloading via Git LFS, with options for partial downloads.
提供机构:
rulins
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作