rulins/MassiveDS-1.4T-raw-data
收藏Hugging Face2024-08-29 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/rulins/MassiveDS-1.4T-raw-data
下载链接
链接失效反馈官方服务:
资源简介:
MassiveDS数据集发布了原始段落、嵌入和索引。该数据集包含两个版本:MassiveDS-1.4T和MassiveDS-140B。MassiveDS-1.4T包含1.4T-token数据存储的嵌入和段落,MassiveDS-140B则包含140B-token数据存储的索引、嵌入、段落和原始文本。文件结构包括raw_data(原始数据,以JSONL文件格式存储)、passages(分块的原始段落,每个段落不超过256个单词)、embeddings(使用Contriever-MSMACRO编码的段落嵌入)和index(基于嵌入构建的平面索引)。
MassiveDS is a dataset containing raw passages, embeddings, and indexes. It offers three versions: MassiveDS-1.4T includes embeddings and passages of the 1.4T-token datastore; MassiveDS-1.4T-raw-text contains the raw text of the 1.4T-token datastore; MassiveDS-140B includes the index, embeddings, passages, and raw text of a subsampled version with 140B tokens. The datasets file structure includes raw_data, passages, embeddings, and index. It is recommended to use Git LFS for downloading, with example scripts provided.
提供机构:
rulins



