Shitao/bge-m3-data
收藏Hugging Face2024-04-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Shitao/bge-m3-data
下载链接
链接失效反馈官方服务:
资源简介:
---
{}
---
# Dataset Summary
This depository contains all the fine-tuning data for the [bge-m3](https://huggingface.co/BAAI/bge-m3) model, including:
| Dataset | Language |
| --------------- | :----------: |
| MS MARCO | English |
| NQ | English |
| HotpotQA | English |
| TriviaQA | English |
| SQuAD | English |
| COLIEE | English |
| PubMedQA | English |
| NLI from SimCSE | English |
| DuReader | Chinese |
| mMARCO-zh | Chinese |
| T2Ranking | Chinese |
| Law-GPT | Chinese |
| cMedQAv2 | Chinese |
| NLI-zh | Chinese |
| LeCaRDv2 | Chinese |
| Mr.TyDi | 11 languages |
| MIRACL | 16 languages |
| MLDR | 13 languages |
Note: The MLDR dataset here is the handled `train` set of the [MLDR dataset](https://huggingface.co/datasets/Shitao/MLDR).
For more details, please refer to our [paper](https://arxiv.org/pdf/2402.03216.pdf).
# Dataset Structure
Each dataset has been split into multiple files according to the tokenized length of the text (tokenizer of bge-m3, i.e. tokenizer of [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)). For example, the MS MARCO dataset has been split into 8 files: `msmarco_len-0-500.jsonl`, `msmarco_len-500-1000.jsonl`, ..., `msmarco_len-6000-7000.jsonl`, `msmarco_len-7000-inf.jsonl`. All the files are in the `jsonl` format. Each line of the file is a json object. The following is an example of the json object:
```python
{"query": str, "pos": List[str], "neg":List[str]}
```
# Citation Information
```
@misc{bge-m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
year={2024},
eprint={2402.03216},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
# 数据集概览
本存储库包含[bge-m3](https://huggingface.co/BAAI/bge-m3)模型的全部微调数据,具体包括:
| 数据集 | 语言 |
| --------------- | :----------: |
| MS MARCO | 英语 |
| NQ | 英语 |
| HotpotQA | 英语 |
| TriviaQA | 英语 |
| SQuAD | 英语 |
| COLIEE | 英语 |
| PubMedQA | 英语 |
| 源自SimCSE的自然语言推理(Natural Language Inference,简称NLI) | 英语 |
| DuReader | 中文 |
| mMARCO-zh | 中文 |
| T2Ranking | 中文 |
| Law-GPT | 中文 |
| cMedQAv2 | 中文 |
| NLI-zh | 中文 |
| LeCaRDv2 | 中文 |
| Mr.TyDi | 11种语言 |
| MIRACL | 16种语言 |
| MLDR | 13种语言 |
注:此处的MLDR数据集为[MLDR数据集](https://huggingface.co/datasets/Shitao/MLDR)的已处理训练集。
如需了解更多细节,请参阅我们的[论文](https://arxiv.org/pdf/2402.03216.pdf)。
# 数据集结构
所有数据集均依据文本的分词长度进行拆分,所用分词器为bge-m3的分词器,即[xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)的分词器。例如,MS MARCO数据集被拆分为8个文件:`msmarco_len-0-500.jsonl`、`msmarco_len-500-1000.jsonl`……`msmarco_len-6000-7000.jsonl`与`msmarco_len-7000-inf.jsonl`。所有文件均采用`jsonl`格式,文件的每一行均为一个JSON对象。以下为该JSON对象的示例:
python
{"query": str, "pos": List[str], "neg":List[str]}
# 引用信息
@misc{bge-m3,
title={BGE M3嵌入模型:基于自知识蒸馏的多语言、多功能、多粒度文本嵌入},
author={陈建吕、肖诗涛、张培天、罗坤、连德富、刘正},
year={2024},
eprint={2402.03216},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
提供机构:
Shitao
原始信息汇总
数据集概述
该仓库包含 bge-m3 模型的所有微调数据,包括:
| 数据集 | 语言 |
|---|---|
| MS MARCO | English |
| NQ | English |
| HotpotQA | English |
| TriviaQA | English |
| SQuAD | English |
| COLIEE | English |
| PubMedQA | English |
| NLI from SimCSE | English |
| DuReader | Chinese |
| mMARCO-zh | Chinese |
| T2Ranking | Chinese |
| Law-GPT | Chinese |
| cMedQAv2 | Chinese |
| NLI-zh | Chinese |
| LeCaRDv2 | Chinese |
| Mr.TyDi | 11 languages |
| MIRACL | 16 languages |
| MLDR | 13 languages |
注意:MLDR 数据集是 MLDR 数据集 的处理后的 train 集。
数据集结构
每个数据集根据文本的标记化长度(bge-m3 的标记器,即 xlm-roberta-large 的标记器)被分割成多个文件。例如,MS MARCO 数据集被分割成 8 个文件:msmarco_len-0-500.jsonl, msmarco_len-500-1000.jsonl, ..., msmarco_len-6000-7000.jsonl, msmarco_len-7000-inf.jsonl。所有文件均为 jsonl 格式,每行是一个 JSON 对象。JSON 对象示例如下:
python {"query": str, "pos": List[str], "neg": List[str]}
引用信息
@misc{bge-m3, title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, year={2024}, eprint={2402.03216}, archivePrefix={arXiv}, primaryClass={cs.CL} }
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



