Shitao/bge-m3-data

Name: Shitao/bge-m3-data
Creator: Shitao
Published: 2024-04-26 06:13:26
License: 暂无描述

Hugging Face2024-04-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Shitao/bge-m3-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- {} --- # Dataset Summary This depository contains all the fine-tuning data for the [bge-m3](https://huggingface.co/BAAI/bge-m3) model, including: | Dataset | Language | | --------------- | :----------: | | MS MARCO | English | | NQ | English | | HotpotQA | English | | TriviaQA | English | | SQuAD | English | | COLIEE | English | | PubMedQA | English | | NLI from SimCSE | English | | DuReader | Chinese | | mMARCO-zh | Chinese | | T2Ranking | Chinese | | Law-GPT | Chinese | | cMedQAv2 | Chinese | | NLI-zh | Chinese | | LeCaRDv2 | Chinese | | Mr.TyDi | 11 languages | | MIRACL | 16 languages | | MLDR | 13 languages | Note: The MLDR dataset here is the handled `train` set of the [MLDR dataset](https://huggingface.co/datasets/Shitao/MLDR). For more details, please refer to our [paper](https://arxiv.org/pdf/2402.03216.pdf). # Dataset Structure Each dataset has been split into multiple files according to the tokenized length of the text (tokenizer of bge-m3, i.e. tokenizer of [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)). For example, the MS MARCO dataset has been split into 8 files: `msmarco_len-0-500.jsonl`, `msmarco_len-500-1000.jsonl`, ..., `msmarco_len-6000-7000.jsonl`, `msmarco_len-7000-inf.jsonl`. All the files are in the `jsonl` format. Each line of the file is a json object. The following is an example of the json object: ```python {"query": str, "pos": List[str], "neg":List[str]} ``` # Citation Information ``` @misc{bge-m3, title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, year={2024}, eprint={2402.03216}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# 数据集概览本存储库包含[bge-m3](https://huggingface.co/BAAI/bge-m3)模型的全部微调数据，具体包括： | 数据集 | 语言 | | --------------- | :----------: | | MS MARCO | 英语 | | NQ | 英语 | | HotpotQA | 英语 | | TriviaQA | 英语 | | SQuAD | 英语 | | COLIEE | 英语 | | PubMedQA | 英语 | | 源自SimCSE的自然语言推理（Natural Language Inference，简称NLI） | 英语 | | DuReader | 中文 | | mMARCO-zh | 中文 | | T2Ranking | 中文 | | Law-GPT | 中文 | | cMedQAv2 | 中文 | | NLI-zh | 中文 | | LeCaRDv2 | 中文 | | Mr.TyDi | 11种语言 | | MIRACL | 16种语言 | | MLDR | 13种语言 | 注：此处的MLDR数据集为[MLDR数据集](https://huggingface.co/datasets/Shitao/MLDR)的已处理训练集。如需了解更多细节，请参阅我们的[论文](https://arxiv.org/pdf/2402.03216.pdf)。 # 数据集结构所有数据集均依据文本的分词长度进行拆分，所用分词器为bge-m3的分词器，即[xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)的分词器。例如，MS MARCO数据集被拆分为8个文件：`msmarco_len-0-500.jsonl`、`msmarco_len-500-1000.jsonl`……`msmarco_len-6000-7000.jsonl`与`msmarco_len-7000-inf.jsonl`。所有文件均采用`jsonl`格式，文件的每一行均为一个JSON对象。以下为该JSON对象的示例： python {"query": str, "pos": List[str], "neg":List[str]} # 引用信息 @misc{bge-m3, title={BGE M3嵌入模型：基于自知识蒸馏的多语言、多功能、多粒度文本嵌入}, author={陈建吕、肖诗涛、张培天、罗坤、连德富、刘正}, year={2024}, eprint={2402.03216}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

Shitao

原始信息汇总

数据集概述

该仓库包含 bge-m3 模型的所有微调数据，包括：

数据集	语言
MS MARCO	English
NQ	English
HotpotQA	English
TriviaQA	English
SQuAD	English
COLIEE	English
PubMedQA	English
NLI from SimCSE	English
DuReader	Chinese
mMARCO-zh	Chinese
T2Ranking	Chinese
Law-GPT	Chinese
cMedQAv2	Chinese
NLI-zh	Chinese
LeCaRDv2	Chinese
Mr.TyDi	11 languages
MIRACL	16 languages
MLDR	13 languages

注意：MLDR 数据集是 MLDR 数据集的处理后的 train 集。

数据集结构

每个数据集根据文本的标记化长度（bge-m3 的标记器，即 xlm-roberta-large 的标记器）被分割成多个文件。例如，MS MARCO 数据集被分割成 8 个文件：msmarco_len-0-500.jsonl, msmarco_len-500-1000.jsonl, ..., msmarco_len-6000-7000.jsonl, msmarco_len-7000-inf.jsonl。所有文件均为 jsonl 格式，每行是一个 JSON 对象。JSON 对象示例如下：

python {"query": str, "pos": List[str], "neg": List[str]}

引用信息

@misc{bge-m3, title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, year={2024}, eprint={2402.03216}, archivePrefix={arXiv}, primaryClass={cs.CL} }

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集