hanhainebula/bge-m3_miracl_2cr
收藏Hugging Face2024-04-08 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/hanhainebula/bge-m3_miracl_2cr
下载链接
链接失效反馈官方服务:
资源简介:
## Introduction
This respository introduces how to reproduce the `Dense`, `Sparse`, and `Dense+Sparse` evaluation results of the paper [BGE-M3](https://arxiv.org/pdf/2402.03216.pdf) on the [MIRACL](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00595/117438/MIRACL-A-Multilingual-Retrieval-Dataset-Covering) dev split.
## Requirements
```bash
# Install Java (Linux)
apt update
apt install openjdk-21-jdk
# Install Pyserini
pip install pyserini
# Install Faiss
## CPU version
conda install -c conda-forge faiss-cpu
## GPU version
conda install -c conda-forge faiss-gpu
```
**It should be noted that** the Pyserini code needs to be modified to support the multiple alpha settings in `pyserini/fusion`. I have already submitted a pull request to the official repository to support this feature. You can refer to this [PR](https://github.com/castorini/pyserini/pull/1858) to modify the code.
## 2CR
### Download and Unzip
```bash
# Download
## MIRACL topics and qrels
git clone https://huggingface.co/datasets/miracl/miracl
mv miracl/*/*/* topics-and-qrels
## Dense and Sparse Index
git lfs install
git clone https://huggingface.co/datasets/hanhainebula/bge-m3_miracl_2cr
cat bge-m3_miracl_2cr/dense/en.tar.gz.part_* > bge-m3_miracl_2cr/dense/en.tar.gz
cat bge-m3_miracl_2cr/dense/de.tar.gz.part_* > bge-m3_miracl_2cr/dense/de.tar.gz
# Unzip
languages=(ar bn en es fa fi fr hi id ja ko ru sw te th zh de yo)
## Dense
for lang in ${languages[@]}; do
tar -zxvf bge-m3_miracl_2cr/dense/${lang}.tar.gz -C bge-m3_miracl_2cr/dense/
done
## Sparse
for lang in ${languages[@]}; do
tar -zxvf bge-m3_miracl_2cr/sparse/${lang}.tar.gz -C bge-m3_miracl_2cr/sparse/
done
```
### Reproduction
#### Dense
```bash
# Avaliable Language: ar bn en es fa fi fr hi id ja ko ru sw te th zh de yo
lang=zh
# Generate run
python -m pyserini.search.faiss \
--threads 16 --batch-size 512 \
--encoder-class auto \
--encoder BAAI/bge-m3 \
--pooling cls --l2-norm \
--topics topics-and-qrels/topics.miracl-v1.0-${lang}-dev.tsv \
--index bge-m3_miracl_2cr/dense/${lang} \
--output bge-m3_miracl_2cr/dense/runs/${lang}.txt \
--hits 1000
# Evaluate
## nDCG@10
python -m pyserini.eval.trec_eval \
-c -M 100 -m ndcg_cut.10 \
topics-and-qrels/qrels.miracl-v1.0-${lang}-dev.tsv \
bge-m3_miracl_2cr/dense/runs/${lang}.txt
## Recall@100
python -m pyserini.eval.trec_eval \
-c -m recall.100 \
topics-and-qrels/qrels.miracl-v1.0-${lang}-dev.tsv \
bge-m3_miracl_2cr/dense/runs/${lang}.txt
```
#### Sparse
```bash
# Avaliable Language: ar bn en es fa fi fr hi id ja ko ru sw te th zh de yo
lang=zh
# Generate run
python -m pyserini.search.lucene \
--threads 16 --batch-size 128 \
--topics bge-m3_miracl_2cr/sparse/${lang}/query_embd.tsv \
--index bge-m3_miracl_2cr/sparse/${lang}/index \
--output bge-m3_miracl_2cr/sparse/runs/${lang}.txt \
--output-format trec \
--impact --hits 1000
# Evaluate
## nDCG@10
python -m pyserini.eval.trec_eval \
-c -M 100 -m ndcg_cut.10 \
topics-and-qrels/qrels.miracl-v1.0-${lang}-dev.tsv \
bge-m3_miracl_2cr/sparse/runs/${lang}.txt
## Recall@100
python -m pyserini.eval.trec_eval \
-c -m recall.100 \
topics-and-qrels/qrels.miracl-v1.0-${lang}-dev.tsv \
bge-m3_miracl_2cr/sparse/runs/${lang}.txt
```
#### Dense+Sparse
**Note**: You should first merge this [PR](https://github.com/castorini/pyserini/pull/1858) to support the multiple alpha settings in `pyserini/fusion`.
```bash
# Avaliable Language: ar bn en es fa fi fr hi id ja ko ru sw te th zh de yo
lang=zh
# Generate dense run and sparse run
python -m pyserini.search.faiss \
--threads 16 --batch-size 512 \
--encoder-class auto \
--encoder BAAI/bge-m3 \
--pooling cls --l2-norm \
--topics topics-and-qrels/topics.miracl-v1.0-${lang}-dev.tsv \
--index bge-m3_miracl_2cr/dense/${lang} \
--output bge-m3_miracl_2cr/dense/runs/${lang}.txt \
--hits 1000
python -m pyserini.search.lucene \
--threads 16 --batch-size 128 \
--topics bge-m3_miracl_2cr/sparse/${lang}/query_embd.tsv \
--index bge-m3_miracl_2cr/sparse/${lang}/index \
--output bge-m3_miracl_2cr/sparse/runs/${lang}.txt \
--output-format trec \
--impact --hits 1000
# Generate dense+sparse run
mkdir -p bge-m3_miracl_2cr/fusion/runs
python -m pyserini.fusion \
--method interpolation \
--runs bge-m3_miracl_2cr/dense/runs/${lang}.txt bge-m3_miracl_2cr/sparse/runs/${lang}.txt \
--alpha 1 3e-5 \
--output bge-m3_miracl_2cr/fusion/runs/${lang}.txt \
--depth 1000 --k 1000
# Evaluation
## nDCG@10
python -m pyserini.eval.trec_eval \
-c -M 100 -m ndcg_cut.10 \
topics-and-qrels/qrels.miracl-v1.0-${lang}-dev.tsv \
bge-m3_miracl_2cr/fusion/runs/${lang}.txt
## Recall@100
python -m pyserini.eval.trec_eval \
-c -m recall.100 \
topics-and-qrels/qrels.miracl-v1.0-${lang}-dev.tsv \
bge-m3_miracl_2cr/fusion/runs/${lang}.txt
```
Note:
- The hybrid method we used for MIRACL in BGE-M3 paper is: `s_dense + 0.3 * s_sparse`. But when the sparse score is calculated, it has already been multiplied by 100^2, so the alpha for sparse run here is 3e-5, instead of 0.3.
提供机构:
hanhainebula
原始信息汇总
数据集概述
数据集目的
本数据集用于重现论文BGE-M3在MIRACL dev split上的Dense, Sparse, 和 Dense+Sparse评估结果。
数据集内容
- MIRACL topics and qrels: 包含多个语言版本的查询主题和相关性判断。
- Dense and Sparse Index: 包含多种语言的密集索引和稀疏索引。
数据集结构
- Dense Index: 包含以下语言的索引文件:ar, bn, en, es, fa, fi, fr, hi, id, ja, ko, ru, sw, te, th, zh, de, yo。
- Sparse Index: 同样包含上述语言的索引文件。
数据集使用方法
- Dense Index: 使用Faiss进行搜索,支持的语言包括ar, bn, en, es, fa, fi, fr, hi, id, ja, ko, ru, sw, te, th, zh, de, yo。
- Sparse Index: 使用Lucene进行搜索,同样支持上述语言。
- Dense+Sparse Index: 通过融合密集和稀疏索引的结果进行搜索,需要先合并特定的代码修改以支持多重alpha设置。
评估指标
- nDCG@10: 用于评估搜索结果的排名质量。
- Recall@100: 用于评估搜索结果的覆盖率。
注意事项
- 使用Pyserini代码时,需要修改以支持多重alpha设置,具体修改参考已提交的PR。
- 在计算密集和稀疏索引的融合时,稀疏分数已经乘以100^2,因此alpha设置为3e-5,而非0.3。



