UTokyo-Yokoya-Lab/arguana_CS-MTEB
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/UTokyo-Yokoya-Lab/arguana_CS-MTEB
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: test
path: qrels/test.jsonl
- config_name: corpus
data_files:
- split: corpus
path: corpus.jsonl
- config_name: queries_zh_en
data_files:
- split: queries
path: queries_zh_en.jsonl
- config_name: queries_ja_en
data_files:
- split: queries
path: queries_ja_en.jsonl
- config_name: queries_de_en
data_files:
- split: queries
path: queries_de_en.jsonl
- config_name: queries_es_en
data_files:
- split: queries
path: queries_es_en.jsonl
- config_name: queries_ko_en
data_files:
- split: queries
path: queries_ko_en.jsonl
- config_name: queries_fr_en
data_files:
- split: queries
path: queries_fr_en.jsonl
- config_name: queries_it_en
data_files:
- split: queries
path: queries_it_en.jsonl
- config_name: queries_pt_en
data_files:
- split: queries
path: queries_pt_en.jsonl
- config_name: queries_nl_en
data_files:
- split: queries
path: queries_nl_en.jsonl
dataset_info:
- config_name: default
features:
- name: query-id
dtype: string
- name: corpus-id
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_examples: 1406
- config_name: corpus
features:
- name: _id
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: corpus
num_examples: 8674
- config_name: queries_zh_en
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: queries
num_examples: 1406
- config_name: queries_ja_en
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: queries
num_examples: 1406
- config_name: queries_de_en
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: queries
num_examples: 1406
- config_name: queries_es_en
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: queries
num_examples: 1406
- config_name: queries_ko_en
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: queries
num_examples: 1406
- config_name: queries_fr_en
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: queries
num_examples: 1406
- config_name: queries_it_en
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: queries
num_examples: 1406
- config_name: queries_pt_en
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: queries
num_examples: 1406
- config_name: queries_nl_en
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: queries
num_examples: 1406
language:
- en
- zh
- ja
- de
- es
- ko
- fr
- it
- pt
- nl
multilinguality: multilingual
task_categories:
- text-retrieval
task_ids: []
tags:
- mteb
- text
- code-switching
---
<div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;">
<h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">ArguAna CS-MTEB</h1>
<div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">An <a href="https://github.com/embeddings-benchmark/mteb" style="color: #2c5282; font-weight: 600; text-decoration: none;">MTEB</a> dataset</div>
<div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">Massive Text Embedding Benchmark</div>
</div>
Code-switching version of [mteb/arguana](https://huggingface.co/datasets/mteb/arguana), with queries rewritten in Chinese-English, Japanese-English, German-English, Spanish-English, Korean-English, French-English, Italian-English, Portuguese-English, Dutch-English code-switching styles.
## Dataset Structure
The dataset contains the following configurations:
**From original dataset (unchanged):**
- `corpus`: Original corpus documents
- `default`: Original relevance judgments (qrels)
**Code-switching queries:**
- `queries_zh_en`: Chinese-English code-switching queries
- `queries_ja_en`: Japanese-English code-switching queries
- `queries_de_en`: German-English code-switching queries
- `queries_es_en`: Spanish-English code-switching queries
- `queries_ko_en`: Korean-English code-switching queries
- `queries_fr_en`: French-English code-switching queries
- `queries_it_en`: Italian-English code-switching queries
- `queries_pt_en`: Portuguese-English code-switching queries
- `queries_nl_en`: Dutch-English code-switching queries
## Usage
```python
from datasets import load_dataset
# Load code-switching queries
queries_zh = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_zh_en")
queries_ja = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_ja_en")
queries_de = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_de_en")
queries_es = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_es_en")
queries_ko = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_ko_en")
queries_fr = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_fr_en")
queries_it = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_it_en")
queries_pt = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_pt_en")
queries_nl = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_nl_en")
# Load original configs
corpus = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "corpus")
qrels = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "default")
```
## Attribution
Based on [mteb/arguana](https://huggingface.co/datasets/mteb/arguana).
## Citation
If you use this dataset, please also cite the original:
```bibtex
@inproceedings{wachsmuth2018retrieval,
author = {Henning Wachsmuth and Shahbaz Syed and Benno Stein},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
doi = {10.18653/v1/P18-1023},
pages = {239--249},
title = {Retrieval of the Best Counterargument without Prior Topic Knowledge},
year = {2018},
}
@article{enevoldsen2025mmtebmassivemultilingualtext,
title={MMTEB: Massive Multilingual Text Embedding Benchmark},
author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and others},
journal={arXiv preprint arXiv:2502.13595},
year={2025},
url={https://arxiv.org/abs/2502.13595},
doi={10.48550/arXiv.2502.13595},
}
@article{muennighoff2022mteb,
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo\"{\i}c and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
journal={arXiv preprint arXiv:2210.07316},
year = {2022},
url = {https://arxiv.org/abs/2210.07316},
doi = {10.48550/ARXIV.2210.07316},
}
```
### 数据集配置
- 配置项名称:default
数据文件:
- 拆分集:test
- 路径:qrels/test.jsonl
- 配置项名称:corpus
数据文件:
- 拆分集:corpus
- 路径:corpus.jsonl
- 配置项名称:queries_zh_en
数据文件:
- 拆分集:queries
- 路径:queries_zh_en.jsonl
- 配置项名称:queries_ja_en
数据文件:
- 拆分集:queries
- 路径:queries_ja_en.jsonl
- 配置项名称:queries_de_en
数据文件:
- 拆分集:queries
- 路径:queries_de_en.jsonl
- 配置项名称:queries_es_en
数据文件:
- 拆分集:queries
- 路径:queries_es_en.jsonl
- 配置项名称:queries_ko_en
数据文件:
- 拆分集:queries
- 路径:queries_ko_en.jsonl
- 配置项名称:queries_fr_en
数据文件:
- 拆分集:queries
- 路径:queries_fr_en.jsonl
- 配置项名称:queries_it_en
数据文件:
- 拆分集:queries
- 路径:queries_it_en.jsonl
- 配置项名称:queries_pt_en
数据文件:
- 拆分集:queries
- 路径:queries_pt_en.jsonl
- 配置项名称:queries_nl_en
数据文件:
- 拆分集:queries
- 路径:queries_nl_en.jsonl
### 数据集信息
- 配置项名称:default
特征:
- 名称:query-id,数据类型:字符串(string)
- 名称:corpus-id,数据类型:字符串(string)
- 名称:score,数据类型:双精度浮点数(float64)
拆分集:
- 名称:test,样本数量:1406
- 配置项名称:corpus
特征:
- 名称:_id,数据类型:字符串(string)
- 名称:title,数据类型:字符串(string)
- 名称:text,数据类型:字符串(string)
拆分集:
- 名称:corpus,样本数量:8674
- 配置项名称:queries_zh_en
特征:
- 名称:_id,数据类型:字符串(string)
- 名称:text,数据类型:字符串(string)
拆分集:
- 名称:queries,样本数量:1406
- 配置项名称:queries_ja_en
特征:
- 名称:_id,数据类型:字符串(string)
- 名称:text,数据类型:字符串(string)
拆分集:
- 名称:queries,样本数量:1406
- 配置项名称:queries_de_en
特征:
- 名称:_id,数据类型:字符串(string)
- 名称:text,数据类型:字符串(string)
拆分集:
- 名称:queries,样本数量:1406
- 配置项名称:queries_es_en
特征:
- 名称:_id,数据类型:字符串(string)
- 名称:text,数据类型:字符串(string)
拆分集:
- 名称:queries,样本数量:1406
- 配置项名称:queries_ko_en
特征:
- 名称:_id,数据类型:字符串(string)
- 名称:text,数据类型:字符串(string)
拆分集:
- 名称:queries,样本数量:1406
- 配置项名称:queries_fr_en
特征:
- 名称:_id,数据类型:字符串(string)
- 名称:text,数据类型:字符串(string)
拆分集:
- 名称:queries,样本数量:1406
- 配置项名称:queries_it_en
特征:
- 名称:_id,数据类型:字符串(string)
- 名称:text,数据类型:字符串(string)
拆分集:
- 名称:queries,样本数量:1406
- 配置项名称:queries_pt_en
特征:
- 名称:_id,数据类型:字符串(string)
- 名称:text,数据类型:字符串(string)
拆分集:
- 名称:queries,样本数量:1406
- 配置项名称:queries_nl_en
特征:
- 名称:_id,数据类型:字符串(string)
- 名称:text,数据类型:字符串(string)
拆分集:
- 名称:queries,样本数量:1406
### 语言覆盖
- 英语(en)、中文(zh)、日语(ja)、德语(de)、西班牙语(es)、韩语(ko)、法语(fr)、意大利语(it)、葡萄牙语(pt)、荷兰语(nl)
### 多语言属性
多语言(multilingual)
### 任务类别
文本检索(text-retrieval)
### 任务ID
无
### 数据集标签
大规模文本嵌入基准(MTEB)、文本(text)、语码转换(code-switching)
---
<div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;">
<h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">ArguAna CS-MTEB</h1>
<div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">An <a href="https://github.com/embeddings-benchmark/mteb" style="color: #2c5282; font-weight: 600; text-decoration: none;">MTEB</a> dataset</div>
<div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">Massive Text Embedding Benchmark</div>
</div>
本数据集是[mteb/arguana](https://huggingface.co/datasets/mteb/arguana)的语码转换版本,其查询文本被改写为中英、日英、德英、西英、韩英、法英、意英、葡英、荷英的语码转换样式。
## 数据集结构
本数据集包含以下配置项:
**源自原始数据集(未作修改):**
- `corpus`:原始语料库文档
- `default`:原始相关性判断文件(qrels)
**语码转换查询集:**
- `queries_zh_en`:中英语码转换查询集
- `queries_ja_en`:日英语码转换查询集
- `queries_de_en`:德英语码转换查询集
- `queries_es_en`:西英语码转换查询集
- `queries_ko_en`:韩英语码转换查询集
- `queries_fr_en`:法英语码转换查询集
- `queries_it_en`:意英语码转换查询集
- `queries_pt_en`:葡英语码转换查询集
- `queries_nl_en`:荷英语码转换查询集
## 使用方法
python
from datasets import load_dataset
# 加载语码转换查询集
queries_zh = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_zh_en")
queries_ja = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_ja_en")
queries_de = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_de_en")
queries_es = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_es_en")
queries_ko = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_ko_en")
queries_fr = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_fr_en")
queries_it = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_it_en")
queries_pt = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_pt_en")
queries_nl = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_nl_en")
# 加载原始配置项
corpus = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "corpus")
qrels = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "default")
## 归属说明
本数据集基于[mteb/arguana](https://huggingface.co/datasets/mteb/arguana)构建。
## 引用声明
若您使用本数据集,请同时引用原始文献:
bibtex
@inproceedings{wachsmuth2018retrieval,
author = {Henning Wachsmuth and Shahbaz Syed and Benno Stein},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
doi = {10.18653/v1/P18-1023},
pages = {239--249},
title = {Retrieval of the Best Counterargument without Prior Topic Knowledge},
year = {2018},
}
@article{enevoldsen2025mmtebmassivemultilingualtext,
title={MMTEB: Massive Multilingual Text Embedding Benchmark},
author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and others},
journal={arXiv preprint arXiv:2502.13595},
year={2025},
url={https://arxiv.org/abs/2502.13595},
doi={10.48550/arXiv.2502.13595},
}
@article{muennighoff2022mteb,
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo"{i}c and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
journal={arXiv preprint arXiv:2210.07316},
year = {2022},
url = {https://arxiv.org/abs/2210.07316},
doi = {10.48550/ARXIV.2210.07316},
}
提供机构:
UTokyo-Yokoya-Lab



