five

UTokyo-Yokoya-Lab/arguana_CS-MTEB

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/UTokyo-Yokoya-Lab/arguana_CS-MTEB
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: test path: qrels/test.jsonl - config_name: corpus data_files: - split: corpus path: corpus.jsonl - config_name: queries_zh_en data_files: - split: queries path: queries_zh_en.jsonl - config_name: queries_ja_en data_files: - split: queries path: queries_ja_en.jsonl - config_name: queries_de_en data_files: - split: queries path: queries_de_en.jsonl - config_name: queries_es_en data_files: - split: queries path: queries_es_en.jsonl - config_name: queries_ko_en data_files: - split: queries path: queries_ko_en.jsonl - config_name: queries_fr_en data_files: - split: queries path: queries_fr_en.jsonl - config_name: queries_it_en data_files: - split: queries path: queries_it_en.jsonl - config_name: queries_pt_en data_files: - split: queries path: queries_pt_en.jsonl - config_name: queries_nl_en data_files: - split: queries path: queries_nl_en.jsonl dataset_info: - config_name: default features: - name: query-id dtype: string - name: corpus-id dtype: string - name: score dtype: float64 splits: - name: test num_examples: 1406 - config_name: corpus features: - name: _id dtype: string - name: title dtype: string - name: text dtype: string splits: - name: corpus num_examples: 8674 - config_name: queries_zh_en features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 1406 - config_name: queries_ja_en features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 1406 - config_name: queries_de_en features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 1406 - config_name: queries_es_en features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 1406 - config_name: queries_ko_en features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 1406 - config_name: queries_fr_en features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 1406 - config_name: queries_it_en features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 1406 - config_name: queries_pt_en features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 1406 - config_name: queries_nl_en features: - name: _id dtype: string - name: text dtype: string splits: - name: queries num_examples: 1406 language: - en - zh - ja - de - es - ko - fr - it - pt - nl multilinguality: multilingual task_categories: - text-retrieval task_ids: [] tags: - mteb - text - code-switching --- <div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;"> <h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">ArguAna CS-MTEB</h1> <div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">An <a href="https://github.com/embeddings-benchmark/mteb" style="color: #2c5282; font-weight: 600; text-decoration: none;">MTEB</a> dataset</div> <div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">Massive Text Embedding Benchmark</div> </div> Code-switching version of [mteb/arguana](https://huggingface.co/datasets/mteb/arguana), with queries rewritten in Chinese-English, Japanese-English, German-English, Spanish-English, Korean-English, French-English, Italian-English, Portuguese-English, Dutch-English code-switching styles. ## Dataset Structure The dataset contains the following configurations: **From original dataset (unchanged):** - `corpus`: Original corpus documents - `default`: Original relevance judgments (qrels) **Code-switching queries:** - `queries_zh_en`: Chinese-English code-switching queries - `queries_ja_en`: Japanese-English code-switching queries - `queries_de_en`: German-English code-switching queries - `queries_es_en`: Spanish-English code-switching queries - `queries_ko_en`: Korean-English code-switching queries - `queries_fr_en`: French-English code-switching queries - `queries_it_en`: Italian-English code-switching queries - `queries_pt_en`: Portuguese-English code-switching queries - `queries_nl_en`: Dutch-English code-switching queries ## Usage ```python from datasets import load_dataset # Load code-switching queries queries_zh = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_zh_en") queries_ja = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_ja_en") queries_de = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_de_en") queries_es = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_es_en") queries_ko = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_ko_en") queries_fr = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_fr_en") queries_it = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_it_en") queries_pt = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_pt_en") queries_nl = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_nl_en") # Load original configs corpus = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "corpus") qrels = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "default") ``` ## Attribution Based on [mteb/arguana](https://huggingface.co/datasets/mteb/arguana). ## Citation If you use this dataset, please also cite the original: ```bibtex @inproceedings{wachsmuth2018retrieval, author = {Henning Wachsmuth and Shahbaz Syed and Benno Stein}, booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, doi = {10.18653/v1/P18-1023}, pages = {239--249}, title = {Retrieval of the Best Counterargument without Prior Topic Knowledge}, year = {2018}, } @article{enevoldsen2025mmtebmassivemultilingualtext, title={MMTEB: Massive Multilingual Text Embedding Benchmark}, author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and others}, journal={arXiv preprint arXiv:2502.13595}, year={2025}, url={https://arxiv.org/abs/2502.13595}, doi={10.48550/arXiv.2502.13595}, } @article{muennighoff2022mteb, author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo\"{\i}c and Reimers, Nils}, title = {MTEB: Massive Text Embedding Benchmark}, journal={arXiv preprint arXiv:2210.07316}, year = {2022}, url = {https://arxiv.org/abs/2210.07316}, doi = {10.48550/ARXIV.2210.07316}, } ```

### 数据集配置 - 配置项名称:default 数据文件: - 拆分集:test - 路径:qrels/test.jsonl - 配置项名称:corpus 数据文件: - 拆分集:corpus - 路径:corpus.jsonl - 配置项名称:queries_zh_en 数据文件: - 拆分集:queries - 路径:queries_zh_en.jsonl - 配置项名称:queries_ja_en 数据文件: - 拆分集:queries - 路径:queries_ja_en.jsonl - 配置项名称:queries_de_en 数据文件: - 拆分集:queries - 路径:queries_de_en.jsonl - 配置项名称:queries_es_en 数据文件: - 拆分集:queries - 路径:queries_es_en.jsonl - 配置项名称:queries_ko_en 数据文件: - 拆分集:queries - 路径:queries_ko_en.jsonl - 配置项名称:queries_fr_en 数据文件: - 拆分集:queries - 路径:queries_fr_en.jsonl - 配置项名称:queries_it_en 数据文件: - 拆分集:queries - 路径:queries_it_en.jsonl - 配置项名称:queries_pt_en 数据文件: - 拆分集:queries - 路径:queries_pt_en.jsonl - 配置项名称:queries_nl_en 数据文件: - 拆分集:queries - 路径:queries_nl_en.jsonl ### 数据集信息 - 配置项名称:default 特征: - 名称:query-id,数据类型:字符串(string) - 名称:corpus-id,数据类型:字符串(string) - 名称:score,数据类型:双精度浮点数(float64) 拆分集: - 名称:test,样本数量:1406 - 配置项名称:corpus 特征: - 名称:_id,数据类型:字符串(string) - 名称:title,数据类型:字符串(string) - 名称:text,数据类型:字符串(string) 拆分集: - 名称:corpus,样本数量:8674 - 配置项名称:queries_zh_en 特征: - 名称:_id,数据类型:字符串(string) - 名称:text,数据类型:字符串(string) 拆分集: - 名称:queries,样本数量:1406 - 配置项名称:queries_ja_en 特征: - 名称:_id,数据类型:字符串(string) - 名称:text,数据类型:字符串(string) 拆分集: - 名称:queries,样本数量:1406 - 配置项名称:queries_de_en 特征: - 名称:_id,数据类型:字符串(string) - 名称:text,数据类型:字符串(string) 拆分集: - 名称:queries,样本数量:1406 - 配置项名称:queries_es_en 特征: - 名称:_id,数据类型:字符串(string) - 名称:text,数据类型:字符串(string) 拆分集: - 名称:queries,样本数量:1406 - 配置项名称:queries_ko_en 特征: - 名称:_id,数据类型:字符串(string) - 名称:text,数据类型:字符串(string) 拆分集: - 名称:queries,样本数量:1406 - 配置项名称:queries_fr_en 特征: - 名称:_id,数据类型:字符串(string) - 名称:text,数据类型:字符串(string) 拆分集: - 名称:queries,样本数量:1406 - 配置项名称:queries_it_en 特征: - 名称:_id,数据类型:字符串(string) - 名称:text,数据类型:字符串(string) 拆分集: - 名称:queries,样本数量:1406 - 配置项名称:queries_pt_en 特征: - 名称:_id,数据类型:字符串(string) - 名称:text,数据类型:字符串(string) 拆分集: - 名称:queries,样本数量:1406 - 配置项名称:queries_nl_en 特征: - 名称:_id,数据类型:字符串(string) - 名称:text,数据类型:字符串(string) 拆分集: - 名称:queries,样本数量:1406 ### 语言覆盖 - 英语(en)、中文(zh)、日语(ja)、德语(de)、西班牙语(es)、韩语(ko)、法语(fr)、意大利语(it)、葡萄牙语(pt)、荷兰语(nl) ### 多语言属性 多语言(multilingual) ### 任务类别 文本检索(text-retrieval) ### 任务ID 无 ### 数据集标签 大规模文本嵌入基准(MTEB)、文本(text)、语码转换(code-switching) --- <div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;"> <h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">ArguAna CS-MTEB</h1> <div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">An <a href="https://github.com/embeddings-benchmark/mteb" style="color: #2c5282; font-weight: 600; text-decoration: none;">MTEB</a> dataset</div> <div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">Massive Text Embedding Benchmark</div> </div> 本数据集是[mteb/arguana](https://huggingface.co/datasets/mteb/arguana)的语码转换版本,其查询文本被改写为中英、日英、德英、西英、韩英、法英、意英、葡英、荷英的语码转换样式。 ## 数据集结构 本数据集包含以下配置项: **源自原始数据集(未作修改):** - `corpus`:原始语料库文档 - `default`:原始相关性判断文件(qrels) **语码转换查询集:** - `queries_zh_en`:中英语码转换查询集 - `queries_ja_en`:日英语码转换查询集 - `queries_de_en`:德英语码转换查询集 - `queries_es_en`:西英语码转换查询集 - `queries_ko_en`:韩英语码转换查询集 - `queries_fr_en`:法英语码转换查询集 - `queries_it_en`:意英语码转换查询集 - `queries_pt_en`:葡英语码转换查询集 - `queries_nl_en`:荷英语码转换查询集 ## 使用方法 python from datasets import load_dataset # 加载语码转换查询集 queries_zh = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_zh_en") queries_ja = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_ja_en") queries_de = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_de_en") queries_es = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_es_en") queries_ko = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_ko_en") queries_fr = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_fr_en") queries_it = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_it_en") queries_pt = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_pt_en") queries_nl = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "queries_nl_en") # 加载原始配置项 corpus = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "corpus") qrels = load_dataset("UTokyo-Yokoya-Lab/arguana_CS-MTEB", "default") ## 归属说明 本数据集基于[mteb/arguana](https://huggingface.co/datasets/mteb/arguana)构建。 ## 引用声明 若您使用本数据集,请同时引用原始文献: bibtex @inproceedings{wachsmuth2018retrieval, author = {Henning Wachsmuth and Shahbaz Syed and Benno Stein}, booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, doi = {10.18653/v1/P18-1023}, pages = {239--249}, title = {Retrieval of the Best Counterargument without Prior Topic Knowledge}, year = {2018}, } @article{enevoldsen2025mmtebmassivemultilingualtext, title={MMTEB: Massive Multilingual Text Embedding Benchmark}, author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and others}, journal={arXiv preprint arXiv:2502.13595}, year={2025}, url={https://arxiv.org/abs/2502.13595}, doi={10.48550/arXiv.2502.13595}, } @article{muennighoff2022mteb, author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo"{i}c and Reimers, Nils}, title = {MTEB: Massive Text Embedding Benchmark}, journal={arXiv preprint arXiv:2210.07316}, year = {2022}, url = {https://arxiv.org/abs/2210.07316}, doi = {10.48550/ARXIV.2210.07316}, }
提供机构:
UTokyo-Yokoya-Lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作