UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB

Name: UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB
Creator: UTokyo-Yokoya-Lab
Published: 2026-04-16 16:50:48
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: corpus data_files: - path: corpus/corpus-* split: corpus - config_name: default data_files: - split: test path: data/test-* - config_name: queries_de_en data_files: - path: queries_de_en/train-* split: train - config_name: queries_es_en data_files: - path: queries_es_en/train-* split: train - config_name: queries_fr_en data_files: - path: queries_fr_en/train-* split: train - config_name: queries_it_en data_files: - path: queries_it_en/train-* split: train - config_name: queries_ja_en data_files: - path: queries_ja_en/train-* split: train - config_name: queries_ko_en data_files: - path: queries_ko_en/train-* split: train - config_name: queries_nl_en data_files: - path: queries_nl_en/train-* split: train - config_name: queries_pt_en data_files: - path: queries_pt_en/train-* split: train - config_name: queries_zh_en data_files: - path: queries_zh_en/train-* split: train dataset_info: - config_name: corpus features: - name: _id dtype: string - name: title dtype: string - name: text dtype: string splits: - name: corpus num_examples: 303732 - config_name: default features: - name: query-id dtype: string - name: corpus-id dtype: string - name: score dtype: float64 splits: - name: test num_bytes: 161729 num_examples: 2849 download_size: 50929 dataset_size: 161729 - config_name: queries_de_en features: - name: _id dtype: string - name: text dtype: string splits: - name: train num_examples: 49 - config_name: queries_es_en features: - name: _id dtype: string - name: text dtype: string splits: - name: train num_examples: 49 - config_name: queries_fr_en features: - name: _id dtype: string - name: text dtype: string splits: - name: train num_examples: 49 - config_name: queries_it_en features: - name: _id dtype: string - name: text dtype: string splits: - name: train num_examples: 49 - config_name: queries_ja_en features: - name: _id dtype: string - name: text dtype: string - name: metadata struct: - name: description dtype: string - name: narrative dtype: string splits: - name: train num_examples: 49 - config_name: queries_ko_en features: - name: _id dtype: string - name: text dtype: string splits: - name: train num_examples: 49 - config_name: queries_nl_en features: - name: _id dtype: string - name: text dtype: string splits: - name: train num_examples: 49 - config_name: queries_pt_en features: - name: _id dtype: string - name: text dtype: string splits: - name: train num_examples: 49 - config_name: queries_zh_en features: - name: _id dtype: string - name: text dtype: string - name: metadata struct: - name: description dtype: string - name: narrative dtype: string splits: - name: train num_examples: 49 license: mit language: - en - zh - ja - de - es - ko - fr - it - pt - nl multilinguality: multilingual tags: - text-retrieval - code-switching task_categories: - text-retrieval task_ids: - document-retrieval --- <div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;"> <h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">Touche2020-v3 CS-MTEB</h1> <div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">An <a href="https://github.com/embeddings-benchmark/mteb" style="color: #2c5282; font-weight: 600; text-decoration: none;">MTEB</a> dataset</div> <div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">Massive Text Embedding Benchmark</div> </div> Code-switching version of [mteb/webis-touche2020-v3](https://huggingface.co/datasets/mteb/webis-touche2020-v3), with queries rewritten in Chinese-English, Japanese-English, German-English, Spanish-English, Korean-English, French-English, Italian-English, Portuguese-English, Dutch-English code-switching styles. ## Dataset Structure The dataset contains the following configurations: **From original dataset (unchanged):** - `corpus`: Original corpus documents - `default`: Original relevance judgments (qrels) **Code-switching queries:** - `queries_zh_en`: Chinese-English code-switching queries - `queries_ja_en`: Japanese-English code-switching queries - `queries_de_en`: German-English code-switching queries - `queries_es_en`: Spanish-English code-switching queries - `queries_ko_en`: Korean-English code-switching queries - `queries_fr_en`: French-English code-switching queries - `queries_it_en`: Italian-English code-switching queries - `queries_pt_en`: Portuguese-English code-switching queries - `queries_nl_en`: Dutch-English code-switching queries ## Usage ```python from datasets import load_dataset # Load code-switching queries queries_zh = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_zh_en") queries_ja = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_ja_en") queries_de = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_de_en") queries_es = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_es_en") queries_ko = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_ko_en") queries_fr = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_fr_en") queries_it = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_it_en") queries_pt = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_pt_en") queries_nl = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_nl_en") # Load original configs corpus = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "corpus") qrels = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "default") ``` ## Attribution Based on [mteb/webis-touche2020-v3](https://huggingface.co/datasets/mteb/webis-touche2020-v3). ## Citation If you use this dataset, please also cite the original: ```bibtex @inproceedings{bondarenko2020overview, author = {Alexander Bondarenko and Maik Fr\"{o}be and Meriem Beloucif and Lukas Gienapp and Yamen Ajjour and Alexander Panchenko and Chris Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction. 11th International Conference of the CLEF Association (CLEF 2020)}, doi = {10.1007/978-3-030-58219-7\_26}, pages = {384--395}, title = {Overview of Touch\'{e} 2020: Argument Retrieval}, year = {2020}, } @article{enevoldsen2025mmtebmassivemultilingualtext, title={MMTEB: Massive Multilingual Text Embedding Benchmark}, author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and others}, journal={arXiv preprint arXiv:2502.13595}, year={2025}, url={https://arxiv.org/abs/2502.13595}, doi={10.48550/arXiv.2502.13595}, } @article{muennighoff2022mteb, author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo\"{\i}c and Reimers, Nils}, title = {MTEB: Massive Text Embedding Benchmark}, journal={arXiv preprint arXiv:2210.07316}, year = {2022}, url = {https://arxiv.org/abs/2210.07316}, doi = {10.48550/ARXIV.2210.07316}, } ```

配置项： - 配置名称：corpus 数据文件： - 路径：corpus/corpus-* 数据集划分：corpus - 配置名称：default 数据文件： - 数据集划分：test 路径：data/test-* - 配置名称：queries_de_en 数据文件： - 路径：queries_de_en/train-* 数据集划分：train - 配置名称：queries_es_en 数据文件： - 路径：queries_es_en/train-* 数据集划分：train - 配置名称：queries_fr_en 数据文件： - 路径：queries_fr_en/train-* 数据集划分：train - 配置名称：queries_it_en 数据文件： - 路径：queries_it_en/train-* 数据集划分：train - 配置名称：queries_ja_en 数据文件： - 路径：queries_ja_en/train-* 数据集划分：train - 配置名称：queries_ko_en 数据文件： - 路径：queries_ko_en/train-* 数据集划分：train - 配置名称：queries_nl_en 数据文件： - 路径：queries_nl_en/train-* 数据集划分：train - 配置名称：queries_pt_en 数据文件： - 路径：queries_pt_en/train-* 数据集划分：train - 配置名称：queries_zh_en 数据文件： - 路径：queries_zh_en/train-* 数据集划分：train 数据集信息： - 配置名称：corpus 特征字段： - 字段名：_id，数据类型：字符串（string） - 字段名：title，数据类型：字符串（string） - 字段名：text，数据类型：字符串（string）数据集划分： - 划分名称：corpus，样本数量：303732 - 配置名称：default 特征字段： - 字段名：query-id，数据类型：字符串（string） - 字段名：corpus-id，数据类型：字符串（string） - 字段名：score，数据类型：64位浮点数（float64）数据集划分： - 划分名称：test，占用字节数：161729，样本数量：2849 下载大小：50929，数据集总大小：161729 - 配置名称：queries_de_en 特征字段： - 字段名：_id，数据类型：字符串（string） - 字段名：text，数据类型：字符串（string）数据集划分： - 划分名称：train，样本数量：49 - 配置名称：queries_es_en 特征字段： - 字段名：_id，数据类型：字符串（string） - 字段名：text，数据类型：字符串（string）数据集划分： - 划分名称：train，样本数量：49 - 配置名称：queries_fr_en 特征字段： - 字段名：_id，数据类型：字符串（string） - 字段名：text，数据类型：字符串（string）数据集划分： - 划分名称：train，样本数量：49 - 配置名称：queries_it_en 特征字段： - 字段名：_id，数据类型：字符串（string） - 字段名：text，数据类型：字符串（string）数据集划分： - 划分名称：train，样本数量：49 - 配置名称：queries_ja_en 特征字段： - 字段名：_id，数据类型：字符串（string） - 字段名：text，数据类型：字符串（string） - 字段名：metadata，数据类型：结构体（struct），包含： - 字段名：description，数据类型：字符串（string） - 字段名：narrative，数据类型：字符串（string）数据集划分： - 划分名称：train，样本数量：49 - 配置名称：queries_ko_en 特征字段： - 字段名：_id，数据类型：字符串（string） - 字段名：text，数据类型：字符串（string）数据集划分： - 划分名称：train，样本数量：49 - 配置名称：queries_nl_en 特征字段： - 字段名：_id，数据类型：字符串（string） - 字段名：text，数据类型：字符串（string）数据集划分： - 划分名称：train，样本数量：49 - 配置名称：queries_pt_en 特征字段： - 字段名：_id，数据类型：字符串（string） - 字段名：text，数据类型：字符串（string）数据集划分： - 划分名称：train，样本数量：49 - 配置名称：queries_zh_en 特征字段： - 字段名：_id，数据类型：字符串（string） - 字段名：text，数据类型：字符串（string） - 字段名：metadata，数据类型：结构体（struct），包含： - 字段名：description，数据类型：字符串（string） - 字段名：narrative，数据类型：字符串（string）数据集划分： - 划分名称：train，样本数量：49 许可证：MIT许可证（mit）语言：英语、中文、日语、德语、西班牙语、韩语、法语、意大利语、葡萄牙语、荷兰语多语言特性：多语言（multilingual）标签：文本检索、语码转换（code-switching）任务类别：文本检索任务子任务：文档检索 <div align="center" style="padding: 40px 20px; background-color: white; border-radius: 12px; box-shadow: 0 2px 10px rgba(0, 0, 0, 0.05); max-width: 600px; margin: 0 auto;"> <h1 style="font-size: 3.5rem; color: #1a1a1a; margin: 0 0 20px 0; letter-spacing: 2px; font-weight: 700;">Touche2020-v3 CS-MTEB</h1> <div style="font-size: 1.5rem; color: #4a4a4a; margin-bottom: 5px; font-weight: 300;">一款MTEB（Massive Text Embedding Benchmark，大规模文本嵌入基准）数据集</div> <div style="font-size: 0.9rem; color: #2c5282; margin-top: 10px;">大规模文本嵌入基准</div> </div> 本数据集为[mteb/webis-touche2020-v3](https://huggingface.co/datasets/mteb/webis-touche2020-v3)的语码转换（code-switching）版本，其查询语句采用汉英、日英、德英、西英、韩英、法英、意英、葡英、荷英的语码转换风格进行重写。 ## 数据集结构本数据集包含以下配置项： **源自原始数据集（未作修改）：** - `corpus`：原始语料库（corpus）文档 - `default`：原始相关性判断（qrels） **语码转换查询集：** - `queries_zh_en`：汉英语码转换查询集 - `queries_ja_en`：日英语码转换查询集 - `queries_de_en`：德英语码转换查询集 - `queries_es_en`：西英语码转换查询集 - `queries_ko_en`：韩英语码转换查询集 - `queries_fr_en`：法英语码转换查询集 - `queries_it_en`：意英语码转换查询集 - `queries_pt_en`：葡英语码转换查询集 - `queries_nl_en`：荷英语码转换查询集 ## 使用方法 python from datasets import load_dataset # 加载语码转换查询集 queries_zh = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_zh_en") queries_ja = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_ja_en") queries_de = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_de_en") queries_es = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_es_en") queries_ko = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_ko_en") queries_fr = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_fr_en") queries_it = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_it_en") queries_pt = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_pt_en") queries_nl = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "queries_nl_en") # 加载原始配置 corpus = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "corpus") qrels = load_dataset("UTokyo-Yokoya-Lab/webis-touche2020-v3_CS-MTEB", "default") ## 数据集归因本数据集基于[mteb/webis-touche2020-v3](https://huggingface.co/datasets/mteb/webis-touche2020-v3)构建。 ## 引用声明若使用本数据集，请同时引用以下原始文献： bibtex @inproceedings{bondarenko2020overview, author = {Alexander Bondarenko and Maik Fr"{o}be and Meriem Beloucif and Lukas Gienapp and Yamen Ajjour and Alexander Panchenko and Chris Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction. 11th International Conference of the CLEF Association (CLEF 2020)}, doi = {10.1007/978-3-030-58219-7\_26}, pages = {384--395}, title = {Overview of Touch"{e} 2020: Argument Retrieval}, year = {2020}, } @article{enevoldsen2025mmtebmassivemultilingualtext, title={MMTEB: Massive Multilingual Text Embedding Benchmark}, author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and others}, journal={arXiv preprint arXiv:2502.13595}, year={2025}, url={https://arxiv.org/abs/2502.13595}, doi={10.48550/arXiv.2502.13595}, } @article{muennighoff2022mteb, author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo"{i}c and Reimers, Nils}, title = {MTEB: Massive Text Embedding Benchmark}, journal={arXiv preprint arXiv:2210.07316}, year = {2022}, url = {https://arxiv.org/abs/2210.07316}, doi = {10.48550/ARXIV.2210.07316}, }

提供机构：

UTokyo-Yokoya-Lab

5,000+

优质数据集

54 个

任务类型

进入经典数据集