jhu-clsp/news21-instructions-mteb
收藏Hugging Face2024-11-05 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jhu-clsp/news21-instructions-mteb
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: corpus
data_files:
- path: corpus/corpus-*
split: corpus
- config_name: queries
data_files:
- path: queries/queries-*
split: queries
- config_name: instruction
data_files:
- path: instruction/instruction-*
split: instruction
- config_name: default
data_files:
- path: data/default-*
split: test
- config_name: qrel_diff
data_files:
- path: qrel_diff/qrel_diff-*
split: qrel_diff
- config_name: top_ranked
data_files:
- path: top_ranked/top_ranked-*
split: top_ranked
dataset_info:
- config_name: corpus
features:
- dtype: string
name: _id
- dtype: string
name: title
- dtype: string
name: text
splits:
- name: corpus
num_examples: 30921
- config_name: queries
features:
- dtype: string
name: _id
- dtype: string
name: text
splits:
- name: queries
num_examples: 64
- config_name: instruction
features:
- dtype: string
name: query-id
- dtype: string
name: instruction
splits:
- name: instruction
num_examples: 64
- config_name: default
features:
- dtype: string
name: query-id
- dtype: string
name: corpus-id
- dtype: float64
name: score
splits:
- name: test
num_examples: 8554
- config_name: qrel_diff
features:
- dtype: string
name: query-id
- list: string
name: corpus-ids
splits:
- name: qrel_diff
num_examples: 32
- config_name: top_ranked
features:
- dtype: string
name: query-id
- list: string
name: corpus-ids
splits:
- name: top_ranked
num_examples: 64
language:
- en
multilinguality:
- monolingual
tags:
- text-retrieval
- instruction-retrieval
task_categories:
- text-retrieval
task_ids:
- document-retrieval
---
# news21-instructions-mteb
This is a new version of the news21-instructions dataset modified to fit the new MTEB format.
1. Restructured queries to include both original and changed versions
2. Separated instructions into a dedicated configuration
3. Reorganized qrels into default (original) and qrel_diff configurations
## Dataset Structure
The dataset contains the following configurations:
- corpus: Original corpus documents
- queries: Queries with both original and changed versions
- instruction: Instructions for both original and changed queries
- default: Original relevance judgments
- qrel_diff: Changes in relevance judgments
- top_ranked: Top ranked documents for each query
配置项:
- 配置名称:corpus
数据文件:
- 路径:corpus/corpus-*
拆分:corpus
- 配置名称:queries
数据文件:
- 路径:queries/queries-*
拆分:queries
- 配置名称:instruction
数据文件:
- 路径:instruction/instruction-*
拆分:instruction
- 配置名称:default
数据文件:
- 路径:data/default-*
拆分:test
- 配置名称:相关性判断差异(qrel_diff)
数据文件:
- 路径:qrel_diff/qrel_diff-*
拆分:qrel_diff
- 配置名称:top_ranked
数据文件:
- 路径:top_ranked/top_ranked-*
拆分:top_ranked
数据集信息:
- 配置名称:corpus
特征:
- 数据类型:字符串,字段名:_id
- 数据类型:字符串,字段名:title
- 数据类型:字符串,字段名:text
拆分:
- 拆分名称:corpus,样本数量:30921
- 配置名称:queries
特征:
- 数据类型:字符串,字段名:_id
- 数据类型:字符串,字段名:text
拆分:
- 拆分名称:queries,样本数量:64
- 配置名称:instruction
特征:
- 数据类型:字符串,字段名:query-id
- 数据类型:字符串,字段名:instruction
拆分:
- 拆分名称:instruction,样本数量:64
- 配置名称:default
特征:
- 数据类型:字符串,字段名:query-id
- 数据类型:字符串,字段名:corpus-id
- 数据类型:双精度浮点数(float64),字段名:score
拆分:
- 拆分名称:test,样本数量:8554
- 配置名称:相关性判断差异(qrel_diff)
特征:
- 数据类型:字符串,字段名:query-id
- 数据类型:字符串列表,字段名:corpus-ids
拆分:
- 拆分名称:qrel_diff,样本数量:32
- 配置名称:top_ranked
特征:
- 数据类型:字符串,字段名:query-id
- 数据类型:字符串列表,字段名:corpus-ids
拆分:
- 拆分名称:top_ranked,样本数量:64
语言:
- 英语
多语言属性:
- 单语言
标签:
- 文本检索
- 指令检索
任务类别:
- 文本检索
任务子项:
- 文档检索
# news21-instructions-mteb
这是适配新版MTEB格式的news21-instructions数据集的修订版本。
1. 重构查询集,涵盖原始版本与修改后的版本
2. 将指令集单独分离为独立配置项
3. 将相关性判断文件重组为default(原始)与相关性判断差异(qrel_diff)两类配置项
## 数据集结构
该数据集包含以下配置项:
- corpus:原始语料库文档
- queries:涵盖原始版本与修改后版本的查询集
- instruction:针对原始与修改后查询的指令集
- default:原始相关性判断结果
- 相关性判断差异(qrel_diff):相关性判断的变更内容
- top_ranked:针对每个查询的Top排名文档
提供机构:
jhu-clsp



