maveriq/medi
收藏Hugging Face2023-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/maveriq/medi
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: query
sequence: string
- name: pos
sequence: string
- name: neg
sequence: string
- name: task_name
dtype: string
splits:
- name: train
num_bytes: 2572523114
num_examples: 1435000
download_size: 1232020798
dataset_size: 2572523114
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- feature-extraction
language:
- en
pretty_name: Multitask Embeddings Data with Instructions (MEDI)
size_categories:
- 1M<n<10M
---
# Disclaimer
I am not the author of the dataset or the paper. I have just uploaded it for ease of availability. For all information please refer to the [website](https://instructor-embedding.github.io/)
# Dataset Card for "medi"
The MEDI data consists of a collection of 330 datasets from Super-NI(Super-NaturalInstructions), sentence-transformer embedding training data, and KILT, spanning a wide range of domains and tasks.
If you use the dataset, please cite the following papers including Su et al., 2022, Wang et al., 2022, Petroni et al., 2021 and sentence transformer embedding training data at https://huggingface.co/datasets/sentence-transformers/embedding-training-data.
# Citation Information
```
@inproceedings{INSTRUCTOR,
title={One Embedder, Any Task: Instruction-Finetuned Text Embeddings},
author={Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, Tao Yu},
url={https://arxiv.org/abs/2212.09741},
year={2022},
}
@inproceedings{wang2022super,
title={Super-naturalinstructions: generalization via declarative instructions on 1600+ tasks},
author={Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and others},
year={2022},
organization={EMNLP}
}
@article{petroni2020kilt,
title={KILT: a benchmark for knowledge intensive language tasks},
author={Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and others},
journal={arXiv preprint arXiv:2009.02252},
year={2020}
}
```
提供机构:
maveriq
原始信息汇总
数据集概述
数据集信息
- 特征:
query: 字符串序列pos: 字符串序列neg: 字符串序列task_name: 字符串类型
- 拆分:
train: 包含2,572,523,114字节,1,435,000个样本
- 下载大小: 1,232,020,798字节
- 数据集大小: 2,572,523,114字节
配置
- 默认配置:
- 数据文件路径:
data/train-*
- 数据文件路径:
任务类别
- 特征提取
语言
- 英语
数据集名称
- Multitask Embeddings Data with Instructions (MEDI)
数据集大小类别
- 1M<n<10M



