five

maveriq/medi

收藏
Hugging Face2023-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/maveriq/medi
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: query sequence: string - name: pos sequence: string - name: neg sequence: string - name: task_name dtype: string splits: - name: train num_bytes: 2572523114 num_examples: 1435000 download_size: 1232020798 dataset_size: 2572523114 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - feature-extraction language: - en pretty_name: Multitask Embeddings Data with Instructions (MEDI) size_categories: - 1M<n<10M --- # Disclaimer I am not the author of the dataset or the paper. I have just uploaded it for ease of availability. For all information please refer to the [website](https://instructor-embedding.github.io/) # Dataset Card for "medi" The MEDI data consists of a collection of 330 datasets from Super-NI(Super-NaturalInstructions), sentence-transformer embedding training data, and KILT, spanning a wide range of domains and tasks. If you use the dataset, please cite the following papers including Su et al., 2022, Wang et al., 2022, Petroni et al., 2021 and sentence transformer embedding training data at https://huggingface.co/datasets/sentence-transformers/embedding-training-data. # Citation Information ``` @inproceedings{INSTRUCTOR, title={One Embedder, Any Task: Instruction-Finetuned Text Embeddings}, author={Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, Tao Yu}, url={https://arxiv.org/abs/2212.09741}, year={2022}, } @inproceedings{wang2022super, title={Super-naturalinstructions: generalization via declarative instructions on 1600+ tasks}, author={Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and others}, year={2022}, organization={EMNLP} } @article{petroni2020kilt, title={KILT: a benchmark for knowledge intensive language tasks}, author={Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and others}, journal={arXiv preprint arXiv:2009.02252}, year={2020} } ```
提供机构:
maveriq
原始信息汇总

数据集概述

数据集信息

  • 特征:
    • query: 字符串序列
    • pos: 字符串序列
    • neg: 字符串序列
    • task_name: 字符串类型
  • 拆分:
    • train: 包含2,572,523,114字节,1,435,000个样本
  • 下载大小: 1,232,020,798字节
  • 数据集大小: 2,572,523,114字节

配置

  • 默认配置:
    • 数据文件路径: data/train-*

任务类别

  • 特征提取

语言

  • 英语

数据集名称

  • Multitask Embeddings Data with Instructions (MEDI)

数据集大小类别

  • 1M<n<10M
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作