five

cmarkea/mmarco-contrastive

收藏
Hugging Face2024-04-19 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/cmarkea/mmarco-contrastive
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 dataset_info: features: - name: id dtype: int64 - name: query struct: - name: english dtype: string - name: french dtype: string - name: positive struct: - name: english dtype: string - name: french dtype: string - name: negatives list: - name: english dtype: string - name: french dtype: string - name: score dtype: float64 splits: - name: train num_bytes: 30850551179 num_examples: 398792 download_size: 15626428403 dataset_size: 30850551179 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - translation - text-classification - feature-extraction language: - fr - en size_categories: - 100K<n<1M --- # mMARCO-contrastive The dataset is a modification of [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) focusing on French and English parts. The aim is to train a bi-encoder model using all hard negatives from the database. Instead of having a query/positive/negative triplet, we pair all negatives with a query and a positive. However, it's worth noting that there are many false negatives in the dataset. This isn't a big issue with a triplet view because false negatives are much fewer in number, but it's more significant with this arrangement. Each query/negative pair is scored by the reranking model [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking), assigning a value from 0 to 1. Hence, it's easy to apply a filter to limit false negatives. Finally, the dataset consists of 398,792 queries with their associated positive contexts and a total of 39,595,191 negative contexts. ## Note The text encoding of mMARCO is in `latin1`. Converting the text to `utf-8` can be done by re-encoding it as follows: ```python def to_utf8(txt: str): return txt.encode('latin1').decode('utf-8') ```
提供机构:
cmarkea
原始信息汇总

数据集概述

数据集信息

  • 许可: Apache-2.0
  • 特征:
    • id: 整数类型 (int64)
    • query: 结构化数据
      • english: 字符串类型 (string)
      • french: 字符串类型 (string)
    • positive: 结构化数据
      • english: 字符串类型 (string)
      • french: 字符串类型 (string)
    • negatives: 列表
      • english: 字符串类型 (string)
      • french: 字符串类型 (string)
      • score: 浮点数类型 (float64)
  • 分割:
    • train:
      • 字节数: 30,850,551,179
      • 示例数: 398,792
  • 下载大小: 15,626,428,403字节
  • 数据集大小: 30,850,551,179字节

配置

  • 默认配置:
    • 数据文件:
      • 分割: train
      • 路径: data/train-*

任务类别

  • 翻译
  • 文本分类
  • 特征提取

语言

  • 法语 (fr)
  • 英语 (en)

大小类别

  • 100K<n<1M
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作