cmarkea/mmarco-contrastive

Name: cmarkea/mmarco-contrastive
Creator: cmarkea
Published: 2024-04-19 16:12:31
License: 暂无描述

Hugging Face2024-04-19 更新2024-04-19 收录

下载链接：

https://hf-mirror.com/datasets/cmarkea/mmarco-contrastive

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: features: - name: id dtype: int64 - name: query struct: - name: english dtype: string - name: french dtype: string - name: positive struct: - name: english dtype: string - name: french dtype: string - name: negatives list: - name: english dtype: string - name: french dtype: string - name: score dtype: float64 splits: - name: train num_bytes: 30850551179 num_examples: 398792 download_size: 15626428403 dataset_size: 30850551179 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - translation - text-classification - feature-extraction language: - fr - en size_categories: - 100K<n<1M --- # mMARCO-contrastive The dataset is a modification of [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) focusing on French and English parts. The aim is to train a bi-encoder model using all hard negatives from the database. Instead of having a query/positive/negative triplet, we pair all negatives with a query and a positive. However, it's worth noting that there are many false negatives in the dataset. This isn't a big issue with a triplet view because false negatives are much fewer in number, but it's more significant with this arrangement. Each query/negative pair is scored by the reranking model [cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking), assigning a value from 0 to 1. Hence, it's easy to apply a filter to limit false negatives. Finally, the dataset consists of 398,792 queries with their associated positive contexts and a total of 39,595,191 negative contexts. ## Note The text encoding of mMARCO is in `latin1`. Converting the text to `utf-8` can be done by re-encoding it as follows: ```python def to_utf8(txt: str): return txt.encode('latin1').decode('utf-8') ```

提供机构：

cmarkea

原始信息汇总

数据集概述

数据集信息

许可: Apache-2.0
特征:
- id: 整数类型 (int64)
- query: 结构化数据
  - english: 字符串类型 (string)
  - french: 字符串类型 (string)
- positive: 结构化数据
  - english: 字符串类型 (string)
  - french: 字符串类型 (string)
- negatives: 列表
  - english: 字符串类型 (string)
  - french: 字符串类型 (string)
  - score: 浮点数类型 (float64)
分割:
- train:
  - 字节数: 30,850,551,179
  - 示例数: 398,792
下载大小: 15,626,428,403字节
数据集大小: 30,850,551,179字节

配置

默认配置:
- 数据文件:
  - 分割: train
  - 路径: data/train-*

任务类别

翻译
文本分类
特征提取

语言

法语 (fr)
英语 (en)

大小类别

100K<n<1M

5,000+

优质数据集

54 个

任务类型

进入经典数据集