cmarkea/mmarco-contrastive
收藏Hugging Face2024-04-19 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/cmarkea/mmarco-contrastive
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
features:
- name: id
dtype: int64
- name: query
struct:
- name: english
dtype: string
- name: french
dtype: string
- name: positive
struct:
- name: english
dtype: string
- name: french
dtype: string
- name: negatives
list:
- name: english
dtype: string
- name: french
dtype: string
- name: score
dtype: float64
splits:
- name: train
num_bytes: 30850551179
num_examples: 398792
download_size: 15626428403
dataset_size: 30850551179
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- translation
- text-classification
- feature-extraction
language:
- fr
- en
size_categories:
- 100K<n<1M
---
# mMARCO-contrastive
The dataset is a modification of [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) focusing on French and English parts. The aim is to train a
bi-encoder model using all hard negatives from the database. Instead of having a query/positive/negative triplet, we pair all negatives with a query and a
positive. However, it's worth noting that there are many false negatives in the dataset. This isn't a big issue with a triplet view because false negatives
are much fewer in number, but it's more significant with this arrangement. Each query/negative pair is scored by the reranking model
[cmarkea/bloomz-560m-reranking](https://huggingface.co/cmarkea/bloomz-560m-reranking), assigning a value from 0 to 1. Hence, it's easy to apply a filter to
limit false negatives.
Finally, the dataset consists of 398,792 queries with their associated positive contexts and a total of 39,595,191 negative contexts.
## Note
The text encoding of mMARCO is in `latin1`. Converting the text to `utf-8` can be done by re-encoding it as follows:
```python
def to_utf8(txt: str):
return txt.encode('latin1').decode('utf-8')
```
提供机构:
cmarkea
原始信息汇总
数据集概述
数据集信息
- 许可: Apache-2.0
- 特征:
- id: 整数类型 (int64)
- query: 结构化数据
- english: 字符串类型 (string)
- french: 字符串类型 (string)
- positive: 结构化数据
- english: 字符串类型 (string)
- french: 字符串类型 (string)
- negatives: 列表
- english: 字符串类型 (string)
- french: 字符串类型 (string)
- score: 浮点数类型 (float64)
- 分割:
- train:
- 字节数: 30,850,551,179
- 示例数: 398,792
- train:
- 下载大小: 15,626,428,403字节
- 数据集大小: 30,850,551,179字节
配置
- 默认配置:
- 数据文件:
- 分割: train
- 路径: data/train-*
- 数据文件:
任务类别
- 翻译
- 文本分类
- 特征提取
语言
- 法语 (fr)
- 英语 (en)
大小类别
- 100K<n<1M



