tomaarsen/mining_demo
收藏Hugging Face2024-07-03 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/tomaarsen/mining_demo
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是gooaq数据集的查询-答案-负样本三元组集合,可用于直接训练句子嵌入模型。负样本的挖掘参数包括范围(跳过最相似的10个样本,仅考虑前20个最相似的样本)、边距(负样本的相似度加上边距必须小于正样本的相似度)、采样策略(随机采样)以及每个问题-答案对的负样本数量(3个)。数据集格式包括查询、答案和负样本三个字段,均为字符串类型。
This dataset is a collection of query-answer-negative triplets from the gooaq dataset, which can be used directly with Sentence Transformers to train embedding models. The negative samples have been mined using parameters such as range (skipping the 10 most similar samples and considering the top 20 most similar samples), margin (negative similarity + margin < positive similarity), sampling strategy (random), and the number of negatives per question-answer pair (3). The dataset format includes query, answer, and negative fields, all of which are of string type.
提供机构:
tomaarsen
原始信息汇总
数据集概述
数据集信息
- 语言: 英语
- 特征:
query: 字符串类型answer: 字符串类型
- 分割:
train: 包含30286字节,100个样本
- 下载大小: 24527字节
- 数据集大小: 30286字节
- 配置:
default: 包含训练数据文件路径data/train-*
- 标签:
sentence-transformers
数据集格式
- 列:
query,answer,negative - 列类型: 字符串, 字符串, 字符串
- 示例: python { "query": "is toprol xl the same as metoprolol?", "answer": "Metoprolol succinate is also known by the brand name Toprol XL. It is the extended-release form of metoprolol. Metoprolol succinate is approved to treat high blood pressure, chronic chest pain, and congestive heart failure." }
负样本挖掘参数
range_min: 10range_max: 20margin: 0.1sampling_strategy: randomnum_negatives: 3



