wenbopan/RefGPT-Fact-v2-8x
收藏Hugging Face2024-03-19 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/wenbopan/RefGPT-Fact-v2-8x
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
features:
- name: dialogue
dtype: string
- name: reference
dtype: string
- name: language
dtype: string
- name: type
dtype: string
splits:
- name: en
num_bytes: 303791610
num_examples: 7322
- name: zh
num_bytes: 105454150
num_examples: 7582
download_size: 232926198
dataset_size: 409245760
configs:
- config_name: default
data_files:
- split: en
path: data/en-*
- split: zh
path: data/zh-*
---
# Dataset Card for "RefGPT-Fact-v2-8x"
This is a lengthened version of [Mutonix/RefGPT-Fact-v2](https://huggingface.co/datasets/Mutonix/RefGPT-Fact-v2). The reference field in each sample is 8 times the length of the original sample. Correspondingly, the dataset is subsampled to 1/8 of its original size.
## Data Construction
Each reference is upsampled by K times, where:
$$
K \sim \text{Poisson}(\lambda=8)
$$
To lengthen the reference text, the reference text is shuffled into K - 1 paragraphs. The K - 1 paragraphs are the nearest samples to the target text, where the distances are calculated using OpenAI's `text-embedding-3-small` and FAISS.
提供机构:
wenbopan
原始信息汇总
数据集概述
数据集名称
RefGPT-Fact-v2-8x
数据集信息
- 许可证: Apache-2.0
- 特征:
dialogue: 数据类型为stringreference: 数据类型为stringlanguage: 数据类型为stringtype: 数据类型为string
- 分割:
en: 字节数为 303791610,示例数为 7322zh: 字节数为 105454150,示例数为 7582
- 下载大小: 232926198
- 数据集大小: 409245760
配置
- 默认配置:
en: 数据文件路径为data/en-*zh: 数据文件路径为data/zh-*
数据构造
- 参考字段通过泊松分布(λ=8)进行上采样,每个参考文本被扩展为K-1个段落,其中K-1个段落是从与目标文本距离最近的样本中选取,距离计算使用OpenAI的
text-embedding-3-small和FAISS算法。



