dustalov/pierogue
收藏Hugging Face2024-03-30 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/dustalov/pierogue
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
language:
- en
language_creators:
- machine-generated
license:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: Pierogue
size_categories:
- n<1K
source_datasets:
- original
tags:
- cosmos
- nature
- music
- technology
- fashion
- education
- qrels
- queries
- documents
task_categories:
- text-retrieval
- feature-extraction
- text-generation
task_ids:
- document-retrieval
- language-modeling
dataset_info:
- config_name: documents
features:
- name: document_id
dtype: int8
- name: topic
dtype:
class_label:
names:
'0': cosmos
'1': nature
'2': music
'3': technology
'4': fashion
- name: text
dtype: string
splits:
- name: train
num_bytes: 8125
num_examples: 10
- name: test
num_bytes: 6743
num_examples: 5
- config_name: queries
features:
- name: query_id
dtype: int8
- name: topic
dtype:
class_label:
names:
'0': cosmos
'1': nature
'2': music
'3': technology
'4': fashion
- name: query
dtype: string
splits:
- name: train
num_bytes: 2728
num_examples: 25
- name: test
num_bytes: 2280
num_examples: 10
- config_name: qrels
features:
- name: query_id
dtype: int8
- name: document_id
dtype: int8
- name: relevancy
dtype: int8
splits:
- name: train
num_bytes: 2109
num_examples: 375
- name: test
num_bytes: 1951
num_examples: 150
- config_name: embeddings
features:
- name: word
dtype: string
- name: embedding
sequence: float32
splits:
- name: train
num_bytes: 300741
num_examples: 566
- config_name: relatedness
features:
- name: word1
dtype: string
- name: word2
dtype: string
- name: score
dtype: float64
- name: rank
dtype: int16
splits:
- name: train
num_bytes: 6522
num_examples: 100
- name: test
num_bytes: 6294
num_examples: 100
- config_name: analogies
features:
- name: a
dtype: string
- name: c
dtype: string
- name: b
dtype: string
- name: d
dtype: string
splits:
- name: train
num_bytes: 3598
num_examples: 8
configs:
- config_name: documents
data_files:
- split: train
path: documents/train*.parquet
- split: test
path: documents/test*.parquet
default: true
- config_name: queries
data_files:
- split: train
path: queries/train*.parquet
- split: test
path: queries/test*.parquet
- config_name: qrels
data_files:
- split: train
path: qrels/train*.parquet
- split: test
path: qrels/test*.parquet
- config_name: embeddings
data_files: embeddings.parquet
- config_name: relatedness
data_files:
- split: train
path: relatedness/train*.parquet
- split: test
path: relatedness/test*.parquet
- config_name: analogies
data_files: analogies.parquet
---
# Pierogue
**Pierogue** is a small open-licensed machine-generated dataset that contains fifteen short texts in English covering five topics, provided with the relevance judgements (qrels), designed for educational purposes.
- Topics: cosmos, nature, music, technology, fashion
- Splits: `train` (10 documents, 375 qrels) and `test` (5 documents, 150 qrels)
Texts were generated by ChatGPT 3.5. Queries, qrels, and analogies were generated by GPT-4. Words were provided with Word2Vec embeddings based on the Google News dataset.

提供机构:
dustalov
原始信息汇总
Pierogue 数据集概述
基本信息
- 数据集名称: Pierogue
- 语言: 英语
- 数据创建者: 机器生成
- 许可证: CC BY 4.0
- 多语言性: 单语种
- 数据集大小: 小于1K
- 源数据集: 原始数据
- 标签: cosmos, nature, music, technology, fashion, education, qrels, queries, documents
- 任务类别: 文本检索, 特征提取, 文本生成
- 任务ID: 文档检索, 语言建模
数据集配置
文档 (documents)
- 特征:
document_id: 整数类型topic: 分类标签 (cosmos, nature, music, technology, fashion)text: 字符串类型
- 分割:
train: 8125字节, 10个样本test: 6743字节, 5个样本
查询 (queries)
- 特征:
query_id: 整数类型topic: 分类标签 (cosmos, nature, music, technology, fashion)query: 字符串类型
- 分割:
train: 2728字节, 25个样本test: 2280字节, 10个样本
相关性判断 (qrels)
- 特征:
query_id: 整数类型document_id: 整数类型relevancy: 整数类型
- 分割:
train: 2109字节, 375个样本test: 1951字节, 150个样本
嵌入 (embeddings)
- 特征:
word: 字符串类型embedding: 浮点数序列
- 分割:
train: 300741字节, 566个样本
相关性 (relatedness)
- 特征:
word1: 字符串类型word2: 字符串类型score: 浮点数类型rank: 整数类型
- 分割:
train: 6522字节, 100个样本test: 6294字节, 100个样本
类比 (analogies)
- 特征:
a: 字符串类型c: 字符串类型b: 字符串类型d: 字符串类型
- 分割:
train: 3598字节, 8个样本



