LongLaMP/LongLaMP
收藏Hugging Face2024-10-26 更新2025-04-19 收录
下载链接:
https://hf-mirror.com/datasets/LongLaMP/LongLaMP
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: abstract_generation_temporal
features:
- name: name
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: profile
list:
- name: abstract
dtype: string
- name: id
dtype: string
- name: title
dtype: string
- name: year
dtype: int64
splits:
- name: train
num_bytes: 3031120943
num_examples: 22822
- name: val
num_bytes: 607783744
num_examples: 4565
- name: test
num_bytes: 599120779
num_examples: 4564
download_size: 2141171404
dataset_size: 4238025466
- config_name: abstract_generation_user
features:
- name: name
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: profile
list:
- name: abstract
dtype: string
- name: id
dtype: string
- name: title
dtype: string
- name: year
dtype: int64
splits:
- name: train
num_bytes: 1855402366
num_examples: 13693
- name: val
num_bytes: 612108287
num_examples: 4562
- name: test
num_bytes: 620417451
num_examples: 4560
download_size: 1578664376
dataset_size: 3087928104
- config_name: product_review_temporal
features:
- name: reviewerId
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: profile
list:
- name: description
dtype: string
- name: overall
dtype: string
- name: reviewText
dtype: string
- name: summary
dtype: string
splits:
- name: train
num_bytes: 1444326460
num_examples: 16197
- name: val
num_bytes: 167208689
num_examples: 1831
- name: test
num_bytes: 159206958
num_examples: 1784
download_size: 1039534145
dataset_size: 1770742107
- config_name: product_review_user
features:
- name: reviewerId
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: profile
list:
- name: description
dtype: string
- name: overall
dtype: string
- name: reviewText
dtype: string
- name: summary
dtype: string
splits:
- name: train
num_bytes: 1222656833
num_examples: 14745
- name: val
num_bytes: 147559402
num_examples: 1826
- name: test
num_bytes: 168661273
num_examples: 1822
download_size: 911471721
dataset_size: 1538877508
- config_name: topic_writing_temporal
features:
- name: author
dtype: string
- name: output
dtype: string
- name: input
dtype: string
- name: profile
list:
- name: author
dtype: string
- name: content
dtype: string
- name: id
dtype: string
- name: summary
dtype: string
splits:
- name: train
num_bytes: 1542333778
num_examples: 16347
- name: val
num_bytes: 108126082
num_examples: 2452
- name: test
num_bytes: 113545780
num_examples: 2452
download_size: 1064723500
dataset_size: 1764005640
- config_name: topic_writing_user
features:
- name: author
dtype: string
- name: output
dtype: string
- name: input
dtype: string
- name: profile
list:
- name: author
dtype: string
- name: content
dtype: string
- name: id
dtype: string
- name: summary
dtype: string
splits:
- name: train
num_bytes: 1366090142
num_examples: 11442
- name: val
num_bytes: 114610867
num_examples: 2453
- name: test
num_bytes: 114795391
num_examples: 2452
download_size: 961434879
dataset_size: 1595496400
configs:
- config_name: abstract_generation_temporal
data_files:
- split: train
path: abstract_generation_temporal/train-*
- split: val
path: abstract_generation_temporal/val-*
- split: test
path: abstract_generation_temporal/test-*
- config_name: abstract_generation_user
data_files:
- split: train
path: abstract_generation_user/train-*
- split: val
path: abstract_generation_user/val-*
- split: test
path: abstract_generation_user/test-*
- config_name: product_review_temporal
data_files:
- split: train
path: product_review_temporal/train-*
- split: val
path: product_review_temporal/val-*
- split: test
path: product_review_temporal/test-*
- config_name: product_review_user
data_files:
- split: train
path: product_review_user/train-*
- split: val
path: product_review_user/val-*
- split: test
path: product_review_user/test-*
- config_name: topic_writing_temporal
data_files:
- split: train
path: topic_writing_temporal/train-*
- split: val
path: topic_writing_temporal/val-*
- split: test
path: topic_writing_temporal/test-*
- config_name: topic_writing_user
data_files:
- split: train
path: topic_writing_user/train-*
- split: val
path: topic_writing_user/val-*
- split: test
path: topic_writing_user/test-*
task_categories:
- text-generation
- summarization
language:
- en
---
# LongLaMP Dataset
## Dataset Description
- **Repository:** https://longlamp-benchmark.github.io/
- **Paper:** https://www.arxiv.org/abs/2407.11016
## Dataset Summary
LongLaMP is a comprehensive benchmark for personalized long-form text generation. The dataset is designed to evaluate and improve the performance of language models in generating extended, personalized content across various domains and tasks.
Our dataset consists of multiple tasks focusing on different aspects of long-form text generation, including:
1. Personalized Email Completion
2. Personalized Abstract Generation
3. Personalized Review Writing
4. Personalized Topic Writing
Each task in LongLaMP is carefully curated to challenge language models in producing coherent, contextually relevant, and personalized long-form text. The dataset provides a robust framework for testing and developing personalization techniques in language models.
Details about the dataset construction, task specifications, and evaluation metrics can be found in our technical paper: [Link]
## Accessing the Dataset
You can download the dataset using the Hugging Face datasets library. Here's an example of how to load the product review dataset for the user-based setting:
```python
from datasets import load_dataset
ds = load_dataset("LongLaMP/LongLaMP",
"product_review_user",
split="train",
use_auth_token=True)
```
## Dataset Structure
The LongLaMP dataset is organized into different tasks and settings. Each task has two settings:
1. User Setting: Evaluates personalized text generation for new users
2. Temporal Setting: Evaluates generating the latest content for previously seen users
## Citation
If you use the LongLaMP dataset in your research, please cite our paper:
```
@misc{kumar2024longlampbenchmarkpersonalizedlongform,
title={LongLaMP: A Benchmark for Personalized Long-form Text Generation},
author={Ishita Kumar and Snigdha Viswanathan and Sushrita Yerra and Alireza Salemi and Ryan A. Rossi and Franck Dernoncourt and Hanieh Deilamsalehy and Xiang Chen and Ruiyi Zhang and Shubham Agarwal and Nedim Lipka and Chien Van Nguyen and Thien Huu Nguyen and Hamed Zamani},
year={2024},
eprint={2407.11016},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.11016},
}
```
提供机构:
LongLaMP



