five

LongLaMP/LongLaMP

收藏
Hugging Face2024-10-26 更新2025-04-19 收录
下载链接:
https://hf-mirror.com/datasets/LongLaMP/LongLaMP
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: abstract_generation_temporal features: - name: name dtype: string - name: input dtype: string - name: output dtype: string - name: profile list: - name: abstract dtype: string - name: id dtype: string - name: title dtype: string - name: year dtype: int64 splits: - name: train num_bytes: 3031120943 num_examples: 22822 - name: val num_bytes: 607783744 num_examples: 4565 - name: test num_bytes: 599120779 num_examples: 4564 download_size: 2141171404 dataset_size: 4238025466 - config_name: abstract_generation_user features: - name: name dtype: string - name: input dtype: string - name: output dtype: string - name: profile list: - name: abstract dtype: string - name: id dtype: string - name: title dtype: string - name: year dtype: int64 splits: - name: train num_bytes: 1855402366 num_examples: 13693 - name: val num_bytes: 612108287 num_examples: 4562 - name: test num_bytes: 620417451 num_examples: 4560 download_size: 1578664376 dataset_size: 3087928104 - config_name: product_review_temporal features: - name: reviewerId dtype: string - name: input dtype: string - name: output dtype: string - name: profile list: - name: description dtype: string - name: overall dtype: string - name: reviewText dtype: string - name: summary dtype: string splits: - name: train num_bytes: 1444326460 num_examples: 16197 - name: val num_bytes: 167208689 num_examples: 1831 - name: test num_bytes: 159206958 num_examples: 1784 download_size: 1039534145 dataset_size: 1770742107 - config_name: product_review_user features: - name: reviewerId dtype: string - name: input dtype: string - name: output dtype: string - name: profile list: - name: description dtype: string - name: overall dtype: string - name: reviewText dtype: string - name: summary dtype: string splits: - name: train num_bytes: 1222656833 num_examples: 14745 - name: val num_bytes: 147559402 num_examples: 1826 - name: test num_bytes: 168661273 num_examples: 1822 download_size: 911471721 dataset_size: 1538877508 - config_name: topic_writing_temporal features: - name: author dtype: string - name: output dtype: string - name: input dtype: string - name: profile list: - name: author dtype: string - name: content dtype: string - name: id dtype: string - name: summary dtype: string splits: - name: train num_bytes: 1542333778 num_examples: 16347 - name: val num_bytes: 108126082 num_examples: 2452 - name: test num_bytes: 113545780 num_examples: 2452 download_size: 1064723500 dataset_size: 1764005640 - config_name: topic_writing_user features: - name: author dtype: string - name: output dtype: string - name: input dtype: string - name: profile list: - name: author dtype: string - name: content dtype: string - name: id dtype: string - name: summary dtype: string splits: - name: train num_bytes: 1366090142 num_examples: 11442 - name: val num_bytes: 114610867 num_examples: 2453 - name: test num_bytes: 114795391 num_examples: 2452 download_size: 961434879 dataset_size: 1595496400 configs: - config_name: abstract_generation_temporal data_files: - split: train path: abstract_generation_temporal/train-* - split: val path: abstract_generation_temporal/val-* - split: test path: abstract_generation_temporal/test-* - config_name: abstract_generation_user data_files: - split: train path: abstract_generation_user/train-* - split: val path: abstract_generation_user/val-* - split: test path: abstract_generation_user/test-* - config_name: product_review_temporal data_files: - split: train path: product_review_temporal/train-* - split: val path: product_review_temporal/val-* - split: test path: product_review_temporal/test-* - config_name: product_review_user data_files: - split: train path: product_review_user/train-* - split: val path: product_review_user/val-* - split: test path: product_review_user/test-* - config_name: topic_writing_temporal data_files: - split: train path: topic_writing_temporal/train-* - split: val path: topic_writing_temporal/val-* - split: test path: topic_writing_temporal/test-* - config_name: topic_writing_user data_files: - split: train path: topic_writing_user/train-* - split: val path: topic_writing_user/val-* - split: test path: topic_writing_user/test-* task_categories: - text-generation - summarization language: - en --- # LongLaMP Dataset ## Dataset Description - **Repository:** https://longlamp-benchmark.github.io/ - **Paper:** https://www.arxiv.org/abs/2407.11016 ## Dataset Summary LongLaMP is a comprehensive benchmark for personalized long-form text generation. The dataset is designed to evaluate and improve the performance of language models in generating extended, personalized content across various domains and tasks. Our dataset consists of multiple tasks focusing on different aspects of long-form text generation, including: 1. Personalized Email Completion 2. Personalized Abstract Generation 3. Personalized Review Writing 4. Personalized Topic Writing Each task in LongLaMP is carefully curated to challenge language models in producing coherent, contextually relevant, and personalized long-form text. The dataset provides a robust framework for testing and developing personalization techniques in language models. Details about the dataset construction, task specifications, and evaluation metrics can be found in our technical paper: [Link] ## Accessing the Dataset You can download the dataset using the Hugging Face datasets library. Here's an example of how to load the product review dataset for the user-based setting: ```python from datasets import load_dataset ds = load_dataset("LongLaMP/LongLaMP", "product_review_user", split="train", use_auth_token=True) ``` ## Dataset Structure The LongLaMP dataset is organized into different tasks and settings. Each task has two settings: 1. User Setting: Evaluates personalized text generation for new users 2. Temporal Setting: Evaluates generating the latest content for previously seen users ## Citation If you use the LongLaMP dataset in your research, please cite our paper: ``` @misc{kumar2024longlampbenchmarkpersonalizedlongform, title={LongLaMP: A Benchmark for Personalized Long-form Text Generation}, author={Ishita Kumar and Snigdha Viswanathan and Sushrita Yerra and Alireza Salemi and Ryan A. Rossi and Franck Dernoncourt and Hanieh Deilamsalehy and Xiang Chen and Ruiyi Zhang and Shubham Agarwal and Nedim Lipka and Chien Van Nguyen and Thien Huu Nguyen and Hamed Zamani}, year={2024}, eprint={2407.11016}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.11016}, } ```
提供机构:
LongLaMP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作