MeetingBank-LLMCompressed

Name: MeetingBank-LLMCompressed
Creator: maas
Published: 2025-12-05 12:12:29
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-26 收录

下载链接：

https://modelscope.cn/datasets/microsoft/MeetingBank-LLMCompressed

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for MeetingBank-LLMCompressed This dataset is introduced in [LLMLingua-2 (Pan _et al._, 2024)](https://arxiv.org/abs/2403.12968), and is collected to construct the training data for LLMLingua-2 compressor. It consists of 5169 instances from [MeetingBank](https://aclanthology.org/2023.acl-long.906/) training split, with their GPT-4 compressed versions. Given pairs of original texts and their compressed versions, we release the data annotation tool [here](https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/data_collection/label_word.py) to assign a binary label to each token in the original texts to determine if it should be preserved or discarded after compression. ### 🎯 Usage ```python from datasets import load_dataset meeting_bank_comp = load_dataset("microsoft/MeetingBank-LLMCompressed", split="train") for sample in meeting_bank_comp: # concatenation of all chunks origin_prompt = sample["prompt"] compressed_prompt = sample["compressed_prompt"] # chunk list origin_prompt_list = sample["prompt_list"] compressed_prompt_list = sample["compressed_prompt_list"] ``` ### 🔎 Details We segment the original meeting transcripts into a few chunks and then instruct GPT-4 to compress each chunk independently. Please refer to [LLMLingua-2 (Pan _et al._, 2024)](https://arxiv.org/abs/2403.12968) for the prompt used for compression. There are 6 fields: 1. `idx: int`: index of the instance. 2. `prompt: str`: original text of meeting transcripts. 3. `prompt_list: List[str]`: a List of chunks corresponding to the original instance in `prompt`. 4. `compressed_prompt_list: List[str]`: a List of compressed chunks. Each chunk is compressed by GPT-4 independently. 5. `compressed_prompt: str`: GPT-4 compressed version of the meeting transcripts. Each instance is a concatenation of all compressed chunks in `compressed_prompt_list`. 6. `summary: str`: summary of the meeting transcript from [MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank). ## 📄 Citation Information ```bibtex @inproceedings{pan2024llmlingua2, title={LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression}, author={Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Rühle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang}, year={2024}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics}, publisher = {Association for Computational Linguistics} } ``` ## 🧑‍🎓 Contributions Thanks to [@panzs19](https://pzs19.github.io/), [@qianhuiwu](https://qianhuiwu.github.io/), and [@iofu728](https://cv.wyydsb.com/) for adding this dataset.

# MeetingBank-LLMCompressed 数据集卡片（Dataset Card）本数据集由[LLMLingua-2（Pan等人，2024）](https://arxiv.org/abs/2403.12968)一文提出，旨在为LLMLingua-2压缩器构建训练数据。本数据集包含来自[MeetingBank](https://aclanthology.org/2023.acl-long.906/)训练划分集的5169条样本，每条样本均附带其经GPT-4压缩后的版本。针对原始文本与压缩文本的配对样本，我们公开了数据标注工具[此处](https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/data_collection/label_word.py)，用于为原始文本中的每个Token（标记）分配二元标签，以判定该标记在压缩后是否应当保留。 ### 🎯 使用方法 python from datasets import load_dataset meeting_bank_comp = load_dataset("microsoft/MeetingBank-LLMCompressed", split="train") for sample in meeting_bank_comp: # 所有片段的拼接结果 origin_prompt = sample["prompt"] compressed_prompt = sample["compressed_prompt"] # 片段列表 origin_prompt_list = sample["prompt_list"] compressed_prompt_list = sample["compressed_prompt_list"] ### 🔎 细节说明我们将原始会议转录文本拆分为若干片段，随后令GPT-4独立压缩每个片段。关于压缩所用的Prompt（提示词），请参阅[LLMLingua-2（Pan等人，2024）](https://arxiv.org/abs/2403.12968)一文。本数据集包含以下6个字段： 1. `idx: int`：样本索引。 2. `prompt: str`：会议转录文本的原始内容。 3. `prompt_list: List[str]`：对应`prompt`中原始样本的片段列表。 4. `compressed_prompt_list: List[str]`：压缩片段列表，每个片段均由GPT-4独立压缩得到。 5. `compressed_prompt: str`：会议转录文本经GPT-4压缩后的版本，每条样本为`compressed_prompt_list`中所有压缩片段的拼接结果。 6. `summary: str`：来自[MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank)的会议转录文本摘要。 ## 📄 引用信息 bibtex @inproceedings{pan2024llmlingua2, title={LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression}, author={Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Rühle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang}, year={2024}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics}, publisher = {Association for Computational Linguistics} } ## 🧑‍🎓 贡献说明感谢[@panzs19](https://pzs19.github.io/)、[@qianhuiwu](https://qianhuiwu.github.io/)与[@iofu728](https://cv.wyydsb.com/)为本数据集的添加所做出的贡献。

提供机构：

maas

创建时间：

2025-07-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集