MeetingBank-LLMCompressed
收藏魔搭社区2025-12-05 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/microsoft/MeetingBank-LLMCompressed
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for MeetingBank-LLMCompressed
This dataset is introduced in [LLMLingua-2 (Pan _et al._, 2024)](https://arxiv.org/abs/2403.12968), and is collected to construct the training data for LLMLingua-2 compressor.
It consists of 5169 instances from [MeetingBank](https://aclanthology.org/2023.acl-long.906/) training split, with their GPT-4 compressed versions.
Given pairs of original texts and their compressed versions, we release the data annotation tool [here](https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/data_collection/label_word.py) to assign a binary label to each token in the original texts to determine if it should be preserved or discarded after compression.
### 🎯 Usage
```python
from datasets import load_dataset
meeting_bank_comp = load_dataset("microsoft/MeetingBank-LLMCompressed", split="train")
for sample in meeting_bank_comp:
# concatenation of all chunks
origin_prompt = sample["prompt"]
compressed_prompt = sample["compressed_prompt"]
# chunk list
origin_prompt_list = sample["prompt_list"]
compressed_prompt_list = sample["compressed_prompt_list"]
```
### 🔎 Details
We segment the original meeting transcripts into a few chunks and then instruct GPT-4 to compress each chunk independently.
Please refer to [LLMLingua-2 (Pan _et al._, 2024)](https://arxiv.org/abs/2403.12968) for the prompt used for compression.
There are 6 fields:
1. `idx: int`: index of the instance.
2. `prompt: str`: original text of meeting transcripts.
3. `prompt_list: List[str]`: a List of chunks corresponding to the original instance in `prompt`.
4. `compressed_prompt_list: List[str]`: a List of compressed chunks. Each chunk is compressed by GPT-4 independently.
5. `compressed_prompt: str`: GPT-4 compressed version of the meeting transcripts. Each instance is a concatenation of all compressed chunks in `compressed_prompt_list`.
6. `summary: str`: summary of the meeting transcript from [MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank).
## 📄 Citation Information
```bibtex
@inproceedings{pan2024llmlingua2,
title={LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression},
author={Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Rühle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang},
year={2024},
booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics},
publisher = {Association for Computational Linguistics}
}
```
## 🧑🎓 Contributions
Thanks to [@panzs19](https://pzs19.github.io/), [@qianhuiwu](https://qianhuiwu.github.io/), and [@iofu728](https://cv.wyydsb.com/) for adding this dataset.
# MeetingBank-LLMCompressed 数据集卡片(Dataset Card)
本数据集由[LLMLingua-2(Pan等人,2024)](https://arxiv.org/abs/2403.12968)一文提出,旨在为LLMLingua-2压缩器构建训练数据。
本数据集包含来自[MeetingBank](https://aclanthology.org/2023.acl-long.906/)训练划分集的5169条样本,每条样本均附带其经GPT-4压缩后的版本。
针对原始文本与压缩文本的配对样本,我们公开了数据标注工具[此处](https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/data_collection/label_word.py),用于为原始文本中的每个Token(标记)分配二元标签,以判定该标记在压缩后是否应当保留。
### 🎯 使用方法
python
from datasets import load_dataset
meeting_bank_comp = load_dataset("microsoft/MeetingBank-LLMCompressed", split="train")
for sample in meeting_bank_comp:
# 所有片段的拼接结果
origin_prompt = sample["prompt"]
compressed_prompt = sample["compressed_prompt"]
# 片段列表
origin_prompt_list = sample["prompt_list"]
compressed_prompt_list = sample["compressed_prompt_list"]
### 🔎 细节说明
我们将原始会议转录文本拆分为若干片段,随后令GPT-4独立压缩每个片段。关于压缩所用的Prompt(提示词),请参阅[LLMLingua-2(Pan等人,2024)](https://arxiv.org/abs/2403.12968)一文。
本数据集包含以下6个字段:
1. `idx: int`:样本索引。
2. `prompt: str`:会议转录文本的原始内容。
3. `prompt_list: List[str]`:对应`prompt`中原始样本的片段列表。
4. `compressed_prompt_list: List[str]`:压缩片段列表,每个片段均由GPT-4独立压缩得到。
5. `compressed_prompt: str`:会议转录文本经GPT-4压缩后的版本,每条样本为`compressed_prompt_list`中所有压缩片段的拼接结果。
6. `summary: str`:来自[MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank)的会议转录文本摘要。
## 📄 引用信息
bibtex
@inproceedings{pan2024llmlingua2,
title={LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression},
author={Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Rühle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang},
year={2024},
booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics},
publisher = {Association for Computational Linguistics}
}
## 🧑🎓 贡献说明
感谢[@panzs19](https://pzs19.github.io/)、[@qianhuiwu](https://qianhuiwu.github.io/)与[@iofu728](https://cv.wyydsb.com/)为本数据集的添加所做出的贡献。
提供机构:
maas
创建时间:
2025-07-22



