five

MeetingBank-LLMCompressed

收藏
魔搭社区2025-12-05 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/microsoft/MeetingBank-LLMCompressed
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for MeetingBank-LLMCompressed This dataset is introduced in [LLMLingua-2 (Pan _et al._, 2024)](https://arxiv.org/abs/2403.12968), and is collected to construct the training data for LLMLingua-2 compressor. It consists of 5169 instances from [MeetingBank](https://aclanthology.org/2023.acl-long.906/) training split, with their GPT-4 compressed versions. Given pairs of original texts and their compressed versions, we release the data annotation tool [here](https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/data_collection/label_word.py) to assign a binary label to each token in the original texts to determine if it should be preserved or discarded after compression. ### 🎯 Usage ```python from datasets import load_dataset meeting_bank_comp = load_dataset("microsoft/MeetingBank-LLMCompressed", split="train") for sample in meeting_bank_comp: # concatenation of all chunks origin_prompt = sample["prompt"] compressed_prompt = sample["compressed_prompt"] # chunk list origin_prompt_list = sample["prompt_list"] compressed_prompt_list = sample["compressed_prompt_list"] ``` ### 🔎 Details We segment the original meeting transcripts into a few chunks and then instruct GPT-4 to compress each chunk independently. Please refer to [LLMLingua-2 (Pan _et al._, 2024)](https://arxiv.org/abs/2403.12968) for the prompt used for compression. There are 6 fields: 1. `idx: int`: index of the instance. 2. `prompt: str`: original text of meeting transcripts. 3. `prompt_list: List[str]`: a List of chunks corresponding to the original instance in `prompt`. 4. `compressed_prompt_list: List[str]`: a List of compressed chunks. Each chunk is compressed by GPT-4 independently. 5. `compressed_prompt: str`: GPT-4 compressed version of the meeting transcripts. Each instance is a concatenation of all compressed chunks in `compressed_prompt_list`. 6. `summary: str`: summary of the meeting transcript from [MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank). ## 📄 Citation Information ```bibtex @inproceedings{pan2024llmlingua2, title={LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression}, author={Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Rühle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang}, year={2024}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics}, publisher = {Association for Computational Linguistics} } ``` ## 🧑‍🎓 Contributions Thanks to [@panzs19](https://pzs19.github.io/), [@qianhuiwu](https://qianhuiwu.github.io/), and [@iofu728](https://cv.wyydsb.com/) for adding this dataset.

# MeetingBank-LLMCompressed 数据集卡片(Dataset Card) 本数据集由[LLMLingua-2(Pan等人,2024)](https://arxiv.org/abs/2403.12968)一文提出,旨在为LLMLingua-2压缩器构建训练数据。 本数据集包含来自[MeetingBank](https://aclanthology.org/2023.acl-long.906/)训练划分集的5169条样本,每条样本均附带其经GPT-4压缩后的版本。 针对原始文本与压缩文本的配对样本,我们公开了数据标注工具[此处](https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/data_collection/label_word.py),用于为原始文本中的每个Token(标记)分配二元标签,以判定该标记在压缩后是否应当保留。 ### 🎯 使用方法 python from datasets import load_dataset meeting_bank_comp = load_dataset("microsoft/MeetingBank-LLMCompressed", split="train") for sample in meeting_bank_comp: # 所有片段的拼接结果 origin_prompt = sample["prompt"] compressed_prompt = sample["compressed_prompt"] # 片段列表 origin_prompt_list = sample["prompt_list"] compressed_prompt_list = sample["compressed_prompt_list"] ### 🔎 细节说明 我们将原始会议转录文本拆分为若干片段,随后令GPT-4独立压缩每个片段。关于压缩所用的Prompt(提示词),请参阅[LLMLingua-2(Pan等人,2024)](https://arxiv.org/abs/2403.12968)一文。 本数据集包含以下6个字段: 1. `idx: int`:样本索引。 2. `prompt: str`:会议转录文本的原始内容。 3. `prompt_list: List[str]`:对应`prompt`中原始样本的片段列表。 4. `compressed_prompt_list: List[str]`:压缩片段列表,每个片段均由GPT-4独立压缩得到。 5. `compressed_prompt: str`:会议转录文本经GPT-4压缩后的版本,每条样本为`compressed_prompt_list`中所有压缩片段的拼接结果。 6. `summary: str`:来自[MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank)的会议转录文本摘要。 ## 📄 引用信息 bibtex @inproceedings{pan2024llmlingua2, title={LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression}, author={Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Rühle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang}, year={2024}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics}, publisher = {Association for Computational Linguistics} } ## 🧑‍🎓 贡献说明 感谢[@panzs19](https://pzs19.github.io/)、[@qianhuiwu](https://qianhuiwu.github.io/)与[@iofu728](https://cv.wyydsb.com/)为本数据集的添加所做出的贡献。
提供机构:
maas
创建时间:
2025-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作