five

MeetingBank-QA-Summary

收藏
魔搭社区2025-12-10 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/microsoft/MeetingBank-QA-Summary
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for MeetingBank-QA-Summary <!-- Inspired by the concept of "LLMs as Compressors", we propose a data distillation procedure to derive --> <!-- knowledge from an LLM (GPT-4) to compress prompts without sacrificing crucial information. --> This dataset is introduced in [LLMLingua-2 (Pan et al., 2024)](https://arxiv.org/abs/2403.12968) and is designed to assess the performance of compressed meeting transcripts on downstream tasks such as question answering (QA) and summarization. It includes 862 meeting transcripts from the test set of meeting transcripts introduced in [MeetingBank (Hu et al, 2023)](https://aclanthology.org/2023.acl-long.906/) as the context, togeter with QA pairs and summaries that were generated by GPT-4 for each context transcripts. ## 🎯 Usage ```python meeting_bank_qa = load_dataset("microsoft/MeetingBank-QA-Summary", split="test") for i, sample in enumerate(meeting_bank_qa): origin_prompt = sample["prompt"] # meeting transcripts to be used as the context. gpt4_summary = sample["gpt4_summary"] # GPT4 generated summary coresponding to the context. qa_pair_list = sample["QA_pairs"] # GPT4 generated QA pairs coresponding to the context. for qa_pair in qa_pair_list: q = qa_pair["question"] a = qa_pair["answer"] ``` ## 🔎 Details ### 1. QA Pair Generation Initially, we generate 10 question-answer pairs for each meeting transcript using **GPT-4-32K**. The instruction used in generating QA pairs is: "_Create 10 questions/answer pairs from the given meeting transcript. The answer should be short and concise. The question should start with `Q:` and answsershould start with `A:` . The meeting transcript is as follows.\n{transcript\_example}_". To ensure the quality of the generated QA pairs, we discard the question-answer pairs with answer lengths exceeding 50 tokens. Subsequently, we carefully examine the remaining QA pairs to ensure that the answers actually appear in the original transcripts, instead of being products of GPT-4’s hallucinations. After the aforementioned filtering process, we retain **3 high-quality question-answer pairs for each meeting transcript**. ### 2. Summary Generation We instruct GPT-4-32K to summarize each meeting transcript. The instruction used here is: "_Summarize the following meeting transcript.\n{transcript\_example}\nSummary:_". ## 📄 Citation Information ```bibtex @inproceedings{pan2024llmlingua2, title={LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression}, author={Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Rühle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang}, year={2024}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics}, publisher = {Association for Computational Linguistics} } ``` ## 🧑‍🎓 Contributions Thanks to [@panzs19](https://pzs19.github.io/), [@qianhuiwu](https://qianhuiwu.github.io/), and [@iofu728](https://cv.wyydsb.com/) for adding this dataset.

# MeetingBank-QA-Summary 数据集卡片 本数据集由[LLMLingua-2(Pan等人,2024)](https://arxiv.org/abs/2403.12968)提出,旨在评估经过压缩的会议文本在问答(QA)与摘要生成等下游任务中的表现。该数据集采用[MeetingBank(Hu等人,2023)](https://aclanthology.org/2023.acl-long.906/)中会议文本测试集的862条文本作为上下文,并包含由GPT-4为每条上下文文本生成的问答对与摘要。 ## 🎯 使用方法 python meeting_bank_qa = load_dataset("microsoft/MeetingBank-QA-Summary", split="test") for i, sample in enumerate(meeting_bank_qa): origin_prompt = sample["prompt"] # 作为上下文使用的会议文本 gpt4_summary = sample["gpt4_summary"] # 与上下文对应的GPT-4生成摘要 qa_pair_list = sample["QA_pairs"] # 与上下文对应的GPT-4生成问答对 for qa_pair in qa_pair_list: q = qa_pair["question"] a = qa_pair["answer"] ## 🔎 数据集详情 ### 1. 问答对生成 首先,我们使用**GPT-4-32K**为每条会议文本生成10组问答对。生成问答对时使用的提示词为:`"请根据给定的会议文本创建10组问答对。答案需简洁凝练,问题以`Q:`开头,答案以`A:`开头。会议文本如下: {transcript_example}"`。 为确保生成的问答对质量,我们会丢弃答案长度超过50个Token的问答对。随后,我们会仔细检查剩余的问答对,确保答案确实出现在原始会议文本中,而非GPT-4产生的幻觉内容。经过上述过滤流程后,我们为每条会议文本保留**3组高质量问答对**。 ### 2. 摘要生成 我们使用GPT-4-32K对每条会议文本进行摘要生成。本次使用的提示词为:`"请总结以下会议文本。 {transcript_example} 摘要:"`。 ## 📄 引用信息 bibtex @inproceedings{pan2024llmlingua2, title={LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression}, author={Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Rühle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang}, year={2024}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics}, publisher = {Association for Computational Linguistics} } ## 🧑‍🎓 贡献致谢 感谢[@panzs19](https://pzs19.github.io/)、[@qianhuiwu](https://qianhuiwu.github.io/)与[@iofu728](https://cv.wyydsb.com/)为本数据集添加支持。
提供机构:
maas
创建时间:
2025-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作