InstruSum
收藏魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/Salesforce/InstruSum
下载链接
链接失效反馈官方服务:
资源简介:
# InstruSum
This is the dataset corresponding to our paper ["Benchmarking Generation and Evaluation Capabilities of Large Language
Models for Instruction Controllable Summarization"](https://arxiv.org/abs/2311.09184).
### dataset
The `dataset` subset contains 100 human-written data examples by us.
Each example contains an article, a summary instruction, a LLM-generated summary, and a hybrid LLM-human summary.
### human_eval
This subset contains human evaluation results for the 100 examples in the `dataset` subset.
There are 5 systems evaluated: OpenAI's `text-davinci-002`, `text-davinci-003`, `gpt-3.5-turbo-0301`, `gpt-4-0314`, along with the `hybrid` LLM-human summary.
We evaluated 4 evaluation aspects:
- **Overall Quality**: This rating assesses the overall quality of the summary in relation to the summary requirement.
- **Missing Information**: Does the summary omit any crucial information from the article concerning the summary requirement?
- **Irrelevant Information**: Does the summary include any information that is not relevant to the summary requirement?
- **Factual Consistency**: Is the summary consistent with the facts presented in the article, without contradicting or misrepresenting any information?
### human_eval_pairwise
This subset contains converted pairwise human evaluation results based on the human evaluation results in the `human_eval` subset.
The conversion process is as follows:
- The ranking-based human evaluation results are convered into pairwise comparisons for the *overall quality* aspect.
- Only comparisons where the annotators reached a consensus are included.
- Comparisons that resulted in a tie are excluded.
### llm_eval
This subset contains LLM-based automatic evaluation results for the 100 examples in the `dataset` subset.
We used 11 LLMs in our evaluation and 4 evaluation protocols:
- `LLMRank`: listwise ranking
- `LLMCompare`: pairwise comparison
- `LLMEval`: pointwise scoring by text completion
- `LLMScore`: pointwise scoring by model-predicted log-likelihood
In total, we evaluated 40 LLM-based evaluation methods over three quality aspects:
| LLM | LLMRank | LLMCompare | LLMEval | LLMScore |
|--------------------------|---------|------------|---------|----------|
| `text-davinci-002` | ✅ | ✅ | ✅ | ✅ |
| `text-davinci-003` | ✅ | ✅ | ✅ | ✅ |
| `gpt-3.5-turbo-0301` | ✅ | ✅ | ✅ | ❌ |
| `gpt-3.5-turbo-0613` | ✅ | ✅ | ✅ | ❌ |
| `gpt-3.5-turbo-instruct` | ✅ | ✅ | ✅ | ✅ |
| `gpt-4-0314` | ✅ | ✅ | ✅ | ❌ |
| `gpt-4-1106-preview` | ✅ | ✅ | ✅ | ❌ |
| `llama-2-7b-chat` | ✅ | ✅ | ✅ | ✅ |
| `llama-2-13b-chat` | ✅ | ✅ | ✅ | ✅ |
| `llama-2-70b-chat` | ✅ | ✅ | ✅ | ✅ |
| `mistral-instruct` | ✅ | ✅ | ✅ | ✅ |
### system_outputs
This subset contains the system outputs for the 100 examples in the `dataset` subset over 11 LLMs (same as the `llm_eval` subset).
## Ethical Considerations
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
# InstruSum
本数据集对应我们发表的学术论文《面向指令可控摘要任务的大语言模型生成与评估能力基准测试》,论文链接:https://arxiv.org/abs/2311.09184。
### dataset
`dataset` 子集包含我们人工撰写的100条数据样本。每条样本均包含一篇原文、一条摘要生成指令、一个大语言模型(Large Language Model,LLM)生成的摘要,以及一个混合式大语言模型-人工摘要。
### human_eval
该子集包含针对`dataset`子集100条样本的人工评估结果。本次评估共覆盖5个系统:OpenAI的`text-davinci-002`、`text-davinci-003`、`gpt-3.5-turbo-0301`、`gpt-4-0314`,以及上述混合式大语言模型-人工摘要。我们从4个评估维度展开:
- **整体质量**:该评分用于衡量摘要相对于摘要要求的整体品质。
- **信息缺失**:摘要是否遗漏了原文中与摘要要求相关的关键信息?
- **无关信息**:摘要是否包含与摘要要求无关的内容?
- **事实一致性**:摘要是否与原文呈现的事实相符,未出现任何信息矛盾或歪曲?
### human_eval_pairwise
该子集包含基于`human_eval`子集的人工评估结果转换得到的成对人工评估结果。转换流程如下:
- 将基于排序的人工评估结果转换为针对**整体质量**维度的成对比较结果。
- 仅保留标注者达成一致意见的比较结果。
- 排除出现平局的比较项。
### llm_eval
该子集包含针对`dataset`子集100条样本的基于大语言模型的自动评估结果。本次评估共使用11款大语言模型与4种评估协议:
- `LLMRank`:列表式排序评估
- `LLMCompare`:成对比较评估
- `LLMEval`:基于文本补全的逐点评分
- `LLMScore`:基于模型预测对数似然的逐点评分
本次共在3个质量维度上评估了40种基于大语言模型的评估方法:
| 大语言模型 | LLMRank | LLMCompare | LLMEval | LLMScore |
|---------------------------|---------|------------|---------|----------|
| `text-davinci-002` | ✅ | ✅ | ✅ | ✅ |
| `text-davinci-003` | ✅ | ✅ | ✅ | ✅ |
| `gpt-3.5-turbo-0301` | ✅ | ✅ | ✅ | ❌ |
| `gpt-3.5-turbo-0613` | ✅ | ✅ | ✅ | ❌ |
| `gpt-3.5-turbo-instruct` | ✅ | ✅ | ✅ | ✅ |
| `gpt-4-0314` | ✅ | ✅ | ✅ | ❌ |
| `gpt-4-1106-preview` | ✅ | ✅ | ✅ | ❌ |
| `llama-2-7b-chat` | ✅ | ✅ | ✅ | ✅ |
| `llama-2-13b-chat` | ✅ | ✅ | ✅ | ✅ |
| `llama-2-70b-chat` | ✅ | ✅ | ✅ | ✅ |
| `mistral-instruct` | ✅ | ✅ | ✅ | ✅ |
### system_outputs
该子集包含针对`dataset`子集100条样本,在11款大语言模型(与`llm_eval`子集所用模型一致)上生成的系统输出结果。
## 伦理考量
本数据集仅用于学术研究用途,以支持学术论文的发表。我们开发的模型、数据集与代码并未针对所有下游场景进行专门设计或评估。我们强烈建议用户在部署该模型前,针对准确性、安全性与公平性等潜在问题开展评估与优化。我们鼓励用户考虑人工智能的普遍局限性,遵守适用法律法规,并在选择应用场景时遵循最佳实践,尤其是在错误或不当使用可能严重影响民众生活、权利或安全的高风险场景中。如需了解更多应用场景相关指南,请参阅我们的AUP与AI AUP。
提供机构:
maas
创建时间:
2025-08-16



