ivanleomk/gpt4-chain-of-density
收藏Hugging Face2023-11-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ivanleomk/gpt4-chain-of-density
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- summarization
language:
- en
---
# Introduction
The following is a dataset which consists of some chain of density summaries which we generated using GPT-4. The approach is slightly modified to account for GPT-4 timeouts with some additional validation that we added using the [Instructor](https://github.com/jxnl/instructor) Library.
We wrote a short blog about how we generated this date [here](https://jxnl.github.io/instructor/blog/2023/11/05/better-summaries-by-finetuning-chain-of-density/#results-and-benchmarks)
Here's a quick summary of the individual files that we have
1. `summarization_20`, `summarization_50` and `summarization_all` are the respective `.jsonl` files that we used to fine-tune our models. They contain 20, 50 and 76 examples respectively.
2. `test.csv`: This is a randomly selected group of 100 test articles that were sampled from the original `griffin/chain-of-density` article that was used to create our training sets with GPT-4.
3. `train.csv`: This is a randomly selected group of 20 test articles that were sampled from the original `griffin/chain-of-density` article which were not provided to our fine-tuned models. These were then used to evaluate their quality and performance.
4, `validation-summaries`: These are the summaries generated by `GPT-4` on the test set. We include the following fields:
- `text`: The original article that was summarized
- `model`: This has a single value of `GPT-4`
- `Summary 1` : The first summary created
- `Summary 2` : The second rewritten summary
- `Summary 3` : The third rewritten summary
- `Summary 4` : The fourth rewritten summary
- `time` : The time taken for the entire chain of density to be created
5. `vanilla_35.csv`: This contains the summaries generated by a vanilla GPT 3.5 model that was prompted to generate an entity dense summary.
6. `results.csv` : These are the summaries generated by the individual fine-tuned models. We include the following fields
- `Article`: The original article that was summarized
- `model`: This is either `fine-tuned-20`, `fine-tuned-50` or `fine-tuned-76` which represent our GPT3.5 model that was fine tuned on 20, 50 or 76 examples respectively
- `Summary`: The summary generated by the fine-tuned model
- `Time` : This was the amount of time it took for the summary to be generated
提供机构:
ivanleomk
原始信息汇总
数据集概述
数据集内容
文件列表
-
summarization_20,summarization_50,summarization_all- 类型:
.jsonl文件 - 描述:用于微调模型的数据集,分别包含20、50和76个示例。
- 类型:
-
test.csv- 类型:CSV文件
- 描述:从原始的
griffin/chain-of-density文章中随机选择的100篇测试文章。
-
train.csv- 类型:CSV文件
- 描述:从原始的
griffin/chain-of-density文章中随机选择的20篇测试文章,未用于微调模型,用于评估模型质量和性能。
-
validation-summaries- 描述:由
GPT-4生成的测试集摘要,包含以下字段:text:原始文章model:模型名称,值为GPT-4Summary 1:第一个生成的摘要Summary 2:第二个重写的摘要Summary 3:第三个重写的摘要Summary 4:第四个重写的摘要time:生成整个密度链所需的时间
- 描述:由
-
vanilla_35.csv- 类型:CSV文件
- 描述:由普通GPT 3.5模型生成的实体密集型摘要。
-
results.csv- 类型:CSV文件
- 描述:由各个微调模型生成的摘要,包含以下字段:
Article:原始文章model:模型名称,值为fine-tuned-20,fine-tuned-50或fine-tuned-76Summary:由微调模型生成的摘要Time:生成摘要所需的时间



