allenai/commongen_lite
收藏Hugging Face2024-01-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/allenai/commongen_lite
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: all
data_files:
- split: train
- path: commongen_hard_nohuman.json
license: apache-2.0
task_categories:
- text-generation
language:
- en
pretty_name: CommonGen-Lite
size_categories:
- n<1K
datasets:
- commongen
---
# CommonGen-Lite
Evaluating LLMs with [CommonGen](https://inklab.usc.edu/CommonGen/) using [CommonGen-lite](https://huggingface.co/datasets/allenai/commongen_lite) dataset (400 examples + 900 human references).
We use GPT-4 to evaluate the constrained text generation ability of LLMs. Please see more in our [paper](https://arxiv.org/abs/1911.03705).
Github: [https://github.com/allenai/CommonGen-Eval](https://github.com/allenai/CommonGen-Eval)
## Leaderboard
| model | len | cover | pos | win_tie | overall |
|----------------------------|-------|---------|-------|-----------|-----------|
| human | 12.84 | 99.00 | 98.11 | 100.00 | 97.13 |
| gpt-4-0613 | 14.13 | 97.44 | 91.78 | 50.44 | 45.11 |
| gpt-4-1106-preview | 14.90 | 96.33 | 90.11 | 50.78 | 44.08 |
| gpt-3.5-turbo | 12.76 | 92.11 | 83.00 | 49.78 | 38.06 |
| Yi-34b-chat | 13.45 | 80.11 | 75.11 | 39.44 | 23.73 |
| vicuna-13b-v1.5 | 15.02 | 85.89 | 79.56 | 27.44 | 18.75 |
| tulu-2-dpo-70b | 17.89 | 88.78 | 80.11 | 23.00 | 16.36 |
| Mixtral-8x7B-Instruct-v0.1 | 20.15 | 84.11 | 73.33 | 17.89 | 11.03 |
| Llama-2-7b-chat-hf | 16.06 | 88.56 | 76.44 | 15.44 | 10.45 |
| zephyr-7b-beta | 15.76 | 82.44 | 72.78 | 16.89 | 10.13 |
| Yi-6b-chat | 13.32 | 71.67 | 63.56 | 22.11 | 10.07 |
- **length**: the number of words on average in the generated sentences
- **cover**: the percentage of examples where all given concepts are covered by model outputs
- **PoS**: the percentage of examples where the part-of-speech (PoS) of ALL given concepts are correct in model outputs
- **win_tie**: the percentage of examples where GPT-4-turbo prefers the model outputs over the human-written references (or thinks they are equally good)
- **overall**: `cover%` x `pos%` x `win_tie%`
Link: [https://inklab.usc.edu/CommonGen/leaderboard.html](https://inklab.usc.edu/CommonGen/leaderboard.html)
To make your model on the leaderboard, please create an issue or PR to submit the inference script. I'll run the following evaluation script and update the leaderboard.
You will not need to run the evaluation script yourself (the human references are hidden to avoid data leakage).
## Prompts
<details>
<summary>Instruction Prompt Template (2-shots prompting)</summary>
```markdown
# Instruction
Given several concepts (i.e., nouns or verbs), write a short and simple sentence that contains *all* the required words.
The sentence should describe a common scene in daily life, and the concepts should be used in a natural way.
# Examples
## Example 1
- Concepts: "dog(noun), frisbee(noun), catch(verb), throw(verb)"
- Sentence: The dog catches the frisbee when the boy throws it into the air.
## Example 2
- Concepts: "apple(noun), place(verb), tree(noun), pick(verb)"
- Sentence: A girl picks some apples from a tree and places them into her basket.
# Your Task
- Concepts: "{$concept_list}"
- Sentence:
```
</details>
<details>
<summary>Evaluation Prompt Template (Pairwise Comparison)</summary>
```markdown
# Data
Given several concepts (i.e., nouns or verbs), we ask models to write a short and simple sentence that contains *all* the required words.
The sentence should describe a common scene in daily life, and the concepts should be used in a natural way.
Concepts: "{$concept_list}"
Model A: "{$candidate_A}"
Model B: "{$candidate_B}"
# Your Task
Your task is to choose a better sentence from the two candidates. Decide which model's sentence is better in terms of the naturalness and commonness of the scenes they describe.
## Rules:
- A better sentence should describe a common scene in daily life, and all concepts should be used in a natural way.
- You should prefer sentences that use all given concepts with correct part-of-speech tags.
- A simpler and shorter sentence is preferred if it describes the same scene as the other sentence.
- If you think both sentences are equally good or bad, please choose "tie".
Now, please output your choice ("A" or "B" or "tie").
Your choice:
```
</details>
## Case studies
Here are some examples of the generated sentences from the models.
<details>
<summary> Example 1 </summary>
- **Concepts**: `['demonstrate_V', 'machine_N', 'sew_V', 'use_V']`
- **Human Reference**: `She used the sewing machine to demonstrate how well she could sew.`
- **Mixtral-8x7B-Instruct-v0.1**: ` During the fashion design class, the teacher demonstrates how to use the sewing machine to create a stylish outfit.`
--> ```Versus=lose; Cover=False; PoS=False; Len=20```
- **Yi-34b-chat**: `The tailor demonstrates how to use the sewing machine.`
--> ```Versus=win; Cover=False; PoS=False; Len=9```
- **Yi-6b-chat**: `The machine demonstrates how to sew a useful item.`
--> ```Versus=lose; Cover=False; PoS=False; Len=9```
- **Llama-2-7b-chat-hf**: `The teacher demonstrates how to use the sewing machine to sew a new dress, which the student will use to create a beautiful outfit for the school play.`
--> ```Versus=lose; Cover=True; PoS=True; Len=28```
- **zephyr-7b-beta**: `The tailor demonstrates how to use the sewing machine to create garments by allowing the customer to try it out themselves.`
--> ```Versus=lose; Cover=False; PoS=False; Len=21```
- **tulu-2-dpo-70b**: `A woman demonstrates how to use a machine to sew, showing the audience step by step.`
--> ```Versus=lose; Cover=True; PoS=True; Len=16```
- **vicuna-13b-v1.5**: ` The tailor demonstrates how to use the sewing machine to make a dress.`
--> ```Versus=lose; Cover=False; PoS=False; Len=14```
- **gpt-3.5-turbo**: `The tailor uses the sewing machine to demonstrate how to sew a straight line.`
--> ```Versus=lose; Cover=True; PoS=True; Len=14```
- **gpt-4-0613**: `The instructor demonstrates how to use the sewing machine by sewing a piece of fabric.`
--> ```Versus=lose; Cover=True; PoS=True; Len=15```
- **gpt-4-1106-preview**: `The tailor demonstrates how to use the sewing machine by making a small pouch.`
--> ```Versus=win; Cover=False; PoS=False; Len=14```
</details>
<details>
<summary> Example 2 </summary>
- **Concepts**: `['car_N', 'drive_V', 'road_N', 'water_N']`
- **Human Reference**: `A man drives the car slowly because there is water on the road`
- **Mixtral-8x7B-Instruct-v0.1**: ` During the road trip, the family takes breaks to drink water next to the car driving on the open road.`
--> ```Versus=lose; Cover=True; PoS=True; Len=21```
- **Yi-34b-chat**: `The car drives down the road splashing water from the puddles.`
--> ```Versus=win; Cover=True; PoS=True; Len=11```
- **Yi-6b-chat**: `The car drives along the road, avoiding the puddle of water.`
--> ```Versus=win; Cover=True; PoS=True; Len=11```
- **Llama-2-7b-chat-hf**: `The driver drives the car along the road, feeling the water splashing against the tires.`
--> ```Versus=lose; Cover=True; PoS=True; Len=15```
- **zephyr-7b-beta**: `The driver takes the car along the road until they reach a place with water, where they cannot drive anymore.`
--> ```Versus=lose; Cover=True; PoS=True; Len=20```
- **tulu-2-dpo-70b**: `A person drives their car on the road next to a river, where the water is flowing smoothly.`
--> ```Versus=lose; Cover=True; PoS=True; Len=18```
- **vicuna-13b-v1.5**: ` The car drives on the road by the water.`
--> ```Versus=lose; Cover=True; PoS=True; Len=10```
- **gpt-3.5-turbo**: `I drive my car on the road next to the water.`
--> ```Versus=win; Cover=True; PoS=True; Len=11```
- **gpt-4-0613**: `The man drives his car on the road that runs alongside the sparkling water.`
--> ```Versus=lose; Cover=True; PoS=True; Len=14```
- **gpt-4-1106-preview**: `The car drives smoothly on the wet road, splashing water as it goes.`
--> ```Versus=win; Cover=True; PoS=True; Len=13```
</details>
<details>
<summary> Example 3 </summary>
- **Concepts**: `['ceiling_N', 'climb_V', 'gym_N', 'rope_N', 'suspend_V']`
- **Human Reference**: `The boys climbed the rope suspended from the gym ceiling.`
- **Mixtral-8x7B-Instruct-v0.1**: ` During gym class, the students take turns climbing the suspended rope while the others cheer from below, careful not to hit their heads on the ceiling.`
--> ```Versus=lose; Cover=True; PoS=True; Len=27```
- **Yi-34b-chat**: `The gymnast climbs the rope and suspends from the ceiling of the gym.`
--> ```Versus=lose; Cover=True; PoS=True; Len=13```
- **Yi-6b-chat**: `The climber suspends themselves on the rope as they climb the gym's ceiling.`
--> ```Versus=lose; Cover=True; PoS=True; Len=13```
- **Llama-2-7b-chat-hf**: `The gym teacher suspends herself from the ceiling using a rope while climbing to reach the high bar for a workout.`
--> ```Versus=lose; Cover=True; PoS=True; Len=21```
- **zephyr-7b-beta**: `Gymnasts suspend themselves from the ceiling by climbing on the ropes in the gym.`
--> ```Versus=lose; Cover=True; PoS=True; Len=14```
- **tulu-2-dpo-70b**: `In the gym, an athlete climbs a rope suspended from the ceiling to improve their physical fitness.`
--> ```Versus=lose; Cover=True; PoS=True; Len=17```
- **vicuna-13b-v1.5**: ` In the gym, she climbs up to the ceiling using a rope that is suspended from the top.`
--> ```Versus=lose; Cover=True; PoS=True; Len=19```
- **gpt-3.5-turbo**: `In the gym, the athlete climbs the rope and suspends himself from the ceiling.`
--> ```Versus=lose; Cover=True; PoS=True; Len=14```
- **gpt-4-0613**: `In the gym, a man climbs a rope that is suspended from the ceiling.`
--> ```Versus=lose; Cover=True; PoS=True; Len=14```
- **gpt-4-1106-preview**: `In the gym, the rope is suspended from the ceiling for athletes to climb during their workout.`
--> ```Versus=lose; Cover=True; PoS=True; Len=17```
</details>
<details>
<summary> Example 4 </summary>
- **Concepts**: `['hat_N', 'knit_V', 'pile_N', 'rock_N', 'sit_V']`
- **Human Reference**: `The woman sat on a pile of rocks and knitted a hat.`
- **Mixtral-8x7B-Instruct-v0.1**: ` During her break, a woman sits on a rock and starts to knit a pile of hats to sell at the local market.`
--> ```Versus=lose; Cover=True; PoS=True; Len=24```
- **Yi-34b-chat**: `The knitted hat sat on a pile of rocks.`
--> ```Versus=lose; Cover=False; PoS=False; Len=9```
- **Yi-6b-chat**: `The woman knits a cozy hat while sitting on a rock, placing the pile of rocks around her.`
--> ```Versus=lose; Cover=True; PoS=True; Len=18```
- **Llama-2-7b-chat-hf**: `The old man sits on a rock and knits a new hat while piling up the ones he already has, enjoying the peaceful atmosphere.`
--> ```Versus=lose; Cover=True; PoS=False; Len=24```
- **zephyr-7b-beta**: `A woman sits on a rock and knits a hat from a pile of yarn.`
--> ```Versus=lose; Cover=True; PoS=True; Len=15```
- **tulu-2-dpo-70b**: `An elderly woman knits a colorful hat while sitting on a pile of rocks, enjoying the peaceful scenery around her.`
--> ```Versus=lose; Cover=True; PoS=True; Len=20```
- **vicuna-13b-v1.5**: ` The old man sits on a rock and knits a hat in a pile of yarn.`
--> ```Versus=lose; Cover=True; PoS=True; Len=17```
- **gpt-3.5-turbo**: `The old man sits on a rock, wearing a knitted hat, while a pile of leaves gathers around him.`
--> ```Versus=lose; Cover=False; PoS=False; Len=19```
- **gpt-4-0613**: `The woman sits on a rock, knitting a hat, with a pile of yarn beside her.`
--> ```Versus=lose; Cover=True; PoS=True; Len=16```
- **gpt-4-1106-preview**: `She sits on a rock, knitting a hat, and places the finished ones in a colorful pile beside her.`
--> ```Versus=lose; Cover=True; PoS=True; Len=19```
</details>
## Links
- Contact: [Bill Yuchen Lin](https://yuchenlin.xyz/)
- Project website: [https://inklab.usc.edu/CommonGen/](https://inklab.usc.edu/CommonGen/)
- HuggingFace Dataset: [CommonGen-lite](https://huggingface.co/datasets/allenai/commongen_lite)
## Citation
```bibtex
@inproceedings{lin-etal-2020-commongen,
title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
author = "Lin, Bill Yuchen and
Zhou, Wangchunshu and
Shen, Ming and
Zhou, Pei and
Bhagavatula, Chandra and
Choi, Yejin and
Ren, Xiang",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
pages = "1823--1840",
}
```
提供机构:
allenai
原始信息汇总
数据集概述
名称: CommonGen-Lite
数据集大小: 小于1000条记录
语言: 英语
任务类别: 文本生成
许可证: Apache-2.0
数据集包含:
- 训练数据文件:
commongen_hard_nohuman.json
数据集来源:
- 链接: CommonGen-lite
评估方法:
- 使用GPT-4评估大型语言模型(LLMs)的约束文本生成能力。
相关研究:
- 详细信息请参阅论文: 论文链接
数据集评估指标:
- 长度: 生成句子的平均单词数
- 覆盖率: 模型输出中包含所有给定概念的示例百分比
- PoS: 模型输出中所有给定概念的词性正确的示例百分比
- win_tie: GPT-4-turbo偏好模型输出超过人类编写的参考文献(或认为它们同样好)的示例百分比
- 总体:
覆盖率%xPoS%xwin_tie%
评估结果:
- 提供了一个模型性能的排行榜,展示了不同模型在各项指标上的表现。
参与评估:
- 通过创建问题或提交请求来提交推理脚本,评估脚本将由维护者运行并更新排行榜。
评估提示模板:
- 提供了两种类型的提示模板:指令提示模板和评估提示模板,用于指导模型生成和评估过程。
案例研究:
- 提供了多个模型生成句子的示例,以及与人类参考的比较分析。



