five

xx18/R2PE

收藏
Hugging Face2024-02-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/xx18/R2PE
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification language: - en configs: - config_name: GSM8K data_files: - split: gpt3 path: data/gsm8k/text-davinci-003/test.jsonl - split: gpt3.5 path: data/gsm8k/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/gsm8k/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/gsm8k/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/gsm8k/mixtral-8x7b/test.jsonl - split: mistral_medium path: data/gsm8k/mistral-medium/test.jsonl - config_name: MATH data_files: - split: gpt3 path: data/math/text-davinci-003/test.jsonl - split: gpt3.5 path: data/math/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/math/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/math/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/math/mixtral-8x7b/test.jsonl - split: mistral_medium path: data/math/mistral-medium/test.jsonl - config_name: StrategyQA data_files: - split: gpt3 path: data/StrategyQA/text-davinci-003/test.jsonl - split: gpt3.5 path: data/StrategyQA/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/StrategyQA/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/StrategyQA/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/StrategyQA/mixtral-8x7b/test.jsonl - split: mistral_medium path: data/StrategyQA/mistral-medium/test.jsonl - config_name: Play data_files: - split: gpt3 path: data/play/text-davinci-003/test.jsonl - split: gpt3.5 path: data/play/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/play/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/play/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/play/mixtral-8x7b/test.jsonl - split: mistral_medium path: data/play/mistral-medium/test.jsonl - config_name: Physics data_files: - split: gpt3 path: data/physics/text-davinci-003/test.jsonl - split: gpt3.5 path: data/physics/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/physics/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/physics/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/physics/mixtral-8x7b/test.jsonl - split: mistral_medium path: data/physics/mistral-medium/test.jsonl - config_name: FEVER data_files: - split: gpt3 path: data/Fever/text-davinci-003/test.jsonl - split: gpt3.5 path: data/Fever/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/Fever/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/Fever/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/Fever/mixtral-8x7b/test.jsonl - config_name: HotpotQA data_files: - split: gpt3 path: data/HotpotQA/text-davinci-003/test.jsonl - split: gpt4 path: data/HotpotQA/gpt-4-0314/test.jsonl - split: gpt_instruct path: data/HotpotQA/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/HotpotQA/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/HotpotQA/mixtral-8x7b/test.jsonl - config_name: 2WikiMultihop data_files: - split: gpt3 path: data/2WikiMultihop/text-davinci-003/test.jsonl - split: gpt4 path: data/2WikiMultihop/gpt-4-0314/test.jsonl - split: gpt_instruct path: data/2WikiMultihop/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/2WikiMultihop/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/2WikiMultihop/mixtral-8x7b/test.jsonl pretty_name: R2PE size_categories: - 10K<n<100K --- # Dataset Card for R2PE Benchmark - GitHub repository: https://github.com/XinXU-USTC/R2PE - Paper: [Can We Verify Step by Step for Incorrect Answer Detection?](https://arxiv.org/abs/2402.10528) ## Dataset Summary - This is R2PE (Relation of Rationales and Performance Evaluation) Benchmark. - The aim is to explore the connection between the quality of reasoning chains and end-task performance. - We use CoT-SC to collect responses from 8 reasoning tasks spanning from 5 domains with various answer formats using 6 different LLMs. | Dataset | Task Type | Answer Format | Domain | |--------------|------------------------|-----------------|-----------------| | GSM8K | Mathematical Reasoning | Numeric | Mathematics | | MATH | Mathematical Reasoning | Numeric | Mathematics | | StrategyQA | Common Sense Reasoning | Yes/No | Commonsense | | play | Common Sense Reasoning | Yes/No | Literature | | physics | Physical Reasoning | Multiple Choice | Physics | | FEVER | Fact Verification | Yes/No | World Knowledge | | HotpotQA | Open-Domain QA | Free Form | World Knowledge | | 2WikiMultihop| Open-Domain QA | Free Form | World Knowledge | ## Dataset Structure ### Data Fields | Field Name | Value | Description | | ----------- | ----------- | ------------------------------------------- | | question | string | The question or claim used to query LLM from the original dataset. | | id | string or int | id of 'question' in the original dataset | dataset | string | Which dataset Q is from? (FEVER, HotpotQA, or 2WikiMultihop) | | llm | string | LLM used to query. | | responses | list | A list of five responses generated by 'llm name' for a 'question' from the 'dataset'. Each response contains a rationale and an answer | | rationales | list | A list of rationales segmented from 'responses'. | | answers | list | A list of answers segmented from 'responses'. | | output | string | The final answer selected from 'answers' by majority voting. | | ground-truth | list or string | The ground-truth answer or answer list provided for 'question' from the 'dataset'. | | label | Boolean | {True, False} to indicate whether 'output' matches the 'ground-truth'. | ### Data Instances An example looks as follows: ```python {'question': 'Which film was released earlier, Navavadhu or The January Man?', 'id': '5effec28087111ebbd63ac1f6bf848b6' 'dataset': '2WikiMultihop', 'llm': 'text-davinci-003' 'repsonses': ["First, Navavadhu was released on 15 February 2019. Second, The January Man was released on 17 February 1989. The answer is The January Man.", "First, film Navavadhu was released on 17 August 1979. Second, The January Man was released on 24 August 1989. The answer is Navavadhu.", "First, film Navavadhu was released on 8 April 1988. Second, The January Man was released on 11 August 1989. The answer is Navavadhu.", "First, film Navavadhu was released on 21 August 1992. Second, The January Man was released on 11 August 1989. The answer is The January Man.", "First, film Navavadhu was released on 15 February 2019. Second, The January Man was released on 10 February 1989. The answer is The January Man."], 'rationales': ["First, Navavadhu was released on 15 February 2019. Second, The January Man was released on 17 February 1989.", "First, film Navavadhu was released on 17 August 1979. Second, The January Man was released on 24 August 1989.", "First, film Navavadhu was released on 8 April 1988. Second, The January Man was released on 11 August 1989.", "First, film Navavadhu was released on 21 August 1992. Second, The January Man was released on 11 August 1989.", "First, film Navavadhu was released on 15 February 2019. Second, The January Man was released on 10 February 1989."], 'answers': ["The January Man", "Navavadhu", "Navavadhu", "The January Man", "The January Man"], 'output': "The January Man", 'ground-truth': 'Navavadhu', 'label': False} ``` The statistics for R2PE are as follows. | Dataset | Method | GPT3 | GPT-instruct | GPT-3.5 | Gemini | Mixtral | mistral | |--------------- |------------|------|--------------|---------|--------|---------|---------| | GSM8K | FALSE | 510 | 300 | 326 | 246 | 389 | 225 | | | total | 1319 | 1319 | 1250 | 1319 | 1278 | 1313 | | MATH | FALSE | 827 | 674 | 380 | 697 | 737 | 719 | | | total | 998 | 1000 | 1000 | 1000 | 999 | 1000 | | StrategyQA | FALSE | 490 | 368 | 399 | 445 | 553 | 479 | | | total | 1000 | 1000 | 1000 | 988 | 1000 | 1000 | | Play | FALSE | 409 | 454 | 487 | 385 | 634 | 448 | | | total | 1000 | 1000 | 1000 | 984 | 1000 | 1000 | | Physics | FALSE | 56 | 50 | 70 | 191 | 107 | 109 | | | total | 227 | 227 | 227 | 227 | 227 | 227 | | FEVER | FALSE | 485 | 432 | 441 | 449 | 570 | - | | | total | 1000 | 1000 | 1000 | 1000 | 1000 | - | | HotpotQA | FALSE | 217 | 175 | 192 | 219 | 199 | - | | | total | 308 | 308 | 308 | 308 | 308 | - | | 2WikiMultihop | FALSE | 626 | 598 | 401 | 629 | 562 | - | | | total | 1000 | 1000 | 1000 | 1000 | 1000 | - | ### Citation Information ```bibtex @misc{xu2024verify, title={Can We Verify Step by Step for Incorrect Answer Detection?}, author={Xin Xu and Shizhe Diao and Can Yang and Yang Wang}, year={2024}, eprint={2402.10528}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
xx18
原始信息汇总

数据集概述

本数据集详情页面未提供具体的数据集信息。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作