Name: xx18/R2PE
Creator: xx18
Published: 2024-02-21 09:04:00
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/xx18/R2PE

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification language: - en configs: - config_name: GSM8K data_files: - split: gpt3 path: data/gsm8k/text-davinci-003/test.jsonl - split: gpt3.5 path: data/gsm8k/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/gsm8k/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/gsm8k/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/gsm8k/mixtral-8x7b/test.jsonl - split: mistral_medium path: data/gsm8k/mistral-medium/test.jsonl - config_name: MATH data_files: - split: gpt3 path: data/math/text-davinci-003/test.jsonl - split: gpt3.5 path: data/math/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/math/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/math/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/math/mixtral-8x7b/test.jsonl - split: mistral_medium path: data/math/mistral-medium/test.jsonl - config_name: StrategyQA data_files: - split: gpt3 path: data/StrategyQA/text-davinci-003/test.jsonl - split: gpt3.5 path: data/StrategyQA/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/StrategyQA/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/StrategyQA/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/StrategyQA/mixtral-8x7b/test.jsonl - split: mistral_medium path: data/StrategyQA/mistral-medium/test.jsonl - config_name: Play data_files: - split: gpt3 path: data/play/text-davinci-003/test.jsonl - split: gpt3.5 path: data/play/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/play/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/play/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/play/mixtral-8x7b/test.jsonl - split: mistral_medium path: data/play/mistral-medium/test.jsonl - config_name: Physics data_files: - split: gpt3 path: data/physics/text-davinci-003/test.jsonl - split: gpt3.5 path: data/physics/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/physics/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/physics/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/physics/mixtral-8x7b/test.jsonl - split: mistral_medium path: data/physics/mistral-medium/test.jsonl - config_name: FEVER data_files: - split: gpt3 path: data/Fever/text-davinci-003/test.jsonl - split: gpt3.5 path: data/Fever/gpt-3.5-turbo-1106/test.jsonl - split: gpt_instruct path: data/Fever/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/Fever/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/Fever/mixtral-8x7b/test.jsonl - config_name: HotpotQA data_files: - split: gpt3 path: data/HotpotQA/text-davinci-003/test.jsonl - split: gpt4 path: data/HotpotQA/gpt-4-0314/test.jsonl - split: gpt_instruct path: data/HotpotQA/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/HotpotQA/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/HotpotQA/mixtral-8x7b/test.jsonl - config_name: 2WikiMultihop data_files: - split: gpt3 path: data/2WikiMultihop/text-davinci-003/test.jsonl - split: gpt4 path: data/2WikiMultihop/gpt-4-0314/test.jsonl - split: gpt_instruct path: data/2WikiMultihop/gpt-3.5-turbo-instruct/test.jsonl - split: gemini_pro path: data/2WikiMultihop/gemini-pro/test.jsonl - split: mixtral_8x7b path: data/2WikiMultihop/mixtral-8x7b/test.jsonl pretty_name: R2PE size_categories: - 10K<n<100K --- # Dataset Card for R2PE Benchmark - GitHub repository: https://github.com/XinXU-USTC/R2PE - Paper: [Can We Verify Step by Step for Incorrect Answer Detection?](https://arxiv.org/abs/2402.10528) ## Dataset Summary - This is R2PE (Relation of Rationales and Performance Evaluation) Benchmark. - The aim is to explore the connection between the quality of reasoning chains and end-task performance. - We use CoT-SC to collect responses from 8 reasoning tasks spanning from 5 domains with various answer formats using 6 different LLMs. | Dataset | Task Type | Answer Format | Domain | |--------------|------------------------|-----------------|-----------------| | GSM8K | Mathematical Reasoning | Numeric | Mathematics | | MATH | Mathematical Reasoning | Numeric | Mathematics | | StrategyQA | Common Sense Reasoning | Yes/No | Commonsense | | play | Common Sense Reasoning | Yes/No | Literature | | physics | Physical Reasoning | Multiple Choice | Physics | | FEVER | Fact Verification | Yes/No | World Knowledge | | HotpotQA | Open-Domain QA | Free Form | World Knowledge | | 2WikiMultihop| Open-Domain QA | Free Form | World Knowledge | ## Dataset Structure ### Data Fields | Field Name | Value | Description | | ----------- | ----------- | ------------------------------------------- | | question | string | The question or claim used to query LLM from the original dataset. | | id | string or int | id of 'question' in the original dataset | dataset | string | Which dataset Q is from? (FEVER, HotpotQA, or 2WikiMultihop) | | llm | string | LLM used to query. | | responses | list | A list of five responses generated by 'llm name' for a 'question' from the 'dataset'. Each response contains a rationale and an answer | | rationales | list | A list of rationales segmented from 'responses'. | | answers | list | A list of answers segmented from 'responses'. | | output | string | The final answer selected from 'answers' by majority voting. | | ground-truth | list or string | The ground-truth answer or answer list provided for 'question' from the 'dataset'. | | label | Boolean | {True, False} to indicate whether 'output' matches the 'ground-truth'. | ### Data Instances An example looks as follows: ```python {'question': 'Which film was released earlier, Navavadhu or The January Man?', 'id': '5effec28087111ebbd63ac1f6bf848b6' 'dataset': '2WikiMultihop', 'llm': 'text-davinci-003' 'repsonses': ["First, Navavadhu was released on 15 February 2019. Second, The January Man was released on 17 February 1989. The answer is The January Man.", "First, film Navavadhu was released on 17 August 1979. Second, The January Man was released on 24 August 1989. The answer is Navavadhu.", "First, film Navavadhu was released on 8 April 1988. Second, The January Man was released on 11 August 1989. The answer is Navavadhu.", "First, film Navavadhu was released on 21 August 1992. Second, The January Man was released on 11 August 1989. The answer is The January Man.", "First, film Navavadhu was released on 15 February 2019. Second, The January Man was released on 10 February 1989. The answer is The January Man."], 'rationales': ["First, Navavadhu was released on 15 February 2019. Second, The January Man was released on 17 February 1989.", "First, film Navavadhu was released on 17 August 1979. Second, The January Man was released on 24 August 1989.", "First, film Navavadhu was released on 8 April 1988. Second, The January Man was released on 11 August 1989.", "First, film Navavadhu was released on 21 August 1992. Second, The January Man was released on 11 August 1989.", "First, film Navavadhu was released on 15 February 2019. Second, The January Man was released on 10 February 1989."], 'answers': ["The January Man", "Navavadhu", "Navavadhu", "The January Man", "The January Man"], 'output': "The January Man", 'ground-truth': 'Navavadhu', 'label': False} ``` The statistics for R2PE are as follows. | Dataset | Method | GPT3 | GPT-instruct | GPT-3.5 | Gemini | Mixtral | mistral | |--------------- |------------|------|--------------|---------|--------|---------|---------| | GSM8K | FALSE | 510 | 300 | 326 | 246 | 389 | 225 | | | total | 1319 | 1319 | 1250 | 1319 | 1278 | 1313 | | MATH | FALSE | 827 | 674 | 380 | 697 | 737 | 719 | | | total | 998 | 1000 | 1000 | 1000 | 999 | 1000 | | StrategyQA | FALSE | 490 | 368 | 399 | 445 | 553 | 479 | | | total | 1000 | 1000 | 1000 | 988 | 1000 | 1000 | | Play | FALSE | 409 | 454 | 487 | 385 | 634 | 448 | | | total | 1000 | 1000 | 1000 | 984 | 1000 | 1000 | | Physics | FALSE | 56 | 50 | 70 | 191 | 107 | 109 | | | total | 227 | 227 | 227 | 227 | 227 | 227 | | FEVER | FALSE | 485 | 432 | 441 | 449 | 570 | - | | | total | 1000 | 1000 | 1000 | 1000 | 1000 | - | | HotpotQA | FALSE | 217 | 175 | 192 | 219 | 199 | - | | | total | 308 | 308 | 308 | 308 | 308 | - | | 2WikiMultihop | FALSE | 626 | 598 | 401 | 629 | 562 | - | | | total | 1000 | 1000 | 1000 | 1000 | 1000 | - | ### Citation Information ```bibtex @misc{xu2024verify, title={Can We Verify Step by Step for Incorrect Answer Detection?}, author={Xin Xu and Shizhe Diao and Can Yang and Yang Wang}, year={2024}, eprint={2402.10528}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

应用场景：

xx18/R2PE

数据集概述