huawei-noah/human_rank_eval

Name: huawei-noah/human_rank_eval
Creator: huawei-noah
Published: 2024-07-25 08:30:51
License: 暂无描述

Hugging Face2024-07-25 更新2025-04-08 收录

下载链接：

https://hf-mirror.com/datasets/huawei-noah/human_rank_eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit size_categories: - 1K<n<10K task_categories: - text-generation - question-answering configs: - config_name: default data_files: - split: HumanRankEvalSoftEng path: data/HumanRankEvalSoftEng-* - split: HumanRankEvalLanguagesSciences path: data/HumanRankEvalLanguagesSciences-* - split: HumanRankEvalEnglish path: data/HumanRankEvalEnglish-* - split: HumanRankEvalMath path: data/HumanRankEvalMath-* - split: HumanRankEvalUnix path: data/HumanRankEvalUnix-* - split: HumanRankEvalCPP path: data/HumanRankEvalCPP-* - split: HumanRankEvalJava path: data/HumanRankEvalJava-* - split: HumanRankEvalHTML path: data/HumanRankEvalHTML-* - split: HumanRankEvalAppleAndroid path: data/HumanRankEvalAppleAndroid-* - split: HumanRankEvalPhysics path: data/HumanRankEvalPhysics-* - split: HumanRankEvalCSDB path: data/HumanRankEvalCSDB-* - split: HumanRankEvalPython path: data/HumanRankEvalPython-* - split: HumanRankEvalStats path: data/HumanRankEvalStats-* - split: HumanRankEvalLaTeX path: data/HumanRankEvalLaTeX-* dataset_info: features: - name: question dtype: string - name: answers list: - name: text dtype: string - name: votes dtype: string splits: - name: HumanRankEvalSoftEng num_bytes: 1953762 num_examples: 500 - name: HumanRankEvalLanguagesSciences num_bytes: 2088240 num_examples: 500 - name: HumanRankEvalEnglish num_bytes: 1253540 num_examples: 500 - name: HumanRankEvalMath num_bytes: 1794319 num_examples: 500 - name: HumanRankEvalUnix num_bytes: 1715449 num_examples: 500 - name: HumanRankEvalCPP num_bytes: 1610271 num_examples: 500 - name: HumanRankEvalJava num_bytes: 1603095 num_examples: 500 - name: HumanRankEvalHTML num_bytes: 1415909 num_examples: 500 - name: HumanRankEvalAppleAndroid num_bytes: 1447166 num_examples: 500 - name: HumanRankEvalPhysics num_bytes: 2593234 num_examples: 500 - name: HumanRankEvalCSDB num_bytes: 2391929 num_examples: 500 - name: HumanRankEvalPython num_bytes: 1493471 num_examples: 500 - name: HumanRankEvalStats num_bytes: 2410621 num_examples: 500 - name: HumanRankEvalLaTeX num_bytes: 2125300 num_examples: 500 download_size: 15235919 dataset_size: 25896306 --- # Dataset Card for HumanRankEval This dataset supports the NAACL 2024 paper **[HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants](https://aclanthology.org/2024.naacl-long.456/)**. ### Dataset Description Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation of such LMs would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic evaluation featuring auxiliary LMs as judges and/or knowledge-based tasks is scalable but struggles with assessing conversational ability and adherence to instructions. To help accelerate the development of LMs as conversational assistants, we propose a novel automatic evaluation task: HumanRankEval (HRE). It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans. To perform evaluation, HRE ranks these answers based on their log-likelihood under the LM’s distribution, and subsequently calculates their correlation with the corresponding human rankings. We support HRE’s efficacy by investigating how efficiently it separates pretrained and instruction-tuned LMs of various sizes. We show that HRE correlates well with human judgements and is particularly responsive to model changes following instruction-tuning. - **Curated by:** Milan Gritta - **Shared by:** Huawei (London Research Centre) - **Language(s) (NLP):** Almost all topics are in English. - **License:** MIT ### Dataset Sources The data for HumanRankEval was sourced from **StackExchange** and **StackOverflow**. - **Repository:** [Github Link](https://github.com/huawei-noah/noah-research/tree/master/NLP/HumanRankEval) - visit for code and instructions! Thanks. - **Paper:** [HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants](https://arxiv.org/pdf/2405.09186) ## Dataset Structure HumanRankEval contains 14 topics, see paper link above for full details. ## Citation ``` @inproceedings{gritta-etal-2024-humanrankeval, title = "{H}uman{R}ank{E}val: Automatic Evaluation of {LM}s as Conversational Assistants", author = "Gritta, Milan and Lampouras, Gerasimos and Iacobacci, Ignacio", editor = "Duh, Kevin and Gomez, Helena and Bethard, Steven", booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.naacl-long.456", pages = "8237--8249", } ```

提供机构：

huawei-noah

5,000+

优质数据集

54 个

任务类型

进入经典数据集