Nutanix/cpp_unit_tests_unprocessed_llama3.1_vs_llama3.1_finetuned_gpt_judge

Name: Nutanix/cpp_unit_tests_unprocessed_llama3.1_vs_llama3.1_finetuned_gpt_judge
Creator: Nutanix
Published: 2024-07-31 20:58:39
License: 暂无描述

Hugging Face2024-07-31 更新2025-04-08 收录

下载链接：

https://hf-mirror.com/datasets/Nutanix/cpp_unit_tests_unprocessed_llama3.1_vs_llama3.1_finetuned_gpt_judge

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: Code dtype: string - name: Unit Test_llama3.1 dtype: string - name: Unit Test_llama3.1_finetuned dtype: string - name: Unit Test dtype: string - name: Winning Model dtype: string - name: Judgement dtype: string splits: - name: train num_bytes: 10853630 num_examples: 201 download_size: 2697802 dataset_size: 10853630 configs: - config_name: default data_files: - split: train path: data/train-* --- # Unit Test Evaluation Results This repository details the evaluation of unit tests generated by llama models. It compares the unit tests produced by two models: llama3.1 8B Instruct and finetuned llama3.1 8b Instruct against the [groundtruth data](https://huggingface.co/datasets/Nutanix/cpp-unit-test-benchmarking-dataset). In this evaluation, gpt-4o-mini served as the judge, assessing how well the unit tests from both models aligned with the ground truth. ## Models Used ### [Llama3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) - **HuggingFace Link**: [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) - **Precision**: BF16 Precision - **Description**: Base instruct model to generate the unit tests. ### LLama3.1 8B Instruct - finetuned - **HuggingFace Link**: [Finetuned LoRA adapter](https://huggingface.co/Nutanix/Meta-Llama-3.1-8B-Instruct_cppunittest_lora_8_alpha_16) - **Finetune Settings**: LoRaRank = 8, alpha = 16, finetuned llama3.1 8b instruct model for 2 epochs on [this](https://huggingface.co/datasets/Nutanix/cpp_unit_tests_finetuning_dataset_chat_format_less_than_8k) dataset. - **Description**: A finetuned model whose unit tests were compared against those generated by base model. ## Dataset The evaluation utilized the [cpp unit test benchmarking dataset](https://huggingface.co/datasets/Nutanix/cpp_unit_tests_eval_dataset_less_than_8k_extracted_code) as the ground truth. ### Dataset Structure The dataset was loaded using the following structure: ```python from datasets import Dataset, load_dataset # Load the dataset dataset = load_dataset("Nutanix/cpp_unit_tests_unprocessed_llama3.1_vs_llama3.1_finetuned_gpt_judge") # View dataset structure Dataset({ features: ['Code', 'Unit Test_llama3.1', 'Unit Test_llama3.1_finetuned', 'Unit Test', 'Winning Model', 'Judgement'], num_rows: 201 }) ``` ## Features: - **Code**: The source code for which the unit tests are written. - **Unit Test_llama3.1**: Unit test generated by llama3.1 8b instruct model. - **Unit Test_llama3.1_finetuned**: Unit test generated by finetuned llama3.1 8b instruct model. - **Unit Test**: The benchmark or ground truth unit test. - **Winning Model**: The model whose unit test is closer to the ground truth. - **Judgement**: The evaluation results comparing the unit tests. The results are summarized in the table below: ## Unit Test Evaluation Results | Outcome | Count | |---------------------------------|-------| | Llama3.1-8b Instruct finetuned | 105 | | Llama3.1-8b Instruct | 87 | | Tie | 9 | ### Explanation 1. Llama3.1-8b Instruct finetuned Wins: Llama3.1-8b Instruct finetuned aligned more closely with the ground truth in 105 cases. 2. Llama3.1-8b Instruct Wins: Llama3.1-8b Instruct model aligned more closely with the ground truth in 87 cases. 3. Tie: 9 instances where results were tied between the models. ### Win Rates - Llama3.1-8b Instruct finetuned Win Percentage: 52.2% - Llama3.1-8b Instruct Win Percentage: 43.3% - Tie Percentage: 4.5% ### Framework to generate unit test <img src="https://cdn-uploads.huggingface.co/production/uploads/6658bb3acf5fc31e3a0bd24a/nFUDNtFeAukk_qLZL24F6.png" alt="image/png" width="600" height="400"/> ### Evaluation Approach The [gpt-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/), was used as the judge to evaluate which unit test was closer to the ground truth provided by the benchmark dataset. This evaluation highlights the performance differences between the two models and indicates a higher alignment of finetuned llama3.1 model with the benchmarked unit tests. Prompt used for evaluation: [Evaluation Prompt](https://huggingface.co/datasets/Nutanix/cpp_unittests_llama8b_vs_llama70b_judge_llama70/blob/main/config_evaluator.yaml)

提供机构：

Nutanix

5,000+

优质数据集

54 个

任务类型

进入经典数据集