Nutanix/cpp_unit_tests_processed_tinyllama_vs_tinyllama_finetuned_gpt_judge

Name: Nutanix/cpp_unit_tests_processed_tinyllama_vs_tinyllama_finetuned_gpt_judge
Creator: Nutanix
Published: 2024-07-30 22:20:07
License: 暂无描述

Hugging Face2024-07-30 更新2025-04-08 收录

下载链接：

https://hf-mirror.com/datasets/Nutanix/cpp_unit_tests_processed_tinyllama_vs_tinyllama_finetuned_gpt_judge

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: Code dtype: string - name: Unit Test_tinyllama dtype: string - name: Unit Test_tinyllama_finetuned dtype: string - name: Unit Test dtype: string - name: Winning Model dtype: string - name: Judgement dtype: string splits: - name: train num_bytes: 1598198 num_examples: 212 download_size: 495600 dataset_size: 1598198 configs: - config_name: default data_files: - split: train path: data/train-* --- # Unit Test Evaluation Results This repository details the evaluation of unit tests generated by tinyllama and finetuned tiny llama models. It compares the unit tests produced by two models: tinyllama and finetuned tinyllama against the [groundtruth data](https://huggingface.co/datasets/Nutanix/cpp_unit_tests_processed_data). In this evaluation, gpt-4o-mini served as the judge, assessing how well the unit tests from both models aligned with the ground truth. ## Models Used ### [TinyLLaMA](https://huggingface.co/Doctor-Shotgun/TinyLlama-1.1B-32k-Instruct) - **HuggingFace Link**: [TinyLlama-1.1B-32k-Instruct](https://huggingface.co/Doctor-Shotgun/TinyLlama-1.1B-32k-Instruct) - **Precision**: BF16 Precision - **Description**: Base instruct model to generate the unit tests. ### TinyLLaMA - finetuned - **HuggingFace Link**: [Finetuned LoRA adapter](https://huggingface.co/Nutanix/TinyLlama-1.1B-32k-Instruct_cppunittestprocessed_lora_16_alpha_16) - **Finetune Settings**: LoRaRank = 8, alpha = 16, finetuned tinyllama model for 2 epochs on [this](https://huggingface.co/datasets/Nutanix/cpp_unit_tests_processed_data_chat_format) dataset. - **Description**: A finetuned model whose unit tests were compared against those generated by base model. ## Dataset The evaluation utilized the [processed cpp benchmarking dataset](https://huggingface.co/datasets/Nutanix/cpp_unit_tests_processed_data)[val] as the ground truth. ### Dataset Structure The dataset was loaded using the following structure: ```python from datasets import Dataset, load_dataset # Load the dataset dataset = load_dataset("Nutanix/cpp_unit_tests_processed_tinyllama_vs_tinyllama_finetuned_gpt_judge") # View dataset structure DatasetDict({ train: Dataset({ features: ['Code', 'Unit Test_tinyllama', 'Unit Test_tinyllama_finetuned', 'Unit Test', 'Winning Model', 'Judgement'], num_rows: 212 }) }) ``` ## Features: - **Code**: The source code for which the unit tests are written. - **Unit Test_tinyllama**: Unit test generated by tinyllama1.1B model. - **Unit Test_tinyllama_finetuned**: Unit test generated by finetuned tinyllama1.1B model. - **Unit Test**: The benchmark or ground truth unit test. - **Winning Model**: The model whose unit test is closer to the ground truth. - **Judgement**: The evaluation results comparing the unit tests. The results are summarized in the table below: ## Unit Test Evaluation Results | Outcome | Count | |--------------------------|-------| | Tinyllama1.1B finetuned | 165 | | Tinyllama1.1B | 36 | | Tie | 11 | ### Explanation 1. Tinyllama1.1B finetuned: Tinyllama1.1B finetuned model aligned more closely with the ground truth in 165 cases. 2. Tinyllama1.1B Wins: Tinyllama1.1B model aligned more closely with the ground truth in 36 cases. 3. Tie: 11 instances where results were tied between the models. ### Win Rates - Tinyllama1.1B finetuned Win Percentage: 77.8% - Tinyllama1.1B Win Percentage: 17% - Tie Percentage: 5.2% ### Framework to generate unit test <img src="https://cdn-uploads.huggingface.co/production/uploads/6658bb3acf5fc31e3a0bd24a/nFUDNtFeAukk_qLZL24F6.png" alt="image/png" width="600" height="400"/> ### Evaluation Approach The [gpt-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/), was used as the judge to evaluate which unit test was closer to the ground truth provided by the benchmark dataset. This evaluation highlights the performance differences between the two models and indicates a higher alignment of finetuned tinyllama model with the benchmarked unit tests. Prompt used for evaluation: [Evaluation Prompt](https://huggingface.co/datasets/Nutanix/cpp_unittests_llama8b_vs_llama70b_judge_llama70/blob/main/config_evaluator.yaml)

提供机构：

Nutanix

5,000+

优质数据集

54 个

任务类型

进入经典数据集