Nutanix/cpp_unit_tests_unprocessed_llama3.1_vs_llama3.1_finetuned_gpt_judge
收藏Hugging Face2024-07-31 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/Nutanix/cpp_unit_tests_unprocessed_llama3.1_vs_llama3.1_finetuned_gpt_judge
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: Code
dtype: string
- name: Unit Test_llama3.1
dtype: string
- name: Unit Test_llama3.1_finetuned
dtype: string
- name: Unit Test
dtype: string
- name: Winning Model
dtype: string
- name: Judgement
dtype: string
splits:
- name: train
num_bytes: 10853630
num_examples: 201
download_size: 2697802
dataset_size: 10853630
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Unit Test Evaluation Results
This repository details the evaluation of unit tests generated by llama models. It compares the unit tests produced by two models: llama3.1 8B Instruct and finetuned llama3.1 8b Instruct against the [groundtruth data](https://huggingface.co/datasets/Nutanix/cpp-unit-test-benchmarking-dataset). In this evaluation, gpt-4o-mini served as the judge, assessing how well the unit tests from both models aligned with the ground truth.
## Models Used
### [Llama3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
- **HuggingFace Link**: [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
- **Precision**: BF16 Precision
- **Description**: Base instruct model to generate the unit tests.
### LLama3.1 8B Instruct - finetuned
- **HuggingFace Link**: [Finetuned LoRA adapter](https://huggingface.co/Nutanix/Meta-Llama-3.1-8B-Instruct_cppunittest_lora_8_alpha_16)
- **Finetune Settings**: LoRaRank = 8, alpha = 16, finetuned llama3.1 8b instruct model for 2 epochs on [this](https://huggingface.co/datasets/Nutanix/cpp_unit_tests_finetuning_dataset_chat_format_less_than_8k) dataset.
- **Description**: A finetuned model whose unit tests were compared against those generated by base model.
## Dataset
The evaluation utilized the [cpp unit test benchmarking dataset](https://huggingface.co/datasets/Nutanix/cpp_unit_tests_eval_dataset_less_than_8k_extracted_code) as the ground truth.
### Dataset Structure
The dataset was loaded using the following structure:
```python
from datasets import Dataset, load_dataset
# Load the dataset
dataset = load_dataset("Nutanix/cpp_unit_tests_unprocessed_llama3.1_vs_llama3.1_finetuned_gpt_judge")
# View dataset structure
Dataset({
features: ['Code', 'Unit Test_llama3.1', 'Unit Test_llama3.1_finetuned', 'Unit Test', 'Winning Model', 'Judgement'],
num_rows: 201
})
```
## Features:
- **Code**: The source code for which the unit tests are written.
- **Unit Test_llama3.1**: Unit test generated by llama3.1 8b instruct model.
- **Unit Test_llama3.1_finetuned**: Unit test generated by finetuned llama3.1 8b instruct model.
- **Unit Test**: The benchmark or ground truth unit test.
- **Winning Model**: The model whose unit test is closer to the ground truth.
- **Judgement**: The evaluation results comparing the unit tests.
The results are summarized in the table below:
## Unit Test Evaluation Results
| Outcome | Count |
|---------------------------------|-------|
| Llama3.1-8b Instruct finetuned | 105 |
| Llama3.1-8b Instruct | 87 |
| Tie | 9 |
### Explanation
1. Llama3.1-8b Instruct finetuned Wins: Llama3.1-8b Instruct finetuned aligned more closely with the ground truth in 105 cases.
2. Llama3.1-8b Instruct Wins: Llama3.1-8b Instruct model aligned more closely with the ground truth in 87 cases.
3. Tie: 9 instances where results were tied between the models.
### Win Rates
- Llama3.1-8b Instruct finetuned Win Percentage: 52.2%
- Llama3.1-8b Instruct Win Percentage: 43.3%
- Tie Percentage: 4.5%
### Framework to generate unit test
<img src="https://cdn-uploads.huggingface.co/production/uploads/6658bb3acf5fc31e3a0bd24a/nFUDNtFeAukk_qLZL24F6.png" alt="image/png" width="600" height="400"/>
### Evaluation Approach
The [gpt-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/), was used as the judge to evaluate which unit test was closer to the ground truth provided by the benchmark dataset. This evaluation highlights the performance differences between the two models and indicates a higher alignment of finetuned llama3.1 model with the benchmarked unit tests.
Prompt used for evaluation: [Evaluation Prompt](https://huggingface.co/datasets/Nutanix/cpp_unittests_llama8b_vs_llama70b_judge_llama70/blob/main/config_evaluator.yaml)
提供机构:
Nutanix



