Nutanix/cpp_unit_tests_processed_tinyllama_vs_tinyllama_finetuned_gpt_judge
收藏Hugging Face2024-07-30 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/Nutanix/cpp_unit_tests_processed_tinyllama_vs_tinyllama_finetuned_gpt_judge
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: Code
dtype: string
- name: Unit Test_tinyllama
dtype: string
- name: Unit Test_tinyllama_finetuned
dtype: string
- name: Unit Test
dtype: string
- name: Winning Model
dtype: string
- name: Judgement
dtype: string
splits:
- name: train
num_bytes: 1598198
num_examples: 212
download_size: 495600
dataset_size: 1598198
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Unit Test Evaluation Results
This repository details the evaluation of unit tests generated by tinyllama and finetuned tiny llama models. It compares the unit tests produced by two models: tinyllama and finetuned tinyllama against the [groundtruth data](https://huggingface.co/datasets/Nutanix/cpp_unit_tests_processed_data). In this evaluation, gpt-4o-mini served as the judge, assessing how well the unit tests from both models aligned with the ground truth.
## Models Used
### [TinyLLaMA](https://huggingface.co/Doctor-Shotgun/TinyLlama-1.1B-32k-Instruct)
- **HuggingFace Link**: [TinyLlama-1.1B-32k-Instruct](https://huggingface.co/Doctor-Shotgun/TinyLlama-1.1B-32k-Instruct)
- **Precision**: BF16 Precision
- **Description**: Base instruct model to generate the unit tests.
### TinyLLaMA - finetuned
- **HuggingFace Link**: [Finetuned LoRA adapter](https://huggingface.co/Nutanix/TinyLlama-1.1B-32k-Instruct_cppunittestprocessed_lora_16_alpha_16)
- **Finetune Settings**: LoRaRank = 8, alpha = 16, finetuned tinyllama model for 2 epochs on [this](https://huggingface.co/datasets/Nutanix/cpp_unit_tests_processed_data_chat_format) dataset.
- **Description**: A finetuned model whose unit tests were compared against those generated by base model.
## Dataset
The evaluation utilized the [processed cpp benchmarking dataset](https://huggingface.co/datasets/Nutanix/cpp_unit_tests_processed_data)[val] as the ground truth.
### Dataset Structure
The dataset was loaded using the following structure:
```python
from datasets import Dataset, load_dataset
# Load the dataset
dataset = load_dataset("Nutanix/cpp_unit_tests_processed_tinyllama_vs_tinyllama_finetuned_gpt_judge")
# View dataset structure
DatasetDict({
train: Dataset({
features: ['Code', 'Unit Test_tinyllama', 'Unit Test_tinyllama_finetuned', 'Unit Test', 'Winning Model', 'Judgement'],
num_rows: 212
})
})
```
## Features:
- **Code**: The source code for which the unit tests are written.
- **Unit Test_tinyllama**: Unit test generated by tinyllama1.1B model.
- **Unit Test_tinyllama_finetuned**: Unit test generated by finetuned tinyllama1.1B model.
- **Unit Test**: The benchmark or ground truth unit test.
- **Winning Model**: The model whose unit test is closer to the ground truth.
- **Judgement**: The evaluation results comparing the unit tests.
The results are summarized in the table below:
## Unit Test Evaluation Results
| Outcome | Count |
|--------------------------|-------|
| Tinyllama1.1B finetuned | 165 |
| Tinyllama1.1B | 36 |
| Tie | 11 |
### Explanation
1. Tinyllama1.1B finetuned: Tinyllama1.1B finetuned model aligned more closely with the ground truth in 165 cases.
2. Tinyllama1.1B Wins: Tinyllama1.1B model aligned more closely with the ground truth in 36 cases.
3. Tie: 11 instances where results were tied between the models.
### Win Rates
- Tinyllama1.1B finetuned Win Percentage: 77.8%
- Tinyllama1.1B Win Percentage: 17%
- Tie Percentage: 5.2%
### Framework to generate unit test
<img src="https://cdn-uploads.huggingface.co/production/uploads/6658bb3acf5fc31e3a0bd24a/nFUDNtFeAukk_qLZL24F6.png" alt="image/png" width="600" height="400"/>
### Evaluation Approach
The [gpt-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/), was used as the judge to evaluate which unit test was closer to the ground truth provided by the benchmark dataset. This evaluation highlights the performance differences between the two models and indicates a higher alignment of finetuned tinyllama model with the benchmarked unit tests.
Prompt used for evaluation: [Evaluation Prompt](https://huggingface.co/datasets/Nutanix/cpp_unittests_llama8b_vs_llama70b_judge_llama70/blob/main/config_evaluator.yaml)
提供机构:
Nutanix



