renataaraujoe/Bilingual-LLM-Eval-106
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/renataaraujoe/Bilingual-LLM-Eval-106
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# Bilingual-LLM-Eval-106
## 📌 Overview
**Bilingual-LLM-Eval-106** is a curated, human-annotated evaluation dataset of 106 LLM response pairs in **English and Portuguese**, designed to benchmark model performance across multiple quality dimensions.
The dataset focuses on realistic and challenging evaluation scenarios, including adversarial prompts, ambiguous queries, and hard negatives.
It is intended for:
- LLM evaluation and benchmarking
- LLM-as-a-Judge research
- Multilingual robustness analysis
- Annotation quality research
---
## 🎯 Objectives
This dataset was built to:
- Design and apply a **professional annotation rubric**
- Collect diverse prompt types:
- Factual
- Reasoning
- Adversarial
- Ambiguous
- Annotate responses using **multi-dimensional quality labels**
- Ensure **bilingual parity (EN + PT)** with equivalent difficulty
- Validate annotation consistency using:
- Cohen’s Kappa
- Bilingual calibration
- Study **LLM-as-a-Judge biases** through controlled experiments
---
## 📊 Dataset Structure
The dataset consists of annotated CSV files:
- `annotations_EN_batch1.csv` → English samples
- `annotations_PT_batch1.csv` → Portuguese samples
- `annotations_mirrored_batch.csv` → Cross-lingual mirrored examples
- `annotations_edge_cases.csv` → Adversarial and difficult cases
Each row represents a **prompt + response pair with annotations**.
---
## 🧾 Annotation Dimensions
Each response is evaluated across multiple dimensions:
- **Faithfulness** → factual correctness and grounding
- **Relevance** → alignment with the prompt
- **Fluency** → linguistic quality
- **Completeness** → coverage of required information
- **Safety** → harmful or risky content
Labels follow a structured rubric inspired by industry standards used in:
- Scale AI
- Anthropic
- DataAnnotation
---
## 🌍 Languages
- English (EN)
- Portuguese (PT)
The dataset includes:
- Independent annotations per language
- Mirrored examples for cross-lingual consistency analysis
---
## 🧪 Research Use Cases
This dataset enables:
### 🔍 LLM Evaluation
Benchmark models across multiple qualitative dimensions beyond accuracy.
### ⚖️ LLM-as-a-Judge Analysis
Study bias, inconsistency, and failure modes in automated evaluation systems.
### 🌐 Multilingual Testing
Compare performance across English and Portuguese under equivalent conditions.
### 🎯 Robustness Testing
Evaluate models on:
- Edge cases
- Adversarial prompts
- Ambiguous inputs
---
## 📏 Annotation Quality
- Annotated following a **custom-built professional rubric**
- Includes **hard negatives and adversarial cases**
- Validated with:
- Inter-annotator agreement (Cohen’s Kappa)
- Cross-lingual calibration
---
## 📁 Data Format
Typical columns may include:
- `prompt`
- `response`
- `language`
- `faithfulness`
- `relevance`
- `fluency`
- `completeness`
- `safety`
- `notes` (optional)
---
## ⚠️ Limitations
- Dataset size is intentionally small (106 samples) for **high-quality evaluation**, not training
- Domain coverage is diverse but not exhaustive
- Some annotations may include subjective judgment despite calibration
## 🤝 Contributions
This dataset was created as an independent research and engineering project focused on **LLM evaluation quality and methodology**.
---
## ⭐ Citation
@dataset{bilingual_llm_eval_106_2026,
author = {Renata de Araujo},
title = {Bilingual-LLM-Eval-106: A Human-Annotated Benchmark for English–Portuguese LLM Evaluation},
year = {2026},
publisher = {renataaraujoe},
howpublished = {\url{https://huggingface.co/datasets/renataaraujoe/Bilingual-LLM-Eval-106}},
note = {Version 1.0}
}
提供机构:
renataaraujoe



