renataaraujoe/Bilingual-LLM-Eval-106

Name: renataaraujoe/Bilingual-LLM-Eval-106
Creator: renataaraujoe
Published: 2026-03-25 13:04:52
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/renataaraujoe/Bilingual-LLM-Eval-106

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit --- # Bilingual-LLM-Eval-106 ## 📌 Overview **Bilingual-LLM-Eval-106** is a curated, human-annotated evaluation dataset of 106 LLM response pairs in **English and Portuguese**, designed to benchmark model performance across multiple quality dimensions. The dataset focuses on realistic and challenging evaluation scenarios, including adversarial prompts, ambiguous queries, and hard negatives. It is intended for: - LLM evaluation and benchmarking - LLM-as-a-Judge research - Multilingual robustness analysis - Annotation quality research --- ## 🎯 Objectives This dataset was built to: - Design and apply a **professional annotation rubric** - Collect diverse prompt types: - Factual - Reasoning - Adversarial - Ambiguous - Annotate responses using **multi-dimensional quality labels** - Ensure **bilingual parity (EN + PT)** with equivalent difficulty - Validate annotation consistency using: - Cohen’s Kappa - Bilingual calibration - Study **LLM-as-a-Judge biases** through controlled experiments --- ## 📊 Dataset Structure The dataset consists of annotated CSV files: - `annotations_EN_batch1.csv` → English samples - `annotations_PT_batch1.csv` → Portuguese samples - `annotations_mirrored_batch.csv` → Cross-lingual mirrored examples - `annotations_edge_cases.csv` → Adversarial and difficult cases Each row represents a **prompt + response pair with annotations**. --- ## 🧾 Annotation Dimensions Each response is evaluated across multiple dimensions: - **Faithfulness** → factual correctness and grounding - **Relevance** → alignment with the prompt - **Fluency** → linguistic quality - **Completeness** → coverage of required information - **Safety** → harmful or risky content Labels follow a structured rubric inspired by industry standards used in: - Scale AI - Anthropic - DataAnnotation --- ## 🌍 Languages - English (EN) - Portuguese (PT) The dataset includes: - Independent annotations per language - Mirrored examples for cross-lingual consistency analysis --- ## 🧪 Research Use Cases This dataset enables: ### 🔍 LLM Evaluation Benchmark models across multiple qualitative dimensions beyond accuracy. ### ⚖️ LLM-as-a-Judge Analysis Study bias, inconsistency, and failure modes in automated evaluation systems. ### 🌐 Multilingual Testing Compare performance across English and Portuguese under equivalent conditions. ### 🎯 Robustness Testing Evaluate models on: - Edge cases - Adversarial prompts - Ambiguous inputs --- ## 📏 Annotation Quality - Annotated following a **custom-built professional rubric** - Includes **hard negatives and adversarial cases** - Validated with: - Inter-annotator agreement (Cohen’s Kappa) - Cross-lingual calibration --- ## 📁 Data Format Typical columns may include: - `prompt` - `response` - `language` - `faithfulness` - `relevance` - `fluency` - `completeness` - `safety` - `notes` (optional) --- ## ⚠️ Limitations - Dataset size is intentionally small (106 samples) for **high-quality evaluation**, not training - Domain coverage is diverse but not exhaustive - Some annotations may include subjective judgment despite calibration ## 🤝 Contributions This dataset was created as an independent research and engineering project focused on **LLM evaluation quality and methodology**. --- ## ⭐ Citation @dataset{bilingual_llm_eval_106_2026, author = {Renata de Araujo}, title = {Bilingual-LLM-Eval-106: A Human-Annotated Benchmark for English–Portuguese LLM Evaluation}, year = {2026}, publisher = {renataaraujoe}, howpublished = {\url{https://huggingface.co/datasets/renataaraujoe/Bilingual-LLM-Eval-106}}, note = {Version 1.0} }

提供机构：

renataaraujoe

5,000+

优质数据集

54 个

任务类型

进入经典数据集