five

Mira-Network/ensemble-validation

收藏
Hugging Face2024-11-12 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Mira-Network/ensemble-validation
下载链接
链接失效反馈
官方服务:
资源简介:
## Ensemble Evaluation Data #### Dataset Summary The Learnrite Evaluation Data is a comprehensive question bank designed for evaluating AI models on complex, real-world questions derived from India’s Civil Services examination — widely regarded as one of the toughest competitive exams globally. The dataset features multiple-choice questions (MCQs) covering topics such as the Indian Constitution, governance, and administrative functions. This makes it a particularly challenging benchmark due to the depth and nuance of the content, as well as the requirement for internal causal consistency in many questions. #### Key Features • High Difficulty Level: The questions are modeled after India’s Civil Services exam, known for its rigor and the depth of knowledge required, making this an excellent benchmark for testing advanced AI models. • Internal Causal Consistency: Many questions involve logical reasoning and require understanding internal causal relationships, making them difficult to solve with simple pattern recognition. This aspect tests an AI model’s ability to engage in deeper reasoning rather than relying solely on surface-level matching. • Benchmark Performance: As a testament to the dataset’s difficulty, Claude 3.5 Sonnet achieved a score of 73.1%, indicating that even state-of-the-art models face significant challenges with this benchmark. • AI Generated: The dataset was generated using Claude 3.5 Sonnet. #### Dataset Structure The dataset includes the following columns: • question_id: A unique identifier for each question (e.g., 0001, 0002). • question_text: The full text of the question. • question_answer_options: The full text of the multiple-choice answer options. • expected_correct_answer: The correct answer choice (e.g., A, B, C, D). • ground_truth: The correct answer choice (e.g., A, B, C, D.. INVALID). #### Intended Use This dataset is particularly suitable for: • Model Evaluation: Assessing the performance of language models on complex, domain-specific knowledge tasks. • Benchmarking: Providing a challenging test for AI systems aimed at improving output accuracy. Why This Benchmark Matters This dataset is not only challenging but also practically relevant. The types of questions included are similar to what users might expect from AI in real-world educational and analytical tasks, such as exam preparation, legal analysis, and understanding of complex policy matters. Licensing CC BY 4.0. #### Usage Example from datasets import load_dataset ##### Load the dataset dataset = load_dataset("your-username/learnrite-evaluation-data") ##### Display the first few examples print(dataset["train"].head()) #### Dataset Size • Number of Questions: 78 entries. • File Size: Approximately 41 KB. #### Limitations • The dataset is focused on questions related to Indian governance and constitutional topics, which may limit its generalizability to broader domains. • The MCQ format may not capture the full complexity of open-ended reasoning tasks, but it still provides a robust test of logical and factual understanding.
提供机构:
Mira-Network
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作