MCINext/PerCoR
收藏Hugging Face2025-10-26 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/MCINext/PerCoR
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- text-classification
language:
- fa
size_categories:
- 10K<n<100K
---
# 📘 PerCoR: Persian Commonsense Reasoning (Multiple-Choice Sentence Completion)
**PerCoR** is a large-scale Persian benchmark for **commonsense reasoning** in a **4-choice sentence-completion** format.
It contains **~106K** examples from **40+** Persian websites across news, culture, lifestyle, tech, religion, travel, and more.
Each instance provides a **prefix** (context) and **four candidate completions** — one correct and three distractors.
---
## 📦 What’s inside
- 🧮 **Total size:** ~106K multiple-choice instances
- 📊 **Splits:** `train` 86,217 • `validation` 10,000 • `test` 10,000
- 🧱 **Format:** single passage/prefix + 4 completions (A–D / 0–3) with one correct answer
- 🧠 **Human accuracy:** ~89% on a random subset
> 💡 *The dataset is designed to be difficult for LLMs while remaining answerable by humans; no LLM text is used to generate distractors (reducing generation-style biases).*
许可证:Apache-2.0
任务类别:
- 问答
- 文本分类
语言:波斯语(fa)
样本量级区间:10000 < 样本量 < 100000
---
# 📘 PerCoR:波斯语常识推理(多项选择式句子补全任务)
**PerCoR** 是一款面向多项选择式句子补全场景的大规模波斯语常识推理基准数据集。
该数据集包含来自40余个波斯语网站的约10.6万个样本,涵盖新闻、文化、生活方式、科技、宗教、旅游等多个领域。
每个样本均提供一个前缀上下文与四个候选补全项,其中仅一个为正确答案,其余三个为干扰项。
---
## 📦 数据集内容概览
- 🧮 **总样本量:** 约10.6万个多项选择样本
- 📊 **数据集划分:** 训练集(train)86217条 • 验证集(validation)10000条 • 测试集(test)10000条
- 🧱 **数据格式:** 单段上下文前缀 + 4个补全候选(标注为A–D / 0–3),且仅含一个正确答案
- 🧠 **人类作答准确率:** 随机子集上的准确率约为89%
> 💡 *本数据集专为挑战大语言模型(LLM)而设计,但人类仍可正确作答;干扰项未使用大语言模型文本生成,以此降低生成式偏差。*
提供机构:
MCINext



