DataOrigin/human-feedback-mentor-sessions-india
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/DataOrigin/human-feedback-mentor-sessions-india
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- audio-classification
- automatic-speech-recognition
- text-generation
language:
- en
- hi
- bn
- ta
- te
- ml
- mr
- or
- as
- pa
tags:
- education,
- rlhf
- human-feedback
- multilingual
- mentor-interaction
- indic
- indic-languages
- upsc
- government-exams
- long-form-audio
pretty_name: Human Feedback Mentor Sessions India
size_categories:
- 1K<n<10K
---
---
license: other
task_categories:
- audio-classification
- automatic-speech-recognition
- text-generation
language:
- en
- hi
- bn
- ta
- te
- ml
- mr
- or
- as
- pa
pretty_name: Human Feedback Mentor Sessions India
size_categories:
- 1K<n<10K
# Human Feedback Mentor Sessions India
## Dataset Description
A rare and high-value collection of recorded aspirant-mentor interactions
capturing real guidance, feedback, and reasoning corrections in the context
of Indian government exam preparation. Produced by Prepp, India's largest
government exam preparation platform, operated by Collegedunia Web Private
Limited.
This dataset captures something almost entirely absent from public AI
training data: real human expert feedback on student reasoning — not
synthetic, not scripted, not annotated after the fact. These are live
mentoring conversations where experts identify misconceptions, correct
reasoning errors, and guide aspirants toward better thinking in real time.
## Dataset Summary
- **Total duration:** 4,500 hours of recorded mentor-aspirant sessions
- **Content type:** Real aspirant-mentor interactions — guidance, feedback,
and reasoning corrections
- **Languages:** English, Hindi, Marathi, Tamil
- **Format:** Audio recordings of structured mentoring conversations
- **Domain:** UPSC Civil Services, State PSC, SSC, and government
competitive exam preparation
- **Interaction type:** One-on-one and small group mentoring sessions
capturing natural pedagogical dialogue
## Sample Data
Three sample sessions are available in this repository:
- Sample 1: English medium — UPSC General Studies reasoning correction
- Sample 2: Marathi medium — State PSC conceptual guidance session
- Sample 3: Tamil medium — Aspirant feedback and answer improvement session
## Key Features
- **Genuine RLHF signal:** This dataset captures the exact interaction
type that powers Reinforcement Learning from Human Feedback — an expert
correcting a learner's reasoning in real time. Unlike synthetic RLHF
datasets, these are authentic expert-novice interactions with natural
language feedback on reasoning quality.
- **Reasoning correction data:** Mentors explicitly identify where student
thinking goes wrong and model correct reasoning — a high-value signal
for training models to reason better and self-correct.
- **Rare multilingual RLHF:** Human feedback data in Marathi and Tamil
is extraordinarily scarce globally. This dataset contains thousands of
hours of expert feedback in genuinely low-resource languages.
- **Domain expertise:** Mentors are qualified UPSC and government exam
subject matter experts — the feedback reflects deep domain knowledge,
not generic tutoring.
- **Natural conversational structure:** Unlike scripted educational content,
these sessions capture natural back-and-forth dialogue — valuable for
conversational AI and dialogue model training.
- **Long-form interactions:** Sessions range from 30 minutes to several
hours — providing extended context for long-form audio understanding
models.
## Why This Dataset Is Exceptionally Rare
Most RLHF training data is either synthetic, crowdsourced from platforms
like MTurk, or generated by annotators who are not domain experts. This
dataset offers something fundamentally different: thousands of hours of
real expert-novice dialogue where a genuine subject matter expert provides
structured feedback on a student's reasoning, argument construction, and
conceptual understanding — in four languages including two that are
extremely low-resource for AI training.
The combination of authentic human feedback, domain expertise, multilingual
coverage, and volume makes this one of the most distinctive educational AI
training assets available from India.
## Intended Uses
- Reinforcement Learning from Human Feedback (RLHF) model training
- Reasoning correction and self-improvement model development
- Multilingual dialogue and conversational AI training
- Indic language ASR model training — especially Marathi and Tamil
- Educational AI and intelligent tutoring system development
- Expert feedback generation model fine-tuning
- Long-form audio understanding in low-resource Indian languages
## Privacy and Ethics
All recordings are collected with explicit consent from participating
aspirants and mentors. All personally identifiable information has been
removed or anonymised prior to any licensing. Data handling complies
with applicable Indian data protection frameworks.
## Data Collection and Rights
All content is proprietary to Collegedunia Web Private Limited, collected
through the Prepp platform under explicit consent agreements with
participating aspirants and mentors. Full dataset licensing is available
for commercial AI training purposes under negotiated terms.
## Licensing and Commercial Access
This repository contains sample data only. The full dataset of 4,500
hours of mentor-aspirant interaction recordings is available for
commercial AI training licensing.
**For licensing inquiries contact:**
Ankit Dubey — Head of AI Data Partnerships, Collegedunia
ankit.dubey@collegedunia.com
## Dataset Curator
[Collegedunia Web Private Limited](https://collegedunia.com) |
[Prepp](https://prepp.in)
Gurugram, Haryana, India
提供机构:
DataOrigin



