DataOrigin/human-feedback-mentor-sessions-india

Name: DataOrigin/human-feedback-mentor-sessions-india
Creator: DataOrigin
Published: 2026-04-06 12:29:24
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/DataOrigin/human-feedback-mentor-sessions-india

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - audio-classification - automatic-speech-recognition - text-generation language: - en - hi - bn - ta - te - ml - mr - or - as - pa tags: - education, - rlhf - human-feedback - multilingual - mentor-interaction - indic - indic-languages - upsc - government-exams - long-form-audio pretty_name: Human Feedback Mentor Sessions India size_categories: - 1K<n<10K --- --- license: other task_categories: - audio-classification - automatic-speech-recognition - text-generation language: - en - hi - bn - ta - te - ml - mr - or - as - pa pretty_name: Human Feedback Mentor Sessions India size_categories: - 1K<n<10K # Human Feedback Mentor Sessions India ## Dataset Description A rare and high-value collection of recorded aspirant-mentor interactions capturing real guidance, feedback, and reasoning corrections in the context of Indian government exam preparation. Produced by Prepp, India's largest government exam preparation platform, operated by Collegedunia Web Private Limited. This dataset captures something almost entirely absent from public AI training data: real human expert feedback on student reasoning — not synthetic, not scripted, not annotated after the fact. These are live mentoring conversations where experts identify misconceptions, correct reasoning errors, and guide aspirants toward better thinking in real time. ## Dataset Summary - **Total duration:** 4,500 hours of recorded mentor-aspirant sessions - **Content type:** Real aspirant-mentor interactions — guidance, feedback, and reasoning corrections - **Languages:** English, Hindi, Marathi, Tamil - **Format:** Audio recordings of structured mentoring conversations - **Domain:** UPSC Civil Services, State PSC, SSC, and government competitive exam preparation - **Interaction type:** One-on-one and small group mentoring sessions capturing natural pedagogical dialogue ## Sample Data Three sample sessions are available in this repository: - Sample 1: English medium — UPSC General Studies reasoning correction - Sample 2: Marathi medium — State PSC conceptual guidance session - Sample 3: Tamil medium — Aspirant feedback and answer improvement session ## Key Features - **Genuine RLHF signal:** This dataset captures the exact interaction type that powers Reinforcement Learning from Human Feedback — an expert correcting a learner's reasoning in real time. Unlike synthetic RLHF datasets, these are authentic expert-novice interactions with natural language feedback on reasoning quality. - **Reasoning correction data:** Mentors explicitly identify where student thinking goes wrong and model correct reasoning — a high-value signal for training models to reason better and self-correct. - **Rare multilingual RLHF:** Human feedback data in Marathi and Tamil is extraordinarily scarce globally. This dataset contains thousands of hours of expert feedback in genuinely low-resource languages. - **Domain expertise:** Mentors are qualified UPSC and government exam subject matter experts — the feedback reflects deep domain knowledge, not generic tutoring. - **Natural conversational structure:** Unlike scripted educational content, these sessions capture natural back-and-forth dialogue — valuable for conversational AI and dialogue model training. - **Long-form interactions:** Sessions range from 30 minutes to several hours — providing extended context for long-form audio understanding models. ## Why This Dataset Is Exceptionally Rare Most RLHF training data is either synthetic, crowdsourced from platforms like MTurk, or generated by annotators who are not domain experts. This dataset offers something fundamentally different: thousands of hours of real expert-novice dialogue where a genuine subject matter expert provides structured feedback on a student's reasoning, argument construction, and conceptual understanding — in four languages including two that are extremely low-resource for AI training. The combination of authentic human feedback, domain expertise, multilingual coverage, and volume makes this one of the most distinctive educational AI training assets available from India. ## Intended Uses - Reinforcement Learning from Human Feedback (RLHF) model training - Reasoning correction and self-improvement model development - Multilingual dialogue and conversational AI training - Indic language ASR model training — especially Marathi and Tamil - Educational AI and intelligent tutoring system development - Expert feedback generation model fine-tuning - Long-form audio understanding in low-resource Indian languages ## Privacy and Ethics All recordings are collected with explicit consent from participating aspirants and mentors. All personally identifiable information has been removed or anonymised prior to any licensing. Data handling complies with applicable Indian data protection frameworks. ## Data Collection and Rights All content is proprietary to Collegedunia Web Private Limited, collected through the Prepp platform under explicit consent agreements with participating aspirants and mentors. Full dataset licensing is available for commercial AI training purposes under negotiated terms. ## Licensing and Commercial Access This repository contains sample data only. The full dataset of 4,500 hours of mentor-aspirant interaction recordings is available for commercial AI training licensing. **For licensing inquiries contact:** Ankit Dubey — Head of AI Data Partnerships, Collegedunia ankit.dubey@collegedunia.com ## Dataset Curator [Collegedunia Web Private Limited](https://collegedunia.com) | [Prepp](https://prepp.in) Gurugram, Haryana, India

提供机构：

DataOrigin

5,000+

优质数据集

54 个

任务类型

进入经典数据集