five

IndicVault

收藏
魔搭社区2025-12-04 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/IndicVault
下载链接
链接失效反馈
官方服务:
资源简介:
# Indic Vault — everyday Indian language QA pairs, tuned for chatbots & voice agents. <img src="https://cdn-uploads.huggingface.co/production/uploads/64d36fa5df7be14f0a2e3b7b/2VpwFBjjeSEOIu1GZks6t.png" width="600"/> --- ## 🧾 Overview Indic Vault is a high-quality, instruction-tuned dataset featuring question-answer pairs crafted in the contemporary, everyday language spoken across India in 2025. Unlike traditional datasets that lean heavily on formal or outdated linguistic styles, Indic Vault captures the authentic, colloquial expressions used in daily conversations, making it ideal for building AI assistants that feel genuinely relatable. --- ## 🧠 Key Features * **Conversational Language**: Reflects the way people actually speak in 2025, moving beyond rigid textbook Hindi or English. * **Diverse Topics**: Covers 20 essential categories, including Finance, Health, Technology, Relationships, Home & Living, Food & Cooking, Education, Career, Entertainment, Travel, Sports, Culture, Society, Environment, Science, Law & Government, Business, Agriculture, Beauty & Fashion, and Politics. * **High-Quality QA Pairs**: Each entry is thoughtfully curated to provide meaningful and contextually relevant information. * **Model Compatibility**: Optimized for fine-tuning on open-source models ranging from 4B to 27B parameters. --- ## 💬 Sample Entry ### Hindi **Question**: "आजकल सोशल मीडिया का टीनएजर्स के मेंटल हेल्थ पर क्या असर होता है?" **Answer**: "आजकल सोशल मीडिया टीनएजर्स (किशोरों) की ज़िंदगी का एक बहुत अहम हिस्सा बन गया है। इसका उनके मेंटल हेल्थ पर सकारात्मक और नकारात्मक दोनों तरह के प्रभाव पड़ सकते हैं। यह इस बात पर निर्भर करता है कि वे इसका इस्तेमाल कैसे करते हैं, कितना समय देते हैं और किस तरह की सामग्री से जुड़ते हैं। उदाहरण के लिए, ज़्यादा देर स्क्रीन टाइम या दूसरों से तुलना करने की आदत चिंता और आत्म-सम्मान पर असर डाल सकती है, जबकि सकारात्मक कंटेंट और सपोर्ट ग्रुप्स से मदद मिल सकती है।" ### Hinglish (Romanized Hindi + Telugu inserts) **Question**: "Mujhe apna budget kitni baar check aur update karna chahiye?" **Answer**: "Chal, suno yaar, yeh budget ko kitni baar check karna hai aur update karna hai na, yeh bahut important sawaal hai. Basically, har week ek quick check maaro, aur mahine mein ek baar deep dive karo. Telugu mein bolte hain kadaa — 'vaaram vaaramu parikshinchi sarididdandi', ante weekly review chesi correct cheyyandi. Life mein koi major change aaye toh budget turant update karo, simple!" ### Telugu **Question**: "ఇండియాలో మార్జినలైజ్డ్ కమ్యూనిటీస్ కి హెల్థ్ కేర్ ఆక్సెస్ చేయడానికి ఎలాంటి బ్యారియర్స్ ఉన్నాయి?" **Answer**: "ఇండియాలో మార్జినలైజ్డ్ కమ్యూనిటీస్ హెల్త్ కేర్ అందుకోవడానికి చాలా బ్యారియర్స్ ఉన్నాయి. ముఖ్యంగా డబ్బు లేకపోవడం, హాస్పిటల్ లకు దూరం ఉండటం, మరియు సామాజిక వివక్ష (social discrimination) ఇవన్నీ పెద్ద అవరోధాలు. చాలా మందికి హెల్త్ స్కీమ్స్ గురించి తెలియకపోవడం, లేదా హాస్పిటల్స్ లో సరైన సదుపాయాలు లేకపోవడం కూడా ప్రధాన సమస్యలు." These examples demonstrate how the dataset handles: * **Natural, colloquial tone** — responses are written the way people actually speak in casual, real conversations. * **Code-mixed constructs with native script fidelity** — even when English words are used, they’re written in native script (e.g., फ़ोन, బ్యాటరీ), making it voice-friendly. * **Blended cultural references** — Hinglish examples mix South/North Indian idioms, showing real-life cross-lingual fluency. * **Context-rich responses** — not just short-form answers, but full, thoughtful breakdowns suited for actual deployments. * **Multilingual coverage** — answers are crafted per language nuance, not just translated. These qualities make Indic Vault ideal for training: * Chatbots that sound culturally in-tune and emotionally aware * Voice agents that can read responses naturally with Indic TTS * Assistants that respond with empathy, depth, and real-world fluency * Fine-tuned LLMs that don't require further alignment for tone or format how the dataset handles: * **Natural, colloquial tone** * **Code-mixed constructs with native script fidelity** * **Real-life topics written to match 2025 Indian user expectations** --- ## 🌟 Real-World Use & Benefits Indic Vault isn’t just a dataset — it’s how you make your chatbot or voice assistant actually *work* in India. Most datasets sound robotic, too formal, or disconnected from how users actually talk. Indic Vault fixes that. You can fine-tune LLMs like **Gemma-12B**, **LLaMA 3**, or **Mistral** using Indic Vault to: * Build **chatbots** that handle informal, context-heavy queries with the same ease as human agents * Train **voice agents** that feel culturally grounded and natural when speaking * Create **personal assistants** that can understand common queries like “budget kaise banayein?” or “EMI kaunsa best hai?” * Power **creator tools** that generate content in native scripts, using tone and phrasing that actually resonates Fine-tuning on Indic Vault gives your model a massive boost in user relatability, script fluency, and tone correctness — especially for mixed-language Indian audiences. --- ## 🛠️ Technical Details * **Format**: JSON with fields: `question`, `response` * **Languages**: Currently includes Hindi, Hinglish, and Telugu; with Tamil, Marathi, and Bengali planned. * **Script**: Only native scripts used (e.g., Devanagari, Telugu, Tamil). English borrowings appear in transliterated form, never Latin. * **Size**: Approx. **80,000 QA pairs per language**, actively growing with each update --- ## 📈 Performance Insights While Indic Vault is designed to enable natural, conversational AI, we do not claim exact performance parity with Gemini 2.5 or GPT-4o. However, in internal evaluations, open-source models like **Gemma-12B**, **Mistral-7B**, and **LLaMA 3-8B** fine-tuned on Indic Vault were able to handle code-mixed, context-rich prompts with a level of naturalness and relatability that rivaled closed models in **tone and usability**. The key distinction is not raw benchmark scores, but: * **How relatable the responses sound to Indian users** * **How much script-switching, emotional nuance, and day-to-day phrasing** the model can handle post-finetuning This makes Indic Vault a strong fit for building deployable assistants that "sound right" in real user environments. --- ## 📦 Dataset Updates Currently, Indic Vault includes QA pairs in **Hindi**, **Hinglish**, and **Telugu**, with all responses written in native scripts. More languages such as Tamil, Marathi, and Bengali are planned for upcoming releases. Our goal is to continuously expand coverage across Indian languages, while preserving everyday realism in tone and phrasing. ## 🤝 Contribution & Feedback We welcome contributions and feedback to enhance Indic Vault further. If you're interested in adding more topics, languages, or improving existing entries, please reach out or submit a pull request on our GitHub repository. --- This dataset card provides a clear, concise, and engaging overview of Indic Vault, highlighting its unique value in the conversational AI space. By focusing on real-world language usage, Indic Vault is a practical and deployable resource for anyone building Indian-language assistants in 2025.

# Indic Vault — 面向聊天机器人与语音智能体优化的印度日常语言问答对数据集 <img src="https://cdn-uploads.huggingface.co/production/uploads/64d36fa5df7be14f0a2e3b7b/2VpwFBjjeSEOIu1GZks6t.png" width="600"/> --- ## 🧾 概览 Indic Vault是一款高质量、经过指令微调的数据集,收录了2025年印度各地民众日常使用的当代生活化语言编写的问答对。与以往大量依赖正式或过时语言风格的传统数据集不同,Indic Vault捕捉了日常对话中真实的口语化表达,非常适合打造极具真实亲切感的AI助手。 --- ## 🧠 核心特性 * **口语化语言**:贴合2025年民众的真实说话习惯,摒弃刻板的教科书式印地语或英语表达。 * **多元主题**:涵盖20大类核心话题,包括金融、健康、科技、人际关系、家居生活、美食烹饪、教育、职业、娱乐、旅行、体育、文化、社会、环境、科学、法律与政务、商业、农业、美妆时尚以及政治。 * **高质量问答对**:每一条数据均经过精心筛选,确保信息具备实际意义且贴合上下文语境。 * **模型兼容性**:针对从40亿到270亿参数的开源模型微调进行了优化。 --- ## 💬 示例条目 ### 印地语 **问题**: "आजकल सोशल मीडिया का टीनएजर्स के मेंटल हेल्थ पर क्या असर होता है?" **答案**: "आजकल सोशल मीडिया टीनएजर्स (किशोरों) की ज़िंदगी का एक बहुत अहम हिस्सा बन गया है। इसका उनके मेंटल हेल्थ पर सकारात्मक और नकारात्मक दोनों तरह के प्रभाव पड़ सकते हैं। यह इस बात पर निर्भर करता है कि वे इसका इस्तेमाल कैसे करते हैं, कितना समय देते हैं और किस तरह की सामग्री से जुड़ते हैं। उदाहरण के लिए, ज़्यादा देर स्क्रीन टाइम या दूसरों से तुलना करने की आदत चिंता और आत्म-सम्मान पर असर डाल सकती है, जबकि सकारात्मक कंटेंट और सपोर्ट ग्रुप्स से मदद मिल सकती है।" ### 印式英语(Hinglish) **问题**: "Mujhe apna budget kitni baar check aur update karna chahiye?" **答案**: "Chal, suno yaar, yeh budget ko kitni baar check karna hai aur update karna hai na, yeh bahut important sawaal hai. Basically, har week ek quick check maaro, aur mahine mein ek baar deep dive karo. Telugu mein bolte hain kadaa — 'vaaram vaaramu parikshinchi sarididdandi', ante weekly review chesi correct cheyyandi. Life mein koi major change aaye toh budget turant update karo, simple!" ### 泰卢固语 **问题**: "ఇండియాలో మార్జినలైజ్డ్ కమ్యూనిటీస్ కి హెల్థ్ కేర్ ఆక్సెస్ చేయడానికి ఎలాంటి బ్యారియర్స్ ఉన్నాయి?" **答案**: "ఇండియాలో మార్జినలైజ్డ్ కమ్యూనిటీస్ హెల్త్ కేర్ అందుకోవడానికి చాలా బ్యారియర్స్ ఉన్నాయి. ముఖ్యంగా డబ్బు లేకపోవడం, హాస్పిటల్ లకు దూరం ఉండటం, మరియు సామాజిక వివక్ష (social discrimination) ఇవన్నీ పెద్ద అవరోధాలు. చాలా మందికి హెల్త్ స్కీమ్స్ గురించి తెలియకపోవడం, లేదా హాస్పిటల్స్ లో సరైన సదుపాయాలు లేకపోవడం కూడా ప్రధాన సమస్యలు." 这些示例展示了该数据集在以下方面的处理能力: * **自然口语化语气**:回复完全贴合日常闲聊时的真实表达习惯。 * **混合语言结构且保留原生脚本规范**:即使使用英语借词,也会以音译形式写入原生脚本(例如 फ़ोन、బ్యాటరీ),适配语音合成场景。 * **融合文化参考**:印式英语(Hinglish)示例融合了南北印度的本土习语,展现真实的跨语言流畅度。 * **上下文丰富的回复**:并非简短的格式化回答,而是完整且经过深思熟虑的细节拆解,适配实际部署需求。 * **多语言覆盖**:回复会根据语言细节与文化语境进行定制,而非单纯的直译。 这些特性让Indic Vault非常适合训练以下模型: * 具备文化适配性与情感感知能力的聊天机器人 * 可使用印度语言文本到语音(Indic TTS)系统自然朗读回复的语音智能体 * 能够以共情、深度且贴合现实语境的方式回应的助手 * 无需额外进行语气与格式对齐即可直接部署的微调后大语言模型(Large Language Model,简称LLM) 该数据集还在以下方面具备核心优势: * **自然口语化语气** * **混合语言结构且保留原生脚本规范** * **贴合2025年印度用户预期的现实话题** --- ## 🌟 实际应用与价值 Indic Vault绝非普通的数据集——它正是让你的聊天机器人或语音智能体真正适配印度本土使用场景的解决方案。绝大多数现有数据集的话术生硬刻板、过于正式,与用户的真实表达脱节,而Indic Vault完美解决了这一痛点。 你可以使用Indic Vault对**Gemma-12B**、**LLaMA 3**或**Mistral**等大语言模型进行微调,以实现以下目标: * 打造能够轻松处理非正式、上下文丰富的查询的聊天机器人,其应答能力媲美人工客服 * 训练具备文化适配性、表达自然的语音智能体 * 开发可理解“budget kaise banayein?”(如何制定预算)或“EMI kaunsa best hai?”(哪种分期还款最划算)等日常查询的个人助手 * 构建能够生成贴合本土语境的原生脚本内容的创作者工具,其语气与措辞真正能够引发用户共鸣 在Indic Vault上进行微调,可大幅提升模型的用户亲切感、脚本熟练度与语气准确性,尤其针对多语言混杂的印度用户群体效果显著。 --- ## 🛠️ 技术细节 * **格式**:采用JSON格式,包含`question`(问题)与`response`(回复)两个字段 * **支持语言**:目前涵盖印地语、印式英语(Hinglish)与泰卢固语,计划后续新增泰米尔语、马拉地语与孟加拉语 * **脚本规范**:仅使用原生书写系统(例如天城文(Devanagari)、泰卢固文(Telugu)、泰米尔文(Tamil)),英语借词仅以音译形式呈现,绝不使用拉丁字母 * **数据规模**:每种语言约含**80000条问答对**,并将随版本更新持续增长 --- ## 📈 性能分析 尽管Indic Vault旨在打造自然对话式AI,但我们并未宣称其性能可完全媲美Gemini 2.5或GPT-4o。不过在内部评估中,经过Indic Vault微调的开源模型,如**Gemma-12B**、**Mistral-7B**与**LLaMA 3-8B**,在处理混合语言、上下文丰富的提示词时,其自然度与亲切感可与闭源模型在**语气与可用性**层面相媲美。 其核心优势并非单纯的基准测试分数,而是: * **回复对印度用户的亲切感程度** * **微调后模型可处理的脚本切换、情感细节与日常用语表达能力** 这让Indic Vault成为打造可在真实用户场景中“适配自然”的可部署助手的理想选择。 --- ## 📦 数据集更新 目前,Indic Vault包含印地语、印式英语(Hinglish)与泰卢固语的问答对,所有回复均采用原生脚本书写。计划在后续版本中新增泰米尔语、马拉地语(Marathi)与孟加拉语(Bengali)。我们的目标是持续拓展印度语言覆盖范围,同时保留日常对话的真实语气与措辞风格。 --- ## 🤝 贡献与反馈 我们欢迎各类贡献与反馈,以进一步完善Indic Vault。若你有兴趣新增话题、语言或优化现有条目,请联系我们或前往我们的GitHub仓库提交拉取请求。 --- 本数据集卡片清晰、简洁且生动地介绍了Indic Vault,凸显了其在对话式AI领域的独特价值。通过聚焦真实的日常语言使用场景,Indic Vault为2025年印度语言助手的开发者提供了一款实用且可直接部署的优质资源。
提供机构:
maas
创建时间:
2025-05-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作