five

Finance-Curriculum-Edu-Multilingual

收藏
魔搭社区2025-12-05 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/Josephgflowers/Finance-Curriculum-Edu-Multilingual
下载链接
链接失效反馈
官方服务:
资源简介:
## 💡 Abstract I present a cleaned, multilingual version of the *Finance Curriculum Edu* Q‑A dataset, comprising **7,941** entries spanning **60+ languages**, generated by translating and expanding upon the 7,794‑row English finance‑curriculum topics list. Every question is paired with a nuanced, domain‑rich answer in its target language. All entries are provided in a single **CSV file**. --- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6328952f798f8d122ce62a44/S_y4ZUMO-3RGhTFWEFuN9.png) ## 📚 Datasets & Links - **All the datasets have unique entries, they are not direct translations.** - **Master topics list (seed)**: [Finance Curriculum Topics list at Hugging Face](https://huggingface.co/datasets/Josephgflowers/finance_curriculum_topics) — a 7.79 k‑row CSV of curated finance topics used to guide question generation :contentReference[oaicite:4]{index=4} - **English version** (~6.87 k entries): [Josephgflowers/Finance_Curriculum_Edu_English] dataset in CSV format :contentReference[oaicite:5]{index=5} - **Arabic version** (~4.83 k entries): [Josephgflowers/Finance-Curriculum-Edu-Arabic] CSV dataset :contentReference[oaicite:6]{index=6} - **Uzbek version** (~2.23 k entries): [Josephgflowers/Finance-Curriculum-Edu-Uzbek] cleaned CSV dataset :contentReference[oaicite:7]{index=7} --- ## 📄 Dataset Overview | Property | Detail | |-----------------------|------------------------------------------| | **Languages** | ~60 (including English, Arabic, Uzbek) | | **Total size** | ~7,941 QA pairs | | **File format** | CSV (UTF‑8 encoded, cleaned) | | **Sponsor field** | Sanitized (values trimmed, typos fixed) | | **License** | MIT (open access) | | **Topics used** | 7,794 seed topics from master list | --- **Full Languages List Used:** "Arabic", "Amharic", "Azerbaijani", "Bengali", "Burmese", "Chinese (Simplified)", "Chinese (Traditional)", "Czech", "Danish", "Dutch", "English", "Finnish", "French", "Georgian", "German", "Greek", "Gujarati", "Haitian Creole", "Hausa", "Hebrew", "Hindi", "Hungarian", "Igbo", "Indonesian", "Italian", "Japanese", "Javanese", "Kazakh", "Khmer", "Korean", "Lao", "Malay", "Marathi", "Persian", "Polish", "Portuguese", "Punjabi", "Quechua", "Romanian", "Russian", "Serbian/Croatian/Bosnian", "Sinhala", "Somali", "Spanish", "Swahili", "Swedish", "Tagalog", "Tamil", "Telugu", "Thai", "Turkish", "Turkmen", "Ukrainian", "Urdu", "Uzbek", "Vietnamese", "Yoruba", "Zulu" ### 🛑 The Problem Despite rapid advances in large language models, **finance-domain Q\&A coherence outside English remains very poor**—especially for small and mid-sized models. * Most open datasets cover only basic finance, lack conceptual depth, or are English-only. * Community and business users report that models struggle with domain reasoning in Arabic, Uzbek, Chinese, and dozens of other languages. * For global applications, educational tools, and real financial tech products, this linguistic gap is a major bottleneck—leading to hallucinations, shallow answers, and poor user experience in non-English contexts. --- ### ✅ The Solution **Finance-Curriculum-Edu-Multilingual** directly addresses this by: * **Expanding the scope** of QA data to 60+ languages, not just English or a few major world languages. * **Grounding every question/answer in a curated finance curriculum**, ensuring conceptual richness across corporate finance, fintech, policy, risk, personal finance, and more. * **Cleaning and standardizing outputs** (removing sponsor artefacts, checking for consistency) to maximize utility for fine-tuning, benchmarking, and research. * Providing a large, *open-access*, CSV-formatted dataset with nearly 8,000 diverse, multilingual QA pairs—ready for use in both training and evaluation. * Enabling the community to benchmark and improve models’ reasoning and instruction-following across language boundaries, making finance LMs more equitable and globally useful. --- **Summary:** This dataset closes a critical gap for anyone building or testing AI for global finance, education, or fintech—bringing robust multilingual coverage and real conceptual depth to a domain where it was previously missing. --- ## 🔁 Generation & Cleaning Process 1. Each topic from the **master list** was translated or paired with a finance‑domain question in the target language via Pollinations.AI. 2. Conceptual, structured answers were generated using a finance‑expert-style template emphasizing frameworks like Basel III, CAPM, DCF, ESG, Monte Carlo, etc. 3. A post-processing pass removed or standardized sponsor entries (e.g. Pollinations.AI sponsor metadata), improving dataset hygiene without impairing content fidelity. 4. Output is consolidated into one **CSV file**, with consistent headers: `task_type`, `language`, `instruction_type`, `reasoning_tags`, `contains_code`, `topic`, `system`, `user`, `assistant`. --- ## 🎯 Intended Use Cases - Fine‑tuning compact multilingual finance LMs - Benchmarking conceptual finance reasoning across languages - Curriculum design for finance education – especially non‑English training - Probing how reasoning degrades in low‑resource finance scenarios --- ## ⚠️ Limitations & Responsible Use - **Automatically generated**: not fact‑checked; liable to subtle errors. Human verification recommended for high‑stake uses. - **Language imbalance**: mapping between translated and source topics might vary in nuance. - **Ethical caution**: meant for **research and educational demo purposes only**, especially regarding financial advice—real clients should rely on human experts. --- ## 📝 Citation & Contact **BibTeX:** ```bibtex @misc{Flowers2025FinanceEduMulti, title = {Finance Curriculum Edu – Multilingual QA (7,941 entries)}, author = {Joseph G. Flowers}, year = {2025}, howpublished = {\\url{https://huggingface.co/datasets/Josephgflowers/Finance-Curriculum-Edu-Multilingual}}, license = {MIT} } ```` Questions, corrections, or language‑specific input welcome in the Hugging Face discussion or dataset issue tracker. --- ## 🗂 Comparison with Per‑Language Releases | Version | Format | Entry Count | Notes | | ------------- | ------ | ----------- | ------------------------------------------------------------------------------- | | English | CSV | \~6.87 k | Pollinations‌‑generated content in English ([Hugging Face][1]) | | Arabic | CSV | \~4.83 k | Arabic translations / generations, cleaned sponsor entries ([Hugging Face][2]) | | Uzbek | CSV | \~2.23 k | Uzbek‑only dataset with cleaned CSV ([Hugging Face][3]) | | Master topics | CSV | 7.79 k | Pre‑QA seed list of finance topics \~ broad domain coverage ([Hugging Face][4]) | [1]: https://huggingface.co/Josephgflowers/datasets?utm_source=chatgpt.com "Josephgflowers (Joseph G Flowers)" [2]: https://huggingface.co/datasets/Josephgflowers/Finance-Curriculum-Edu-Arabic/tree/main "Josephgflowers/Finance-Curriculum-Edu-Arabic at main" [3]: https://huggingface.co/datasets/Josephgflowers/Finance-Curriculum-Edu-Uzbek "Josephgflowers/Finance-Curriculum-Edu-Uzbek · Datasets at Hugging Face" [4]: https://huggingface.co/datasets/Josephgflowers/finance_curriculum_topics "Josephgflowers/finance_curriculum_topics · Datasets at Hugging Face"

## 💡 摘要 本文提供了经过清洗的多语言版《Finance Curriculum Edu》问答(QA)数据集,包含7941条数据,覆盖60余种语言,其构建基础为包含7794条条目的英文金融课程主题列表,通过翻译与拓展生成。每个问题均配有目标语言下兼具细节深度与领域专业性的答案,所有数据均整合至单个CSV文件中。 --- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6328952f798f8d122ce62a44/S_y4ZUMO-3RGhTFWEFuN9.png) ## 📚 数据集与链接 - **所有数据集均包含独有条目,并非简单的直接翻译产物。** - **核心主题列表(种子集)**:[Hugging Face平台上的《Finance Curriculum Topics》数据集](https://huggingface.co/datasets/Josephgflowers/finance_curriculum_topics) — 该数据集为包含7794条条目的CSV文件,收录经筛选的金融主题,用于引导问答生成 :contentReference[oaicite:4]{index=4} - **英文版本**(约6870条数据):CSV格式的[Josephgflowers/Finance_Curriculum_Edu_English]数据集 :contentReference[oaicite:5]{index=5} - **阿拉伯语版本**(约4830条数据):CSV格式的[Josephgflowers/Finance-Curriculum-Edu-Arabic]数据集 :contentReference[oaicite:6]{index=6} - **乌兹别克语版本**(约2230条数据):清洗后的CSV格式数据集[Josephgflowers/Finance-Curriculum-Edu-Uzbek] :contentReference[oaicite:7]{index=7} --- ## 📄 数据集概览 | 属性 | 详情 | |-----------------------|------------------------------------------| | **语言覆盖** | 约60种(含英语、阿拉伯语、乌兹别克语等) | | **总规模** | 约7941组问答对 | | **文件格式** | UTF-8编码的CSV格式,已完成清洗 | | **赞助商字段** | 已标准化(值已修剪、错误已修正) | | **授权协议** | MIT开源协议 | | **所用主题** | 核心主题列表中的7794条种子主题 | --- **完整语言覆盖列表:** "阿拉伯语", "阿姆哈拉语", "阿塞拜疆语", "孟加拉语", "缅甸语", "简体中文", "繁体中文", "捷克语", "丹麦语", "荷兰语", "英语", "芬兰语", "法语", "格鲁吉亚语", "德语", "希腊语", "古吉拉特语", "海地克里奥尔语", "豪萨语", "希伯来语", "印地语", "匈牙利语", "伊博语", "印度尼西亚语", "意大利语", "日语", "爪哇语", "哈萨克语", "高棉语", "韩语", "老挝语", "马来语", "马拉地语", "波斯语", "波兰语", "葡萄牙语", "旁遮普语", "克丘亚语", "罗马尼亚语", "俄语", "塞尔维亚语/克罗地亚语/波斯尼亚语", "僧伽罗语", "索马里语", "西班牙语", "斯瓦希里语", "瑞典语", "他加禄语", "泰米尔语", "泰卢固语", "泰语", "土耳其语", "土库曼语", "乌克兰语", "乌尔都语", "乌兹别克语", "越南语", "约鲁巴语", "祖鲁语" ### 🛑 现存问题 尽管大语言模型(Large Language Model, LLM)领域发展迅猛,但**非英语环境下的金融领域问答一致性仍极差**,对于中小型模型而言尤为突出。 * 多数开源数据集仅覆盖基础金融知识,缺乏概念深度,或仅支持英语。 * 社区与商业用户反馈,模型在阿拉伯语、乌兹别克语、中文等数十种语言的金融领域推理任务中表现不佳。 * 对于全球应用、教育工具及实际金融科技产品而言,这一语言鸿沟已成为关键瓶颈——导致非英语场景下出现模型幻觉、答案肤浅及用户体验不佳等问题。 --- ### ✅ 解决方案 **Finance-Curriculum-Edu-Multilingual** 数据集针对性解决了上述问题,具体方式如下: * **拓展覆盖范围**:将问答数据覆盖至60余种语言,而非仅支持英语或少数主流语言。 * **锚定领域基准**:所有问答均基于经筛选的金融课程体系,确保覆盖公司金融、金融科技、监管政策、风险管理、个人理财等多个领域的概念深度。 * **清洗与标准化处理**:移除或标准化赞助商元数据等冗余信息,检查并统一格式,在不损害内容真实性的前提下提升数据集的整洁度,最大化其在模型微调、基准测试与研究中的实用性。 * **开放可即用格式**:提供包含近8000组多样化多语言问答对的大型开源CSV数据集,可直接用于模型训练与评估。 * **推动公平化发展**:助力社区在多语言场景下基准测试并优化模型的推理与指令遵循能力,让金融领域大语言模型更具公平性与全球适用性。 --- **总结:** 本数据集填补了一项关键空白:为所有构建或测试面向全球金融、教育或金融科技的AI的从业者提供支持,为此前缺乏相关资源的领域带来了可靠的多语言覆盖与真正的概念深度。 --- ## 🔁 生成与清洗流程 1. 通过Pollinations.AI平台,将**核心主题列表**中的每个主题翻译为目标语言,或为其生成目标语言下的金融领域专属问题。 2. 采用金融专家风格的模板生成兼具概念性与结构化的答案,重点覆盖《巴塞尔协议Ⅲ》(Basel III)、资本资产定价模型(Capital Asset Pricing Model, CAPM)、折现现金流模型(Discounted Cash Flow, DCF)、环境、社会和公司治理(Environmental, Social, Governance, ESG,简称ESG)、蒙特卡洛模拟(Monte Carlo)等金融框架。 3. 执行后处理流程,移除或标准化赞助商相关条目(如Pollinations.AI的赞助元数据),在不损害内容真实性的前提下提升数据集的整洁度。 4. 将最终结果整合至单个**CSV文件**中,采用统一的字段名:`任务类型`(task_type)、`语言`(language)、`指令类型`(instruction_type)、`推理标签`(reasoning_tags)、`是否包含代码`(contains_code)、`主题`(topic)、`系统提示`(system)、`用户输入`(user)、`助手回复`(assistant)。 --- ## 🎯 预期应用场景 - 针对轻量化多语言金融大语言模型进行微调 - 跨语言基准测试金融领域概念推理能力 - 金融教育课程设计,尤其是非英语语言的教学素材构建 - 探究低资源金融场景下模型推理能力的退化规律 --- ## ⚠️ 局限性与合规使用说明 - **自动生成产物**:未经过事实核查,可能存在细微错误,高风险场景下建议进行人工验证。 - **语言分布不均**:翻译后主题与源主题之间的细节匹配度可能存在差异。 - **伦理警示**:本数据集仅用于**研究与教育演示**,尤其不可作为金融建议的依据——真实用户应咨询专业金融从业者。 --- ## 📝 引用与联系方式 **BibTeX引用格式:** bibtex @misc{Flowers2025FinanceEduMulti, title = {Finance Curriculum Edu – Multilingual QA (7,941 entries)}, author = {Joseph G. Flowers}, year = {2025}, howpublished = {\url{https://huggingface.co/datasets/Josephgflowers/Finance-Curriculum-Edu-Multilingual}}, license = {MIT} } 欢迎在Hugging Face平台的讨论区或数据集问题追踪器中提交疑问、修正建议或特定语言的输入内容。 --- ## 🗂 与单语言发布版本的对比 | 版本 | 格式 | 条目数 | 备注 | | ------------- | ------ | ----------- | ------------------------------------------------------------------------------- | | 英文版本 | CSV | ~6.87 k | 由Pollinations.AI生成的英文内容,[Hugging Face链接][1] | | 阿拉伯语版本 | CSV | ~4.83 k | 阿拉伯语翻译/生成内容,已清理赞助商条目,[Hugging Face链接][2] | | 乌兹别克语版本 | CSV | ~2.23 k | 仅含乌兹别克语内容的清洗后CSV数据集,[Hugging Face链接][3] | | 核心主题列表 | CSV | 7.79 k | 预问答阶段的金融主题种子列表,覆盖广泛领域,[Hugging Face链接][4] | [1]: https://huggingface.co/Josephgflowers/datasets?utm_source=chatgpt.com "Josephgflowers (Joseph G Flowers)" [2]: https://huggingface.co/datasets/Josephgflowers/Finance-Curriculum-Edu-Arabic/tree/main "Josephgflowers/Finance-Curriculum-Edu-Arabic at main" [3]: https://huggingface.co/datasets/Josephgflowers/Finance-Curriculum-Edu-Uzbek "Josephgflowers/Finance-Curriculum-Edu-Uzbek · Datasets at Hugging Face" [4]: https://huggingface.co/datasets/Josephgflowers/finance_curriculum_topics "Josephgflowers/finance_curriculum_topics · Datasets at Hugging Face"
提供机构:
maas
创建时间:
2025-08-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作