FinLang/investopedia-instruction-tuning-dataset
收藏数据集卡片 for investopedia-instruction-tuning 数据集
我们使用一种新的技术从 Investopedia 中筛选出与金融相关的大规模数据集,该技术利用非结构化的抓取数据和大型语言模型(LLM)生成适合微调嵌入模型的结构化数据。数据集生成采用了一种新的自验证方法,确保生成的问答对在很大程度上不是由LLM幻觉产生的。
数据集描述
每个数据点包含以下字段:
Topic:问题和答案生成所围绕的主题的一般分类。Title:更详细的描述或从其中生成问题和答案的段落的标题。Context:从互联网上抓取的文本的真实/未编辑段落。这可以用于RAG应用中的微调,以减少幻觉。Question-Answer:用于在完整问答对上进行SFT的拼接问答。Question:用于训练聊天模型的指令。Answer:用于训练聊天模型的响应。bge-large-en-v1.5-correlation:在BGE大型嵌入模型下计算问题和答案的相关性。尽管BGE对金融的理解可能有限,但它作为基线是有用的。
示例:
json { "Topic": "mortgage", "Title": "<title>How to Use a Home Equity Loan for a Remodel</title>", "Context": "Here are some to keep in mind: Because home equity loans are secured by your home, they generally offer low interest rates when compared to unsecured options like personal loans. These low rates can help you save money on your home remodel costs. Unlike home equity lines of credit (HELOCs), home equity loans have fixed interest rates. This means you aren’t vulnerable to rate changes, and your monthly payment will remain stable for your entire repayment term. Home equity loans can be used for a wide variety of purposes, including home improvements. While your lender may ask what the money will be used for, you generally won’t have to provide any sort of documentation", "Question-Answer": "Question: What are some advantages of using a home equity loan for a home remodel compared to unsecured options like personal loans? Answer: The passage highlights two main advantages: home equity loans generally offer low interest rates when compared to unsecured options like personal loans, which can help save money on home remodel costs. Additionally, they have fixed interest rates, providing stability in monthly payments and protection from rate changes during the entire repayment term.", "Question": "What are some advantages of using a home equity loan for a home remodel compared to unsecured options like personal loans?", "Answer": "The passage highlights two main advantages: home equity loans generally offer low interest rates when compared to unsecured options like personal loans, which can help save money on home remodel costs. Additionally, they have fixed interest rates, providing stability in monthly payments and protection from rate changes during the entire repayment term.", "bge-large-en-v1.5-correlation": 0.915799 }
数据集来源
- 来源数据:从 Investopedia 收集。
数据集结构
我们创建了一个90-10的训练和测试数据集分割。
数据集创建
筛选理由
在金融领域,语言模型的普及存在三个关键限制:首先,没有适合语言和嵌入模型微调的大型(数百万个令牌)公开可用数据集,这是由于大型公司如Bloomberg等出于金钱和隐私利益保护内部数据的结果;其次,当前的语言模型在遇到复杂的金融缩写时表现不佳,这再次指向了训练模型数据不足的问题;第三,尽管互联网上有大量关于金融的数据,如Investopedia、Yahoo Finance等网站,但很难以适合指令调优或嵌入训练的形式获取数据,因为注释非结构化数据集将因需要高薪专家而产生巨大成本。
许可证
由于用于生成数据集的数据是非商业性的,因此我们以cc-by-nc-4.0许可证发布此数据集。




