selyobkj/investopedia-embedding-dataset

Name: selyobkj/investopedia-embedding-dataset
Creator: selyobkj
Published: 2026-03-06 12:54:35
License: 暂无描述

Hugging Face2026-03-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/selyobkj/investopedia-embedding-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 --- # Dataset Card for investopedia-embedding dataset We curate a dataset of substantial size pertaining to finance from Investopedia using a new technique that leverages unstructured scraping data and LLM to generate structured data that is suitable for fine-tuning embedding models. The dataset generation uses a new method of self-verification that ensures that the generated question-answer pairs and not hallucinated by the LLM with high probability. ### Dataset Description  Each data point in the dataset consists of the following fields: * `Topic`: A general classification of the topic around which the questions and answers are generated. * `Title`: A more detailed description or heading for the passage from which the questions and answers are generated. * `Question`: The sentence1 of your training dataset for the embedding model, also referred to as the anchor. * `Answer`: The sentence3 of your training dataset for the embedding model, also referred to as a positive sample. Example: ``` {'Topic': mortgage, 'Title': <title>How to Use a Home Equity Loan for a Remodel</title>, 'Question': What are some advantages of using a home equity loan for a home remodel compared to unsecured options like personal loans?, 'Answer': The passage highlights two main advantages: home equity loans generally offer low interest rates when compared to unsecured options like personal loans, which can help save money on home remodel costs. Additionally, they have fixed interest rates, providing stability in monthly payments and protection from rate changes during the entire repayment term. } ``` - **Curated by:** FinLang Team - **Language(s) (NLP):** English - **License:** cc-by-nc-4.0 ### Dataset Sources [optional]  - **Repository:** https://huggingface.co/datasets/FinLang - **Paper:** [Coming Soon] ## Dataset Structure We create a dataset split of 90-10 for training and testing. ## Dataset Creation ### Curation Rationale  Three key limitations plague the ubiquity of language models in the financial domain - First, there are no large (order of a million tokens) publically available datasets that are suited for language and embedding model fine-tuning, a direct consequence of internal data being protected by large companies like Bloomberg, etc for monetary and privacy interests; Second, current language models falter in the presence of complex financial abbreviations a commonality in financial documents again pointing to a lack of data in training the models, and third, even with an abundance on internet data on finance with websites like Investopedia, Yahoo Finance, etc it is hard to get data in a form suitable for instruction tuning or embedding training as annotating the unstructured datasets will incur huge costs due to requiring experts who will need to be compensated due to high-pay nature of the jobs in the financial sector. ### Source Data The source data is collected from [Investopedia](https://www.investopedia.com/) ### License Since non-commercial data used for generating dataset therefore we release this dataset as cc-by-nc-4.0. ## Citation [Coming Soon]

--- license: cc-by-nc-4.0 --- # investopedia-embedding 数据集卡片我们采用融合非结构化爬取数据与大语言模型（LLM）的新技术，从Investopedia平台遴选构建了大规模金融领域数据集，生成适配嵌入模型微调的结构化数据。本次数据集生成采用全新的自验证方法，可高概率确保生成的问答对未被大语言模型虚构。 ### 数据集概述  本数据集的每条数据均包含以下字段： * `Topic`：问答内容所属的通用主题分类。 * `Title`：生成问答所依据的文本段落的详细描述或标题。 * `Question`：嵌入模型训练数据集的第一句，亦称为锚点（anchor）。 * `Answer`：嵌入模型训练数据集的第三句，亦称为正样本（positive sample）。示例： { "Topic": "抵押贷款", "Title": "<title>如何利用住房抵押贷款进行房屋翻新</title>", "Question": "与个人贷款等无担保选项相比，使用住房抵押贷款进行房屋翻新有哪些优势？", "Answer": "本文主要介绍了两大核心优势：相较于个人贷款等无担保选项，住房抵押贷款通常利率更低，可帮助节省房屋翻新成本。此外，其利率固定，能为每月还款提供稳定性，且在整个还款期内免受利率波动影响。" } - **数据遴选方**：FinLang团队 - **自然语言语种**：英语 - **授权协议**：cc-by-nc-4.0 ### 数据集来源 [可选]  - **仓库地址**：https://huggingface.co/datasets/FinLang - **论文**：[即将上线] ## 数据集结构我们将数据集按90:10的比例划分为训练集与测试集。 ## 数据集构建 ### 构建动因  当前金融领域语言模型的普及面临三大核心局限：其一，尚无规模达百万Token级别的公开可用数据集，适用于语言模型与嵌入模型的微调，这一局面源于彭博（Bloomberg）等大型企业出于商业利益与隐私保护需求，对内部数据进行了保密；其二，现有语言模型在处理金融领域常见的复杂专业缩写时表现欠佳，这也反映出模型训练数据的匮乏；其三，尽管Investopedia、雅虎财经（Yahoo Finance）等平台存有海量互联网金融数据，但难以获取适用于指令微调或嵌入模型训练的结构化数据——对非结构化数据集进行标注需聘请金融领域专家，而该行业岗位薪酬较高，标注成本将极其高昂。 ### 源数据本数据集的源数据采集自[Investopedia](https://www.investopedia.com/) ### 授权协议说明由于本数据集生成所用数据为非商用数据，因此我们采用cc-by-nc-4.0协议发布本数据集。 ## 引用信息 [即将上线]

提供机构：

selyobkj

5,000+

优质数据集

54 个

任务类型

进入经典数据集