five

Appendix for Text Analysis and Trading Strategies with Multi-Source Financial Large Language Models

收藏
DataCite Commons2026-03-20 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=24c1430399284bb5820248fbbc138ebd
下载链接
链接失效反馈
官方服务:
资源简介:
This document is a supplementary appendix to the "Text Analysis and Trading Strategy of Multi Source Data Driven Financial Big Language Model". Due to confidentiality issues, the original data cannot be fully disclosed. The main information is as follows: This article takes January 1, 2020 to December 31, 2024 as the sample period, selects the constituent stocks of the Shanghai and Shenzhen 300 Index as the research object, including a total of 459 historical constituent stocks On the one hand, the selected samples cover a wide range of industries and have high market attention, providing sufficient data to reflect the impact of multi-source information on asset prices On the other hand, this period has gone through key stages such as the impact of the epidemic, economic recovery, and policy adjustments, which is conducive to evaluating the adaptability of the model in different market environments Considering the computational power requirements of LLMs and the complexity of multi-source data processing, the five-year sample period takes into account both representativeness and computational feasibility The news text data is sourced from ChinaScope's SmarTag news database This database comprehensively covers a massive amount of news reports disclosed by finance, industry, and government across the entire network, and has key tags such as companies, individuals, events, industries, and products After company relevance screening and text similarity deduplication, the final sample covered 459 stocks with a total of 827800 news articles over 60 months. Each stock had an average of about 31 news articles per month, with an average length of 335 words per article The basic information text of the company is sourced from the listed enterprise basic information database provided by DataYes This dataset comprehensively covers the core content of company introductions, main businesses, and basic information of all listed companies in the A-share market The average character count of the entire sample is 435 words, with the longest basic information text being 2858 words and the shortest being only 23 words The stock price data comes from the daily trading data provided by Wind database. To ensure comparability and continuity of the data, the opening and closing prices after pre compounding processing are used After data cleaning, a total of 353185 stock daily return observations were obtained The macroeconomic research report data comes from DataYes, including the New Fortune Best Analyst selection data and research report text data After multiple rounds of screening, 869 valid samples were finally obtained, covering the entire sample period From the statistical results, the research report data exhibits the following characteristics: firstly, the average length of the report is 8.9 pages, with the abstract section containing an average of 2230 characters
提供机构:
Science Data Bank
创建时间:
2026-03-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作