Annual Reports Assessment Dataset
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7536331
下载链接
链接失效反馈官方服务:
资源简介:
Annual reports Assessment Dataset
This dataset will help investors, merchant bankers, credit rating agencies, and the community of equity research analysts explore annual reports in a more automated way, saving them time.
Following Sub Dataset(s) are there :
a) pdf and corresponding OCR text of 100 Indian annual reports
These 100 annual reports are for the 100 largest companies listed on the Bombay Stock Exchange.
The total number of words in OCRed text is 12.25 million.
b) A Few Examples of Sentences with Corresponding Classes
The author defined 16 widely used topics used in the investment community as classes like:
Accounting Standards
Accounting for Revenue Recognition
Corporate Social Responsbility
Credit Ratings
Diversity Equity and Inclusion
Electronic Voting
Environment and Sustainability
Hedging Strategy
Intellectual Property Infringement Risk
Litigation Risk
Order Book
Related Party Transaction
Remuneration
Research and Development
Talent Management
Whistle Blower Policy
These classes should help generate ideas and investment decisions, as well as identify red flags and early warning signs of trouble when everything appears to be proceeding smoothly.
ABOUT DATA ::
"scrips.json" is a json with name of companies
"SC_CODE" is BSE Scrip Id
"SC_NAME" is Listed Companies Name
"NET_TURNOV" is Turnover on the day of consideration
"source_pdf" is folder containing both PDF and OCR Output from Tesseract
"raw_pdf.zip" contains raw PDF and it can be used to try another OCR.
"ocr.zip" contains json file (annual_report_content.json) containing OCR text for each pdf.
"annual_report_content.json" is an array of 100 elements and each element is having two keys "file_name" and "content"
"classif_data_rank_freezed.json" is used for evaluation of results
contains "sentence" and corresponding "class"
创建时间:
2023-01-14



