scholarly360/indian_ipo_prospectus_data_with_pageno

Name: scholarly360/indian_ipo_prospectus_data_with_pageno
Creator: scholarly360
Published: 2023-08-15 13:27:37
License: 暂无描述

Hugging Face2023-08-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/scholarly360/indian_ipo_prospectus_data_with_pageno

下载链接

链接失效反馈

官方服务：

资源简介：

--- {} --- # Dataset Card for Dataset Name ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Prospectus text mining is very important for the investor community to identify major risks. factors and evaluate the use of the amount to be raised during an IPO. For this dataset author downloaded 100 prospectuses from the Indian Market Regulator website. The dataset contains the URL and OCR text for 100 prospectuses. Further, the author released a Roberta LM and sentence transformer for usage. This dataset Contains Page number Also for Retrieval Augmented Generation ### Supported Tasks and Leaderboards Retrieval Augmented Generation ### Languages ENGLISH ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields There are 4 columns: title_prospectus: Title of the IPO prospectus href_prospectus: Location of HTML pdf_prospectus : Pdf of prospectus content_whole_prospectus: OCR text for the whole prospectus ### Data Splits N.A. ## Dataset Creation ### Curation Rationale Prospectus text mining ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset This will help investors and the merchant bank community explore prospectuses in a more automated way, thus saving time. ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ```bibtex @misc{ROBERTA GOES FOR IPO: PROSPECTUS ANALYSIS WITH LANGUAGE MODELS FOR INDIAN INITIAL PUBLIC OFFERINGS, author = {Abhishek Mishra and Yogendra Sisodia}, title = {ROBERTA GOES FOR IPO: PROSPECTUS ANALYSIS WITH LANGUAGE MODELS FOR INDIAN INITIAL PUBLIC OFFERINGS}, year = {2022}, url = {https://aircconline.com/csit/papers/vol12/csit121905.pdf}, } ``` ### Contributions Made by Author [Scholarly360](https://github.com/Scholarly360).

提供机构：

scholarly360

原始信息汇总

数据集卡片 for Dataset Name

数据集描述

数据集概述

Prospectus文本挖掘对于投资者社区识别重大风险因素和评估IPO期间筹集金额的使用非常重要。本数据集作者从印度市场监管机构网站下载了100份prospectus。数据集包含100份prospectus的URL和OCR文本。此外，作者还发布了一个Roberta语言模型和句子转换器供使用。该数据集还包含页面编号，用于增强检索生成。

支持的任务和排行榜

增强检索生成

语言

英语

数据集结构

数据实例

[更多信息需要]

数据字段

数据集包含4列：

title_prospectus: IPO prospectus的标题
href_prospectus: HTML位置
pdf_prospectus: prospectus的PDF文件
content_whole_prospectus: 整个prospectus的OCR文本

数据分割

N.A.

数据集创建

策划理由

Prospectus文本挖掘

源数据

初始数据收集和规范化

[更多信息需要]

源语言生产者

[更多信息需要]

注释

注释过程

[更多信息需要]

注释者

[更多信息需要]

个人和敏感信息

[更多信息需要]

使用数据的考虑因素

数据集的社会影响

这将帮助投资者和商业银行社区以更自动化的方式探索prospectus，从而节省时间。

讨论偏见

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策展人

[更多信息需要]

许可信息

[更多信息需要]

引用信息

bibtex @misc{ROBERTA GOES FOR IPO: PROSPECTUS ANALYSIS WITH LANGUAGE MODELS FOR INDIAN INITIAL PUBLIC OFFERINGS, author = {Abhishek Mishra and Yogendra Sisodia}, title = {ROBERTA GOES FOR IPO: PROSPECTUS ANALYSIS WITH LANGUAGE MODELS FOR INDIAN INITIAL PUBLIC OFFERINGS}, year = {2022}, url = {https://aircconline.com/csit/papers/vol12/csit121905.pdf}, }

贡献

由作者Scholarly360制作。

5,000+

优质数据集

54 个

任务类型

进入经典数据集