chaannwooff/Dartdoc
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/chaannwooff/Dartdoc
下载链接
链接失效反馈官方服务:
资源简介:
Dartdoc - 韩国金融披露文本数据集是一个用于韩语大型语言模型(LLM)学习的数据集,通过韩国金融监督院电子公示系统(DART)的OpenAPI收集。数据集包含从业务报告、证券申报等公开文件中提取的高质量韩语文本。数据收集时间为2020年至2025年,总计256,548条记录,约4.6亿字符,平均每个文本块约1,794字符。数据集主要提取了文档中的高质量部分,如II.业务内容和IV.董事的经营诊断及分析意见,并以JSON格式存储,包含接收号、公司名称、报告类型等多个字段。数据经过严格的文本过滤标准处理,如去除韩语比例低于20%的行、仅由数字/特殊字符组成的行、少于200字符的文本块等。
Dartdoc - Korean Financial Disclosure Dataset is a dataset for Korean LLM learning, collected through the OpenAPI of the Financial Supervisory Services Electronic Disclosure System (DART) in Korea. The dataset contains high-quality Korean text extracted from disclosure documents such as business reports and securities reports. The data collection period is from 2020 to 2025, with a total of 256,548 records and approximately 460 million characters, averaging about 1,794 characters per chunk. The dataset primarily extracts high-quality sections of documents, such as II. Business Content and IV. Directors Management Diagnosis and Analysis Opinions, and is stored in JSON format with multiple fields including receipt number, company name, and report type. The data undergoes strict text filtering criteria, such as removing lines with less than 20% Korean, lines composed solely of numbers/special characters, and chunks with fewer than 200 characters.
提供机构:
chaannwooff



