anonymous-1501/Open-Korean-Historical-Corpus
收藏Hugging Face2025-10-22 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/anonymous-1501/Open-Korean-Historical-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
Open Korean Historical Corpus是一个大规模的、开放许可的数据集,旨在解决韩国自然语言处理和历史上缺乏可访问数据的问题。该数据集包含了从7世纪到2025年的1300年间的17.7百万份文档(510亿个标记),来源于19个不同的档案馆。它涵盖了多种语言,包括韩语(中古、早期现代、现代、朝鲜)、古典汉语和日语。该语料库为代表性不足的书写系统,如韩国式汉字(Idu)和汉字-韩文混合书写,提供了第一个大规模的开放资源。
The Open Korean Historical Corpus is a large-scale, openly licensed dataset created to address the lack of accessible data for Korean NLP and historical linguistics. It contains 17.7 million documents (5.1 billion tokens) compiled from 19 distinct archives, spanning 1,300 years from the 7th century to 2025. The corpus is linguistically diverse, covering Korean (Middle, Early Modern, Modern, North), Classical Chinese, and Japanese. It provides the first large-scale, open resource for under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script.
提供机构:
anonymous-1501



