FREEMmax
收藏arXiv2022-02-19 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2202.09452v1
下载链接
链接失效反馈官方服务:
资源简介:
FREEMmax是一个专为早期现代法语(16至18世纪)设计的大型语料库,由法国国家信息与自动化研究所等机构创建。该数据集包含约1.86亿个Tokens,来源于多种文献和研究项目,如FRANTEXT和Electronic Enlightenment等。创建过程中,数据集采用了手动准备元数据和XML TEI文件格式,以确保数据的一致性和可用性。FREEMmax主要用于支持数字人文和语言学研究,特别是针对早期现代法语的自然语言处理任务,如词性标注和文本规范化,旨在解决历史语言处理中的复杂性和资源稀缺问题。
FREEMmax is a large-scale corpus specifically designed for Early Modern French (16th to 18th centuries), created by institutions including the Institut National de Recherche en Informatique et en Automatique (INRIA). This dataset contains approximately 186 million Tokens, sourced from a range of scholarly sources and research initiatives such as FRANTEXT and Electronic Enlightenment. During its curation, the dataset utilizes manually curated metadata and XML TEI file formats to ensure data consistency and usability. FREEMmax is primarily intended to support digital humanities and linguistic research, especially natural language processing tasks targeting Early Modern French, such as part-of-speech tagging and text normalization, aiming to address the complexity and resource scarcity issues in historical language processing.
提供机构:
法国国家信息与自动化研究所
创建时间:
2022-02-19



