预处理后数据集
收藏www.doi.org2025-03-25 收录
下载链接:
https://www.doi.org/10.11922/sciencedb.j00133.00005
下载链接
链接失效反馈官方服务:
资源简介:
对XML格式的语料集进行解析和去噪处理,提取句子所属的文章ID、章节ID、语句文本、语句标签、句子序号,并清除无效的噪声数据,如少数人为错误(句子长度过短、句子内容为公式符号等)。预处理后所得有效数据为34 590条。
The corpus in XML format is parsed and denoised, extracting the article ID, chapter ID, sentence text, sentence tag, and sentence sequence number. Invalid noise data, such as minor human errors (e.g., overly short sentences, sentences containing formula symbols), are cleared. The preprocessed valid data consists of 34,590 entries.
提供机构:
www.doi.org



