Evaluation data set of Word Segmentation technology in Minority Languages (MLWS2021)

Name: Evaluation data set of Word Segmentation technology in Minority Languages (MLWS2021)
Creator: 国家语言资源监测与研究少数民族语言中心; 青海师范大学; Mieradilijiang Maimaiti; 西藏大学; 藏语智能信息处理及应用国家重点实验室; 中央民族大学; 清华大学; 呼和浩特民族学院
Published: 2022-01-28 00:00:00
License: 暂无描述

科学数据银行2022-01-28 更新2026-04-23 收录

下载链接：

https://www.scidb.cn/en/detail?dataSetId=09ebc19c041f4e23ba2aee9b91a16494

下载链接

链接失效反馈

官方服务：

资源简介：

MLWS2021 word segmentation evaluation data set includes Mongolian, Tibetan and Uyghur languages. The evaluation object is the core technology of automatic word segmentation in Mongolian, Uyghur and Tibetan Languages. On the basis of MLWS2017, the data set is expanded from the previous news field to news, economy, law, entertainment and other fields; The data scale has also expanded from more than 30000 sentences before to 155000 sentences at present. The dataset includes three data files, including: (1) Tibetan.Zip is Tibetan word segmentation and annotation data, with a data volume of 25000 sentences and a file size of 1.52MB; (2) Mongolian.Zip is Mongolian word segmentation and annotation data, with a data volume of 65000 sentences and a file size of 3.16MB; (3)Uyghur.Zip is Uighur word segmentation and annotation data, with a data volume of 65000 sentences and a file size of 5.12MB.

提供机构：

国家语言资源监测与研究少数民族语言中心; 青海师范大学; Mieradilijiang Maimaiti; 西藏大学; 藏语智能信息处理及应用国家重点实验室; 中央民族大学; 清华大学; 呼和浩特民族学院

创建时间：

2021-12-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集