five

Evaluation data set of Word Segmentation technology in Minority Languages (MLWS2021)

收藏
科学数据银行2022-01-28 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/en/detail?dataSetId=09ebc19c041f4e23ba2aee9b91a16494
下载链接
链接失效反馈
官方服务:
资源简介:
MLWS2021 word segmentation evaluation data set includes Mongolian, Tibetan and Uyghur languages. The evaluation object is the core technology of automatic word segmentation in Mongolian, Uyghur and Tibetan Languages. On the basis of MLWS2017, the data set is expanded from the previous news field to news, economy, law, entertainment and other fields; The data scale has also expanded from more than 30000 sentences before to 155000 sentences at present. The dataset includes three data files, including: (1) Tibetan.Zip is Tibetan word segmentation and annotation data, with a data volume of 25000 sentences and a file size of 1.52MB; (2) Mongolian.Zip is Mongolian word segmentation and annotation data, with a data volume of 65000 sentences and a file size of 3.16MB; (3)Uyghur.Zip is Uighur word segmentation and annotation data, with a data volume of 65000 sentences and a file size of 5.12MB.
提供机构:
国家语言资源监测与研究少数民族语言中心; 青海师范大学; Mieradilijiang Maimaiti; 西藏大学; 藏语智能信息处理及应用国家重点实验室; 中央民族大学; 清华大学; 呼和浩特民族学院
创建时间:
2021-12-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作