A Method of Tokenization Based on Neural Network Natural Language Processing Model
收藏科学数据银行2021-12-10 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/en/detail?dataSetId=ad9dcad705704c5495737b1d57d2e769
下载链接
链接失效反馈官方服务:
资源简介:
This report will review the Tokenization method in natural language processing based on neural networks. First, I will explain the out-of-vocabulary (OOV) problem caused by the closed vocabulary in the natural language processing based on neural network, and introduce two common methods for solving this problem, BPE and WordPiece, and a derivative method, SentencePiece technology. Although BPE is simple and efficient, there might be a insufficient learning problem of some sub-words. In that way, I will introduce BPE Dropout technology to solve this problem. Character-based BBPE and WordPiece still have out-of-vocabulary(OOV) problem when facing mult-languages large character sets (especially CJK languages). I will introduce an effective method to solve this problem: a tokenization technology based on UTF-8 characters (mainly BBPE), and its derivative technology BBPE-based SentencePiece. Finally, I will introduce VOLT, a general vocabulary size optimization technique proposed by ACL2021 best paper.
提供机构:
Huawei Noah's Ark Lab
创建时间:
2021-12-08



