five

A Method of Tokenization Based on Neural Network Natural Language Processing Model

收藏
科学数据银行2021-12-10 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/en/detail?dataSetId=ad9dcad705704c5495737b1d57d2e769
下载链接
链接失效反馈
官方服务:
资源简介:
This report will review the Tokenization method in natural language processing based on neural networks. First, I will explain the out-of-vocabulary (OOV) problem caused by the closed vocabulary in the natural language processing based on neural network, and introduce two common methods for solving this problem, BPE and WordPiece, and a derivative method, SentencePiece technology. Although BPE is simple and efficient, there might be a insufficient learning problem of some sub-words. In that way, I will introduce BPE Dropout technology to solve this problem. Character-based BBPE and WordPiece still have out-of-vocabulary(OOV) problem when facing mult-languages large character sets (especially CJK languages). I will introduce an effective method to solve this problem: a tokenization technology based on UTF-8 characters (mainly BBPE), and its derivative technology BBPE-based SentencePiece. Finally, I will introduce VOLT, a general vocabulary size optimization technique proposed by ACL2021 best paper.
提供机构:
Huawei Noah's Ark Lab
创建时间:
2021-12-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作