A Method of Tokenization Based on Neural Network Natural Language Processing Model

Name: A Method of Tokenization Based on Neural Network Natural Language Processing Model
Creator: Huawei Noah's Ark Lab
Published: 2021-12-10 00:00:00
License: 暂无描述

科学数据银行2021-12-10 更新2026-04-23 收录

下载链接：

https://www.scidb.cn/en/detail?dataSetId=ad9dcad705704c5495737b1d57d2e769

下载链接

链接失效反馈

官方服务：

资源简介：

This report will review the Tokenization method in natural language processing based on neural networks. First, I will explain the out-of-vocabulary (OOV) problem caused by the closed vocabulary in the natural language processing based on neural network, and introduce two common methods for solving this problem, BPE and WordPiece, and a derivative method, SentencePiece technology. Although BPE is simple and efficient, there might be a insufficient learning problem of some sub-words. In that way, I will introduce BPE Dropout technology to solve this problem. Character-based BBPE and WordPiece still have out-of-vocabulary(OOV) problem when facing mult-languages large character sets (especially CJK languages). I will introduce an effective method to solve this problem: a tokenization technology based on UTF-8 characters (mainly BBPE), and its derivative technology BBPE-based SentencePiece. Finally, I will introduce VOLT, a general vocabulary size optimization technique proposed by ACL2021 best paper.

提供机构：

Huawei Noah's Ark Lab

创建时间：

2021-12-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集