A corpus of Chinese abbreviation

Name: A corpus of Chinese abbreviation
Creator: OpenDataLab
Published: 2026-05-24 12:30:38
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/A_corpus_of_Chinese_abbreviation

下载链接

链接失效反馈

官方服务：

资源简介：

我们建立了一个由短语和术语组成的数据集。此数据集中有10,786全形式，包括8,015正全形式和2,661负全形式。短语包含名词短语，动词短语，组织名称，位置名称等。分布如表2所示。对于实验，我们随机抽取7,551个样本作为训练集，1078个样本作为开发集，2,157个样本作为测试集。我们计算数据中的单词和字符 (包括重复项) 的数量。

We constructed a dataset composed of phrases and terms. It contains 10,786 full forms in total, including 8,015 positive full forms and 2,661 negative full forms. The phrases cover noun phrases, verb phrases, organization names, location names, and so on. Its distribution is presented in Table 2. For the experiment, we randomly sampled 7,551 samples as the training set, 1,078 samples as the development set, and 2,157 samples as the test set. We calculated the counts of words and characters (including duplicates) within the dataset.

提供机构：

OpenDataLab

创建时间：

2023-03-30

搜集汇总

数据集介绍