Diachronic Corpuses of Overseas Chinese Letters

Name: Diachronic Corpuses of Overseas Chinese Letters
Creator: Science Data Bank
Published: 2026-04-07 10:47:06
License: 暂无描述

DataCite Commons2026-04-07 更新2026-05-05 收录

下载链接：

https://www.scidb.cn/detail?dataSetId=22e4ec4042aa49fea6cd71380d7762cb

下载链接

链接失效反馈

官方服务：

资源简介：

The overseas Chinese annotation database constructed in this study is sourced from authoritative archival publications, and the selected letters all have clear chronology. The time range of the dataset extends from the late Qing Dynasty to after 1978 and is divided into six fine-grained time periods. Geographically, letters mainly come from the southern Fujian region of China and Chinese settlements in Southeast Asian countries. In terms of spatial resolution, the dataset distinguishes between overseas shipping destinations and domestic receiving destinations, but not all letters fully record this information. The corpus is manually entered by graduate students majoring in linguistics and rigorously proofread to ensure accuracy. During the input process, all traditional Chinese characters are converted to simplified Chinese characters to maintain consistency, and obvious non text markers such as file codes and serial numbers are removed. Each letter is processed as an independent unit, obtaining a total of 2283 text files from six periods. The dataset is stored in plain text format (UTF-8 encoding). The dataset is divided into six sub corpora based on six historical periods. The number of letters in each period ranges from 197 to 488, with a total character count of approximately 14000 to 88000. The length of the text is unevenly distributed across different periods, with individual letters ranging in length from 5 to 1883 words. Due to the possibility of slight errors in the corpus caused by manual input and conversion of complexity, multiple rounds of proofreading have been conducted to minimize them.

提供机构：

Science Data Bank

创建时间：

2026-04-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集