CCAE
收藏arXiv2023-10-09 更新2024-06-21 收录
下载链接:
https://huggingface.co/datasets/CCAE/CCAE-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
CCAE(Corpus of Chinese-based Asian Englishes)是由北京科技大学创建的多品种语料库,包含六种基于中文的亚洲英语变体,总计3.4亿词,来源于44.8万份网络文档。该数据集旨在为亚洲英语(尤其是中文英语)研究提供首个公开可访问的资源,支持特定语言模型的构建和下游任务,如语言变体识别和词汇变异识别。创建过程中,研究团队通过定制的数据收集和清洗流程确保数据质量,同时维护文档来源的可追溯性,以符合GDPR规定。CCAE的应用领域广泛,包括语言模型训练、自动语言变体识别等,旨在深入理解世界英语的多样性和复杂性。
CCAE (Corpus of Chinese-based Asian Englishes) is a multivarietal corpus developed by the University of Science and Technology Beijing. It comprises six Chinese-based Asian English varieties, with a total size of 340 million words sourced from 448,000 web documents. This dataset aims to provide the first publicly accessible resource for Asian English (especially Chinese English) research, supporting the construction of targeted language models and downstream tasks such as language variety identification and lexical variation recognition. During its development, the research team ensured data quality through customized data collection and cleaning workflows, while maintaining traceability of document sources to comply with GDPR regulations. CCAE has a wide range of application scenarios, including language model training, automatic language variety identification and others, with the goal of gaining in-depth insights into the diversity and complexity of World Englishes.
提供机构:
北京科技大学
创建时间:
2023-10-09



