five

CCAE

收藏
arXiv2023-10-09 更新2024-06-21 收录
下载链接:
https://huggingface.co/datasets/CCAE/CCAE-Corpus
下载链接
链接失效反馈
官方服务:
资源简介:
CCAE(Corpus of Chinese-based Asian Englishes)是由北京科技大学创建的多品种语料库,包含六种基于中文的亚洲英语变体,总计3.4亿词,来源于44.8万份网络文档。该数据集旨在为亚洲英语(尤其是中文英语)研究提供首个公开可访问的资源,支持特定语言模型的构建和下游任务,如语言变体识别和词汇变异识别。创建过程中,研究团队通过定制的数据收集和清洗流程确保数据质量,同时维护文档来源的可追溯性,以符合GDPR规定。CCAE的应用领域广泛,包括语言模型训练、自动语言变体识别等,旨在深入理解世界英语的多样性和复杂性。

CCAE (Corpus of Chinese-based Asian Englishes) is a multivarietal corpus developed by the University of Science and Technology Beijing. It comprises six Chinese-based Asian English varieties, with a total size of 340 million words sourced from 448,000 web documents. This dataset aims to provide the first publicly accessible resource for Asian English (especially Chinese English) research, supporting the construction of targeted language models and downstream tasks such as language variety identification and lexical variation recognition. During its development, the research team ensured data quality through customized data collection and cleaning workflows, while maintaining traceability of document sources to comply with GDPR regulations. CCAE has a wide range of application scenarios, including language model training, automatic language variety identification and others, with the goal of gaining in-depth insights into the diversity and complexity of World Englishes.
提供机构:
北京科技大学
创建时间:
2023-10-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作