Afrikaans-English code-switched sentences with language identification and part-of-speech tags

Figshare2026-01-20 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Afrikaans-English_code-switched_sentences_with_language_identification_and_part-of-speech_tags/31077619

下载链接

链接失效反馈

官方服务：

资源简介：

As highlighted in recent surveys, one of the biggest barriers to progress in code-switching (CS) research is the limited availability—both in quantity and quality—of annotated code-switched text (Doğruöz et al., 2023; Mondal et al., 2022). Winata et al. (2022) further show that this issue is particularly acute in the South African context. The survey reports that only a small body of CS research exists for South African languages and that data resources remain limited or absent. Notably, for Afrikaans–English specifically, there are no publicly available code-switched datasets.Traditionally, researchers have relied on naturally occurring CS data from social media, speech recordings and manual or automated transcriptions. While these sources offer authenticity and sociolinguistic richness, they each introduce challenges such as ethical and privacy concerns, noise and inconsistency, costly annotation processes, and domain imbalance. Parallel corpora and substitutive generation methods also offer avenues for artificial CS creation, but for Afrikaans–English these corpora are either highly domain-specific or overly general and do not necessarily reflect naturally occurring switching patterns.Motivated by these limitations, the thesis explores an alternative and increasingly relevant solution: generating synthetic CS data using multilingual large language models (LLMs). A controlled prompting framework is developed using GPT-4o and Gemini 2.0 Flash to produce aligned Afrikaans, English, correct CS and incorrect CS sentences (constituting a sentence set) across diverse topics. In addition to addressing the data scarcity gap, evaluating the quality of synthetically generated sentences remains a challenge. the thesis further aims to develop quality evaluation strategies for synthetic data to be used in language learning applications. For this purpose, word-level language identification (LID) and part-of-speech (POS) tagging are required. Tagging was done using a combination of existing taggers, LLMs and human annotation. The final data set used for training and testing consists of 5750 sets unvalidated sets of sentences and 624 validated sets of sentences, both with LID and POS tags. The validated set is made up of 464 grammar-based sentence sets (sentences that were specifically generated to add diversity to the types of incorrect CS sentences), 60 human-validated sentence sets and 100 gold-standard sets of sentences used for model testing.

创建时间：

2026-01-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集