Code Mixed Dataset (Indonesian-English)

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://data.mendeley.com/datasets/4w6ry5rfxs

下载链接

链接失效反馈

官方服务：

资源简介：

The code-mixed dataset comprising Indonesian-English bilingual text, designed to support sentiment analysis and POS tagging tasks. This dataset was compiled from diverse sources, including 22 applications available on the Google Play Store, yielding a total of 42,145 data points. The dataset is strategically partitioned into three subsets to facilitate different phases of model training and evaluation: a pretrained language model corpus (29,529 data points), a sentiment analysis corpus (11,791 data points), and a POS tagging corpus (1,825 data points). To enhance the linguistic diversity and robustness of the pretrained corpus, additional data were incorporated from Indonesian Wikipedia and English Wikipedia, ensuring a broad representation of code-mixed linguistic patterns. Sentiment Analysis Corpus The sentiment analysis corpus, consisting of 19,767 annotated data points, is specifically tailored for classifying emotions expressed in code-mixed text. This corpus was annotated into two primary categories—positive and negative—reflecting the binary sentiment classification task. The data exhibit a natural mix of Indonesian and English, often featuring colloquial expressions, abbreviations, and syntactic blending, such as "Sudah update, sudah clear cache, masih aja lemot. Sorry to say dear Shopee, very bad user experience," which captures the real-world complexity of code-mixed communication. Preprocessing steps, including tokenization with BERT vocab, lowercase conversion, and removal of special characters, were applied to standardize the dataset while preserving its linguistic nuances. POS Tagging Corpus The POS tagging corpus comprises 11,184 data points, tokenized into 185,424 individual tokens, and is annotated with a detailed set of 14 POS categories: ADJ (adjective), ADP (preposition), ADV (adverb), AUX (auxiliary), CCONJ (coordinating conjunction), DET (determiner), NOUN (noun), NUM (number), PART (particle), PRON (general pronoun), PROPN (proper noun), PUNCT (punctuation), SCONJ (subordinating conjunction), and VERB (verb). This corpus includes examples of language switching, such as "Aku suka this app because of its features," which challenges traditional POS tagging models due to the interplay of Indonesian and English syntax. Similar preprocessing techniques were applied, ensuring consistency across the dataset while maintaining the integrity of bilingual dependencies.

创建时间：

2025-08-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集