five

michsethowusu/afri-bigrams

收藏
Hugging Face2025-11-26 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/michsethowusu/afri-bigrams
下载链接
链接失效反馈
官方服务:
资源简介:
# Afri-Bigrams This dataset contains a large-scale collection of word bigrams extracted from text across 154 African languages. It is meticulously designed to support general research and analysis of African languages, providing foundational data based on two-word sequences. ## **Background** This dataset provides foundational linguistic data, specifically word bigram frequencies, for African languages. This type of structured data is crucial for various computational linguistic tasks, allowing researchers to explore lexical co-occurrence patterns across 154 languages. ## **Approach Overview** The dataset exclusively includes word bigrams (sequences of two words). The data provides the raw frequency counts of these word bigrams for each language. Word bigrams are valuable for analyzing lexical co-occurrence and capturing basic syntactic patterns within the corpus used for extraction. ## **Dataset Structure** Each row in the dataset represents one word bigram and its frequency for a specific language. | Column | Description | | ----------- | --------------------------------------------- | | **bigram** | Extracted sequence (word pair). | | **count** | Frequency of the bigram within that language. | | **lang** | Two-letter language code. | | **lang_id** | Numeric identifier for the language. | ## **Use Cases** - Statistical analysis of word co-occurrence across African languages. - Development of language models and embedding techniques. - Linguistic pattern analysis focusing on common lexical sequences. - Supporting general multilingual NLP research. ## **Languages** The dataset covers the following languages. Note that representation may vary across languages. Acholi (ach), Dangme (ada), Afrikaans (afr), Akan (aka), Alur (alz), Amharic (amh), Algerian Arabic (arq), Moroccan Arabic (ary), Egyptian Arabic (arz), Bamanankan (bam), Basaa (bas), Baoulé (bci), Bemba (bem), Edo (bin), Bulu (bum), Bilen (byn), Chopi (cce), Chuwabu (chw), Chokwe (cjk), Coptic (cop), Seychelles French Creole (crs), Southwestern Dinka (dik), Dinka (din), Zarma (dje), Lukpa (dop), Duala (dua), Jula (dyu), Efik (efi), Éwé (ewe), Fon (fon), Pulaar (fuc), Fulah (ful), Nigerian Fulfulde (fuv), Ga (gaa), Gun (guw), Hausa (hau), Herero (her), Igbo (ibo), Esan (ish), Isoko (iso), Kabyle (kab), Kamba (kam), Kanuri (kau), Kabiyè (kbp), Kabuverdianu (kea), Gikuyu (kik), Kinyarwanda (kin), Kimbundu (kmb), Kongo (kon), Konzo (koo), Kaonde (kqn), Krio (kri), Kisi, Southern (kss), Oshiwambo (kua), Kwangali (kwn), Kikongo (kwy), Lamba (lam), Lingala (lin), Lozi (loz), Luba-Kasai (lua), Luba-Katanga (lub), Luvale (lue), Ganda (lug), Lunda (lun), Dholuo (luo), Mende (men), Morisyen (mfe), Mambwe-Lungu (mgr), Malagasy (mlg), Moore (mos), Mozambican Sign Language (mzy), Min Nan Chinese (nan), Nyemba (nba), Ndau (ndc), Ndonga (ndo), Lomwe (ngl), Northern Sotho (nso), Chichewa (nya), Nyaneka (nyk), Nyankore (nyn), Nyungwe (nyu), Nzema (nzi), Okpe (oke), Oromo (orm), Nigerian Pidgin (pcm), Merina Malagasy (plt), Tarifit (rif), Ruund (rnd), Rundi (run), Sango (sag), Sena (seh), South African Sign Language (sfs), Tachelhit (shi), Sidaama (sid), Shona (sna), Somali (som), Songe (sop), Swati (ssw), Swahili (swa), Congo Swahili (swc), Swahili (swh), Tigrigna (tir), Tiv (tiv), Tetela (tll), Tamashek (tmh), Tonga (tog), Tonga (toh), Tonga (toi), Tswa (tsc), Setswana (tsn), Tsonga (tso), Tooro (ttj), Tumbuka (tum), Umbundu (umb), Urhobo (urh), Venda (ven), Makhuwa (vmw), Wolaytta (wal), Cameroon Pidgin (wes), Wolof (wol), Xhosa (xho), Yao (yao), Yoruba (yor), Zimbabwe Sign Language (zib), Zande (zne), Zambian Sign Language (zsl), Zulu (zul)
提供机构:
michsethowusu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作