haajidheere/Somali-Dictionary
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/haajidheere/Somali-Dictionary
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Somali Dictionary
language:
- so
- en
- it
license: cc-by-4.0
task_categories:
- text-retrieval
tags:
- dictionary
- lexicon
- somali
- multilingual
- nlp
- terminology
size_categories:
- 100K<n<1M
---
# Somali Dictionary
## Dataset Summary
This dataset is a multilingual Somali lexical resource containing Somali terms with corresponding Italian and English glosses. It is designed to support Natural Language Processing (NLP), translation systems, and Somali language technology development.
The dataset currently consists of approximately **239,000 entries**, each stored as a single text string combining abbreviation, Somali term, and translations.
This project aims to evolve into a structured, high-quality Somali linguistic dataset for AI applications.
---
## Dataset Structure
### Current Format
The dataset is currently stored as a single column:
- `text` (string): Combined lexical entry
Example entries:
a Amar imperativo imperative
ac aan cayinnayn indefinito indefinite
bot. Botani botanica botany
dhaq Dhaqaale economia economics
---
### Planned Structured Format
Future versions of this dataset will include structured columns:
| abbreviation | somali | italian | english |
|-------------|------------------|-------------|-------------|
| a | Amar | imperativo | imperative |
| ac | aan cayinnayn | indefinito | indefinite |
| bot. | Botani | botanica | botany |
| dhaq | Dhaqaale | economia | economics |
---
## Intended Use
This dataset is suitable for:
- Somali dictionary and lookup systems
- Terminology extraction and linguistic analysis
- NLP preprocessing and tokenization
- Training translation models (after structuring)
- Building Somali AI tools (chatbots, search, spell-checkers)
---
## Limitations
- The dataset is currently **unstructured** (single text field)
- Parsing into structured columns may introduce errors due to:
- variable-length phrases
- inconsistent formatting
- Some entries may contain:
- abbreviations that are not standardized
- duplicate or ambiguous terms
- Translation quality has not yet been fully validated
---
## Licensing
This dataset is released under the **CC-BY-4.0 license**, allowing reuse with attribution.
If any part of the dataset originates from third-party sources, licensing may need to be reviewed and clarified in future versions.
---
## Citation
If you use this dataset, please cite:
Somali Dictionary Dataset (2026)
---
## Contributions
Contributions to improve this dataset are welcome. This project aims to build a high-quality Somali lexical resource for NLP and AI applications.
### Areas for contribution:
- **Data structuring**: Convert the raw `text` field into structured columns (`abbreviation`, `somali`, `italian`, `english`)
- **Data cleaning**: Fix inconsistencies, duplicates, and formatting issues
- **Normalization**: Standardize abbreviations and terminology
- **Validation**: Review and correct translations for accuracy
- **Expansion**: Add new entries or additional language pairs (e.g., Arabic, Amharic)
- **Annotation**: Add linguistic metadata (part-of-speech, domain tags, etc.)
### How to contribute:
- Submit pull requests via GitHub (if mirrored)
- Open issues for errors or suggestions
- Share improvements or derived datasets on Hugging Face
---
## Future Work
- Release a fully structured version of the dataset
- Add sentence-level Somali-English translation pairs
- Expand into speech datasets (audio + transcription)
- Provide API access for developers via BitBirr platform
---
*This dataset is an evolving resource and will be progressively improved through community contributions.*
提供机构:
haajidheere



