mrrtmob/english-khmer-dictionary
收藏Hugging Face2026-02-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mrrtmob/english-khmer-dictionary
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- km
license: cc-by-4.0
task_categories:
- translation
tags:
- dictionary
- english
- khmer
- bilingual
- nlp
- cambodia
- definitions
- part-of-speech
pretty_name: English-Khmer Dictionary
size_categories:
- 100K<n<1M
---
# 📖 English–Khmer Dictionary Dataset
A comprehensive bilingual English–Khmer (ភាសាខ្មែរ) dictionary dataset in CSV format containing **170,000+ entries**. Each entry includes the original English word, its Khmer translation, part of speech, full definitions in both languages, and example sentences — making it one of the richer English–Khmer lexical resources available for NLP and language learning.
## Dataset Description
This dataset provides structured dictionary entries pairing English words with Khmer translations. With over **170,000 entries** covering a wide range of vocabulary, it goes beyond simple word pairs by including part-of-speech tags, detailed definitions in both English and Khmer, and bilingual example sentences. It is well-suited for machine translation, language learning tools, and linguistic research on Khmer — a low-resource language spoken primarily in Cambodia.
## Dataset Structure
The dataset is a single CSV file with the following columns:
| Column | Type | Description |
|--------|------|-------------|
| `word` | string | The English headword |
| `word_km` | string | The Khmer translation of the headword (in Khmer script) |
| `pos` | string | Part of speech (e.g., `noun`, `verb`, `prep.`, `adj.`) — may be empty |
| `definition_en` | string | Full definition in English |
| `definition_km` | string | Full definition in Khmer script |
| `example_en` | string | Example sentence in English (may be empty) |
| `example_km` | string | Example sentence in Khmer (may be empty) |
> Note: Some entries have multiple rows for the same headword, each representing a different sense or meaning.
### Sample Data
| word | word_km | pos | definition_en | definition_km |
|------|---------|-----|---------------|---------------|
| A | ក | | An adjective, commonly called the indefinite article... | គុណនាម ដែលជាទូទៅគេហៅថា អត្ថបទមិនកំណត់... |
| A | ក | prep. | In; on; at; by. | នៅក្នុង; នៅលើ; នៅ; ដោយ. |
| A | ក | | Of. | នៃ។ |
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("mrrtmob/english-khmer-dictionary")
print(dataset["train"][0])
```
Or load directly with pandas:
```python
import pandas as pd
df = pd.read_csv("dictionary.csv")
# Look up a word
results = df[df["word"].str.lower() == "hello"]
print(results[["word", "word_km", "pos", "definition_en"]])
```
## Languages
- **Source language:** English (`en`)
- **Target language:** Khmer (`km`) — spoken by ~16 million people, primarily in Cambodia
## Potential Use Cases
- Training or fine-tuning English↔Khmer machine translation models
- Building Khmer dictionary or language learning applications
- Part-of-speech tagging and annotation for Khmer NLP pipelines
- Augmenting low-resource Khmer NLP datasets
- Linguistic and lexicographic research on the Khmer language
## Data Collection
The English headwords and definitions were sourced from **OPTED (The Online Plain Text English Dictionary) v0.03**, a public domain English dictionary maintained by the Australian National University (ANU) at `https://www.mso.anu.edu.au/~ralph/OPTED/v003/`. OPTED itself is based on *Webster's Unabridged Dictionary* (1913 edition), which is in the public domain.
The Khmer translations (`word_km`, `definition_km`, `example_km`) were generated using the **Kiri translation model** by [Blizzer.tech](https://blizzer.tech), an AI-powered translation service specializing in Southeast Asian languages including Khmer.
> ⚠️ As the Khmer translations are machine-generated, some entries may contain translation errors or unnatural phrasing. Human review and correction is encouraged for production use.
## License
This dataset is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license.
You are free to share and adapt the data as long as appropriate credit is given.
## Contributions
Contributions, corrections, and additions are welcome! Feel free to open an issue or pull request on the dataset repository.
## Contact
For questions or feedback, please reach out via the Hugging Face community tab.
提供机构:
mrrtmob



