five

mrrtmob/english-khmer-dictionary

收藏
Hugging Face2026-02-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mrrtmob/english-khmer-dictionary
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - km license: cc-by-4.0 task_categories: - translation tags: - dictionary - english - khmer - bilingual - nlp - cambodia - definitions - part-of-speech pretty_name: English-Khmer Dictionary size_categories: - 100K<n<1M --- # 📖 English–Khmer Dictionary Dataset A comprehensive bilingual English–Khmer (ភាសាខ្មែរ) dictionary dataset in CSV format containing **170,000+ entries**. Each entry includes the original English word, its Khmer translation, part of speech, full definitions in both languages, and example sentences — making it one of the richer English–Khmer lexical resources available for NLP and language learning. ## Dataset Description This dataset provides structured dictionary entries pairing English words with Khmer translations. With over **170,000 entries** covering a wide range of vocabulary, it goes beyond simple word pairs by including part-of-speech tags, detailed definitions in both English and Khmer, and bilingual example sentences. It is well-suited for machine translation, language learning tools, and linguistic research on Khmer — a low-resource language spoken primarily in Cambodia. ## Dataset Structure The dataset is a single CSV file with the following columns: | Column | Type | Description | |--------|------|-------------| | `word` | string | The English headword | | `word_km` | string | The Khmer translation of the headword (in Khmer script) | | `pos` | string | Part of speech (e.g., `noun`, `verb`, `prep.`, `adj.`) — may be empty | | `definition_en` | string | Full definition in English | | `definition_km` | string | Full definition in Khmer script | | `example_en` | string | Example sentence in English (may be empty) | | `example_km` | string | Example sentence in Khmer (may be empty) | > Note: Some entries have multiple rows for the same headword, each representing a different sense or meaning. ### Sample Data | word | word_km | pos | definition_en | definition_km | |------|---------|-----|---------------|---------------| | A | ក | | An adjective, commonly called the indefinite article... | គុណនាម ដែលជាទូទៅគេហៅថា អត្ថបទមិនកំណត់... | | A | ក | prep. | In; on; at; by. | នៅក្នុង; នៅលើ; នៅ; ដោយ. | | A | ក | | Of. | នៃ។ | ## Usage ```python from datasets import load_dataset dataset = load_dataset("mrrtmob/english-khmer-dictionary") print(dataset["train"][0]) ``` Or load directly with pandas: ```python import pandas as pd df = pd.read_csv("dictionary.csv") # Look up a word results = df[df["word"].str.lower() == "hello"] print(results[["word", "word_km", "pos", "definition_en"]]) ``` ## Languages - **Source language:** English (`en`) - **Target language:** Khmer (`km`) — spoken by ~16 million people, primarily in Cambodia ## Potential Use Cases - Training or fine-tuning English↔Khmer machine translation models - Building Khmer dictionary or language learning applications - Part-of-speech tagging and annotation for Khmer NLP pipelines - Augmenting low-resource Khmer NLP datasets - Linguistic and lexicographic research on the Khmer language ## Data Collection The English headwords and definitions were sourced from **OPTED (The Online Plain Text English Dictionary) v0.03**, a public domain English dictionary maintained by the Australian National University (ANU) at `https://www.mso.anu.edu.au/~ralph/OPTED/v003/`. OPTED itself is based on *Webster's Unabridged Dictionary* (1913 edition), which is in the public domain. The Khmer translations (`word_km`, `definition_km`, `example_km`) were generated using the **Kiri translation model** by [Blizzer.tech](https://blizzer.tech), an AI-powered translation service specializing in Southeast Asian languages including Khmer. > ⚠️ As the Khmer translations are machine-generated, some entries may contain translation errors or unnatural phrasing. Human review and correction is encouraged for production use. ## License This dataset is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license. You are free to share and adapt the data as long as appropriate credit is given. ## Contributions Contributions, corrections, and additions are welcome! Feel free to open an issue or pull request on the dataset repository. ## Contact For questions or feedback, please reach out via the Hugging Face community tab.
提供机构:
mrrtmob
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作