five

wave101828228/Kaikki-Wiktionary-Ultimate-SQLite

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/wave101828228/Kaikki-Wiktionary-Ultimate-SQLite
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 language: - id - jv - su - la - sa - en tags: - linguistics - etymology - dictionary - sqlite - wiktextract repo_metadata: true pretty_name: Kaikki Wiktionary Ultimate SQLite size_categories: - 10M<n<100M --- # Kaikki Ultimate Raw SQLite - World Dictionary Database (2026 Edition) ## 📌 Overview This dataset is a high-performance **SQLite conversion** of the massive Kaikki.org (Wiktextract) raw data. It contains millions of lexical entries across thousands of languages, preserved in its **absolute raw JSON format** to ensure zero data loss. This database is designed for developers, linguists, and AI researchers who need a structured, indexed, and offline-ready version of the world's most comprehensive dictionary. ## 🚀 Key Features * **Zero Data Loss:** Every single detail from the original raw-wiktextract-data.jsonl is preserved in the full_json column. * **Massive Scale:** Approximately **25 GB** of structured data covering etymology, senses, pronunciations, and translations. * **Performance Optimized:** Includes pre-built **B-Tree Indexes** on the word and lang columns for millisecond query response times. * **Single File Convenience:** No need to parse millions of JSON lines; just connect with sqlite3 and start querying. ## 📊 Database Schema The database contains a single table named dict with the following structure: | Column | Type | Description | |---|---|---| | **word** | TEXT | The headword/entry (Indexed). | | **lang** | TEXT | Language name (e.g., "Indonesian", "Latin", "Sanskrit") (Indexed). | | **pos** | TEXT | Part of Speech (e.g., "noun", "verb", "adj"). | | **full_json** | TEXT | The original raw JSON string containing all metadata. | ## 🛠 How to Use You can access this dataset using Python or any SQLite-compatible tool. ### Quick Python Example: ```python import sqlite3 import json # Connect to the database conn = sqlite3.connect("kaikki_ultimate_raw.db") cursor = conn.cursor() # Search for a word word_to_find = "surya" cursor.execute("SELECT full_json FROM dict WHERE word = ?", (word_to_find,)) results = cursor.fetchall() for row in results: data = json.loads(row[0]) print(f"Language: {data.get('lang')}") print(f"Etymology: {data.get('etymology_text')}") conn.close() ``` ## 🌐 Target Languages Included This database is a **universal** collection, including but not limited to: * **Modern Languages:** Indonesian, English, Spanish, Mandarin, etc. * **Regional Languages:** Javanese, Sundanese, Minangkabau, Balinese. * **Classical/Root Languages:** Latin, Ancient Greek, Sanskrit, Old Javanese, Arabic. ## 📜 Source & Attribution The data is sourced from **Kaikki.org**, which extracts data from **Wiktionary**. All data is licensed under **Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0)**. ## 💡 Potential Use Cases 1. **Etymology Research:** Trace back the roots of Indonesian words to Sanskrit or Latin. 2. **AI Training:** Fine-tune LLMs for translation or morphological analysis. 3. **App Development:** Build powerful offline dictionary apps (e.g., "HafalKuat"). 4. **Linguistic Analysis:** Statistical analysis of word formations across different language families.
提供机构:
wave101828228
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作