wave101828228/Kaikki-Wiktionary-Ultimate-SQLite
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/wave101828228/Kaikki-Wiktionary-Ultimate-SQLite
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
language:
- id
- jv
- su
- la
- sa
- en
tags:
- linguistics
- etymology
- dictionary
- sqlite
- wiktextract
repo_metadata: true
pretty_name: Kaikki Wiktionary Ultimate SQLite
size_categories:
- 10M<n<100M
---
# Kaikki Ultimate Raw SQLite - World Dictionary Database (2026 Edition)
## 📌 Overview
This dataset is a high-performance **SQLite conversion** of the massive Kaikki.org (Wiktextract) raw data. It contains millions of lexical entries across thousands of languages, preserved in its **absolute raw JSON format** to ensure zero data loss.
This database is designed for developers, linguists, and AI researchers who need a structured, indexed, and offline-ready version of the world's most comprehensive dictionary.
## 🚀 Key Features
* **Zero Data Loss:** Every single detail from the original raw-wiktextract-data.jsonl is preserved in the full_json column.
* **Massive Scale:** Approximately **25 GB** of structured data covering etymology, senses, pronunciations, and translations.
* **Performance Optimized:** Includes pre-built **B-Tree Indexes** on the word and lang columns for millisecond query response times.
* **Single File Convenience:** No need to parse millions of JSON lines; just connect with sqlite3 and start querying.
## 📊 Database Schema
The database contains a single table named dict with the following structure:
| Column | Type | Description |
|---|---|---|
| **word** | TEXT | The headword/entry (Indexed). |
| **lang** | TEXT | Language name (e.g., "Indonesian", "Latin", "Sanskrit") (Indexed). |
| **pos** | TEXT | Part of Speech (e.g., "noun", "verb", "adj"). |
| **full_json** | TEXT | The original raw JSON string containing all metadata. |
## 🛠 How to Use
You can access this dataset using Python or any SQLite-compatible tool.
### Quick Python Example:
```python
import sqlite3
import json
# Connect to the database
conn = sqlite3.connect("kaikki_ultimate_raw.db")
cursor = conn.cursor()
# Search for a word
word_to_find = "surya"
cursor.execute("SELECT full_json FROM dict WHERE word = ?", (word_to_find,))
results = cursor.fetchall()
for row in results:
data = json.loads(row[0])
print(f"Language: {data.get('lang')}")
print(f"Etymology: {data.get('etymology_text')}")
conn.close()
```
## 🌐 Target Languages Included
This database is a **universal** collection, including but not limited to:
* **Modern Languages:** Indonesian, English, Spanish, Mandarin, etc.
* **Regional Languages:** Javanese, Sundanese, Minangkabau, Balinese.
* **Classical/Root Languages:** Latin, Ancient Greek, Sanskrit, Old Javanese, Arabic.
## 📜 Source & Attribution
The data is sourced from **Kaikki.org**, which extracts data from **Wiktionary**.
All data is licensed under **Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0)**.
## 💡 Potential Use Cases
1. **Etymology Research:** Trace back the roots of Indonesian words to Sanskrit or Latin.
2. **AI Training:** Fine-tune LLMs for translation or morphological analysis.
3. **App Development:** Build powerful offline dictionary apps (e.g., "HafalKuat").
4. **Linguistic Analysis:** Statistical analysis of word formations across different language families.
提供机构:
wave101828228



