wave101828228/Kaikki-Wiktionary-Ultimate-SQLite

Name: wave101828228/Kaikki-Wiktionary-Ultimate-SQLite
Creator: wave101828228
Published: 2026-04-21 03:14:01
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/wave101828228/Kaikki-Wiktionary-Ultimate-SQLite

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-3.0 language: - id - jv - su - la - sa - en tags: - linguistics - etymology - dictionary - sqlite - wiktextract repo_metadata: true pretty_name: Kaikki Wiktionary Ultimate SQLite size_categories: - 10M<n<100M --- # Kaikki Ultimate Raw SQLite - World Dictionary Database (2026 Edition) ## 📌 Overview This dataset is a high-performance **SQLite conversion** of the massive Kaikki.org (Wiktextract) raw data. It contains millions of lexical entries across thousands of languages, preserved in its **absolute raw JSON format** to ensure zero data loss. This database is designed for developers, linguists, and AI researchers who need a structured, indexed, and offline-ready version of the world's most comprehensive dictionary. ## 🚀 Key Features * **Zero Data Loss:** Every single detail from the original raw-wiktextract-data.jsonl is preserved in the full_json column. * **Massive Scale:** Approximately **25 GB** of structured data covering etymology, senses, pronunciations, and translations. * **Performance Optimized:** Includes pre-built **B-Tree Indexes** on the word and lang columns for millisecond query response times. * **Single File Convenience:** No need to parse millions of JSON lines; just connect with sqlite3 and start querying. ## 📊 Database Schema The database contains a single table named dict with the following structure: | Column | Type | Description | |---|---|---| | **word** | TEXT | The headword/entry (Indexed). | | **lang** | TEXT | Language name (e.g., "Indonesian", "Latin", "Sanskrit") (Indexed). | | **pos** | TEXT | Part of Speech (e.g., "noun", "verb", "adj"). | | **full_json** | TEXT | The original raw JSON string containing all metadata. | ## 🛠 How to Use You can access this dataset using Python or any SQLite-compatible tool. ### Quick Python Example: ```python import sqlite3 import json # Connect to the database conn = sqlite3.connect("kaikki_ultimate_raw.db") cursor = conn.cursor() # Search for a word word_to_find = "surya" cursor.execute("SELECT full_json FROM dict WHERE word = ?", (word_to_find,)) results = cursor.fetchall() for row in results: data = json.loads(row[0]) print(f"Language: {data.get('lang')}") print(f"Etymology: {data.get('etymology_text')}") conn.close() ``` ## 🌐 Target Languages Included This database is a **universal** collection, including but not limited to: * **Modern Languages:** Indonesian, English, Spanish, Mandarin, etc. * **Regional Languages:** Javanese, Sundanese, Minangkabau, Balinese. * **Classical/Root Languages:** Latin, Ancient Greek, Sanskrit, Old Javanese, Arabic. ## 📜 Source & Attribution The data is sourced from **Kaikki.org**, which extracts data from **Wiktionary**. All data is licensed under **Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0)**. ## 💡 Potential Use Cases 1. **Etymology Research:** Trace back the roots of Indonesian words to Sanskrit or Latin. 2. **AI Training:** Fine-tune LLMs for translation or morphological analysis. 3. **App Development:** Build powerful offline dictionary apps (e.g., "HafalKuat"). 4. **Linguistic Analysis:** Statistical analysis of word formations across different language families.

提供机构：

wave101828228

5,000+

优质数据集

54 个

任务类型

进入经典数据集