cstr/de-wiktionary-sqlite-full
收藏Hugging Face2025-11-21 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/cstr/de-wiktionary-sqlite-full
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- text-retrieval
language:
- de
tags:
- wiktionary
- dictionary
- german
- linguistics
- morphology
- semantics
- normalized
- lossless
size_categories:
- 1M<n<10M
---
# German Wiktionary - FULL Normalized SQLite Database
This is a **complete, lossless, and fully normalized** SQLite database of German Wiktionary, capturing 100% of the structured data from the `cstr/de-wiktionary-extracted` dataset.
It is designed for production-ready applications, complex linguistic analysis, and mobile apps (Flutter, React Native) that require a comprehensive local dictionary.
## 🎯 Key Features
- **✅ 100% Lossless**: All 30+ top-level and nested fields from the source JSONL are preserved.
- **⚡ Fast Queries**: Fully indexed schema for sub-20ms queries.
- **🔗 Full Semantic Web**: Includes all semantic relations (synonyms, antonyms, **hypernyms, hyponyms, meronyms, holonyms, coordinate_terms**).
- **🗣️ Rich Content**: Includes **expressions, proverbs, and entry notes** in addition to definitions and examples.
- **📱 Mobile-ready**: Optimized for `sqflite` (Flutter) and other local DB use cases.
- **(and all features from the standard DB: forms, translations, sounds, etc.)**
## 📊 Database Statistics
- **Entries**: 970,801
- **Word Senses**: 3,098,364
- **Definitions (Glosses)**: 3,087,300
- **Translations**: 1,131,251
- **Word Forms (Inflections)**: 6,100,090
- **Form Tags (Total)**: 25,966,680
- **Pronunciations (Sounds)**: 2,327,762
- **Usage Examples**: 427,322
- **Synonyms**: 161,563
- **Antonyms**: 76,054
- **Hypernyms**: 133,059
- **Hyponyms**: 217,179
- **Proverbs**: 1,078
- **Expressions**: 13,138
- **Descendants**: 211
- **Entry Notes**: 16,536
- **Unique Tags**: 185
- **Unique Topics**: 58
- **Unique Categories**: 352
## 🏗️ Database Schema (Full)
This schema includes all tables from the standard `de-wiktionary-sqlite-normalized` dataset, plus the following additions:
- **entries**:
- `title`: The Wiktionary page title.
- `redirect`: The page this entry redirects to (if any).
- **entry_notes**: (New Table) Free-text notes associated with an entry (e.g., "Es gibt etliche Belege für die Steigerung...").
- **other_pos**: (New Table) Alternative part-of-speech values for this word.
- **entry_raw_tags**: (New Table) Unparsed, raw tags from Wiktionary.
- **descendants**: (New Table) Words in other languages descended from this word.
- **hypernyms**: (New Table) "Is-a" relationship (e.g., "Tier" is a hypernym of "Hund").
- **hyponyms**: (New Table) "Type-of" relationship (e.g., "Hund" is a hyponym of "Tier").
- **holonyms**: (New Table) "Part-of" relationship (e.g., "Hand" is a holonym of "Finger").
- **meronyms**: (New Table) "Has-a" relationship (e.g., "Finger" is a meronym of "Hand").
- **coordinate_terms**: (New Table) Sibling terms (e.g., "Hund" and "Katze" are coordinate terms under "Haustier").
- **expressions**: (New Table) Idiomatic expressions using the word (linked to `sense_id`).
- **proverbs**: (New Table) Proverbs using the word (linked to `sense_id`).
*(For the standard schema, see the `cstr/de-wiktionary-sqlite-normalized` dataset card)*
## 📖 Usage
### Download
```python
from huggingface_hub import hf_hub_download
import sqlite3
import gzip
import shutil
# Download compressed database
db_gz_path = hf_hub_download(
repo_id="cstr/de-wiktionary-sqlite-full",
filename="de_wiktionary_normalized_full.db",
repo_type="dataset"
)
# Decompress (if it's .gz)
db_path = db_gz_path.replace('.gz', '')
with gzip.open(db_gz_path, 'rb') as f_in:
with open(db_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
# Connect
conn = sqlite3.connect(db_path)
```
### Example Query (New Tables)
```python
# Get all hypernyms (parent categories) for "Hund"
cursor.execute('''
SELECT h.hypernym_word
FROM entries e
JOIN hypernyms h ON e.id = h.entry_id
WHERE e.word = ? AND e.lang = 'Deutsch'
''', ('Hund',))
print("Hypernyms of 'Hund':", [row[0] for row in cursor.fetchall()])
```
## 🔗 Source
Original data: [cstr/de-wiktionary-extracted](https://huggingface.co/datasets/cstr/de-wiktionary-extracted)
## 📜 License
CC-BY-SA 4.0 (same as source)
### 数据集元数据
- 许可证:CC BY-SA 4.0
- 任务类别:文本检索(text-retrieval)
- 语言:德语(de)
- 标签:维基词典(wiktionary)、词典(dictionary)、德语(german)、语言学(linguistics)、词形学(morphology)、语义学(semantics)、归一化(normalized)、无损(lossless)
- 数据规模:100万 < 数据条目 < 1000万
# 德语维基词典——完整归一化SQLite数据库
本数据集为**完整、无损且完全归一化**的德语维基词典SQLite数据库,完整复刻了`cstr/de-wiktionary-extracted`数据集的全部结构化数据。
其设计面向可投入生产的应用程序、复杂语言学分析,以及需要本地化词典的移动端应用(如Flutter、React Native)。
## 🎯 核心特性
- ✅ 100%无损:完整保留源JSONL文件中全部30余个顶级及嵌套字段。
- ⚡ 快速查询:采用全索引架构,单条查询耗时可控制在20毫秒以内。
- 🔗 完整语义网络:涵盖所有语义关系,包括同义词、反义词、**上位词(hypernyms)、下位词(hyponyms)、部分词(meronyms)、整体词(holonyms)、并列词(coordinate_terms)**。
- 🗣️ 丰富内容:除标准释义与例句外,还包含**固定表达、谚语与词条注释**。
- 📱 移动端适配:针对`sqflite`(Flutter)及其他本地数据库使用场景进行优化。
- (同时包含标准数据库的所有特性:词形、译文、语音资源等)
## 📊 数据库统计
- 词条:970,801
- 词义项:3,098,364
- 释义(释文):3,087,300
- 译文:1,131,251
- 词形(屈折形式):6,100,090
- 词形标签(总计):25,966,680
- 发音(语音资源):2,327,762
- 用例例句:427,322
- 同义词:161,563
- 反义词:76,054
- 上位词(hypernyms):133,059
- 下位词(hyponyms):217,179
- 谚语:1,078
- 固定表达:13,138
- 派生词:211
- 词条注释:16,536
- 唯一标签:185
- 唯一主题:58
- 唯一分类:352
## 🏗️ 数据库完整架构
本架构包含标准`de-wiktionary-sqlite-normalized`数据集的全部表结构,新增以下表:
1. **entries(词条表)**:
- `title`:维基词典页面标题
- `redirect`:当前词条重定向指向的页面(若存在)
2. **entry_notes(新增表)**:与词条关联的自由文本注释(例如:"该词存在多种级别的用法佐证")
3. **other_pos(新增表)**:该词的其他词性标注
4. **entry_raw_tags(新增表)**:从维基词典提取的未解析原始标签
5. **descendants(新增表)**:源自该词的其他语言派生词
6. **hypernyms(新增表)**:「是一类」关系(例如:"动物(Tier)"是"狗(Hund)"的上位词)
7. **hyponyms(新增表)**:「属于某类」关系(例如:"狗(Hund)"是"动物(Tier)"的下位词)
8. **holonyms(新增表)**:「是整体的一部分」关系(例如:"手(Hand)"是"手指(Finger)"的整体词)
9. **meronyms(新增表)**:「包含某部分」关系(例如:"手指(Finger)"是"手(Hand)"的部分词)
10. **coordinate_terms(新增表)**:同级并列词(例如:"狗(Hund)"与"猫(Katze)"均为"宠物(Haustier)"的并列词)
11. **expressions(新增表)**:使用该词的习语表达(关联至`sense_id`)
12. **proverbs(新增表)**:使用该词的谚语(关联至`sense_id`)
*标准架构详情请参阅`cstr/de-wiktionary-sqlite-normalized`数据集卡片*
## 📖 使用指南
### 下载
python
from huggingface_hub import hf_hub_download
import sqlite3
import gzip
import shutil
# 下载压缩后的数据库文件
db_gz_path = hf_hub_download(
repo_id="cstr/de-wiktionary-sqlite-full",
filename="de_wiktionary_normalized_full.db",
repo_type="dataset"
)
# 若文件为GZIP压缩格式则进行解压
db_path = db_gz_path.replace('.gz', '')
with gzip.open(db_gz_path, 'rb') as f_in:
with open(db_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
# 连接数据库
conn = sqlite3.connect(db_path)
### 示例查询(新增表)
python
# 获取"Hund"(狗)的所有上位词(父分类)
cursor.execute('''
SELECT h.hypernym_word
FROM entries e
JOIN hypernyms h ON e.id = h.entry_id
WHERE e.word = ? AND e.lang = 'Deutsch'
''', ('Hund',))
print("'Hund' 的上位词:", [row[0] for row in cursor.fetchall()])
## 🔗 数据来源
原始数据:[cstr/de-wiktionary-extracted](https://huggingface.co/datasets/cstr/de-wiktionary-extracted)
## 📜 许可证
CC-BY-SA 4.0(与源数据集保持一致)
提供机构:
cstr



