cstr/en-wiktionary-sqlite-full
收藏Hugging Face2025-11-22 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/cstr/en-wiktionary-sqlite-full
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- text-retrieval
language:
- en
tags:
- wiktionary
- dictionary
- english
- linguistics
- morphology
- semantics
- normalized
size_categories:
- 1M<n<10M
---
# English Wiktionary - Normalized SQLite Database
This is a normalized SQLite database of English Wiktionary, capturing every field from the `cstr/en-wiktionary-extracted` dataset.
Note that this does **not** include all that would be extractable per wiktextract with `--all`, like translations, examples, etc.
## 🎯 Key Features
- fields captured including:
- 🔗 **Wikilinks** in definitions (semantic connections)
- 📝 **Qualifiers** (e.g., "archaic", "US", "informal")
- 🏷️ **Sense IDs** (unique identifiers)
- 🌐 **Wikidata IDs** (for semantic web linking)
- 📚 **Attestations** (historical citations)
- 🎭 **Head templates** (morphological data)
- 📖 **Info templates** (structured metadata)
- **⚡ Fast Queries**: Fully indexed schema for sub-20ms queries
- **🔗 Semantic Web**: relations preserved with sense-level granularity
- **📱 Mobile-ready**: Optimized for sq(f)lite (Flutter) and local DB use cases
## 📊 Database Statistics
- **Entries**: 1,243,200
- **Word Senses**: 1,361,968
- **Definitions (Glosses)**: 1,381,486
- **Wikilinks**: 2,585,821
- **Sense IDs**: 1,361,968
- **Qualifiers**: Embedded in senses
- **Translations**: 0
- **Word Forms**: 700,191
- **Head Templates**: 1,237,679
- **Pronunciations**: 0
- **Examples**: 0
- **Attestations**: 4,295
- **Wikidata IDs**: 2,309
- **Synonyms**: 214,838
- **Antonyms**: 11,816
- **Hypernyms**: 9,818
- **Hyponyms**: 22,649
-
## 🏗️ Database Schema (40+ Tables)
### New Tables (vs Previous Versions)
- **head_templates**: Morphological templates
- **entry_wikipedia**: Wikipedia cross-references
- **sense_links**: Wikilinks in definitions
- **sense_raw_tags**: Unstructured tags
- **sense_wikidata**: Wikidata identifiers
- **sense_wikipedia**: Wikipedia at sense level
- **attestations**: Historical citations
- **info_templates**: Structured metadata
### Core Tables
- **entries**: Core word data with etymology
- **senses**: Definitions with qualifier, senseid, head_nr
- **translations**: Multi-language translations
- **examples**: Usage examples
- **semantic relations**: hypernyms/hyponyms/meronyms/holonyms/coordinate_terms
## 📖 Usage
### Download
```python
from huggingface_hub import hf_hub_download
import sqlite3
import gzip
import shutil
# Download compressed database
db_gz_path = hf_hub_download(
repo_id="cstr/en-wiktionary-sqlite-full",
filename="en_wiktionary_normalized_full.db.gz",
repo_type="dataset"
)
# Decompress
db_path = db_gz_path.replace('.gz', '')
with gzip.open(db_gz_path, 'rb') as f_in:
with open(db_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
# Connect
conn = sqlite3.connect(db_path)
```
### Example Queries
```python
# Get definition with wikilinks for "dog"
cursor.execute('''
SELECT g.gloss_text, GROUP_CONCAT(l.link_text, ', ') as links
FROM entries e
JOIN senses s ON e.id = s.entry_id
JOIN glosses g ON s.id = g.sense_id
LEFT JOIN sense_links l ON s.id = l.sense_id
WHERE e.word = ? AND e.lang = 'English'
GROUP BY g.id
''', ('dog',))
# Get words with specific qualifier (e.g., "archaic")
cursor.execute('''
SELECT e.word, s.qualifier, g.gloss_text
FROM entries e
JOIN senses s ON e.id = s.entry_id
JOIN glosses g ON s.id = g.sense_id
WHERE s.qualifier LIKE '%archaic%'
LIMIT 10
''')
# Find Wikidata ID for a sense
cursor.execute('''
SELECT e.word, w.wikidata_id
FROM entries e
JOIN senses s ON e.id = s.entry_id
JOIN sense_wikidata w ON s.id = w.sense_id
WHERE e.word = ?
''', ('cat',))
```
## 📜 License
CC-BY-SA 4.0 (same as source)
## 🔄 Version
This is a **truly lossless** version capturing all 40+ fields from the source data.
> 许可证:CC BY-SA 4.0
> 任务类别:
> - 文本检索(text-retrieval)
> 语言:
> - 英语(en)
> 标签:
> - 维基词典(Wiktionary)
> - 词典(dictionary)
> - 英语(English)
> - 语言学(linguistics)
> - 词形学(morphology)
> - 语义学(semantics)
> - 标准化(normalized)
> 规模类别:
> - 100万 < n < 1000万
# 英语维基词典(English Wiktionary)——标准化SQLite数据库
本数据集为英语维基词典的标准化SQLite数据库,完整收录`cstr/en-wiktionary-extracted`数据集的全部字段。请注意,本数据集未包含使用`wiktextract`工具配合`--all`参数可提取的全部内容,例如译文、例句等。
## 🎯 核心特性
已收录的字段包括:
- 🔗 **释义内维基链接(语义关联)**:定义文本中的维基链接,用于建立语义连接
- 📝 **限定词**:例如“古旧”“美式”“非正式”等语言使用场景标注
- 🏷️ **义项编号**:唯一标识符
- 🌐 **维基数据编号**:用于语义网链接的标识符
- 📚 **书证**:历史引用文献
- 🎭 **词头模板**:词形学相关数据
- 📖 **信息模板**:结构化元数据
- **⚡ 极速查询**:采用全索引架构,单条查询延迟低于20毫秒
- **🔗 语义网支持**:保留义项级粒度的语义关系
- **📱 移动端适配**:针对sq(f)lite(Flutter)及本地数据库场景优化
## 📊 数据库统计信息
- **词项条目数**:1,243,200
- **词项义项数**:1,361,968
- **释义(释文)数**:1,381,486
- **维基链接数**:2,585,821
- **义项编号数**:1,361,968
- **限定词**:内嵌于义项数据中
- **译文数**:0
- **词形变体数**:700,191
- **词头模板数**:1,237,679
- **发音数据**:0
- **例句数**:0
- **书证数**:4,295
- **维基数据编号数**:2,309
- **同义词数**:214,838
- **反义词数**:11,816
- **上位词数**:9,818
- **下位词数**:22,649
## 🏗️ 数据库架构(共40+张表)
### 相较于旧版的新增表
- **head_templates**:词形学模板表
- **entry_wikipedia**:维基百科交叉引用表
- **sense_links**:释义内维基链接表
- **sense_raw_tags**:非结构化标签表
- **sense_wikidata**:维基数据标识符表
- **sense_wikipedia**:义项级维基百科关联表
- **attestations**:书证表
- **info_templates**:结构化元数据表
### 核心表
- **entries**:包含词源信息的核心词项数据表
- **senses**:包含限定词、义项编号、词头序号的释义表
- **translations**:多语言译义词表
- **examples**:使用例句表
- **semantic relations**:语义关系表(涵盖上位词、下位词、部分词、整体词及并列词)
## 📖 使用指南
### 下载
python
from huggingface_hub import hf_hub_download
import sqlite3
import gzip
import shutil
# 下载压缩后的数据库文件
db_gz_path = hf_hub_download(
repo_id="cstr/en-wiktionary-sqlite-full",
filename="en_wiktionary_normalized_full.db.gz",
repo_type="dataset"
)
# 解压文件
db_path = db_gz_path.replace('.gz', '')
with gzip.open(db_gz_path, 'rb') as f_in:
with open(db_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
# 建立数据库连接
conn = sqlite3.connect(db_path)
### 示例查询
python
# 获取“dog”的带维基链接的释义
cursor.execute('''
SELECT g.gloss_text, GROUP_CONCAT(l.link_text, ', ') as links
FROM entries e
JOIN senses s ON e.id = s.entry_id
JOIN glosses g ON s.id = g.sense_id
LEFT JOIN sense_links l ON s.id = l.sense_id
WHERE e.word = ? AND e.lang = 'English'
GROUP BY g.id
''', ('dog',))
# 获取带有特定限定词(如“古旧”)的词项
cursor.execute('''
SELECT e.word, s.qualifier, g.gloss_text
FROM entries e
JOIN senses s ON e.id = s.entry_id
JOIN glosses g ON s.id = g.sense_id
WHERE s.qualifier LIKE '%archaic%'
LIMIT 10
''')
# 获取某词项义项对应的维基数据编号
cursor.execute('''
SELECT e.word, w.wikidata_id
FROM entries e
JOIN senses s ON e.id = s.entry_id
JOIN sense_wikidata w ON s.id = w.sense_id
WHERE e.word = ?
''', ('cat',))
## 📜 许可证
本数据集采用与源数据一致的CC-BY-SA 4.0协议。
## 🔄 版本说明
本版本为**完全无损**版本,完整收录源数据集的40+个字段。
提供机构:
cstr



