Kartmaan/french-dictionary

Name: Kartmaan/french-dictionary
Creator: Kartmaan
Published: 2026-03-31 09:27:38
License: 暂无描述

Hugging Face2026-03-31 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Kartmaan/french-dictionary

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 language: - fr tags: - dictionary - french - sqlite - parquet - dictionnaire - français - database - nlp - wiktionary - offline - french-language pretty_name: French Dictionary size_categories: - 100M<n<1B --- # French Dictionary A ready-to-use offline French language dictionary derived from the French Wiktionary. Available in two formats to suit different use cases: **SQLite** for desktop applications and real-time querying, and **Parquet** for data science and machine learning pipelines. Contains nearly **900,000 distinct word forms** including conjugated verb forms, with structured definitions, usage examples, and rich linguistic metadata. --- ## Acknowledgements This dataset would not exist without the foundational work of **Franck Sajous**, CNRS research engineer and lecturer in the Language Sciences department at the University of Toulouse. He performed the complete parsing of the French Wiktionary into a structured XML resource, WiktionaryX, which this database is derived from. The original XML resource is available at: 👉 http://redac.univ-tlse2.fr/lexiques/wiktionaryx.html This dataset is published under **CC BY-SA 4.0**, in accordance with the license of the source material (French Wiktionary). --- ## Available Formats | File | Format | Size | Best suited for | |---|---|---|---| | `french_dict.db` | SQLite 3 | ~280 MB | Desktop apps, real-time search, SQL queries | | `french_dict.parquet` | Parquet (zstd) | ~22 MB | Data science, ML pipelines, Pandas / PyArrow | Both files contain the same data. The format difference reflects a deliberate trade-off between query performance and compression efficiency. --- ## Contents | Property | Value | |---|---| | Language | French | | Total rows | 1,256,143 | | Distinct word forms | 895,090 | | Includes conjugated forms | ✅ Yes | | Definitions | ✅ Yes | | Gender | ✅ Yes | | Usage examples | ✅ Yes | | Register tags | ✅ Yes (familier, vieux, littéraire…) | | Semantic tags | ✅ Yes (figuré, par extension…) | | Domain tags | ✅ Yes (musique, informatique…) | | Etymologies | ❌ Not included | | Translations | ❌ Not included | --- ## SQLite Format (`french_dict.db`) ### Schema ```sql CREATE TABLE mots ( id INTEGER PRIMARY KEY AUTOINCREMENT, forme TEXT NOT NULL, -- lowercase word form (search key) pos TEXT, -- part of speech definitions TEXT NOT NULL -- JSON-serialized array of definitions gender TEXT DEFAULT NULL -- gender ("m" - "f" - "e" - "NULL") ); CREATE INDEX idx_forme ON mots(forme); ``` A word with multiple parts of speech (e.g. *lire* as both a verb and a noun) appears as **multiple rows** sharing the same `forme`. Always query with `SELECT * FROM mots WHERE forme = ?` to retrieve all senses. ### Definition JSON structure Each row's `definitions` field contains a JSON array: ```json [ { "gloss": "Cérémonie ou prestation réservée à un nouvel arrivant...", "register": "Familier", "semantic": "Figuré", "domain": null, "exemples": ["Ils lui ont fait un accueil chaleureux."], "sous_definitions": [ { "gloss": "Sous-définition optionnelle.", "register": null, "semantic": null, "domain": null, "exemples": [] } ] } ] ``` ### Part-of-speech values | Code | Meaning | |---|---| | `N` | Noun (Nom) | | `V` | Verb (Verbe) | | `ADJ` | Adjective (Adjectif) | | `ADV` | Adverb (Adverbe) | | `PRO` | Pronoun (Pronom) | | `DET` | Determiner (Déterminant) | | `PRE` | Preposition (Préposition) | | `CON` | Conjunction (Conjonction) | | `INT` | Interjection | ### Usage example ```python import sqlite3, json conn = sqlite3.connect("french_dict.db") conn.row_factory = sqlite3.Row # Exact search rows = conn.execute( "SELECT pos, definitions FROM mots WHERE forme = ?", ("lire",) ).fetchall() for row in rows: defs = json.loads(row["definitions"]) print(f"[{row['pos']}] {defs[0]['gloss']}") # Prefix-based suggestions candidates = conn.execute( "SELECT DISTINCT forme FROM mots WHERE forme LIKE ? LIMIT 300", ("dict%",) ).fetchall() print([r["forme"] for r in candidates]) ``` --- ## Parquet Format (`french_dict.parquet`) The Parquet file flattens the nested JSON structure into **one row per definition**, making each field directly accessible as a column. Sub-definitions are included as separate rows with a non-null `sub_index`. ### Schema | Column | Type | Description | |---|---|---| | `forme` | string | Lowercase word form | | `pos` | category | Part of speech (N, V, ADJ…) | | `gender` | category | Gender of the noun (m, f, e, NULL) | | `def_index` | int16 | Definition index within the lexeme (1-based) | | `sub_index` | Int16 | Sub-definition index (null for top-level definitions) | | `gloss` | string | Definition text | | `register` | category | Usage register (Familier, Littéraire, vieux…) | | `semantic` | category | Semantic annotation (Figuré, Par extension…) | | `domain` | category | Subject domain (Musique, Informatique…) | | `examples` | string | Usage examples joined by ` \| ` | | `has_sub` | bool | Whether this definition has sub-definitions | Categorical columns (`pos`, `register`, `semantic`, `domain`) use dictionary encoding — each unique value is stored once and referenced by an integer, which contributes significantly to the compression ratio (270 MB → 22 MB with zstd). ### Usage example ```python import pandas as pd # Load full dataset df = pd.read_parquet("french_dict.parquet") # Load specific columns only (columnar read — very fast) df = pd.read_parquet( "french_dict.parquet", columns=["forme", "pos", "gloss", "register"] ) # Filter nouns with a familiar register nouns = df[(df["pos"] == "N") & (df["register"] == "Familier")] # Load directly from Hugging Face from datasets import load_dataset ds = load_dataset("Kartmaan/french-dictionary") df = ds["train"].to_pandas() ``` --- ## How This Dataset Was Built ### Source The raw data comes from **WiktionaryX**, a structured XML dump of the French Wiktionary produced by Franck Sajous at the CNRS / University of Toulouse. ### Parsing pipeline ``` french_dict.xml (341 MB) │ ▼ xml_to_json.py — iterparse streaming, extracts gloss/examples/metadata │ french_dict.json (265 MB) │ ▼ json_to_sqlite.py — batch inserts (5,000 rows/batch), VACUUM + ANALYZE │ french_dict.db (270 MB) ✅ │ ▼ db_to_parquet.py — flattens JSON definitions, zstd compression │ french_dict.parquet (22 MB) ✅ ``` **Note** : The addition of grammatical genders to the french_dict.db file was achieved through a data merging process with a French lexical database. This reference dataset was retrieved from lexique.org and originates from the OpenLexicon project, which is distributed under the CC BY-SA 4.0 LICENSE. **Key implementation choices:** - `iterparse` with `elem.clear()` after each `<entry>` to keep memory usage flat - Attributes `register`, `semantic`, and `domain` preserved for rich downstream filtering - B-tree index on `forme` for sub-millisecond exact lookups in SQLite - Categorical encoding in Parquet for columns with low cardinality --- ## Projects Using This Dataset - **Lexika-fr** — French language desktop tools (dictionary, lexicon, quiz, analyzer) 👉 https://github.com/Kartmaan/lexika-fr --- ## License **CC BY-SA 4.0** — derived from the French Wiktionary via WiktionaryX. You are free to use, share, and adapt this dataset for any purpose, provided you give appropriate credit and distribute any derivative works under the same license. - Original source: French Wiktionary - https://fr.wiktionary.org - WiktionaryX parser: Franck Sajous - http://redac.univ-tlse2.fr/lexiques/wiktionaryx.html - OpenLexicon (for adding the gender of nouns) - https://openlexicon.fr/

提供机构：

Kartmaan

5,000+

优质数据集

54 个

任务类型

进入经典数据集