cargilcm/frwiktionary
收藏Hugging Face2026-01-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cargilcm/frwiktionary
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
- fr
tags:
- code
pretty_name: frwiktionary-20230501-pam-pcre_parsed.db
size_categories:
- 10K<n<100K
---
# open database:
$:` sqlite3 frwiktionary-20230501-pages-articles-multistream.xml-pcre_parsed.db`
# see create table schema
<b>sqlite ></b> ```.schema joined_trans```:<br>
<br>
```CREATE TABLE `joined_trans` ``` (<br>
``` `id` integer NOT NULL```<br>
```, `page_id` integer DEFAULT NULL```<br>
```, `page_title` varchar(255) DEFAULT NULL```<br>
```, `rev_id` integer DEFAULT NULL```<br>
```, `rev_page` integer DEFAULT NULL```<br>
```, `old_id` integer DEFAULT NULL```<br>
```, `text_transd` varchar(1255) DEFAULT NULL```<br>
```);```
## Num rows:
2,281,460 or more accurately: ```select count(*) from joined_trans where `text_transd` != 'NULL';```:<br>
<b><ul>37010*</ul></b><br>
*this number which is significantly smaller than an English dictionary which usually has ~>60k words but accounts for the deficiency in words assigned a translation text on the French Wiktionary as described in the paper and it's referenced papers (download from [sci-hib.ee](https://sci-hub.ee/10.1109/i-Society.2016.7854182) to avoid paywall )<br>
# Garbled text or query returns no result
from a lesson learned, when i imported the Wiktionary data, i used phpmyadmin which on my system had a default collation setting of 'latin_swedish_ci' so to make this demo/data match what's truly in the Wiktionary I'd have to see what the special characters from Wiktionary were mapped to in swedish_ci and replace those strings in my table with their actual special characters after setting collation to utf. More optimally, i could run thru the whole process again from scratch with the correct collation, but why forego the headaches of sorting out utf quirks, parity withstanding ;)<br><br>
# Example query
[SELECT *
FROM train
WHERE page_title = 'être'
LIMIT 100;](https://huggingface.co/datasets/cargilcm/frwiktionary/viewer?views%5B%5D=train&sql=SELECT+*+%0AFROM+train+%0AWHERE+page_title+%3D+%27%E2%94%9C%C2%ACtre%27%0ALIMIT+100%3B) [query to view être due to utf/latin_swedish_ci garbling]<br><br>
I foresee as likely my giving a go at reversing the garbling by string replacing all the erroneous special characters on or before June 1,2026 but as I'm a Dad such types of plans are liable to be derailed; that's my disclaimer but I've dragged my feet publishing this data since i completed it in 2023-2024 and the paper was completed in 2016 so i shall hasten to yield such final product
许可证:MIT许可证
支持语言:英语、法语
标签:代码
展示名称:frwiktionary-20230501-pam-pcre_parsed.db
规模类别:10000 < 数据规模 < 100000
# 开放数据库:
`$: sqlite3 frwiktionary-20230501-pages-articles-multistream.xml-pcre_parsed.db`
# 查看建表语句:
<b>sqlite ></b> .schema joined_trans:<br>
CREATE TABLE `joined_trans` (
`id` integer NOT NULL,
`page_id` integer DEFAULT NULL,
`page_title` varchar(255) DEFAULT NULL,
`rev_id` integer DEFAULT NULL,
`rev_page` integer DEFAULT NULL,
`old_id` integer DEFAULT NULL,
`text_transd` varchar(1255) DEFAULT NULL
);
## 数据行数:
总数据行数约为2,281,460,更精确的统计结果为执行SQL语句 `select count(*) from joined_trans where `text_transd` != 'NULL';`,结果为:
<b><ul>37010*</ul></b><br>
*该行数远少于通常拥有6万余词条的英语词典,这一差距源于法语维基词典(French Wiktionary)中被标注了译文本的词条数量不足,相关细节见本文及参考文献(可通过[sci-hub.ee](https://sci-hub.ee/10.1109/i-Society.2016.7854182)下载以避开付费墙)
# 乱码或查询无结果
经验教训:在导入维基词典数据时,我使用了phpMyAdmin,该工具在我的系统中默认采用的字符集排序规则为`latin_swedish_ci`。若要使本演示数据集与法语维基词典的原始内容保持一致,我需要先明确维基词典中的特殊字符在`swedish_ci`规则下的映射结果,随后将数据表中的对应字符串替换为原始特殊字符,且需将表的字符集排序规则修改为utf。更优的方案是使用正确的字符集排序规则从头重新构建整个数据集,但考虑到需要规避处理utf编码特殊字符兼容性问题的繁琐流程,目前暂未进行此项调整;)
# 示例查询
`[SELECT * FROM train WHERE page_title = '├¬tre' LIMIT 100;](https://huggingface.co/datasets/cargilcm/frwiktionary/viewer?views%5B%5D=train&sql=SELECT+*+%0AFROM+train+%0AWHERE+page_title+%3D+%27%E2%94%9C%C2%ACtre%27%0ALIMIT+100%3B)` [该查询用于查看因utf/latin_swedish_ci编码不兼容导致的乱码词条être]<br><br>
我计划在2026年6月1日或之前通过字符串替换修复所有错误的特殊字符,以此修正当前的乱码问题。但作为一名父亲,这类计划很容易被其他事务打断;特此声明:自2023-2024年完成该数据集构建、2016年完成相关论文后,我迟迟未发布此数据集,因此将尽快推出最终的修正版本。
提供机构:
cargilcm



