five

cargilcm/frwiktionary

收藏
Hugging Face2026-01-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cargilcm/frwiktionary
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en - fr tags: - code pretty_name: frwiktionary-20230501-pam-pcre_parsed.db size_categories: - 10K<n<100K --- # open database: $:` sqlite3 frwiktionary-20230501-pages-articles-multistream.xml-pcre_parsed.db` # see create table schema <b>sqlite ></b> ```.schema joined_trans```:<br> <br> ```CREATE TABLE `joined_trans` ``` (<br> ``` `id` integer NOT NULL```<br> ```, `page_id` integer DEFAULT NULL```<br> ```, `page_title` varchar(255) DEFAULT NULL```<br> ```, `rev_id` integer DEFAULT NULL```<br> ```, `rev_page` integer DEFAULT NULL```<br> ```, `old_id` integer DEFAULT NULL```<br> ```, `text_transd` varchar(1255) DEFAULT NULL```<br> ```);``` ## Num rows: 2,281,460 or more accurately: ```select count(*) from joined_trans where `text_transd` != 'NULL';```:<br> <b><ul>37010*</ul></b><br> *this number which is significantly smaller than an English dictionary which usually has ~>60k words but accounts for the deficiency in words assigned a translation text on the French Wiktionary as described in the paper and it's referenced papers (download from [sci-hib.ee](https://sci-hub.ee/10.1109/i-Society.2016.7854182) to avoid paywall )<br> # Garbled text or query returns no result from a lesson learned, when i imported the Wiktionary data, i used phpmyadmin which on my system had a default collation setting of 'latin_swedish_ci' so to make this demo/data match what's truly in the Wiktionary I'd have to see what the special characters from Wiktionary were mapped to in swedish_ci and replace those strings in my table with their actual special characters after setting collation to utf. More optimally, i could run thru the whole process again from scratch with the correct collation, but why forego the headaches of sorting out utf quirks, parity withstanding ;)<br><br> # Example query [SELECT * FROM train WHERE page_title = '├¬tre' LIMIT 100;](https://huggingface.co/datasets/cargilcm/frwiktionary/viewer?views%5B%5D=train&sql=SELECT+*+%0AFROM+train+%0AWHERE+page_title+%3D+%27%E2%94%9C%C2%ACtre%27%0ALIMIT+100%3B) [query to view être due to utf/latin_swedish_ci garbling]<br><br> I foresee as likely my giving a go at reversing the garbling by string replacing all the erroneous special characters on or before June 1,2026 but as I'm a Dad such types of plans are liable to be derailed; that's my disclaimer but I've dragged my feet publishing this data since i completed it in 2023-2024 and the paper was completed in 2016 so i shall hasten to yield such final product

许可证:MIT许可证 支持语言:英语、法语 标签:代码 展示名称:frwiktionary-20230501-pam-pcre_parsed.db 规模类别:10000 < 数据规模 < 100000 # 开放数据库: `$: sqlite3 frwiktionary-20230501-pages-articles-multistream.xml-pcre_parsed.db` # 查看建表语句: <b>sqlite ></b> .schema joined_trans:<br> CREATE TABLE `joined_trans` ( `id` integer NOT NULL, `page_id` integer DEFAULT NULL, `page_title` varchar(255) DEFAULT NULL, `rev_id` integer DEFAULT NULL, `rev_page` integer DEFAULT NULL, `old_id` integer DEFAULT NULL, `text_transd` varchar(1255) DEFAULT NULL ); ## 数据行数: 总数据行数约为2,281,460,更精确的统计结果为执行SQL语句 `select count(*) from joined_trans where `text_transd` != 'NULL';`,结果为: <b><ul>37010*</ul></b><br> *该行数远少于通常拥有6万余词条的英语词典,这一差距源于法语维基词典(French Wiktionary)中被标注了译文本的词条数量不足,相关细节见本文及参考文献(可通过[sci-hub.ee](https://sci-hub.ee/10.1109/i-Society.2016.7854182)下载以避开付费墙) # 乱码或查询无结果 经验教训:在导入维基词典数据时,我使用了phpMyAdmin,该工具在我的系统中默认采用的字符集排序规则为`latin_swedish_ci`。若要使本演示数据集与法语维基词典的原始内容保持一致,我需要先明确维基词典中的特殊字符在`swedish_ci`规则下的映射结果,随后将数据表中的对应字符串替换为原始特殊字符,且需将表的字符集排序规则修改为utf。更优的方案是使用正确的字符集排序规则从头重新构建整个数据集,但考虑到需要规避处理utf编码特殊字符兼容性问题的繁琐流程,目前暂未进行此项调整;) # 示例查询 `[SELECT * FROM train WHERE page_title = '├¬tre' LIMIT 100;](https://huggingface.co/datasets/cargilcm/frwiktionary/viewer?views%5B%5D=train&sql=SELECT+*+%0AFROM+train+%0AWHERE+page_title+%3D+%27%E2%94%9C%C2%ACtre%27%0ALIMIT+100%3B)` [该查询用于查看因utf/latin_swedish_ci编码不兼容导致的乱码词条être]<br><br> 我计划在2026年6月1日或之前通过字符串替换修复所有错误的特殊字符,以此修正当前的乱码问题。但作为一名父亲,这类计划很容易被其他事务打断;特此声明:自2023-2024年完成该数据集构建、2016年完成相关论文后,我迟迟未发布此数据集,因此将尽快推出最终的修正版本。
提供机构:
cargilcm
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作