snagbreac/russian-reverse-dictionary-full-data
收藏Hugging Face2024-05-22 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/snagbreac/russian-reverse-dictionary-full-data
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: word
dtype: string
- name: definition
dtype: string
- name: df
dtype: string
splits:
- name: train
num_bytes: 37600665
num_examples: 295504
download_size: 15206081
dataset_size: 37600665
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: mit
language:
- ru
---
This dataset contains the Russian-language data I collected for training reverse dictionaries. The data consists of Russian words and their definitions. Each word-definition pair is also labeled with its source, of which there are three:
- 'efremova' (circa 211K) refers to the Efremova's New Explanatory-Morphological Dictionary (2000), which is an authoritative Russian dictionary that was chosen for its lack of examples (allowing for easier data collection) and the large amount of words represented (circa 140K);
- 'codwoe' (circa 50K) refers to the dataset created by the organizers of the CODWOE (COmparing Definitions and WOrd Embeddings) track of SemEval-2022, available here: https://codwoe.atilf.fr/. This part of the dataset only contains definitions for nouns, verbs, adjectives and adverbs. Notably, the original dataset also contains (usually several) examples of use for every word; I have not retained them here, but if you need examples of use in your training (for instance to generate embeddings) they are freely available there;
- 'absite' (circa 35K) refers to absite.com, a Russian-language crossword website, from where I scraped words and clues for them. Unlike the other parts of the dataset, 'absite' contains only definitions for nouns; but since the definitions here are crossword clues and not dictionary definitions, they are written in a more everyday style of Russian, which corresponds to how a hypothetical user of a reverse dictionary would likely phrase their queries.
There are circa 296K datapoints in total.
Note: this dataset is _not_ filtered from the dictionary definitions of words in the test data that I collected (available here: https://huggingface.co/datasets/snagbreac/russian-reverse-dictionary-test-data). This allows you to work with the full volume of data I collected; however, use of the test data may be ill-advised, as some of it is contained in the training dataset. The filtered dataset is available here: https://huggingface.co/datasets/snagbreac/russian-reverse-dictionary-train-data.
I sincerely hope that someone finds this dataset useful for training reverse dictionaries, both Russian-language and multilingual.
提供机构:
snagbreac
原始信息汇总
数据集概述
数据集信息
- 特征:
word: 类型为stringdefinition: 类型为stringdf: 类型为string
- 数据分割:
train: 字节数为 37600665,样本数为 295504
- 下载大小: 15206081 字节
- 数据集大小: 37600665 字节
- 配置:
default配置包含train数据文件,路径为data/train-*
- 许可证: MIT
- 语言: 俄语
数据集内容
- 数据集包含俄语单词及其定义,每个单词-定义对标记有其来源:
efremova: 约 211K 样本,来自 Efremova 的新解释形态词典(2000年),该词典因其缺乏例句(便于数据收集)和大量词汇(约 140K)而被选中。codwoe: 约 50K 样本,来自 SemEval-2022 的 CODWOE 赛道组织者创建的数据集,仅包含名词、动词、形容词和副词的定义。absite: 约 35K 样本,来自俄语填字游戏网站 absite.com,包含名词的定义,这些定义以日常俄语风格编写。
- 总计约 296K 样本。
注意事项
- 该数据集未从测试数据中过滤单词的词典定义,因此可能包含部分测试数据中的样本。过滤后的数据集可在此处获取。



