snagbreac/russian-reverse-dictionary-full-data

Name: snagbreac/russian-reverse-dictionary-full-data
Creator: snagbreac
Published: 2024-05-22 12:22:49
License: 暂无描述

Hugging Face2024-05-22 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/snagbreac/russian-reverse-dictionary-full-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: word dtype: string - name: definition dtype: string - name: df dtype: string splits: - name: train num_bytes: 37600665 num_examples: 295504 download_size: 15206081 dataset_size: 37600665 configs: - config_name: default data_files: - split: train path: data/train-* license: mit language: - ru --- This dataset contains the Russian-language data I collected for training reverse dictionaries. The data consists of Russian words and their definitions. Each word-definition pair is also labeled with its source, of which there are three: - 'efremova' (circa 211K) refers to the Efremova's New Explanatory-Morphological Dictionary (2000), which is an authoritative Russian dictionary that was chosen for its lack of examples (allowing for easier data collection) and the large amount of words represented (circa 140K); - 'codwoe' (circa 50K) refers to the dataset created by the organizers of the CODWOE (COmparing Definitions and WOrd Embeddings) track of SemEval-2022, available here: https://codwoe.atilf.fr/. This part of the dataset only contains definitions for nouns, verbs, adjectives and adverbs. Notably, the original dataset also contains (usually several) examples of use for every word; I have not retained them here, but if you need examples of use in your training (for instance to generate embeddings) they are freely available there; - 'absite' (circa 35K) refers to absite.com, a Russian-language crossword website, from where I scraped words and clues for them. Unlike the other parts of the dataset, 'absite' contains only definitions for nouns; but since the definitions here are crossword clues and not dictionary definitions, they are written in a more everyday style of Russian, which corresponds to how a hypothetical user of a reverse dictionary would likely phrase their queries. There are circa 296K datapoints in total. Note: this dataset is _not_ filtered from the dictionary definitions of words in the test data that I collected (available here: https://huggingface.co/datasets/snagbreac/russian-reverse-dictionary-test-data). This allows you to work with the full volume of data I collected; however, use of the test data may be ill-advised, as some of it is contained in the training dataset. The filtered dataset is available here: https://huggingface.co/datasets/snagbreac/russian-reverse-dictionary-train-data. I sincerely hope that someone finds this dataset useful for training reverse dictionaries, both Russian-language and multilingual.

提供机构：

snagbreac

原始信息汇总

数据集概述

数据集信息

特征:
- word: 类型为 string
- definition: 类型为 string
- df: 类型为 string
数据分割:
- train: 字节数为 37600665，样本数为 295504
下载大小: 15206081 字节
数据集大小: 37600665 字节
配置:
- default 配置包含 train 数据文件，路径为 data/train-*
许可证: MIT
语言: 俄语

数据集内容

数据集包含俄语单词及其定义，每个单词-定义对标记有其来源：
- efremova: 约 211K 样本，来自 Efremova 的新解释形态词典（2000年），该词典因其缺乏例句（便于数据收集）和大量词汇（约 140K）而被选中。
- codwoe: 约 50K 样本，来自 SemEval-2022 的 CODWOE 赛道组织者创建的数据集，仅包含名词、动词、形容词和副词的定义。
- absite: 约 35K 样本，来自俄语填字游戏网站 absite.com，包含名词的定义，这些定义以日常俄语风格编写。
总计约 296K 样本。

注意事项

该数据集未从测试数据中过滤单词的词典定义，因此可能包含部分测试数据中的样本。过滤后的数据集可在此处获取。

5,000+

优质数据集

54 个

任务类型

进入经典数据集