Maximax67/English-Valid-Words

Name: Maximax67/English-Valid-Words
Creator: Maximax67
Published: 2024-04-09 09:43:57
License: 暂无描述

Hugging Face2024-04-09 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/Maximax67/English-Valid-Words

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: unlicense language: - en pretty_name: English Valid Words List size_categories: - 100K<n<1M configs: - config_name: sorted_by_frequency data_files: "valid_words_sorted_by_frequency.csv" - config_name: sorted_alphabetically data_files: "valid_words_sorted_alphabetically.csv" - config_name: valid_words data_files: "valid_words.txt" --- # English Valid Words This repository contains CSV files with valid English words along with their frequency, stem, and stem valid probability. Dataset Github link: https://github.com/Maximax67/English-Valid-Words ## Files included 1. **valid_words_sorted_alphabetically.csv**: * N: Counter for each word entry. * Word: The English word itself. * Frequency count: The number of occurrences of the word in the 1-grams dataset. * Stem: The stem of the word. * Stem valid probability: Probability indicating the validity of the stem within the English language. 2. **valid_words_sorted_by_frequency.csv**: * Rank: The ranking of the word based on its frequency count. * Word: The English word. * Frequency count: The count of occurrences of the word in the 1-grams dataset. * Stem: The stem of the word. * Stem valid probability: Probability indicating the validity of the stem within the English language. 3. **valid_words.txt**: Txt file which contains valid words. Each word appears on a new line for convenient readability and usage. ## Data Collection Process In order to curate a comprehensive dataset of valid English words, the following steps were undertaken: 1. **Initial Dataset**: I was searching a list of valid english words for my personal project and I found [this github repo](https://github.com/dwyl/english-words). However, to refine the dataset to meet my project specifications, a filtering process was necessary. 2. **Words Filtering**: I wrote the Words-filter.ipynb notebook to remove of words with non-alphabetical characters and words exceeding 25 characters. 3. **Frequency Data Collection**: To enrich the dataset with frequency information, the 1-grams dataset provided by Google was employed. Words with a frequency count less than 10,000 were removed. 4. **Stemming and Probability Calculation**: I used NLTK's Porter, Lancaster, and Snowball stemmers, along with a custom prefix stemmer to get stems with the highest frequency among all stemmers, which also existed in the dataset. Additionally, the probability of stem validity was calculated based on the frequencies of the original word and its stem. For further insights into the data curation process, please refer to the Valid-Word-List-Maker.ipynb file. ## License This repository is released under the Unlicensed license. You are free to use, modify, and distribute the contents of this repository for any purpose without any restrictions. ## Acknowledgments I would like to acknowledge the contributions of the following resources: - [Word list by infochimps (archived)](https://web.archive.org/web/20131118073324/https://www.infochimps.com/datasets/word-list-350000-simple-english-words-excel-readable) - [English words github repo by dwyl](https://github.com/dwyl/english-words) - [The Google Books Ngram Viewer (used 1-grams dataset, version 20200217)](https://books.google.com/ngrams/) - [NLTK (Natural Language Toolkit)](https://www.nltk.org/) - [WordNet](https://wordnet.princeton.edu/)

提供机构：

Maximax67

原始信息汇总

English Valid Words

数据集概述

该数据集包含有效的英语单词及其频率、词干和词干有效概率的CSV文件。

文件列表

valid_words_sorted_alphabetically.csv:
- N: 每个单词条目的计数。
- Word: 英语单词本身。
- Frequency count: 该单词在1-grams数据集中的出现次数。
- Stem: 单词的词干。
- Stem valid probability: 表示词干在英语中有效性的概率。
valid_words_sorted_by_frequency.csv:
- Rank: 根据频率计数的单词排名。
- Word: 英语单词。
- Frequency count: 该单词在1-grams数据集中的出现次数。
- Stem: 单词的词干。
- Stem valid probability: 表示词干在英语中有效性的概率。
valid_words.txt: 包含有效单词的文本文件。每个单词在新行上，便于阅读和使用。

数据收集过程

为了构建一个全面的有效英语单词数据集，采取了以下步骤：

初始数据集: 从这个GitHub仓库获取初始单词列表，并根据项目需求进行过滤。
单词过滤: 使用Words-filter.ipynb笔记本移除非字母字符和超过25个字符的单词。
频率数据收集: 使用Google提供的1-grams数据集，移除频率计数少于10,000的单词。
词干提取和概率计算: 使用NLTK的Porter、Lancaster和Snowball词干提取器，以及自定义前缀词干提取器，获取在所有词干提取器中频率最高的词干，并计算词干有效性的概率。

许可证

该数据集采用Unlicense许可证，您可以自由使用、修改和分发该数据集的内容，没有任何限制。

5,000+

优质数据集

54 个

任务类型

进入经典数据集