AiresPucrs/stopwords-pt

Name: AiresPucrs/stopwords-pt
Creator: AiresPucrs
Published: 2024-10-13 20:08:44
License: 暂无描述

Hugging Face2024-10-13 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/AiresPucrs/stopwords-pt

下载链接

链接失效反馈

官方服务：

资源简介：

stopwords-en数据集包含葡萄牙语中常用的停用词列表。这些词在文本分类任务中通常不具有重要意义，因此在预处理和训练较浅模型时通常会被移除。数据集包含一个列，其中包含罗马字母的所有字母、1到10的数字以及葡萄牙语中常用的词，如“de”、“que”、“em”、“para”等。

提供机构：

AiresPucrs

原始信息汇总

数据集概述

python from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = load_dataset(AiresPucrs/stopwords-pt, split=train)[stopwords]

vectorizer = TfidfVectorizer(min_df=10, max_features=100000, analyzer=word, ngram_range=(1, 2), stop_words=stopwords, lowercase=True)

vectorizer.fit(dataset[text])

5,000+

优质数据集

54 个

任务类型

进入经典数据集