Finnish-NLP/Reddit_fi_2006_2022

Name: Finnish-NLP/Reddit_fi_2006_2022
Creator: Finnish-NLP
Published: 2023-11-26 09:06:04
License: 暂无描述

Hugging Face2023-11-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Finnish-NLP/Reddit_fi_2006_2022

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: subreddit dtype: string - name: created_utc dtype: int64 - name: score dtype: int32 - name: body dtype: string - name: predicted_language dtype: string - name: probability dtype: float64 - name: year dtype: float64 - name: day dtype: float64 - name: month dtype: float64 - name: time dtype: string - name: label_identity_attack dtype: float64 - name: label_insult dtype: float64 - name: label_obscene dtype: float64 - name: label_severe_toxicity dtype: float64 - name: label_threat dtype: float64 - name: label_toxicity dtype: float64 - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 1878988954 num_examples: 4524360 download_size: 1059710799 dataset_size: 1878988954 license: apache-2.0 task_categories: - text-generation - conversational language: - fi tags: - social - reddit - Finnish size_categories: - 1M<n<10M --- # Dataset Card for "Reddit_fi_2006_2022" ## Dataset Description - **Point of Contact:** [RASMUS](https://www.linkedin.com/in/rasmustoivanen/) - **Size of csv filee on disk files:** 1542.75 MB - **Size of the generated parquet files:** 970 MB ### Dataset Summary Reddit_fi is a filtered and post-processed corpus consisting of comments from [Reddit](https://reddit.com/). Some words of caution at this stage however. Subreddits were not filtered as in ScandiReddit to filter out any specific subreddits that could have hate speech, toxicity, biased. Be careful when training language models with this data and curate you dataset properly. All Reddit comments from January 2006 up until December 2022 were downloaded through [PushShift](https://files.pushshift.io/reddit/comments/), after which these were filtered based on the FastText language detection model by using confidence score of 70% was as a limit. We also filter out shorter than 30 character messages based on body field. After these filters we end up with 4 524 360 unique messages. This project was inspired by https://huggingface.co/datasets/alexandrainst/scandi-reddit creator https://www.saattrupdan.com/. Kudos to you! ### Filtering disclaimer. Toxicity and bias The dataset is provided as is and high likely includes toxic, biased etch. material. You should carefully curate this dataset for your needs. To label toxic messages, we used Finnish toxicity classifier [TurkuNLP/bert-large-finnish-cased-toxicity](https://huggingface.co/TurkuNLP/bert-large-finnish-cased-toxicity) released by TurkuNLP. This dataset includes 6 different toxicity labels with their predicted scores for each message. You can use those labels and scores to filter out toxic messages. We evaluated subreddits with over 500 messages and decided to provide a list that based on our fast analysis should be filtered out: [FinlandOnlyfans, Warframe, Finnishbitches, vitunluurangot, WTF, SaatananTeletapit, FinnishWhores, pics, iidapiiroinen123, okkamuretardi, FinnishGenderCritical, onlyfanssuomi, SuomiBannatut, jumalattaret, jumalattaret2, jumalattaretPro, HommaInAction, snappisensuroimaton] ### Supported Tasks and Leaderboards Training language models is the intended task for this dataset. You can also use this dataset for various data analysis things ### Languages The dataset is available in Finnish ### Data Instances An example from the dataset looks as follows. ``` { "subreddit": "arkisuomi", "created_utc": 1671152007, "score": 1, "body": "oatlyn iKaffe on maitoa parempaa kahvissa, en jois pelkästään kuitenkaan", "predicted_language": "__label__fi", "probability": 0.9783772230148317, "year": 2022.0, "day": 16.0, "month": 12.0, "time": "00:53:27", "label_identity_attack": 0.00018978118896484375, "label_insult": 0.00058746337890625, "label_obscene": 0.00142669677734375, "label_severe_toxicity": 6.723403930664062e-05, "label_threat": 0.0004100799560546875, "label_toxicity": 0.01025390625 } ``` ### Data Fields The data fields are the same among all splits. - `subreddit`: `string` - `created_utc: `int64` - `score`: `int64` - `body`: `string` - `predicted_language`: `string` - `probability`: `float64` - `year`: `float64` - `day`: `float64` - `month`: `float64` - `time`: `string` - `label_identity_attack`: `float64` - `label_insult`: `float64` - `label_obscene`: `float64` - `label_severe_toxicity`: `float64` - `label_threat`: `float64` - `label_toxicity`: `float64` ### Language Distribution - fi: 4,561,192 ### Top-5 Subreddit Distribution - Suomi: 3 601 806 - snappijuorut: 483 558 - LakkoPostaukset: 58 613 - snappisensuroimaton: 56 157 - mina_irl: 50 696 ## Dataset Creation ### Curation Rationale The Finnish language does not have that many open source social media datasets. One notable dataset is Suomi24 but it has restricted access. ### Source Data The raw Reddit data was collected through [PushShift](https://files.pushshift.io/reddit/comments/). ## Additional Information 1. Edit on 11/25/2023. Added missing dataset for october 2021. user @sannamyl found out that I had missed october 2021 in the initial processing. I had deleted the original source files but I was able to retrieve october 2021 source file and redo the processing. It was added to the dataset and uploaded on 11/25/2023. 2. Edit 11/26/2023. I spotted that I had mistakenly forgot to add the toxicity predictions and overwrote those accidentally. I took the previous dataset from history with the toxicity predicions and then ran the predictions to October 2021 dataset and then combined and reuploaded ### Dataset Curators [Rasmus Toivanen](https://www.linkedin.com/in/rasmustoivanen/) curated this dataset. ### Licensing Information The dataset is licensed under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).

提供机构：

Finnish-NLP

原始信息汇总

数据集概述

数据集名称: Reddit_fi_2006_2022

数据集描述: Reddit_fi是一个经过筛选和后处理的语料库，包含从Reddit收集的评论。数据集涵盖了从2006年1月到2022年12月的所有Reddit评论，通过PushShift下载后，使用FastText语言检测模型进行过滤，仅保留预测语言为芬兰语且置信度超过70%的评论。此外，还排除了长度少于30个字符的消息。

数据集特征:

subreddit: 字符串
created_utc: 整数
score: 整数
body: 字符串
predicted_language: 字符串
probability: 浮点数
year: 浮点数
day: 浮点数
month: 浮点数
time: 字符串
label_identity_attack: 浮点数
label_insult: 浮点数
label_obscene: 浮点数
label_severe_toxicity: 浮点数
label_threat: 浮点数
label_toxicity: 浮点数

数据集大小:

训练集大小: 4,524,360条记录
数据集总大小: 1,878,988,954字节
下载大小: 1,059,710,799字节

许可证: Apache-2.0

语言: 芬兰语

任务类别:

文本生成
对话系统

数据集用途: 主要用于训练语言模型，也可用于各种数据分析任务。

注意事项: 数据集可能包含有毒、偏见等材料，使用时需谨慎筛选。提供了6种不同的毒性标签及其预测分数，用于过滤有毒消息。

数据集创建: 数据集由Rasmus Toivanen策划，原始数据通过PushShift收集。数据集的创建受到了alexandrainst/scandi-reddit的启发。

许可证信息: 数据集根据CC BY 4.0许可发布。

5,000+

优质数据集

54 个

任务类型

进入经典数据集