Finnish-NLP/Reddit_fi_2006_2022
收藏Hugging Face2023-11-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Finnish-NLP/Reddit_fi_2006_2022
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: subreddit
dtype: string
- name: created_utc
dtype: int64
- name: score
dtype: int32
- name: body
dtype: string
- name: predicted_language
dtype: string
- name: probability
dtype: float64
- name: year
dtype: float64
- name: day
dtype: float64
- name: month
dtype: float64
- name: time
dtype: string
- name: label_identity_attack
dtype: float64
- name: label_insult
dtype: float64
- name: label_obscene
dtype: float64
- name: label_severe_toxicity
dtype: float64
- name: label_threat
dtype: float64
- name: label_toxicity
dtype: float64
- name: __index_level_0__
dtype: int64
splits:
- name: train
num_bytes: 1878988954
num_examples: 4524360
download_size: 1059710799
dataset_size: 1878988954
license: apache-2.0
task_categories:
- text-generation
- conversational
language:
- fi
tags:
- social
- reddit
- Finnish
size_categories:
- 1M<n<10M
---
# Dataset Card for "Reddit_fi_2006_2022"
## Dataset Description
- **Point of Contact:** [RASMUS](https://www.linkedin.com/in/rasmustoivanen/)
- **Size of csv filee on disk files:** 1542.75 MB
- **Size of the generated parquet files:** 970 MB
### Dataset Summary
Reddit_fi is a filtered and post-processed corpus consisting of comments from [Reddit](https://reddit.com/).
Some words of caution at this stage however. Subreddits were not filtered as in ScandiReddit to filter out any specific subreddits that could have hate speech, toxicity, biased. Be careful when training language models with this data and curate you dataset properly.
All Reddit comments from January 2006 up until December 2022 were downloaded through [PushShift](https://files.pushshift.io/reddit/comments/), after which these were filtered based on the FastText language detection model by using confidence score of 70% was as a limit.
We also filter out shorter than 30 character messages based on body field.
After these filters we end up with 4 524 360 unique messages.
This project was inspired by https://huggingface.co/datasets/alexandrainst/scandi-reddit creator https://www.saattrupdan.com/. Kudos to you!
### Filtering disclaimer. Toxicity and bias
The dataset is provided as is and high likely includes toxic, biased etch. material. You should carefully curate this dataset for your needs. To label toxic messages, we used Finnish toxicity classifier [TurkuNLP/bert-large-finnish-cased-toxicity](https://huggingface.co/TurkuNLP/bert-large-finnish-cased-toxicity) released by TurkuNLP. This dataset includes 6 different toxicity labels with their predicted scores for each message. You can use those labels and scores to filter out toxic messages.
We evaluated subreddits with over 500 messages and decided to provide a list that based on our fast analysis should be filtered out:
[FinlandOnlyfans,
Warframe,
Finnishbitches,
vitunluurangot,
WTF,
SaatananTeletapit,
FinnishWhores,
pics,
iidapiiroinen123,
okkamuretardi,
FinnishGenderCritical,
onlyfanssuomi,
SuomiBannatut,
jumalattaret,
jumalattaret2,
jumalattaretPro,
HommaInAction,
snappisensuroimaton]
### Supported Tasks and Leaderboards
Training language models is the intended task for this dataset.
You can also use this dataset for various data analysis things
### Languages
The dataset is available in Finnish
### Data Instances
An example from the dataset looks as follows.
```
{
"subreddit": "arkisuomi",
"created_utc": 1671152007,
"score": 1,
"body": "oatlyn iKaffe on maitoa parempaa kahvissa, en jois pelkästään kuitenkaan",
"predicted_language": "__label__fi",
"probability": 0.9783772230148317,
"year": 2022.0,
"day": 16.0,
"month": 12.0,
"time": "00:53:27",
"label_identity_attack": 0.00018978118896484375,
"label_insult": 0.00058746337890625,
"label_obscene": 0.00142669677734375,
"label_severe_toxicity": 6.723403930664062e-05,
"label_threat": 0.0004100799560546875,
"label_toxicity": 0.01025390625
}
```
### Data Fields
The data fields are the same among all splits.
- `subreddit`: `string`
- `created_utc: `int64`
- `score`: `int64`
- `body`: `string`
- `predicted_language`: `string`
- `probability`: `float64`
- `year`: `float64`
- `day`: `float64`
- `month`: `float64`
- `time`: `string`
- `label_identity_attack`: `float64`
- `label_insult`: `float64`
- `label_obscene`: `float64`
- `label_severe_toxicity`: `float64`
- `label_threat`: `float64`
- `label_toxicity`: `float64`
### Language Distribution
- fi: 4,561,192
### Top-5 Subreddit Distribution
- Suomi: 3 601 806
- snappijuorut: 483 558
- LakkoPostaukset: 58 613
- snappisensuroimaton: 56 157
- mina_irl: 50 696
## Dataset Creation
### Curation Rationale
The Finnish language does not have that many open source social media datasets. One notable dataset is Suomi24 but it has restricted access.
### Source Data
The raw Reddit data was collected through [PushShift](https://files.pushshift.io/reddit/comments/).
## Additional Information
1. Edit on 11/25/2023. Added missing dataset for october 2021.
user @sannamyl found out that I had missed october 2021 in the initial processing.
I had deleted the original source files but I was able to retrieve october 2021 source file and redo the processing. It was added to the dataset and uploaded on 11/25/2023.
2. Edit 11/26/2023. I spotted that I had mistakenly forgot to add the toxicity predictions and overwrote those accidentally. I took the previous dataset from history with the toxicity predicions and then ran the predictions to October 2021 dataset and then combined and reuploaded
### Dataset Curators
[Rasmus Toivanen](https://www.linkedin.com/in/rasmustoivanen/)
curated this dataset.
### Licensing Information
The dataset is licensed under the [CC BY 4.0
license](https://creativecommons.org/licenses/by/4.0/).
提供机构:
Finnish-NLP
原始信息汇总
数据集概述
数据集名称: Reddit_fi_2006_2022
数据集描述: Reddit_fi是一个经过筛选和后处理的语料库,包含从Reddit收集的评论。数据集涵盖了从2006年1月到2022年12月的所有Reddit评论,通过PushShift下载后,使用FastText语言检测模型进行过滤,仅保留预测语言为芬兰语且置信度超过70%的评论。此外,还排除了长度少于30个字符的消息。
数据集特征:
subreddit: 字符串created_utc: 整数score: 整数body: 字符串predicted_language: 字符串probability: 浮点数year: 浮点数day: 浮点数month: 浮点数time: 字符串label_identity_attack: 浮点数label_insult: 浮点数label_obscene: 浮点数label_severe_toxicity: 浮点数label_threat: 浮点数label_toxicity: 浮点数
数据集大小:
- 训练集大小: 4,524,360条记录
- 数据集总大小: 1,878,988,954字节
- 下载大小: 1,059,710,799字节
许可证: Apache-2.0
语言: 芬兰语
任务类别:
- 文本生成
- 对话系统
数据集用途: 主要用于训练语言模型,也可用于各种数据分析任务。
注意事项: 数据集可能包含有毒、偏见等材料,使用时需谨慎筛选。提供了6种不同的毒性标签及其预测分数,用于过滤有毒消息。
数据集创建: 数据集由Rasmus Toivanen策划,原始数据通过PushShift收集。数据集的创建受到了alexandrainst/scandi-reddit的启发。
许可证信息: 数据集根据CC BY 4.0许可发布。



