Reddit EU language dataset
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/5346799
下载链接
链接失效反馈官方服务:
资源简介:
This dataset has been created for a personal project related to the recognition of the original language of someone writing in english.
Origin
The dataset has been crawled from the subreddit r/europe and contains around 1.5 milions posts in it's raw form.
Structure
This repo contains both the raw data and the cleaned data, the latter, purged of deleted comments and of those that were not linked to the provenience of the writer, contains around 450k datapoints and has the following structure:
body: the text content of the comment
country_name: extended name of the country
permalink: link to the comment
author: username of the creator
created_utc: utc creation
datetime: date and time of creation
alpha2: ISO country alpha2 code
alpha3: ISO country alpha3 code
numeric: ISO country number
apolitical_name: apolitical country name
创建时间:
2021-08-31



