BEE-spoke-data/upvoteweb-posts

Name: BEE-spoke-data/upvoteweb-posts
Creator: BEE-spoke-data
Published: 2024-07-13 21:26:17
License: 暂无描述

Hugging Face2024-07-13 更新2024-07-13 收录

下载链接：

https://hf-mirror.com/datasets/BEE-spoke-data/upvoteweb-posts

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含来自Reddit的UpVoteWeb的帖子数据，涵盖了多种语言（如英语、葡萄牙语、德语、法语、西班牙语等）。数据集提供了多个配置，每个配置有不同的数据处理方式。默认配置包含原始数据，eduscored配置使用HuggingFace的分类器对文本进行教育评分预测，en-clean配置过滤了英语文本并进行了文本清理，image-dataset-sample配置则筛选了高分帖子并加载了图像数据。数据集的特征包括帖子ID、文本、URL、日期、作者、子论坛、评分、语言等。

This dataset contains post data from Reddits UpVoteWeb, covering multiple languages (e.g., English, Portuguese, German, French, Spanish). The dataset provides several configurations, each with different data processing methods. The default configuration contains raw data, the eduscored configuration uses HuggingFaces classifier to predict educational scores on the text, the en-clean configuration filters English text and performs text cleaning, and the image-dataset-sample configuration filters high-scoring posts and loads image data. The dataset features include post ID, text, URL, date, author, subreddit, score, language, etc.

提供机构：

BEE-spoke-data

原始信息汇总

数据集概述

基本信息

语言:
- 英语 (en)
- 葡萄牙语 (pt)
- 德语 (de)
- 法语 (fr)
- 西班牙语 (es)
许可证: odc-by
数据集大小: 10M<n<100M
来源: OpenCo7/UpVoteWeb
任务类别:
- 文本生成
- 特征提取
- 图像到文本
- 文本到图像
- 填充掩码

配置详情

配置: `default`

特征:
- post_id: string
- text: string
- url: string
- date: string
- author: string
- subreddit: string
- score: int64
- token_count: int64
- language: string
- language_score: float64
- media_urls: string
分割:
- train:
  - num_bytes: 9259550876
  - num_examples: 16056485
下载大小: 5885641617
数据集大小: 9259550876

配置: `eduscore-1`

特征:
- post_id: string
- text: string
- url: string
- date: string
- author: string
- subreddit: string
- eduscore: float64
- token_count: int64
- language: string
- language_score: float64
- media_urls: string
- eduscore_int: int64
分割:
- train:
  - num_bytes: 1106382730.0
  - num_examples: 661861
下载大小: 702006131
数据集大小: 1106382730.0

配置: `eduscore-2`

特征:
- post_id: string
- text: string
- url: string
- date: string
- author: string
- subreddit: string
- eduscore: float64
- token_count: int64
- language: string
- language_score: float64
- media_urls: string
- eduscore_int: int64
分割:
- train:
  - num_bytes: 33822539.0
  - num_examples: 17332
下载大小: 20729551
数据集大小: 33822539.0

配置: `eduscored`

特征:
- post_id: string
- text: string
- url: string
- date: string
- author: string
- subreddit: string
- eduscore: float64
- token_count: int64
- language: string
- language_score: float64
- media_urls: string
- eduscore_int: int64
分割:
- train:
  - num_bytes: 9388002756
  - num_examples: 16056485
下载大小: 5937232561
数据集大小: 9388002756

配置: `en-clean`

特征:
- post_id: string
- text: string
- url: string
- date: string
- author: string
- subreddit: string
- score: int64
- token_count: int64
- language: string
- language_score: float64
- media_urls: string
分割:
- train:
  - num_bytes: 7830835057
  - num_examples: 13019754
下载大小: 4956674820
数据集大小: 7830835057

配置: `image-dataset-sample`

特征:
- post_id: string
- text: string
- date: string
- author: string
- subreddit: string
- score: int64
- token_count: int64
- language: string
- language_score: float64
- image: image
分割:
- train:
  - num_bytes: 97242067379.125
  - num_examples: 122087
下载大小: 96955619502
数据集大小: 97242067379.125

配置文件路径

default: data/train-*
eduscore-1: eduscore-1/train-*
eduscore-2: eduscore-2/train-*
eduscored: eduscored/train-*
en-clean: en-clean/train-*
image-dataset-sample: image-dataset-sample/train-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集

BEE-spoke-data/upvoteweb-posts

数据集概述

基本信息

配置详情

配置: default

配置: eduscore-1

配置: eduscore-2

配置: eduscored

配置: en-clean

配置: image-dataset-sample