UpVoteWeb
收藏魔搭社区2025-11-20 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/UpVoteWeb
下载链接
链接失效反馈官方服务:
资源简介:
<center>
<img src="https://static.grassfoundation.io">
</center>
# Dataset Summary
This dataset is a filtered collection of posts and comments from Reddit in the year 2024. It has been prepared for research and educational purposes. This dataset includes public web data from various subreddits, providing a snapshot of the discussions happening on the platform during this period. The dataset has been processed to anonymize any personal information found in the posts and comments, specifically email addresses and IP addresses, ensuring the privacy of individuals while maintaining the integrity and context of the data.
### Supported Tasks and Leaderboards
The dataset may be used for a variety of natural language processing (NLP) tasks including:
- Text Classification: Classifying comments and posts into categories based on sentiment, topic, or subreddit.
- Language Modeling: Training language models to understand and generate conversational text.
- Sentiment Analysis: Analyzing the sentiment of comments and posts across different subreddits and topics.
- Topic Modeling: Identifying and modeling topics discussed in the posts and comments.
### Languages
The primary language of the dataset is English, as the majority of users post in English. However, posts in other languages may also be present, reflecting the diverse user base of the platform.
# Dataset Structure
### Data Instances
Each data instance represents a post or comment and includes the following fields:
- id: A unique identifier for the comment or post.
- parent_id: The identifier of the parent comment or post. The prefixes are defined as follows:
- t5: subreddit
- t3: post
- t1: comment
- text: The content of the comment or post, with email addresses and IP addresses anonymized.
- url: The URL of the original thread on Reddit.
- date: The timestamp of the comment or post in UTC.
- language: The detected language of the text.
- language_score: The confidence score of the language detection.
- token_count: The number of tokens in the text, as determined by the GPT-2 tokenizer.
- score: The score (upvotes minus downvotes) of the comment or post.
- subreddit: The subreddit where the comment or post was made.
- author: The username of the author of the comment or post.
- media_urls: An array of links to any multimedia included in the comment or post.
### Data Fields
- id: string
- parent_id: string
- text: string
- url: string
- date: string
- language: string
- language_score: float
- token_count: int
- score: int
- subreddit: string
- author: string
- media_urls: array
# Data Preprocessing
The dataset has undergone several preprocessing steps to ensure the quality and privacy of the data:
1. Personal Information Anonymization[CM1] : Email addresses and IP addresses have been replaced with [EMAIL] and [IP] placeholders, respectively.
2. Language Detection: Each text instance has been processed using FastText to detect its language and assign a confidence score.
3. Tokenization: Text instances have been tokenized using the GPT-2 tokenizer to provide a token count.
4. NSFW Filtering: The dataset has been filtered to exclude content marked as NSFW, utilizing the NSFW metadata provided by Reddit's moderation.
### Usage Example:
Here is an example of how to load and use the dataset in Python.
```
from datasets import load_dataset
#Load the dataset
dataset = load_dataset("OpenCo7/UpVoteWeb", split = "train", streaming = True)
```
# Dataset Creation
### Curation Rationale
The Reddit platform hosts public web content about a diverse range of topics, all presented in a conversational format. This has made it a resource in training some of the highest profile LLMs to date. UpVoteWeb is a large, clean pretraining dataset built from this content, for use in developing open source models for research and educational purposes. The dataset is provided for research and educational purposes.
### Source Data
This dataset is a filtered collection of posts and comments from Reddit in the year 2024. Annotations
We augment the scraped data with the language, language_score, and token_count annotations. The language and language_score annotations are generated using FastText and token_count is generated using the gpt2 tokenizer.
### Personal and Sensitive Information
The dataset has been processed to anonymize personal information, specifically email addresses and IP addresses, ensuring the privacy of individuals while maintaining the integrity and context of the data.
# Considerations for Using the Data
### Social Impact of Dataset
With the release of this dataset, we aim to make this development resource available to the community at large.
### Discussion of Biases
Efforts were made to minimize the amount of NSFW and toxic content present in the dataset by employing filtering on the URL level.
# Additional Information
### Licensing Information
The dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0 [CM2] [license](https://opendatacommons.org/licenses/by/1-0/). Its availability is not an invitation to use any of the information for any illegal or unlawful purpose, or outside the scope of research or educational purposes.
### Future Work
Grass is a network for the acquisition of public web data, and we plan to continue building high quality, structured datasets for use in AI/ML research[CM4] . In addition to future offerings, we will also continue to improve UpVoteWeb in future iterations.
### Citation Information
If you use this dataset in your research or project, please cite it as follows:
```
@dataset{UpVoteWeb,
title = {UpVoteWeb-24-600M},
year = {2024},
publisher = {OpenCo},
url = {<https://huggingface.co/datasets/OpenCo7/UpVoteWeb>}
}
```
<center>
<img src="https://static.grassfoundation.io">
</center>
# 数据集概览
本数据集为2024年Reddit平台帖子与评论的过滤后集合,专为研究与教育用途打造。数据集收录了来自多个子论坛(subreddit)的公开网络数据,完整呈现了该时期平台上的讨论生态。数据集已完成处理,对帖子与评论中包含的个人信息(尤其是电子邮箱地址与IP地址)进行了匿名化处理,在保留数据完整性与上下文信息的同时,保障了用户的隐私安全。
### 支持任务与评测榜单
本数据集可适配多种自然语言处理(Natural Language Processing, NLP)任务,具体包括:
- 文本分类:根据情感倾向、主题或所属子论坛对评论与帖子进行类别划分。
- 语言建模:训练语言模型以理解并生成对话式文本。
- 情感分析:针对不同子论坛与主题下的评论与帖子开展情感倾向分析。
- 主题建模:识别并建模评论与帖子中讨论的主题。
### 语言分布
本数据集的主要语言为英语,因平台绝大多数用户以英语发布内容。不过数据中也包含其他语言的帖子,体现了平台用户群体的多样性。
# 数据集结构
### 数据实例
每个数据实例对应一条帖子或评论,包含以下字段:
- id:用于标识单条评论或帖子的唯一标识符。
- parent_id:父评论或父帖子的标识符,前缀定义如下:
- t5:对应子论坛(subreddit)
- t3:对应帖子
- t1:对应评论
- text:评论或帖子的正文内容,其中电子邮箱地址与IP地址已完成匿名化处理。
- url:Reddit平台上原讨论串的URL。
- date:评论或帖子的UTC时间戳。
- language:检测到的文本所属语言。
- language_score:语言检测任务的置信得分。
- token_count:基于GPT-2分词器统计的文本Token数量。
- score:评论或帖子的得分(点赞数减去点踩数)。
- subreddit:评论或帖子所属的子论坛。
- author:评论或帖子发布者的用户名。
- media_urls:包含评论或帖子中所有多媒体资源的链接数组。
### 数据字段
- id:字符串类型
- parent_id:字符串类型
- text:字符串类型
- url:字符串类型
- date:字符串类型
- language:字符串类型
- language_score:浮点型
- token_count:整型
- score:整型
- subreddit:字符串类型
- author:字符串类型
- media_urls:数组类型
# 数据预处理
本数据集经过了多轮预处理步骤,以保障数据质量与用户隐私:
1. 个人信息匿名化:将电子邮箱地址与IP地址分别替换为`[EMAIL]`与`[IP]`占位符。
2. 语言检测:针对每个文本实例,使用FastText工具检测其所属语言并生成置信得分。
3. 分词处理:使用GPT-2分词器对文本实例进行分词,以统计Token数量。
4. NSFW内容过滤:基于Reddit审核系统提供的不宜公开浏览(Not Safe For Work, NSFW)元数据,过滤掉所有标记为NSFW的内容。
### 使用示例
以下为在Python环境中加载与使用本数据集的示例代码:
from datasets import load_dataset
#Load the dataset
dataset = load_dataset("OpenCo7/UpVoteWeb", split = "train", streaming = True)
# 数据集构建
### 构建初衷
Reddit平台承载了涵盖多元主题的公开网络内容,且全部以对话形式呈现,这使其成为当前多款顶尖大语言模型(Large Language Model, LLM)的训练资源之一。UpVoteWeb正是基于此类内容构建的大规模高质量预训练数据集,专为研究与教育场景下的开源模型开发打造,仅用于研究与教育用途。
### 源数据
本数据集为2024年Reddit平台帖子与评论的过滤后集合。
我们为爬取得到的原始数据补充了语言、语言检测置信度与Token计数三类标注:其中语言与语言检测置信度标注由FastText工具生成,Token计数则通过GPT-2分词器统计得到。
### 个人与敏感信息
本数据集已完成个人信息处理,对电子邮箱地址与IP地址进行了匿名化,在保留数据完整性与上下文信息的同时,保障了用户的隐私安全。
# 数据使用注意事项
### 数据集的社会影响
本数据集的发布旨在将该开发资源开放给全体社区成员。
### 偏差说明
我们通过URL层级的过滤机制,尽可能减少数据集中NSFW与不良内容的占比。
# 附加信息
### 许可信息
本数据集基于开放数据共同体署名许可协议(Open Data Commons Attribution License, ODC-By)v1.0 进行发布,相关许可链接可参考:https://opendatacommons.org/licenses/by/1-0/。本数据集的发布并不代表授权任何非法用途,或超出研究与教育范畴的使用行为。
### 未来工作
Grass是一家公开网络数据采集平台,我们计划持续构建高质量、结构化的数据集,用于人工智能(AI)与机器学习(ML)研究。除后续推出的新数据集外,我们还将在未来的迭代版本中持续优化UpVoteWeb数据集。
### 引用信息
若您在研究或项目中使用本数据集,请按照以下格式进行引用:
@dataset{UpVoteWeb,
title = {UpVoteWeb-24-600M},
year = {2024},
publisher = {OpenCo},
url = {<https://huggingface.co/datasets/OpenCo7/UpVoteWeb>}
}
提供机构:
maas
创建时间:
2024-07-06



