reddit_dataset_149

Hugging Face2024-11-30 更新2024-12-12 收录

下载链接：

https://huggingface.co/datasets/LadyMia/reddit_dataset_149

下载链接

链接失效反馈

资源简介：

该数据集是Bittensor Subnet 13去中心化网络的一部分，包含预处理的Reddit数据。数据由网络矿工持续更新，提供Reddit内容的实时流，适用于各种分析和机器学习任务。数据集包括Reddit帖子和评论，字段包括文本、标签、数据类型、社区名称、日期时间、用户名编码和URL编码。数据主要为英文，但也可能是多语言的。该数据集在MIT许可下发布，并受Reddit使用条款的约束。用户应注意潜在的偏见和限制，如数据质量变化和时间偏见。

This dataset is part of the Bittensor Subnet 13 decentralized network, comprising preprocessed Reddit data. The data is continuously updated by network miners, delivering a real-time stream of Reddit content suitable for various analytics and machine learning tasks. The dataset includes Reddit posts and comments, with fields including text, tags, data type, community name, datetime, encoded usernames, and encoded URLs. The data is primarily in English but may also be multilingual. This dataset is released under the MIT License and is subject to Reddit's Terms of Service. Users should be aware of potential biases and limitations, such as varying data quality and temporal bias.

创建时间：

2024-11-23

原始信息汇总

Bittensor Subnet 13 Reddit Dataset

Dataset Description

Repository: LadyMia/reddit_dataset_149
Subnet: Bittensor Subnet 13
Miner Hotkey: 5ER93P7YrerwowGELtpnnkqoK7poR1Q8mca3f84k7b3nig3D

Dataset Summary

This dataset is part of the Bittensor Subnet 13 decentralized network, containing preprocessed Reddit data. The data is continuously updated by network miners, providing a real-time stream of Reddit content for various analytical and machine learning tasks.

Supported Tasks

Sentiment Analysis
Topic Modeling
Community Analysis
Content Categorization

Languages

Primary language: Datasets are mostly English, but can be multilingual due to decentralized ways of creation.

Dataset Structure

Data Instances

Each instance represents a single Reddit post or comment with the following fields:

Data Fields

text (string): The main content of the Reddit post or comment.
label (string): Sentiment or topic category of the content.
dataType (string): Indicates whether the entry is a post or a comment.
communityName (string): The name of the subreddit where the content was posted.
datetime (string): The date when the content was posted or commented.
username_encoded (string): An encoded version of the username to maintain user privacy.
url_encoded (string): An encoded version of any URLs included in the content.

Data Splits

This dataset is continuously updated and does not have fixed splits. Users should create their own splits based on their requirements and the datas timestamp.

Dataset Creation

Source Data

Data is collected from public posts and comments on Reddit, adhering to the platforms terms of service and API usage guidelines.

Personal and Sensitive Information

All usernames and URLs are encoded to protect user privacy. The dataset does not intentionally include personal or sensitive information.

Considerations for Using the Data

Social Impact and Biases

Users should be aware of potential biases inherent in Reddit data, including demographic and content biases. This dataset reflects the content and opinions expressed on Reddit and should not be considered a representative sample of the general population.

Limitations

Data quality may vary due to the nature of media sources.
The dataset may contain noise, spam, or irrelevant content typical of social media platforms.
Temporal biases may exist due to real-time collection methods.
The dataset is limited to public subreddits and does not include private or restricted communities.

Additional Information

Licensing Information

The dataset is released under the MIT license. The use of this dataset is also subject to Reddit Terms of Use.

Citation Information

If you use this dataset in your research, please cite it as follows:

@misc{LadyMia2024datauniversereddit_dataset_149, title={The Data Universe Datasets: The finest collection of social media data the web has to offer}, author={LadyMia}, year={2024}, url={https://huggingface.co/datasets/LadyMia/reddit_dataset_149}, }

Contributions

To report issues or contribute to the dataset, please contact the miner or use the Bittensor Subnet 13 governance mechanisms.

Dataset Statistics

Total Instances: 37221287
Date Range: 2024-11-23T00:00:00Z to 2024-11-30T00:00:00Z
Last Updated: 2024-11-30T08:35:12Z

Data Distribution

Posts: 6.09%
Comments: 93.91%

Top 10 Subreddits

Rank	Topic	Total Count	Percentage
1	r/AskReddit	327580	0.88%
2	r/CFB	191250	0.51%
3	r/AITAH	181540	0.49%
4	r/nfl	167411	0.45%
5	r/politics	119413	0.32%
6	r/Pixelary	116224	0.31%
7	r/NoStupidQuestions	102496	0.28%
8	r/teenagers	99724	0.27%
9	r/repost	88206	0.24%
10	r/mildlyinfuriating	79357	0.21%

Update History

Date	New Instances	Total Instances
2024-11-23T08:11:02Z	763287	763287
2024-11-26T20:17:50Z	18353497	19116784
2024-11-30T08:35:12Z	18104503	37221287

AI搜集汇总

数据集介绍

构建方式

该数据集依托于Bittensor Subnet 13去中心化网络构建，通过网络矿工持续更新，汇集了经过预处理的Reddit数据。数据来源于Reddit平台上的公开帖子和评论，严格遵守Reddit的服务条款和API使用规范。为保护用户隐私，所有用户名和URL均经过编码处理，确保不包含个人敏感信息。数据集的动态更新机制使其能够实时反映Reddit社区的内容变化，为研究者提供了一个持续更新的数据源。

特点

该数据集具有多语言特性，尽管主要以英语为主，但由于去中心化的数据收集方式，可能包含多种语言的内容。其结构化数据包含帖子或评论的文本、情感或主题标签、发布时间、社区名称等字段，支持多种自然语言处理任务，如情感分析、主题分类和社区分析。此外，数据集的动态更新特性使其适用于需要实时数据的场景，但用户需注意数据质量可能因社交平台特性而存在波动。

使用方法

用户可根据研究或业务需求，利用该数据集进行多种自然语言处理任务，如情感分析、主题建模和内容分类。由于数据集无固定分割，用户需根据时间戳或其他标准自行划分训练集、验证集和测试集。使用时应注意数据中的潜在偏差，并结合Reddit平台的特点进行合理分析。数据集的MIT许可证允许广泛使用，但需遵守Reddit的使用条款。

背景与挑战

背景概述

reddit_dataset_149数据集隶属于Bittensor Subnet 13去中心化网络，由LadyMia主导创建，专注于收集和预处理Reddit平台上的公开帖子和评论数据。该数据集的核心研究问题在于通过实时更新的社交媒体数据，支持多种自然语言处理任务，如情感分析、主题建模和社区分析等。其创建时间可追溯至2024年，主要研究人员或机构为LadyMia，该数据集的发布对社交媒体数据分析领域具有重要影响力，尤其在探索社交动态和内容分类方面提供了丰富的资源。

当前挑战

reddit_dataset_149数据集在构建过程中面临多项挑战。首先，数据质量的波动性较大，由于社交媒体平台固有的噪声和垃圾信息，可能导致数据集中的内容不一致。其次，隐私保护问题尤为突出，尽管用户名和URL已被编码处理，但如何在数据收集过程中进一步确保用户隐私仍是一个重要课题。此外，该数据集的实时更新特性带来了时间偏差问题，用户需自行处理数据分割以避免时间上的不均衡。最后，Reddit数据的多样性和多语言特性增加了数据处理的复杂性，尤其是在跨语言任务中的应用。

常用场景

经典使用场景

在社交网络分析领域，reddit_dataset_149数据集因其丰富的内容和多样的任务支持，成为研究者们探索社交媒体动态的经典工具。该数据集特别适用于情感分析、主题建模和社区分析等任务。通过分析Reddit上的帖子与评论，研究者可以深入理解用户情感倾向、识别热门话题以及揭示不同社区的互动模式。

衍生相关工作

基于reddit_dataset_149数据集，研究者们开发了多种创新性工作。例如，有研究利用该数据集构建了高精度的情感分析模型，用于自动化舆情监控；还有研究通过主题建模技术，揭示了Reddit上不同社区的热门话题分布。此外，该数据集还被用于开发社区动态预测模型，为社交媒体的实时分析提供了新的工具。

数据集最近研究