Hiraishin/Reddit-Malaysia
收藏Hugging Face2024-01-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Hiraishin/Reddit-Malaysia
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
- ms
---
# Reddit Crawler on Malaysia Subreddit using Selenium
This Hugging Face dataset repository serves as a dedicated data store for an Extract, Transform, Load (ETL) pipeline designed using MageAI. The pipeline is specifically crafted for harvesting data from the Malaysia subreddit on Reddit. Leveraging Selenium, this ETL process systematically collects information from four distinct sections of the subreddit: Hot, New, Rising, Controversial, and Top.
# Usage
This dataset is specifically curated for users aiming to train Language Models (LLMs) by providing a rich and diverse set of data from the Malaysia subreddit. With a focus on fostering language understanding and generation, this dataset is a valuable resource for training LLMs capable of capturing the nuances and dynamics of online discussions.
提供机构:
Hiraishin
原始信息汇总
Reddit Crawler on Malaysia Subreddit using Selenium
概述
该数据集是用于存储通过MageAI设计的ETL(提取、转换、加载)管道的数据。该管道专门用于从Reddit的Malaysia子版块中收集数据。利用Selenium,该ETL过程系统地从子版块的五个不同部分(Hot、New、Rising、Controversial和Top)收集信息。
用途
该数据集特别为希望使用来自Malaysia子版块的丰富多样数据训练语言模型(LLMs)的用户设计。专注于促进语言理解和生成,该数据集是训练能够捕捉在线讨论细微差别和动态的LLMs的宝贵资源。



