five

The reddit self-post classification task 红点自发布分类任务

收藏
阿里云天池2026-06-07 更新2024-03-07 收录
下载链接:
https://tianchi.aliyun.com/dataset/94101
下载链接
链接失效反馈
官方服务:
资源简介:
欢迎来到雷迪特自邮分类任务(RSPCT)! 此数据集的目的是在许多类中创建一个有趣的大文本分类问题,它不会像大多数此类数据集那样遭受标签稀缺性的影响。请参阅博客文章,了解更详细的写入,或在此处查看论文。目的是将自设帖子分类为发布这些帖子的子编辑。在选择一组"好"的子数据以尽量减少内容重叠方面,我们付出了大量努力。 我们建议您在继续之前查看此数据集的博客文章。如果您有更详细的问题,这里还有一份论文的草稿。

Welcome to the Reddit Self-Post Classification Task (RSPCT)! This dataset aims to create an engaging large-scale text classification task across numerous categories, which avoids the label scarcity issue plaguing most similar existing datasets. Please refer to the blog post for a more detailed exposition, or check out the paper here. The task objective is to classify self-posts into the subreddit where they were originally posted. Considerable effort has been devoted to selecting a "high-quality" subset of data to minimize content overlap between categories. We recommend reviewing the blog post for this dataset before proceeding with your work. If you have further questions, a draft version of the associated paper is also available here.
提供机构:
阿里云天池
创建时间:
2021-03-11
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集旨在构建一个包含多类别的大型文本分类任务,以解决标签稀疏性问题。它包含101.3万个自发布帖子,覆盖1013个子版块,每个类别提供1000个示例,并附带子版块信息文件用于辅助分类。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务