20-million-bluesky-posts
收藏魔搭社区2025-05-25 更新2024-12-07 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/20-million-bluesky-posts
下载链接
链接失效反馈官方服务:
资源简介:
## 20 Million Bluesky Posts
This dataset contains 20 million public posts collected from Bluesky Social's firehose API.
The data is **anonymized** (VERY IMPORTANT POINT). The full dataset could lead to legal issues. If you want your posts removed, I do not know how to as I can't map DID -> Posts.
## Dataset Description
This dataset consists of 20 million **public** (AS IN OPENLY AVAILABLE) posts from Bluesky Social, collected through the platform's firehose API.
- **Language(s) (NLP)**: Multiple (primarily English) (Lang prediction in dataset)
- **License**: Dataset usage is subject to Bluesky's Terms of Service
## Restricted access
The requests will be accepted automatically. I just want to know who has access to this data.
## Bias, Risks, and Limitations
This is not intended to hurt anyone, and I created it because I love making funny graphs. I do not intend to do any other NLP on this data.
2000万条Bluesky帖子
本数据集包含2000万条通过Bluesky社交平台firehose API采集的公开帖子。
本数据集已做**匿名化**处理(极为重要的说明)。完整数据集可能引发法律纠纷。若您希望移除自己的帖子,由于无法建立DID(Decentralized Identifier,去中心化标识符)与帖子的映射关系,我暂无办法完成该操作。
### 数据集说明
本数据集包含2000万条**公开**(即公开可获取)的Bluesky社交平台帖子,均通过该平台的firehose API采集而来。
- **自然语言处理相关语言类型**:多语言(以英语为主)(数据集内置语言预测标注)
- **授权协议**:数据集的使用需遵循Bluesky的服务条款
### 访问权限
访问申请将自动通过审核。我仅希望了解实际获取本数据集的用户身份。
### 偏倚、风险与局限性
本数据集无意对任何人造成伤害,制作初衷仅为制作趣味图表。本人无意对该数据集开展其他自然语言处理相关工作。
提供机构:
maas
创建时间:
2024-11-30



