two-million-bluesky-posts
收藏魔搭社区2025-11-12 更新2024-11-30 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/two-million-bluesky-posts
下载链接
链接失效反馈官方服务:
资源简介:
## 2 Million Bluesky Posts
This dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data.
The with-language-predictions config contains the same data as the default config but with language predictions added using the glotlid model.
Dataset Details
Dataset Description
This dataset consists of 2 million public posts from Bluesky Social, collected through the platform's firehose API. Each post contains text content, metadata, and information about media attachments and reply relationships.
- **Curated by**: Alpin Dale
- **Language(s) (NLP)**: Multiple (primarily English)
- **License**: Dataset usage is subject to Bluesky's Terms of Service
## Uses
This dataset could be used for:
- Training and testing language models on social media content
- Analyzing social media posting patterns
- Studying conversation structures and reply networks
- Research on social media content moderation
- Natural language processing tasks using social media datas
## Dataset Structure
The dataset is available in two configurations:
### Default Configuration
Contains the following fields for each post:
- **text**: The main content of the post
- **created_at**: Timestamp of post creation
- **author**: The Bluesky handle of the post author
- **uri**: Unique identifier for the post
- **has_images**: Boolean indicating if the post contains images
- **reply_to**: URI of the parent post if this is a reply (null otherwise)
### With Language Predictions Configuration
Contains all fields from the default configuration plus:
- **predicted_language**: The predicted language code (e.g., eng_Latn, deu_Latn)
- **language_confidence**: Confidence score for the language prediction (0-1)
Language predictions were added using the [glotlid](https://huggingface.co/cis-lmu/glotlid) model via fasttext.
## Bias, Risks, and Limitations
The goal of this dataset is for you to have fun :)
200万条Bluesky平台帖子
本数据集包含从Bluesky社交平台的firehose API采集的200万条公开帖子,旨在用于社交媒体数据相关的机器学习研究与实验。
带语言预测的配置项与默认配置项包含相同数据,但额外添加了使用glotlid模型生成的语言预测结果。
## 数据集详情
### 数据集描述
本数据集包含通过Bluesky社交平台的firehose API采集的200万条公开帖子。每条帖子均包含文本内容、元数据,以及媒体附件与回复关系相关信息。
- **数据整理者**:Alpin Dale
- **涉及语言(自然语言处理)**:多种语言,以英语为主
- **授权协议**:数据集的使用需遵循Bluesky的服务条款。
## 数据集用途
本数据集可应用于以下场景:
- 针对社交媒体内容的语言模型训练与测试
- 社交媒体发帖行为模式分析
- 对话结构与回复网络研究
- 社交媒体内容审核相关研究
- 基于社交媒体数据的自然语言处理任务
## 数据集结构
本数据集提供两种配置版本:
### 默认配置版本
每条帖子包含以下字段:
- **text**:帖子主体内容
- **created_at**:帖子创建时间戳
- **author**:帖子作者的Bluesky账号标识
- **uri**:帖子的唯一标识符
- **has_images**:布尔值,用于标识帖子是否包含图片
- **reply_to**:若该帖子为回复帖,则为父级帖子的URI;否则为null
### 带语言预测的配置版本
包含默认配置版本的全部字段,并额外添加以下字段:
- **predicted_language**:预测得到的语言代码(例如`eng_Latn`、`deu_Latn`)
- **language_confidence**:语言预测结果的置信度分数,取值范围为0至1
本次语言预测结果通过fasttext工具,使用[glotlid](https://huggingface.co/cis-lmu/glotlid)模型生成并添加至数据中。
## 偏差、风险与局限性
本数据集旨在供使用者开展探索与实践。
提供机构:
maas
创建时间:
2024-11-28



