two-million-bluesky-posts

Name: two-million-bluesky-posts
Creator: maas
Published: 2025-11-12 16:18:14
License: 暂无描述

魔搭社区2025-11-12 更新2024-11-30 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/two-million-bluesky-posts

下载链接

链接失效反馈

官方服务：

资源简介：

## 2 Million Bluesky Posts This dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data. The with-language-predictions config contains the same data as the default config but with language predictions added using the glotlid model. Dataset Details Dataset Description This dataset consists of 2 million public posts from Bluesky Social, collected through the platform's firehose API. Each post contains text content, metadata, and information about media attachments and reply relationships. - **Curated by**: Alpin Dale - **Language(s) (NLP)**: Multiple (primarily English) - **License**: Dataset usage is subject to Bluesky's Terms of Service ## Uses This dataset could be used for: - Training and testing language models on social media content - Analyzing social media posting patterns - Studying conversation structures and reply networks - Research on social media content moderation - Natural language processing tasks using social media datas ## Dataset Structure The dataset is available in two configurations: ### Default Configuration Contains the following fields for each post: - **text**: The main content of the post - **created_at**: Timestamp of post creation - **author**: The Bluesky handle of the post author - **uri**: Unique identifier for the post - **has_images**: Boolean indicating if the post contains images - **reply_to**: URI of the parent post if this is a reply (null otherwise) ### With Language Predictions Configuration Contains all fields from the default configuration plus: - **predicted_language**: The predicted language code (e.g., eng_Latn, deu_Latn) - **language_confidence**: Confidence score for the language prediction (0-1) Language predictions were added using the [glotlid](https://huggingface.co/cis-lmu/glotlid) model via fasttext. ## Bias, Risks, and Limitations The goal of this dataset is for you to have fun :)

200万条Bluesky平台帖子本数据集包含从Bluesky社交平台的firehose API采集的200万条公开帖子，旨在用于社交媒体数据相关的机器学习研究与实验。带语言预测的配置项与默认配置项包含相同数据，但额外添加了使用glotlid模型生成的语言预测结果。 ## 数据集详情 ### 数据集描述本数据集包含通过Bluesky社交平台的firehose API采集的200万条公开帖子。每条帖子均包含文本内容、元数据，以及媒体附件与回复关系相关信息。 - **数据整理者**：Alpin Dale - **涉及语言（自然语言处理）**：多种语言，以英语为主 - **授权协议**：数据集的使用需遵循Bluesky的服务条款。 ## 数据集用途本数据集可应用于以下场景： - 针对社交媒体内容的语言模型训练与测试 - 社交媒体发帖行为模式分析 - 对话结构与回复网络研究 - 社交媒体内容审核相关研究 - 基于社交媒体数据的自然语言处理任务 ## 数据集结构本数据集提供两种配置版本： ### 默认配置版本每条帖子包含以下字段： - **text**：帖子主体内容 - **created_at**：帖子创建时间戳 - **author**：帖子作者的Bluesky账号标识 - **uri**：帖子的唯一标识符 - **has_images**：布尔值，用于标识帖子是否包含图片 - **reply_to**：若该帖子为回复帖，则为父级帖子的URI；否则为null ### 带语言预测的配置版本包含默认配置版本的全部字段，并额外添加以下字段： - **predicted_language**：预测得到的语言代码（例如`eng_Latn`、`deu_Latn`） - **language_confidence**：语言预测结果的置信度分数，取值范围为0至1 本次语言预测结果通过fasttext工具，使用[glotlid](https://huggingface.co/cis-lmu/glotlid)模型生成并添加至数据中。 ## 偏差、风险与局限性本数据集旨在供使用者开展探索与实践。

提供机构：

maas

创建时间：

2024-11-28

搜集汇总

数据集介绍