five

one-million-bluesky-posts

收藏
魔搭社区2024-12-01 更新2024-11-30 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/one-million-bluesky-posts
下载链接
链接失效反馈
官方服务:
资源简介:
# Note > I've removed the data from this dataset since there was a lot of community pushback about its creation/uploading. I will leave the dataset repository up to allow room for discussion of how datasets can be used to help improve Bluesky and allow people to build the tools they need to build their own open models and approaches to creating feeds that work for their needs. Please feel free to continue to leave feedback in the discussions [here](https://huggingface.co/datasets/bluesky-community/one-million-bluesky-posts/discussions). # Dataset Card for 1 Million Bluesky Posts This dataset contains 1 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data. The `with-language-predictions` config contains the same data as the default config but with language predictions added using the [glotlid model](https://huggingface.co/cis-lmu/glotlid). ## Dataset Details ### Dataset Description This dataset consists of 1 million public posts from Bluesky Social, collected through the platform's firehose API. Each post contains text content, metadata, and information about media attachments and reply relationships. - **Curated by:** Daniel van Strien - **Language(s) (NLP):** Multiple (primarily English) - **License:** Dataset usage is subject to Bluesky's Terms of Service ### Dataset Sources - **Source:** Bluesky Social firehose API - **Collection Method:** Python script ([get_data.py](get_data.py)) using the [atproto library](https://github.com/MarshalX/atproto) to connect to the firehose API and collect posts ## Uses ### Direct Use This dataset could be used for: - Training and testing language models on social media content - Analyzing social media posting patterns - Studying conversation structures and reply networks - Research on social media content moderation - Natural language processing tasks using social media datas ### Out-of-Scope Use This dataset should not be used for: - Building automated posting systems for Bluesky - Creating fake or impersonated content - Extracting personal information about users - Any purpose that violates Bluesky's Terms of Service ## Dataset Structure The dataset is available in two configurations: ### Default Configuration Contains the following fields for each post: - `text`: The main content of the post - `created_at`: Timestamp of post creation - `author`: The Bluesky handle of the post author - `uri`: Unique identifier for the post - `has_images`: Boolean indicating if the post contains images - `reply_to`: URI of the parent post if this is a reply (null otherwise) ### With Language Predictions Configuration Contains all fields from the default configuration plus: - `predicted_language`: The predicted language code (e.g., eng_Latn, deu_Latn) - `language_confidence`: Confidence score for the language prediction (0-1) Language predictions were added using the [glotlid model](https://huggingface.co/cis-lmu/glotlid) via fasttext, with the process documented in [predict_language.ipynb](predict_language.ipynb). ## Dataset Creation ### Curation Rationale This dataset was created to provide researchers and developers with a large sample of Bluesky posts for machine learning experimentation and social media analysis. ### Source Data #### Data Collection and Processing Posts were collected using a Python script that connects to Bluesky's firehose API using the atproto library. The script: - Processes the real-time feed of public posts - Extracts relevant fields from each post - Saves posts in batches of 100,000 to JSONL files - Includes basic metadata and structural information about each post #### Who are the source data producers? The data comes from public posts made by Bluesky Social users. These users represent a diverse group of individuals and organizations who have chosen to share content publicly on the platform. ### Personal and Sensitive Information The dataset contains public posts and their associated public metadata. While all data is publicly available through Bluesky's API, users should: - Respect user privacy and platform Terms of Service - Not attempt to de-anonymize or aggregate user information - Use the data responsibly and ethically ## Bias, Risks, and Limitations - The dataset represents a snapshot in time and may not reflect current platform usage - Content may be biased towards more active users or specific time periods - Posts are not filtered for content or quality - The dataset may contain biases present in the Bluesky user base - Language distribution may not be representative of all Bluesky users, especially since the posts where collected over a brief period of time when some timezones were asleep - Language predictions are automated and may contain errors, especially for short texts or mixed-language content - The language detection model may have its own biases and limitations in detecting certain languages or scripts

# 备注 > 由于该数据集的创建与上传引发了大量社区反对,我已将本数据集内的数据移除。我将保留该数据集仓库,以便讨论如何利用数据集助力Bluesky的改进,并帮助开发者构建所需工具,以打造适配自身需求的开源模型与feed生成方案。欢迎各位在下方讨论区留下反馈:[https://huggingface.co/datasets/bluesky-community/one-million-bluesky-posts/discussions](https://huggingface.co/datasets/bluesky-community/one-million-bluesky-posts/discussions) # 100万条Bluesky帖子数据集卡片 本数据集包含从Bluesky社交平台的firehose API(全量流式推送接口)采集的100万条公开帖子,旨在用于社交媒体数据相关的机器学习研究与实验。 `with-language-predictions`配置与默认配置包含相同的数据,但新增了使用[glotlid模型](https://huggingface.co/cis-lmu/glotlid)生成的语言预测结果。 ## 数据集详情 ### 数据集概述 本数据集包含从Bluesky社交平台通过其firehose API采集的100万条公开帖子。每条帖子均包含文本内容、元数据,以及媒体附件与回复关系相关信息。 - **整理者:** Daniel van Strien - **NLP语言:** 多语言(以英语为主) - **许可协议:** 数据集的使用需遵循Bluesky的服务条款 ## 数据集来源 - **来源:** Bluesky社交平台firehose API - **采集方式:** 通过Python脚本([get_data.py](get_data.py))使用[atproto库](https://github.com/MarshalX/atproto)连接firehose API并采集帖子 ## 数据集用途 ### 直接用途 本数据集可用于: - 针对社交媒体内容训练与测试语言模型 - 分析社交媒体发帖模式 - 研究对话结构与回复网络 - 开展社交媒体内容审核相关研究 - 利用社交媒体数据开展自然语言处理相关任务 ### 禁止用途 本数据集不得用于: - 为Bluesky构建自动化发帖系统 - 生成虚假或冒充他人的内容 - 提取用户的个人信息 - 任何违反Bluesky服务条款的用途 ## 数据集结构 本数据集提供两种配置版本: ### 默认配置 每条帖子包含以下字段: - `text`:帖子的核心文本内容 - `created_at`:帖子创建的时间戳 - `author`:帖子作者的Bluesky用户名 - `uri`:帖子的唯一标识符 - `has_images`:布尔值,用于标识帖子是否包含图片 - `reply_to`:若该帖子为回复帖,则为父帖子的URI;否则为null ### 带语言预测结果配置 包含默认配置的全部字段,额外新增以下字段: - `predicted_language`:预测得到的语言代码(例如eng_Latn、deu_Latn) - `language_confidence`:语言预测结果的置信度得分(范围为0至1) 语言预测结果通过fasttext工具调用[glotlid模型](https://huggingface.co/cis-lmu/glotlid)生成,具体流程记录于[predict_language.ipynb](predict_language.ipynb)文件中。 ## 数据集创建 ### 整理初衷 本数据集的创建旨在为研究人员与开发者提供大规模的Bluesky帖子样本,用于机器学习实验与社交媒体分析研究。 ### 源数据 #### 数据采集与处理 帖子通过Python脚本采集,该脚本使用atproto库连接Bluesky的firehose API。脚本具体流程如下: - 处理公开帖子的实时推送流 - 从每条帖子中提取相关字段 - 将帖子以每10万条为一批的形式保存为JSONL文件 - 为每条帖子添加基础元数据与结构信息 #### 源数据生产者 本数据集的数据来自Bluesky社交平台用户发布的公开帖子。这些用户涵盖了选择在该平台公开分享内容的各类个体与组织,群体构成多样。 ### 个人与敏感信息 本数据集包含公开帖子及其关联的公开元数据。尽管所有数据均可通过Bluesky的API公开获取,使用者仍需做到: - 尊重用户隐私,遵守平台服务条款 - 不得尝试去匿名化或聚合用户信息 - 负责任且符合伦理规范地使用数据 ## 偏差、风险与局限性 - 本数据集仅为某一时间点的快照,可能无法反映平台当前的使用情况 - 内容可能偏向活跃用户或特定时间段的发帖 - 未针对帖子内容与质量进行筛选 - 数据集可能存在Bluesky用户群体本身自带的偏差 - 语言分布可能无法代表所有Bluesky用户,尤其因为帖子采集仅在较短时间段内完成,部分时区的用户尚未活跃 - 语言预测为自动化生成,可能存在误差,尤其是针对短文本或混合语言内容 - 语言检测模型本身可能存在偏差与局限性,在识别特定语言或文字时存在不足
提供机构:
maas
创建时间:
2024-11-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作