akarrouch-mohamed/bluesky
收藏Hugging Face2026-04-11 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/akarrouch-mohamed/bluesky
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
pretty_name: Curated English Bluesky Corpus
tags:
- bluesky
- social-media
- english
- text-corpus
- dataset-curation
size_categories:
- 10M<n<100M
license: apache-2.0
---
# Curated English Bluesky Corpus
## Dataset summary
This dataset is a large English-only Bluesky text corpus curated for tokenizer and language-model experiments.
The corpus was built in two stages:
1. **Stage 1:** a full download of the Hugging Face dataset `Roronotalt/bluesky`, filtered to English posts and deduplicated by post URI.
2. **Stage 2:** additional English Bluesky posts collected directly from Bluesky repositories using a custom extractor, again deduplicated by URI and checked against Stage 1 to avoid overlap.
The final merged corpus contains **90,831,330 rows** and **1,500,002,598 whitespace-counted words**.
Only two columns are kept in the released dataset:
- `uri`: the Bluesky AT URI of the post
- `text`: the post text
## Why this dataset was created
This corpus was curated to obtain a large English social-media fitting corpus with a target size of roughly **1.5 billion words**, suitable for tokenizer training and related downstream experiments.
## Data fields
- `uri` (`string`): unique post identifier in AT URI format.
- `text` (`string`): raw post text.
## Dataset splits
The final release is stored as a `DatasetDict` with three splits:
- **train:** 89,062,186 rows
- **validation:** 891,793 rows
- **test:** 877,351 rows
Split creation was based on **cumulative whitespace word counts**, not random row sampling:
- validation was filled first until it reached approximately **1%** of total words,
- test was filled next until it reached approximately **1%** of total words,
- train contains the remainder.
Final split word counts are:
- **train:** 1,470,002,529 words
- **validation:** 15,000,031 words
- **test:** 15,000,038 words
## Curation pipeline
### Stage 0: source download
The initial source used for Stage 1 was the full `train` split of **`Roronotalt/bluesky`**, downloaded from Hugging Face in non-streaming mode and saved locally before filtering.
### Stage 1: filtering and deduplication of the Hugging Face source
The downloaded source was processed row by row with the following rules:
1. Keep a row only if its `langs` metadata contains `en` or a value starting with `en-`.
2. Drop rows with missing or empty `uri`.
3. Drop rows with missing or empty `text`.
4. Count words using whitespace splitting and drop rows with fewer than 1 word.
5. Deduplicate by `uri`.
6. Keep only the columns `uri` and `text`.
Stage 1 statistics:
- **Processed source samples:** 94,967,071
- **English rows detected:** 60,854,982
- **Rows dropped for missing URI:** 0
- **Rows dropped for missing text:** 2,305,001
- **Rows dropped for fewer than 1 word:** 2,640
- **Rows dropped as duplicate URI:** 0
- **Final Stage 1 rows kept:** 58,547,341
- **Final Stage 1 words kept:** 966,013,061
### Stage 2: direct Bluesky extraction to extend the corpus
After Stage 1, a second collection pass was run to add more English Bluesky posts until the corpus reached the target scale.
This step used a custom Go extractor with the following logic:
1. Export all Stage 1 URIs into a lookup file.
2. Query the Bluesky relay at `https://bsky.network` using `SyncListReposByCollection("app.bsky.feed.post")` to discover repositories containing posts.
3. For each discovered DID, resolve its PDS endpoint.
4. Fetch the repository snapshot with `SyncGetRepo`.
5. Stream repository records and keep only records whose path starts with `app.bsky.feed.post/`.
6. Read post text and `langs` metadata from each record.
7. Apply the same English rule as in Stage 1: keep posts only if `langs` contains `en` or a tag starting with `en-`.
8. Drop posts with missing text, missing URI, or fewer than 1 whitespace-counted word.
9. Drop posts whose URI already appeared in Stage 1.
10. Deduplicate within Stage 2 by URI.
11. Stop collection once the additional corpus exceeded the target number of new words needed to bring the merged corpus to roughly 1.5B words.
Stage 2 statistics:
- **Stage 1 URIs exported for duplicate checking:** 58,547,341
- **Repositories processed:** 8,977
- **Posts seen:** 51,955,104
- **English posts seen:** 33,219,791
- **Dropped for missing text:** 2,611,525
- **Dropped for non-English:** 16,123,788
- **Dropped for fewer than 1 word:** 13,837
- **Dropped for missing URI:** 0
- **Dropped as duplicate with Stage 1:** 921,965
- **Dropped as duplicate within Stage 2:** 0
- **Final Stage 2 rows kept:** 32,283,989
- **Final Stage 2 words kept:** 533,989,535
### Final merge and split materialization
The Stage 1 and Stage 2 datasets were concatenated, then deduplicated once more by `uri` during the final merge step. No additional duplicate URIs were found at this stage.
Final merge statistics:
- **Rows loaded from Stage 1:** 58,547,341
- **Rows loaded from Stage 2:** 32,283,989
- **Rows after concatenation:** 90,831,330
- **Rows dropped for empty URI during final merge:** 0
- **Rows dropped for empty text during final merge:** 0
- **Rows dropped as duplicate URI during final merge:** 0
- **Final total rows:** 90,831,330
- **Final total words:** 1,500,002,598
## Curation rules at a glance
- **Language filter:** keep only rows whose language metadata contains `en` or an `en-*` variant.
- **Minimum length:** at least 1 whitespace-counted word.
- **Deduplication:** URI-based.
- **Released columns:** `uri`, `text`.
- **Word counting:** whitespace-based (`split()` / `strings.Fields`).
## Important notes and limitations
1. **English detection relies on metadata, not an external language identification model.** In both stages, English selection was based on the `langs` field attached to the source record.
2. **Deduplication is URI-based, not text-based.** Near-duplicates or repeated text with different URIs may still remain.
3. **The validation and test splits are convenience splits.** They were created by sequential cumulative word count, not by time-based, user-based, or author-disjoint partitioning.
4. **Stage 2 extraction skipped unavailable repositories.** During direct extraction, some repositories were unavailable, deactivated, taken down, not found, or timed out. These failures were logged and skipped rather than retried indefinitely.
5. **The corpus is intended primarily as a large-scale text resource.** It is well suited for tokenizer fitting and corpus-level experimentation, but the provided validation/test splits should not automatically be treated as a benchmark design for all downstream tasks.
## Recommended use
This dataset is especially appropriate for:
- tokenizer training,
- corpus statistics,
- vocabulary analysis,
- language-model pretraining or continued pretraining experiments,
- studies of social-media language variation.
## Source attribution
Stage 1 was derived from the Hugging Face dataset **`Roronotalt/bluesky`**.
Stage 2 was collected directly from Bluesky repositories through a custom extraction pipeline over Bluesky infrastructure, with the same English filtering and URI-level deduplication rules applied.
## Acknowledgement
If you use this dataset, please acknowledge both:
1. the original `Roronotalt/bluesky` source used for Stage 1, and
2. this curated release, which adds English filtering, URI-based deduplication, direct Bluesky augmentation, and train/validation/test split materialization.
提供机构:
akarrouch-mohamed



