akarrouch-mohamed/bluesky

Name: akarrouch-mohamed/bluesky
Creator: akarrouch-mohamed
Published: 2026-04-11 12:27:14
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/akarrouch-mohamed/bluesky

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en pretty_name: Curated English Bluesky Corpus tags: - bluesky - social-media - english - text-corpus - dataset-curation size_categories: - 10M<n<100M license: apache-2.0 --- # Curated English Bluesky Corpus ## Dataset summary This dataset is a large English-only Bluesky text corpus curated for tokenizer and language-model experiments. The corpus was built in two stages: 1. **Stage 1:** a full download of the Hugging Face dataset `Roronotalt/bluesky`, filtered to English posts and deduplicated by post URI. 2. **Stage 2:** additional English Bluesky posts collected directly from Bluesky repositories using a custom extractor, again deduplicated by URI and checked against Stage 1 to avoid overlap. The final merged corpus contains **90,831,330 rows** and **1,500,002,598 whitespace-counted words**. Only two columns are kept in the released dataset: - `uri`: the Bluesky AT URI of the post - `text`: the post text ## Why this dataset was created This corpus was curated to obtain a large English social-media fitting corpus with a target size of roughly **1.5 billion words**, suitable for tokenizer training and related downstream experiments. ## Data fields - `uri` (`string`): unique post identifier in AT URI format. - `text` (`string`): raw post text. ## Dataset splits The final release is stored as a `DatasetDict` with three splits: - **train:** 89,062,186 rows - **validation:** 891,793 rows - **test:** 877,351 rows Split creation was based on **cumulative whitespace word counts**, not random row sampling: - validation was filled first until it reached approximately **1%** of total words, - test was filled next until it reached approximately **1%** of total words, - train contains the remainder. Final split word counts are: - **train:** 1,470,002,529 words - **validation:** 15,000,031 words - **test:** 15,000,038 words ## Curation pipeline ### Stage 0: source download The initial source used for Stage 1 was the full `train` split of **`Roronotalt/bluesky`**, downloaded from Hugging Face in non-streaming mode and saved locally before filtering. ### Stage 1: filtering and deduplication of the Hugging Face source The downloaded source was processed row by row with the following rules: 1. Keep a row only if its `langs` metadata contains `en` or a value starting with `en-`. 2. Drop rows with missing or empty `uri`. 3. Drop rows with missing or empty `text`. 4. Count words using whitespace splitting and drop rows with fewer than 1 word. 5. Deduplicate by `uri`. 6. Keep only the columns `uri` and `text`. Stage 1 statistics: - **Processed source samples:** 94,967,071 - **English rows detected:** 60,854,982 - **Rows dropped for missing URI:** 0 - **Rows dropped for missing text:** 2,305,001 - **Rows dropped for fewer than 1 word:** 2,640 - **Rows dropped as duplicate URI:** 0 - **Final Stage 1 rows kept:** 58,547,341 - **Final Stage 1 words kept:** 966,013,061 ### Stage 2: direct Bluesky extraction to extend the corpus After Stage 1, a second collection pass was run to add more English Bluesky posts until the corpus reached the target scale. This step used a custom Go extractor with the following logic: 1. Export all Stage 1 URIs into a lookup file. 2. Query the Bluesky relay at `https://bsky.network` using `SyncListReposByCollection("app.bsky.feed.post")` to discover repositories containing posts. 3. For each discovered DID, resolve its PDS endpoint. 4. Fetch the repository snapshot with `SyncGetRepo`. 5. Stream repository records and keep only records whose path starts with `app.bsky.feed.post/`. 6. Read post text and `langs` metadata from each record. 7. Apply the same English rule as in Stage 1: keep posts only if `langs` contains `en` or a tag starting with `en-`. 8. Drop posts with missing text, missing URI, or fewer than 1 whitespace-counted word. 9. Drop posts whose URI already appeared in Stage 1. 10. Deduplicate within Stage 2 by URI. 11. Stop collection once the additional corpus exceeded the target number of new words needed to bring the merged corpus to roughly 1.5B words. Stage 2 statistics: - **Stage 1 URIs exported for duplicate checking:** 58,547,341 - **Repositories processed:** 8,977 - **Posts seen:** 51,955,104 - **English posts seen:** 33,219,791 - **Dropped for missing text:** 2,611,525 - **Dropped for non-English:** 16,123,788 - **Dropped for fewer than 1 word:** 13,837 - **Dropped for missing URI:** 0 - **Dropped as duplicate with Stage 1:** 921,965 - **Dropped as duplicate within Stage 2:** 0 - **Final Stage 2 rows kept:** 32,283,989 - **Final Stage 2 words kept:** 533,989,535 ### Final merge and split materialization The Stage 1 and Stage 2 datasets were concatenated, then deduplicated once more by `uri` during the final merge step. No additional duplicate URIs were found at this stage. Final merge statistics: - **Rows loaded from Stage 1:** 58,547,341 - **Rows loaded from Stage 2:** 32,283,989 - **Rows after concatenation:** 90,831,330 - **Rows dropped for empty URI during final merge:** 0 - **Rows dropped for empty text during final merge:** 0 - **Rows dropped as duplicate URI during final merge:** 0 - **Final total rows:** 90,831,330 - **Final total words:** 1,500,002,598 ## Curation rules at a glance - **Language filter:** keep only rows whose language metadata contains `en` or an `en-*` variant. - **Minimum length:** at least 1 whitespace-counted word. - **Deduplication:** URI-based. - **Released columns:** `uri`, `text`. - **Word counting:** whitespace-based (`split()` / `strings.Fields`). ## Important notes and limitations 1. **English detection relies on metadata, not an external language identification model.** In both stages, English selection was based on the `langs` field attached to the source record. 2. **Deduplication is URI-based, not text-based.** Near-duplicates or repeated text with different URIs may still remain. 3. **The validation and test splits are convenience splits.** They were created by sequential cumulative word count, not by time-based, user-based, or author-disjoint partitioning. 4. **Stage 2 extraction skipped unavailable repositories.** During direct extraction, some repositories were unavailable, deactivated, taken down, not found, or timed out. These failures were logged and skipped rather than retried indefinitely. 5. **The corpus is intended primarily as a large-scale text resource.** It is well suited for tokenizer fitting and corpus-level experimentation, but the provided validation/test splits should not automatically be treated as a benchmark design for all downstream tasks. ## Recommended use This dataset is especially appropriate for: - tokenizer training, - corpus statistics, - vocabulary analysis, - language-model pretraining or continued pretraining experiments, - studies of social-media language variation. ## Source attribution Stage 1 was derived from the Hugging Face dataset **`Roronotalt/bluesky`**. Stage 2 was collected directly from Bluesky repositories through a custom extraction pipeline over Bluesky infrastructure, with the same English filtering and URI-level deduplication rules applied. ## Acknowledgement If you use this dataset, please acknowledge both: 1. the original `Roronotalt/bluesky` source used for Stage 1, and 2. this curated release, which adds English filtering, URI-based deduplication, direct Bluesky augmentation, and train/validation/test split materialization.

提供机构：

akarrouch-mohamed

5,000+

优质数据集

54 个

任务类型

进入经典数据集