ruvimx/UkrLM-social
收藏Hugging Face2026-04-15 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ruvimx/UkrLM-social
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- uk
license: cc-by-4.0
task_categories:
- text-generation
task_ids:
- language-modeling
tags:
- ukrainian
- social
- telegram
- reddit
- nlp
- corpus
size_categories:
- 100K<n<1M
---
# UkrLM Social Corpus
A curated corpus of Ukrainian-language text collected from public social media platforms, designed for language model pretraining and fine-tuning.
This dataset is part of the **UkrLM** initiative — an open effort to build foundational NLP resources for the Ukrainian language.
---
## Overview
| Property | Value |
|---|---|
| Language | Ukrainian (`uk`) |
| Sources | Telegram, Reddit |
| License | CC BY 4.0 |
| Format | Parquet |
| Task | Language Modeling |
---
## Sources
**Telegram** — public Ukrainian-language channels covering analysis, culture, and community discussion.
**Reddit** — public posts and comments from Ukrainian-language and Ukraine-focused subreddits including r/ukraine, r/ukr, and others.
---
## Processing Pipeline
Raw text was passed through a multi-stage cleaning pipeline before inclusion:
**Language & quality filtering**
- Ukrainian only — pre-filtered at collection time
- Minimum 15 characters, maximum 10 000 characters
- Minimum 35% alphabetic character ratio
- Minimum 5 Cyrillic characters per record
- Records consisting primarily of anonymization tokens are discarded
**Noise removal**
- Markdown stripped: `**bold**`, `_italic_`, `~~strikethrough~~`, `> quotes`, headings, inline code
- Ad tails removed line-by-line from the bottom of posts (channel promotions, social links)
- Emoji-heavy lines removed (>50% emoji by character count)
- Separator lines removed: `————`, `___`, etc.
- HTML entities decoded: `>` → `>`, `&` → `&`, etc.
**Anonymization**
- Person names - `<PERSON>`
- Usernames - `<USER>`
- Phone numbers - `<PHONE>`
- Card numbers - `<CARD>`
- Email addresses - `<EMAIL>`
- URLs - `<URL>`
**Deduplication**
- Exact match deduplication on lowercased, whitespace-normalized text
---
## Format
Each record is a flat object with the following fields:
```json
{"text": "...", "source": "telegram", "channel": "nazva_kanalu", "date": "2025-08-05T16:19:38+00:00", "lang": "uk"}
{"text": "...", "source": "reddit", "subreddit": "ukraine", "date": "2024-11-12T10:00:00+00:00", "score": 42, "lang": "uk"}
```
| Field | Description |
|---|---|
| `text` | Cleaned Ukrainian text |
| `source` | Origin platform: `telegram` or `reddit` |
| `channel` | Telegram channel identifier (telegram only) |
| `subreddit` | Subreddit name (reddit only) |
| `date` | ISO 8601 timestamp (may be empty) |
| `score` | Reddit upvote score (reddit only) |
| `lang` | Language code — always `uk` |
---
## Part of UkrLM
This dataset is one component of the broader **UkrLM** project, which aims to produce open Ukrainian-language datasets and eventually a trained language model.
Related datasets in this initiative:
- `ruvimx/UkrLM-social` — social corpus (this dataset)
- `ruvimx/UkrLM-wiki` — Ukrainian Wikipedia
---
## License
This dataset is released under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) for **research purposes**.
Source platforms (Telegram, Reddit) retain their own terms of service over the original content.
> Built with ❤️ for the Ukrainian NLP community.
提供机构:
ruvimx



