five

ruvimx/UkrLM-social

收藏
Hugging Face2026-04-15 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ruvimx/UkrLM-social
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - uk license: cc-by-4.0 task_categories: - text-generation task_ids: - language-modeling tags: - ukrainian - social - telegram - reddit - nlp - corpus size_categories: - 100K<n<1M --- # UkrLM Social Corpus A curated corpus of Ukrainian-language text collected from public social media platforms, designed for language model pretraining and fine-tuning. This dataset is part of the **UkrLM** initiative — an open effort to build foundational NLP resources for the Ukrainian language. --- ## Overview | Property | Value | |---|---| | Language | Ukrainian (`uk`) | | Sources | Telegram, Reddit | | License | CC BY 4.0 | | Format | Parquet | | Task | Language Modeling | --- ## Sources **Telegram** — public Ukrainian-language channels covering analysis, culture, and community discussion. **Reddit** — public posts and comments from Ukrainian-language and Ukraine-focused subreddits including r/ukraine, r/ukr, and others. --- ## Processing Pipeline Raw text was passed through a multi-stage cleaning pipeline before inclusion: **Language & quality filtering** - Ukrainian only — pre-filtered at collection time - Minimum 15 characters, maximum 10 000 characters - Minimum 35% alphabetic character ratio - Minimum 5 Cyrillic characters per record - Records consisting primarily of anonymization tokens are discarded **Noise removal** - Markdown stripped: `**bold**`, `_italic_`, `~~strikethrough~~`, `> quotes`, headings, inline code - Ad tails removed line-by-line from the bottom of posts (channel promotions, social links) - Emoji-heavy lines removed (>50% emoji by character count) - Separator lines removed: `————`, `___`, etc. - HTML entities decoded: `&gt;` → `>`, `&amp;` → `&`, etc. **Anonymization** - Person names - `<PERSON>` - Usernames - `<USER>` - Phone numbers - `<PHONE>` - Card numbers - `<CARD>` - Email addresses - `<EMAIL>` - URLs - `<URL>` **Deduplication** - Exact match deduplication on lowercased, whitespace-normalized text --- ## Format Each record is a flat object with the following fields: ```json {"text": "...", "source": "telegram", "channel": "nazva_kanalu", "date": "2025-08-05T16:19:38+00:00", "lang": "uk"} {"text": "...", "source": "reddit", "subreddit": "ukraine", "date": "2024-11-12T10:00:00+00:00", "score": 42, "lang": "uk"} ``` | Field | Description | |---|---| | `text` | Cleaned Ukrainian text | | `source` | Origin platform: `telegram` or `reddit` | | `channel` | Telegram channel identifier (telegram only) | | `subreddit` | Subreddit name (reddit only) | | `date` | ISO 8601 timestamp (may be empty) | | `score` | Reddit upvote score (reddit only) | | `lang` | Language code — always `uk` | --- ## Part of UkrLM This dataset is one component of the broader **UkrLM** project, which aims to produce open Ukrainian-language datasets and eventually a trained language model. Related datasets in this initiative: - `ruvimx/UkrLM-social` — social corpus (this dataset) - `ruvimx/UkrLM-wiki` — Ukrainian Wikipedia --- ## License This dataset is released under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) for **research purposes**. Source platforms (Telegram, Reddit) retain their own terms of service over the original content. > Built with ❤️ for the Ukrainian NLP community.
提供机构:
ruvimx
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作