five

ruvimx/UkrLM-wiki

收藏
Hugging Face2026-04-15 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ruvimx/UkrLM-wiki
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 pretty_name: UkrLM — Wikipedia Corpus configs: - config_name: wikipedia data_files: - split: train path: wikipedia/train-* tags: - ukrainian - pretraining - wikipedia size_categories: - 100M<n<1B dataset_info: - config_name: wikipedia features: - name: text dtype: large_string - name: title dtype: large_string - name: source dtype: large_string - name: license dtype: large_string - name: date dtype: timestamp[us] splits: - name: train num_bytes: 5647322338 num_examples: 1134416 download_size: 2631278461 dataset_size: 5647322338 language: - uk --- # UkrLM — Wikipedia Corpus A large-scale Ukrainian Wikipedia corpus for language model pretraining. Includes cleaned Ukrainian Wikipedia (uk.wikipedia.org). ## Subsets | Config | Description | Records | |--------|-------------|---------| | `wikipedia` | Articles from uk.wikipedia.org, cleaned from wikimarkup | 1,134,432 | ## Usage ```python from datasets import load_dataset ds = load_dataset("ruvimx/UkrLM-wiki", "wikipedia") ``` ## Fields | Field | Type | Description | |-------|------|-------------| | `text` | string | Article text | | `title` | string | Article title | | `source` | string | Source identifier (`wikipedia_uk`) | | `license` | string | Content license (`CC-BY-SA-4.0`) | | `date` | string | Collection date | ## Known Limitations Some articles contain recurring templated sections that are structurally similar but not identical across entries. ## License Content is licensed under **CC-BY-SA 4.0** in accordance with Wikipedia's licensing terms. See [Creative Commons](https://creativecommons.org/licenses/by-sa/4.0/). ## Citation ```bibtex @dataset{savytskyi2026ukrlm, author = {Savytskyi, Ruvim}, title = {UkrLM: Ukrainian Wikipedia Corpus}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ruvimx/UkrLM-wiki} } ```
提供机构:
ruvimx
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作