itsSHAS/clean_ukrainian-news

Name: itsSHAS/clean_ukrainian-news
Creator: itsSHAS
Published: 2026-01-25 00:39:38
License: 暂无描述

Hugging Face2026-01-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/itsSHAS/clean_ukrainian-news

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: unknown task_categories: - text-generation language: - uk pretty_name: ukr-news size_categories: - 10M<n<100M tags: - news --- # Ukrainian News Dataset This is a dataset of news articles downloaded from various Ukrainian websites and Telegram channels. The dataset contains 22 567 099 JSON objects (news), total size ~67GB each with the following fields: ```json title: The title of the news article text: The text of the news article, which may contain HTML tags(e.g., paragraphs, links, images, etc.) url: The URL of the news article datetime: The time of publication or when the article was parsed and added to the dataset owner: The name of the website that published the news article ``` Count of news from websites: 16 022 416 Count of telegram posts: 6 544 683 The JSON objects are divided into parts, and the dataset is available for download via Hugging Face. The terms of use state that all data in this dataset is under the copyright of the owners of the respective websites. ## Accessing the Dataset The dataset is available for download via the Hugging Face datasets library. You can install the library via pip: ```bash pip install datasets ``` Once you have installed the library, you can load the dataset using the following code: ```python from datasets import load_dataset dataset = load_dataset('zeusfsx/ukrainian-news') ``` This will load the entire dataset into memory. If you prefer to load only a subset of the data, you can specify the split argument: ```python # Load only the first 10,000 examples from the "train" split dataset = load_dataset('zeusfsx/ukrainian-news', split='train[:10000]') ``` ## Contacts If you have any questions or comments about this dataset, please contact me at email [zeusfsxtmp@gmail.com]. I will do our best to respond to your inquiry as soon as possible. ## License The dataset is made available under the terms of use specified by the owners of the respective websites. Please consult the individual websites for more information on their terms of use.

许可证：未知任务类别： - 文本生成（text-generation）语言： - 乌克兰语（uk）美观名称：ukr-news 规模类别： - 1000万 < 数据量 < 1亿标签： - 新闻（news） # 乌克兰新闻数据集（Ukrainian News Dataset）本数据集收录了从多家乌克兰本土网站及Telegram频道抓取的新闻稿件。本数据集共包含22567099条JSON格式的新闻对象，总容量约67GB，每条数据包含以下字段： json title: 新闻稿件的标题 text: 新闻稿件的正文内容，可包含HTML标签（如段落、链接、图片等） url: 新闻稿件的原始链接 datetime: 新闻发布时间，或稿件被抓取并录入数据集的时间 owner: 发布该新闻的网站名称来自网站的新闻稿件数量：16022416条 Telegram频道投稿数量：6544683条所有JSON数据均已分块存储，本数据集可通过Hugging Face平台下载。本数据集的使用条款声明：所有数据的版权归对应网站的所有者所有。 ## 数据集获取方式本数据集可通过Hugging Face datasets库进行下载，你可通过pip命令安装该库： bash pip install datasets 完成库安装后，可通过以下代码加载数据集： python from datasets import load_dataset dataset = load_dataset('zeusfsx/ukrainian-news') 该代码会将完整数据集加载至内存。若你仅需加载部分数据，可通过指定split参数实现： python # 仅加载训练划分（train split）中的前10000条数据 dataset = load_dataset('zeusfsx/ukrainian-news', split='train[:10000]') ## 联系方式若你对本数据集有任何疑问或建议，请发送邮件至[zeusfsxtmp@gmail.com]，我们将尽快回复你的咨询。 ## 许可证声明本数据集按照各对应网站所有者指定的使用条款提供。如需了解各网站的详细使用条款，请直接查阅对应网站。

提供机构：

itsSHAS

5,000+

优质数据集

54 个

任务类型

进入经典数据集