five

shamotskyi/ukr_pravda_2y

收藏
Hugging Face2024-02-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shamotskyi/ukr_pravda_2y
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - uk - en - ru pretty_name: Ukrainska Pravda articles in ukr/rus/eng published on or after 01.01.2022 multilinguality: - multilingual --- This dataset contains the articles from [Ukrainska Pravda](https://www.pravda.com.ua/) of the years 2022-2023, in all translations. The dataset was created as part of my Master's Thesis, better documentation will follow. For now: ### Basics One row of the dataset contains an article, title/author/tags in up to three languages (ukr-rus-eng) w/ the corresponding title, author and tags. Different translations of the same article often have inconsistent tags, so the main `tags` column contains the representations of the tags from all languages (each tag is named after its URI on the UP website). The mapping of each tag to its URIs and names in all the languages it's present in is fuond in the `tags_mapping.json` file, found in the metadata. The list of URIs for all downloaded articles can be found there as well. ### Files - Two versions: - The version 0.0.1 (split name `incomplete`) covers articles from 01.01.2022 until 12.12.2023, kept for now as it's used in some other datasets - **The version 0.0.2 (split name `train`) is the one you need** and contains all articles from 01.01.2022 till 31.12.2023 - File structure: - `data/train` is the full 2y 0.0.2 dataset, the one you need - `data/incomplete` is the old 0.0.1 version - `metadata/` contains the tags mappings and list of downloaded URIs for both versions ### The rest - **<https://serhii.net/dtb/2023-12-13-231213-1710-ukrainska-pravda-dataset/>** is the draft of the relevant thesis section - **[pchr8/up_crawler](https://github.com/pchr8/up_crawler)** is the crawler I wrote to gather this dataset <br><br> For any questions, my first name is Serhii, and my email is my_first_name@my_first_name.net.
提供机构:
shamotskyi
原始信息汇总

数据集概述

基本信息

  • 许可证: cc-by-nc-4.0
  • 语言: uk, en, ru
  • 名称: Ukrainska Pravda articles in ukr/rus/eng published on or after 01.01.2022
  • 多语言性: 多语言

数据内容

  • 来源: Ukrainska Pravda
  • 时间范围: 2022-2023年
  • 内容: 包含文章、标题、作者和标签,最多三种语言(ukr-rus-eng)
  • 标签处理: 主tags列包含所有语言的标签表示,每个标签以其URI命名
  • 标签映射: tags_mapping.json文件中包含标签到其URI和名称的映射

文件结构

  • 版本:
    • 0.0.1 (incomplete): 2022年1月1日至2023年12月12日的文章
    • 0.0.2 (train): 2022年1月1日至2023年12月31日的文章
  • 文件路径:
    • data/train: 完整的0.0.2版本数据集
    • data/incomplete: 旧的0.0.1版本数据集
    • metadata/: 包含标签映射和下载的文章URI列表
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作