five

sdadas/8tags

收藏
Hugging Face2024-01-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sdadas/8tags
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - pl license: - cc-by-nc-sa-4.0 multilinguality: - monolingual size_categories: - 10K<n<100K task_categories: - text-classification task_ids: - topic-classification - multi-class-classification pretty_name: 8TAGS dataset_info: features: - name: sentence dtype: string - name: label dtype: class_label: names: 0: film 1: history 2: food 3: medicine 4: motorization 5: work 6: sport 7: technology splits: - name: train num_bytes: 3765325 num_examples: 40001 - name: validation num_bytes: 467676 num_examples: 5000 - name: test num_bytes: 416311 num_examples: 4372 --- # 8TAGS ### Dataset Summary A Polish topic classification dataset consisting of headlines from social media posts. It contains about 50,000 sentences annotated with 8 topic labels: film, history, food, medicine, motorization, work, sport and technology. This dataset was created automatically by extracting sentences from headlines and short descriptions of articles posted on Polish social networking site **wykop.pl**. The service allows users to annotate articles with one or more tags (categories). Dataset represents a selection of article sentences from 8 popular categories. The resulting corpus contains cleaned and tokenized, unambiguous sentences (tagged with only one of the selected categories), and longer than 30 characters. ### Data Instances Example instance: ``` { "sentence": "Kierowca był nieco zdziwiony że podróżując sporo ponad 200 km / h zatrzymali go policjanci.", "label": "4" } ``` ### Data Fields - sentence: sentence text - label: label identifier corresponding to one of 8 topics ### Citation Information ``` @inproceedings{dadas-etal-2020-evaluation, title = "Evaluation of Sentence Representations in {P}olish", author = "Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.207", pages = "1674--1680", language = "English", ISBN = "979-10-95546-34-4", } ```
提供机构:
sdadas
原始信息汇总

8TAGS 数据集概述

数据集基本信息

  • 语言: 波兰语 (pl)
  • 许可证: CC-BY-NC-SA-4.0
  • 多语言性: 单语
  • 大小: 10,000 < n < 100,000
  • 任务类别: 文本分类
  • 任务ID:
    • 主题分类
    • 多类分类
  • 美观名称: 8TAGS

数据集特征

  • 特征:
    • sentence: 字符串类型
    • label: 类别标签,包括以下类别:
      • 0: film
      • 1: history
      • 2: food
      • 3: medicine
      • 4: motorization
      • 5: work
      • 6: sport
      • 7: technology

数据集划分

  • 训练集:
    • num_bytes: 3,765,325
    • num_examples: 40,001
  • 验证集:
    • num_bytes: 467,676
    • num_examples: 5,000
  • 测试集:
    • num_bytes: 416,311
    • num_examples: 4,372

数据集描述

  • 内容: 包含约50,000个来自波兰社交媒体帖子的标题,标注了8个主题标签。
  • 来源: 自动从波兰社交网络服务wykop.pl的文章标题和简短描述中提取。
  • 特点: 句子已清洗和分词,长度超过30个字符,且每个句子仅标注一个主题类别。
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
8tags是一个波兰语主题分类数据集,包含约5万个句子,标注为8个主题(如电影、历史、食物等),用于文本分类任务。数据集基于波兰社交媒体帖子自动构建,句子经过清理和标记,格式为JSON,适用于多类别分类研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作