sdadas/8tags
收藏Hugging Face2024-01-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sdadas/8tags
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- pl
license:
- cc-by-nc-sa-4.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
task_categories:
- text-classification
task_ids:
- topic-classification
- multi-class-classification
pretty_name: 8TAGS
dataset_info:
features:
- name: sentence
dtype: string
- name: label
dtype:
class_label:
names:
0: film
1: history
2: food
3: medicine
4: motorization
5: work
6: sport
7: technology
splits:
- name: train
num_bytes: 3765325
num_examples: 40001
- name: validation
num_bytes: 467676
num_examples: 5000
- name: test
num_bytes: 416311
num_examples: 4372
---
# 8TAGS
### Dataset Summary
A Polish topic classification dataset consisting of headlines from social media posts. It contains about 50,000 sentences annotated with 8 topic labels: film, history, food, medicine, motorization, work, sport and technology. This dataset was created automatically by extracting sentences from headlines and short descriptions of articles posted on Polish social networking site **wykop.pl**. The service allows users to annotate articles with one or more tags (categories). Dataset represents a selection of article sentences from 8 popular categories. The resulting corpus contains cleaned and tokenized, unambiguous sentences (tagged with only one of the selected categories), and longer than 30 characters.
### Data Instances
Example instance:
```
{
"sentence": "Kierowca był nieco zdziwiony że podróżując sporo ponad 200 km / h zatrzymali go policjanci.",
"label": "4"
}
```
### Data Fields
- sentence: sentence text
- label: label identifier corresponding to one of 8 topics
### Citation Information
```
@inproceedings{dadas-etal-2020-evaluation,
title = "Evaluation of Sentence Representations in {P}olish",
author = "Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}",
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.207",
pages = "1674--1680",
language = "English",
ISBN = "979-10-95546-34-4",
}
```
提供机构:
sdadas
原始信息汇总
8TAGS 数据集概述
数据集基本信息
- 语言: 波兰语 (pl)
- 许可证: CC-BY-NC-SA-4.0
- 多语言性: 单语
- 大小: 10,000 < n < 100,000
- 任务类别: 文本分类
- 任务ID:
- 主题分类
- 多类分类
- 美观名称: 8TAGS
数据集特征
- 特征:
- sentence: 字符串类型
- label: 类别标签,包括以下类别:
- 0: film
- 1: history
- 2: food
- 3: medicine
- 4: motorization
- 5: work
- 6: sport
- 7: technology
数据集划分
- 训练集:
- num_bytes: 3,765,325
- num_examples: 40,001
- 验证集:
- num_bytes: 467,676
- num_examples: 5,000
- 测试集:
- num_bytes: 416,311
- num_examples: 4,372
数据集描述
- 内容: 包含约50,000个来自波兰社交媒体帖子的标题,标注了8个主题标签。
- 来源: 自动从波兰社交网络服务wykop.pl的文章标题和简短描述中提取。
- 特点: 句子已清洗和分词,长度超过30个字符,且每个句子仅标注一个主题类别。
搜集汇总
数据集介绍

背景与挑战
背景概述
8tags是一个波兰语主题分类数据集,包含约5万个句子,标注为8个主题(如电影、历史、食物等),用于文本分类任务。数据集基于波兰社交媒体帖子自动构建,句子经过清理和标记,格式为JSON,适用于多类别分类研究。
以上内容由遇见数据集搜集并总结生成



