MLRS/maltese_news_categories
收藏Hugging Face2024-04-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/MLRS/maltese_news_categories
下载链接
链接失效反馈官方服务:
资源简介:
Maltese News Categories是一个用于马耳他语新闻文章的多标签主题分类数据集。数据来源于Korpus Malti v4.0的`press_mt`子集,经过清理和过滤,去除了JavaScript、CSS和重复的非马耳他语子标题。标签基于原始语料库的`category`字段,并进行了进一步的合并和过滤。数据集包含训练集、验证集和测试集,分别包含10784、2293和2297个样本。
Maltese News Categories is a multi-label topic classification dataset for Maltese news articles. The dataset is sourced from the `press_mt` subset of Korpus Malti v4.0, and has undergone cleaning and filtering to remove JavaScript, CSS, and duplicated non-Maltese subheadings. The labels are derived from the `category` field of the original corpus, with further merging and filtering applied. The dataset consists of training, validation, and test splits, which contain 10,784, 2,293, and 2,297 samples respectively.
提供机构:
MLRS
原始信息汇总
Maltese News Categories 数据集概述
数据集信息
语言
- 马耳他语 (mt)
许可
- CC-BY-NC-SA-4.0
大小类别
- 10K < n < 100K
任务类别
- 文本分类
特征
- url: 字符串
- title: 字符串
- base_url: 字符串
- text: 字符串
- labels: 序列,包含类别标签
- 类别标签名称:
- 0: court
- 1: covid
- 2: culture
- 3: eu
- 4: economy
- 5: education
- 6: entertainment
- 7: environment
- 8: health
- 9: immigration
- 10: international
- 11: opinion
- 12: politics
- 13: religion
- 14: social
- 15: sports
- 16: transport
- 类别标签名称:
数据分割
- train:
- 字节数: 19700614
- 样本数: 10784
- validation:
- 字节数: 4286743
- 样本数: 2293
- test:
- 字节数: 4560168
- 样本数: 2297
下载大小
- 16511339 字节
数据集大小
- 28547525 字节
配置
- default:
- 数据文件路径:
- train: data/train-*
- validation: data/validation-*
- test: data/test-*
- 数据文件路径:
数据收集
数据来源于 press_mt 子集,来自 Korpus Malti v4.0。文章内容经过清洗,去除了 JavaScript、CSS 和重复的非马耳他子标题。标签基于该语料库的 category 字段。
数据清洗步骤
- 忽略通用类别(如
News,Local,Headlines,Uncategorised,Archived)的文档。 - 合并一些类别,以标准化不同新闻门户分配的标签(例如,将
European Union和Unjoni Ewropea合并为EU)。 - 忽略类别文章数少于 100 的文档。
- 删除与其他类别共现率超过 75% 的类别。
标签分布
| Tag | Token Count | Article Count | Article Percentage |
|---|---|---|---|
| Court | 404,329 | 860 | 5.59 |
| Covid | 458,120 | 1,735 | 11.29 |
| Culture | 750,406 | 2,186 | 14.22 |
| EU | 93,227 | 240 | 1.56 |
| Economy | 112,972 | 321 | 2.09 |
| Education | 56,084 | 191 | 1.24 |
| Entertainment | 837,248 | 3,147 | 20.47 |
| Environment | 38,522 | 147 | 0.96 |
| Health | 81,630 | 290 | 1.89 |
| Immigration | 29,665 | 120 | 0.78 |
| International | 784,878 | 3,957 | 25.74 |
| Opinion | 231,266 | 321 | 2.09 |
| Politics | 682,007 | 1,294 | 8.42 |
| Religion | 186,300 | 465 | 3.02 |
| Social | 98,127 | 203 | 1.32 |
| Sports | 835,484 | 3,066 | 19.94 |
| Transport | 74,959 | 241 | 1.57 |



