arubenruben/portuguese-language-identification-raw
收藏Hugging Face2024-01-08 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/arubenruben/portuguese-language-identification-raw
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: journalistic
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 1312620204.0
num_examples: 1845205
download_size: 869968625
dataset_size: 1312620204.0
- config_name: legal
features:
- name: text
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 1338097227.0
num_examples: 5211975
download_size: 821524458
dataset_size: 1338097227.0
- config_name: literature
features:
- name: text
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 33472546
num_examples: 82744
download_size: 21387497
dataset_size: 33472546
- config_name: politics
features:
- name: text
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 64856376.0
num_examples: 47344
download_size: 37697313
dataset_size: 64856376.0
- config_name: social_media
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 372374266.0
num_examples: 3074774
download_size: 267382814
dataset_size: 372374266.0
- config_name: web
features:
- name: text
dtype: string
- name: domain
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 915101054.0
num_examples: 200000
download_size: 485541943
dataset_size: 915101054.0
configs:
- config_name: journalistic
data_files:
- split: train
path: journalistic/train-*
- config_name: legal
data_files:
- split: train
path: legal/train-*
- config_name: literature
data_files:
- split: train
path: literature/train-*
- config_name: politics
data_files:
- split: train
path: politics/train-*
- config_name: social_media
data_files:
- split: train
path: social_media/train-*
- config_name: web
data_files:
- split: train
path: web/train-*
---
提供机构:
arubenruben
原始信息汇总
数据集概述
数据集配置
1. 新闻(journalistic)
- 特征:
text: 字符串类型label: 类别标签,包含pt-PT和pt-BR
- 分割:
train: 1,845,205 个样本,1,312,620,204 字节
- 下载大小:869,968,625 字节
- 数据集大小:1,312,620,204 字节
2. 法律(legal)
- 特征:
text: 字符串类型label: 64 位整数类型
- 分割:
train: 5,211,975 个样本,1,338,097,227 字节
- 下载大小:821,524,458 字节
- 数据集大小:1,338,097,227 字节
3. 文学(literature)
- 特征:
text: 字符串类型label: 64 位整数类型
- 分割:
train: 82,744 个样本,33,472,546 字节
- 下载大小:21,387,497 字节
- 数据集大小:33,472,546 字节
4. 政治(politics)
- 特征:
text: 字符串类型label: 64 位整数类型
- 分割:
train: 47,344 个样本,64,856,376 字节
- 下载大小:37,697,313 字节
- 数据集大小:64,856,376 字节
5. 社交媒体(social_media)
- 特征:
text: 字符串类型label: 类别标签,包含pt-PT和pt-BR
- 分割:
train: 3,074,774 个样本,372,374,266 字节
- 下载大小:267,382,814 字节
- 数据集大小:372,374,266 字节
6. 网络(web)
- 特征:
text: 字符串类型domain: 字符串类型label: 64 位整数类型
- 分割:
train: 200,000 个样本,915,101,054 字节
- 下载大小:485,541,943 字节
- 数据集大小:915,101,054 字节
数据文件路径
- 新闻(journalistic):
journalistic/train-* - 法律(legal):
legal/train-* - 文学(literature):
literature/train-* - 政治(politics):
politics/train-* - 社交媒体(social_media):
social_media/train-* - 网络(web):
web/train-*



