udmurtNLP/udmurt_glotcc

Name: udmurtNLP/udmurt_glotcc
Creator: udmurtNLP
Published: 2024-06-16 10:19:34
License: 暂无描述

Hugging Face2024-06-16 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/udmurtNLP/udmurt_glotcc

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含多个特征，如内容、目标URI、日期、记录ID等，并包含质量警告、类别、语言识别及其概率和一致性、脚本百分比、句子数量、内容长度和TLSH哈希值。数据集分为一个训练集，包含722个样本，总大小为5756800字节。数据集的语言为乌德穆尔特语。

The dataset includes multiple features such as content, target URI, date, record ID, etc., and contains quality warnings, categories, language identification and its probability and consistency, script percentage, number of sentences, content length, and TLSH hash. The dataset is divided into a training set containing 722 samples with a total size of 5756800 bytes. The language of the dataset is Udmurt.

提供机构：

udmurtNLP

原始信息汇总

数据集概述

数据集信息

特征

content: 字符串类型
warc-target-uri: 字符串类型
warc-date: 字符串类型
warc-record-id: 字符串类型
quality-warnings: 字符串序列
categories: 字符串序列
identification-language: 字符串类型
identification-prob: 浮点数类型
identification-consistency: 浮点数类型
script-percentage: 浮点数类型
num-sents: 整数类型
content-length: 整数类型
tlsh: 字符串类型

数据分割

train:
- 字节数: 5756800
- 样本数: 722

数据集大小

下载大小: 2822526 字节
数据集大小: 5756800 字节

配置

config_name: default
- data_files:
  - split: train
  - path: data/train-*

语言

5,000+

优质数据集

54 个

任务类型

进入经典数据集