pstroe/cc100-latin

Name: pstroe/cc100-latin
Creator: pstroe
Published: 2022-11-02 14:28:12
License: 暂无描述

Hugging Face2022-11-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/pstroe/cc100-latin

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含cc100语料库的拉丁语部分，用于训练基于RoBERTa的语言模型。数据集经过预处理，包括移除伪拉丁文本、使用CLTK进行句子分割和规范化、保留特定字符的行以及去重。数据集结构包含训练集和测试集，每个集合中包含拉丁语文本。

This dataset contains the Latin subset of the CC100 corpus, which is used for training RoBERTa-based language models. The dataset has undergone preprocessing, including removal of pseudo-Latin texts, sentence segmentation and normalization using CLTK, retention of lines containing specific characters, and deduplication. The dataset structure includes a training set and a test set, with each set containing Latin text.

提供机构：

pstroe

原始信息汇总

数据集概述

数据集名称

Latin part of cc100 corpus

数据集用途

用于训练基于RoBERTa的语言模型。

预处理步骤

移除所有"pseudo-Latin"文本（例如"Lorem ipsum ..."）。
使用CLTK进行句子分割和规范化。
仅保留包含拉丁字母、数字和特定标点符号的行。
去重处理。

数据集规模

约390 million tokens。

数据集结构

train: 包含多个文本样本。
test: 包含多个文本样本。

联系方式

联系人：Phillip Ströbel 邮箱：pstroebel@cl.uzh.ch Twitter：CLingophil

5,000+

优质数据集

54 个

任务类型

进入经典数据集