projecte-aina/CATalog

Name: projecte-aina/CATalog
Creator: projecte-aina
Published: 2025-07-23 04:39:10
License: 暂无描述

Hugging Face2025-07-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/projecte-aina/CATalog

下载链接

链接失效反馈

官方服务：

资源简介：

CATalog是一个多样化的开源加泰罗尼亚语语料库，用于语言建模。它包含来自26个不同来源的文本文档，包括网络爬取、新闻、论坛、数字图书馆和公共机构，总计174.5亿字。数据集支持的任务包括填充掩码、文本生成和语言建模。数据集的结构为JSONL格式，每个文档包含文档标识符、文本、质量评分、评估策略、语言和URL（如果可用）。数据集的创建基于过滤后的CommonCrawl快照和手动选择的特定来源语料库，使用CURATE管道进行去重、语言识别和评分启发式处理。数据集的主要目标是提供大规模、灵活且中立的评分语料库，以支持多语言模型的训练。

CATalog is a diverse open-source Catalan language corpus for language modeling. It contains text documents from 26 distinct sources, including web crawls, news, forums, digital libraries, and public institutions, totaling 17.45 billion words. The dataset supports tasks including masked language modeling, text generation, and general language modeling. It follows the JSONL format, where each document includes a document identifier, text content, quality score, evaluation strategy, language, and URL (if available). The dataset is constructed using filtered CommonCrawl snapshots and manually selected specialized source corpora, processed via the CURATE pipeline for deduplication, language identification, and scoring heuristics. The primary goal of this dataset is to provide a large-scale, flexible, neutrally scored corpus to support the training of multilingual language models.

提供机构：

projecte-aina

原始信息汇总

数据集概述

基本信息

名称: CATalog
语言: 加泰罗尼亚语 (ca)
许可证: Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International
多语言性: 单语
大小: 10B<n<100B

数据集结构

格式: JSONL
字段:
- id: 文档标识符
- text: 文档文本
- score: 文档质量评分
- strategy: 评估文档质量的策略
- languages: 文档语言
- url: 文档URL（如有）
分割:
- train: 34314510个示例，总大小115827685843字节