five

proxectonos/corpusnos

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/corpusnos
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: CorpusNÓS v.3.0 language: - gl license: other task_categories: - text-generation task_ids: - language-modeling tags: - galician - corpus - language-modeling - pretraining - jsonl - web-corpus - ocr - low-resource-nlp size_categories: - 1M<n<10M configs: - config_name: public_data_encyclopedic data_files: - split: train path: - public_data/encyclopedic/*.jsonl - public_data/encyclopedic/**/*.jsonl - config_name: public_data_press_and_blogs data_files: - split: train path: - public_data/press_and_blogs/*.jsonl - public_data/press_and_blogs/**/*.jsonl - config_name: public_data_translation_corpora data_files: - split: train path: - public_data/translation_corpora/*.jsonl - public_data/translation_corpora/**/*.jsonl - config_name: public_data_web_crawls data_files: - split: train path: - public_data/web_crawls/*.jsonl - public_data/web_crawls/**/*.jsonl - config_name: dta_books data_files: - split: train path: - data_transfer_agreement/books/*.jsonl - data_transfer_agreement/books/**/*.jsonl - config_name: dta_encyclopedic data_files: - split: train path: - data_transfer_agreement/encyclopedic/*.jsonl - data_transfer_agreement/encyclopedic/**/*.jsonl - config_name: dta_governmental data_files: - split: train path: - data_transfer_agreement/governmental/*.jsonl - data_transfer_agreement/governmental/**/*.jsonl - config_name: dta_press_and_blogs data_files: - split: train path: - data_transfer_agreement/press_and_blogs/*.jsonl - data_transfer_agreement/press_and_blogs/**/*.jsonl - config_name: dta_research_articles data_files: - split: train path: - data_transfer_agreement/research_articles/*.jsonl - data_transfer_agreement/research_articles/**/*.jsonl - config_name: dta_web_contents data_files: - split: train path: - data_transfer_agreement/web_contents/*.jsonl - data_transfer_agreement/web_contents/**/*.jsonl --- # CorpusNÓS v.3.0 ## Dataset description CorpusNÓS is a massive Galician corpus primarily devised for training large language models. It is composed of texts from a wide range of sources and genres, including books, research articles, press, governmental texts, encyclopedic data, web contents, web crawls, blogs, and translation corpora. This release corresponds to an updated JSONL-based version of the corpus. Compared with previous releases, this version incorporates improvements in the text cleaning and processing pipeline, stronger deduplication, improved OCR for materials originating from PDF sources, and the reprocessing of part of the data in order to improve overall quality. Unlike earlier releases, this version is distributed only in JSONL format. Each document is represented as an individual JSON object, which facilitates downstream filtering, cleaning, and metadata enrichment. Some newly incorporated data from *Praza Pública*, *Diario Nós* and *Wikipedia* are also included in this version. These sources are expected to be extended in future releases as more material is processed. ## Dataset format Each document is stored as a JSON object in JSONL format. Typical entries may have the following structure: ```json {"id": 0, "text": "Abades: Parroquia do concello de Baltar baixo a advocación de san Paio.", "num_words": 12} {"id": 581, "text": "Feliz 2008 a tódolos nosos lectores\nAgora que remata 2007, un ano cheo de novidades tecnolóxicas que difundimos a través deste espazo dixital, queremos desexar a tódolos que non seguen con fidelidade unha boa despedida do ano e un feliz aninovo.\nNós volveremos o mércores, 2 de xaneiro, á nosa actividade ordinaria, cumprindo coa nosa labor informativa para que as novas tecnolóxicas de Galicia e en galego cheguen ós nosos lectores puntualmente.", "num_words": 72, "pyplexity_score": 717.7585757844212, "lang": "gl"} ``` The core fields are: - `id`: document identifier - `text`: textual content of the document - `num_words`: number of words in the document Some entries may also include additional metadata such as: - `pyplexity_score` - `lang` ## Differences from previous versions This version differs from earlier releases in several ways: - It includes improvements in the text cleaning and processing pipeline - It includes improved OCR for files originating from PDF sources - It includes reprocessed data to improve quality - It incorporates new data from *Praza Pública* and *Diario Nós*, which will be expanded in future versions - It includes a cleaner and updated version of Galician Wikipedia data, which will be expanded in future versions ## Dataset composition The corpus is organized into two major subcorpora: - **Data obtained via transfer agreement** - **Public data** ## Data source and creation CorpusNÓS was compiled from multiple heterogeneous sources in Galician and related multilingual resources relevant to language model training. The corpus combines public data and data made available through transfer agreements. This version was created through an updated processing pipeline that includes: - improved text cleaning - stronger deduplication - improved OCR for PDF-derived materials - reprocessing of selected resources to improve quality - conversion and standardization into JSONL format The aim of this release is to provide a cleaner, more structured, and more easily processable corpus for large-scale language modeling and related NLP tasks in Galician. ## Current statistics ### Subcorpus: Data obtained via transfer agreement | Genre | Nº tokens | Nº documents | |------|----------:|-------------:| | Books | 7,074,354 | 103 | | Research articles | 3,005,739 | 499 | | Press and blogs | 98,086,871 | 204,598 | | Governmental | 260,333,471 | 408,607 | | Web contents | 15,390,104 | 41,276 | | Encyclopedic | 4,799,208 | 47,396 | | **Subtotal** | **388,689,747** | **702,479** | ### Subcorpus: Public data | Genre | Nº tokens | Nº documents | |------|----------:|-------------:| | Press and blogs | 142,144,734 | 598,375 | | Encyclopedic | 75,630,525 | 226,964 | | Web crawls | 1,205,208,148 | 2,850,604 | | Translation corpora | 105,523,634 | 3,544,025 | | **Subtotal** | **1,528,507,041** | **7,219,968** | ### Total | Total tokens | Total documents | |-------------:|----------------:| | 1,917,196,788 | 7,922,447 | ## Intended uses This dataset can be used for: - continued pretraining of language models in Galician - corpus-based analysis of Galician text - low-resource NLP research - multilingual and cross-lingual experiments involving Galician - data selection, filtering, and quality analysis for LLM training ## Limitations - Some files referenced in the corpus may still be absent in this version due to pending transfer agreements and may be included in future releases. Other files might be unavailable due to licensing issues. - The corpus contains materials from heterogeneous sources and genres, which implies variation in style, register, and quality. - Although this version includes stronger cleaning, deduplication, OCR improvement, and reprocessing, some noise may still remain. - Some subcorpora are subject to their original licenses and restrictions. ## Licensing This dataset includes materials under different licenses depending on the original source. Please refer to the original source licenses. In particular, the following subcorpora retain their original licenses: - TED2020: CC BY-NC-ND 4.0 - mC4: Apache License 2.0 - OSCAR: CC0 All other subcorpora that do not have a previously established original license are released under CC BY-SA 4.0. Users are responsible for checking the license conditions of each subcorpus before use. ## Citation Please refer to our paper for more details: ```bibtex @inproceedings{de-dios-flores-etal-2024-corpusnos, title = "{C}orpus{N{\'O}S}: A massive {G}alician corpus for training large language models", author = "de-Dios-Flores, Iria and Su{\'a}rez, Silvia Paniagua and P{\'e}rez, Cristina Carbajal and Outeiri{\~n}o, Daniel Bardanca and Garcia, Marcos and Gamallo, Pablo", editor = "Gamallo, Pablo and Claro, Daniela and Teixeira, Ant{\'o}nio and Real, Livy and Garcia, Marcos and Oliveira, Hugo Gon{\c{c}}alo and Amaro, Raquel", booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1", month = mar, year = "2024", address = "Santiago de Compostela, Galicia/Spain", publisher = "Association for Computational Lingustics", url = "https://aclanthology.org/2024.propor-1.66/", pages = "593--599" } ``` ## Acknowledgements This corpus was compiled and developed within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública and by the European Union through NextGenerationEU, within the framework of the [ILENIA project](https://proyectoilenia.es/) (reference 2022/TL22/00215336).
提供机构:
proxectonos
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作