tommasobonomo/ITAGutenberg

Name: tommasobonomo/ITAGutenberg
Creator: tommasobonomo
Published: 2025-12-10 17:56:30
License: 暂无描述

Hugging Face2025-12-10 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/tommasobonomo/ITAGutenberg

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: gutenberg_id dtype: int64 - name: title dtype: string - name: text dtype: string - name: tokenized_length dtype: int64 - name: metadata struct: - name: authors sequence: string - name: bookshelves sequence: string - name: encoding dtype: string - name: languages sequence: string - name: subjects sequence: string - name: summaries sequence: 'null' - name: url dtype: string splits: - name: train num_bytes: 480844883 num_examples: 1084 download_size: 295576071 dataset_size: 480844883 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 language: - it tags: - project_gutenberg - text pretty_name: ITAGutenberg --- # ITAGutenberg A collection of all books written in Italian that appear on [Project Gutenberg](https://www.gutenberg.org/), meant for pretraining of Large Language Models. We collected the plaintext version of each book and lightly processed it to remove licensing text that is usually included in the original text of Project Gutenberg. ## 🚀 Quickstart Simply download the dataset as you would with any other Hugging Face dataset: ```python from datasets import load_dataset dataset = load_dataset("tommasobonomo/ITAGutenberg") ``` ## 📊 Data Schema The dataset is organized with the following schema: * **`gutenberg_id`** *(str)* — Project Gutenberg key identifying the book. * **`title`** *(str)* — Title of the book. * **`text`** *(str)* — Full text of the book from Project Gutenberg. * **`tokenized_length`** *(int)* — Length in tokens of the book. Computed through the `cl100k_base` encoding from [`openai/tiktoken`](https://github.com/openai/tiktoken). * **`metadata`** *(dict)* — Additional contextual information about the book, including: * **`authors`** *(list[str])* — Name of the book’s author(s). * **`bookshelves`** *(list[str])* — Name of bookshelves that include this book, as per Project Gutenberg. * **`encoding`** *(str)* — Name of the encoding of the original book source. * **`languages`** *(list[str])* — Languages that appear in a specific book. * **`summaries`** *(list[str])* — Summaries collected from Project Gutenberg for a specific book. * **`url`** *(str)* — URL from which the text of the book was downloaded.

提供机构：

tommasobonomo

5,000+

优质数据集

54 个

任务类型

进入经典数据集