five

tommasobonomo/ITAGutenberg

收藏
Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/tommasobonomo/ITAGutenberg
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: gutenberg_id dtype: int64 - name: title dtype: string - name: text dtype: string - name: tokenized_length dtype: int64 - name: metadata struct: - name: authors sequence: string - name: bookshelves sequence: string - name: encoding dtype: string - name: languages sequence: string - name: subjects sequence: string - name: summaries sequence: 'null' - name: url dtype: string splits: - name: train num_bytes: 480844883 num_examples: 1084 download_size: 295576071 dataset_size: 480844883 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 language: - it tags: - project_gutenberg - text pretty_name: ITAGutenberg --- # ITAGutenberg A collection of all books written in Italian that appear on [Project Gutenberg](https://www.gutenberg.org/), meant for pretraining of Large Language Models. We collected the plaintext version of each book and lightly processed it to remove licensing text that is usually included in the original text of Project Gutenberg. ## 🚀 Quickstart Simply download the dataset as you would with any other Hugging Face dataset: ```python from datasets import load_dataset dataset = load_dataset("tommasobonomo/ITAGutenberg") ``` ## 📊 Data Schema The dataset is organized with the following schema: * **`gutenberg_id`** *(str)* — Project Gutenberg key identifying the book. * **`title`** *(str)* — Title of the book. * **`text`** *(str)* — Full text of the book from Project Gutenberg. * **`tokenized_length`** *(int)* — Length in tokens of the book. Computed through the `cl100k_base` encoding from [`openai/tiktoken`](https://github.com/openai/tiktoken). * **`metadata`** *(dict)* — Additional contextual information about the book, including: * **`authors`** *(list[str])* — Name of the book’s author(s). * **`bookshelves`** *(list[str])* — Name of bookshelves that include this book, as per Project Gutenberg. * **`encoding`** *(str)* — Name of the encoding of the original book source. * **`languages`** *(list[str])* — Languages that appear in a specific book. * **`summaries`** *(list[str])* — Summaries collected from Project Gutenberg for a specific book. * **`url`** *(str)* — URL from which the text of the book was downloaded.
提供机构:
tommasobonomo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作