LeroyDyer/Mahabharata

Name: LeroyDyer/Mahabharata
Creator: LeroyDyer
Published: 2024-05-21 15:07:30
License: 暂无描述

Hugging Face2024-05-21 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/LeroyDyer/Mahabharata

下载链接

链接失效反馈

官方服务：

资源简介：

```python from datasets import load_dataset dataset = load_dataset("LeroyDyer/Mahabharata", split = "train[:5000]") EOS_TOKEN = tokenizer.eos_token def formatting_func(example): max_seq_length = 4098 # Maximum sequence length text = example["Text"] + EOS_TOKEN chunks = [text[i:i+max_seq_length] for i in range(0, len(text), max_seq_length)] formatted_examples = [{"Text": chunk} for chunk in chunks] return formatted_examples ```

This is a dataset about the Mahabharata, containing the first 5000 text entries. The text data in the dataset is split into chunks with a maximum length of 4098, and an EOS token is added at the end of each chunk.

提供机构：

LeroyDyer

原始信息汇总

数据集概述

数据集名称

名称：Mahabharata
来源：LeroyDyer

数据集加载

使用库：datasets
加载命令：load_dataset("LeroyDyer/Mahabharata", split = "train[:5000]")

数据处理

分块处理：将文本按照最大序列长度4098进行分块。
分块方法：使用列表推导式，每次取4098个字符，不足部分也作为一个块。
分块后的格式：每个块包含一个键值对，键为"Text"，值为对应的文本块。

特殊标记

EOS_TOKEN：文本结束标记，用于文本的格式化。

5,000+

优质数据集

54 个

任务类型

进入经典数据集