doublecringe123/dialoguesum-npc-dialoguesum-stemmed-augmented
收藏Hugging Face2024-04-04 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/doublecringe123/dialoguesum-npc-dialoguesum-stemmed-augmented
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: inp
dtype: string
- name: target
dtype: string
splits:
- name: train
num_bytes: 47766276
num_examples: 59070
- name: validation
num_bytes: 2437693
num_examples: 3000
- name: test
num_bytes: 5709383
num_examples: 7000
download_size: 26987968
dataset_size: 55913352
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
# Dialoguesum-NPC-dialog-Stemmed-Augmented Dataset
## Overview
This dataset, named "Dialoguesum-Booksum-Stemmed-Augmented," is a custom text summarization dataset created by combining the "kmfoda/booksum" and "knkarthick/dialogsum" datasets. The goal of this dataset is to provide a resource for training and evaluating text summarization models specifically tailored for dialogues and book summaries.
## Dataset Composition
- **Original Sources**:
- [kmfoda/npl-dialog](https://huggingface.co/datasets/npc-engine/light-batch-summarize-dialogue)
- [knkarthick/dialogsum](https://huggingface.co/datasets/knkarthick/dialogsum)
- **Combination Method**: The datasets were concatenated to form a unified corpus.
- **Preprocessing Steps**:
- Stop Word Removal: Common stop words were removed to focus on meaningful content.
- Stemming: Words were stemmed to their base forms to reduce variation.
- Synonym Replacement: Synonyms were replaced to enhance variety in the dataset.
## Data Format
- **Input Format**: Each input instance consists of dialogues and npc-dialogues summaries.
- **Output Format**: Corresponding summary for each input instance.
## Example Usage
```python
from datasets import load_dataset
# Load the custom dataset
dataset = load_dataset("doublecringe123/dialoguesum-npc-dialoguesum-stemmed-augmented")
# Access the training split
train_data = dataset["train"]
# Sample input-output pair
sample = train_data[0]
input_text = sample["inp"]
output_summary = sample["target"]
```
I also recommend to use datasets 2.18.0 version
```
pip install -q datasets>=2.18.0
```
提供机构:
doublecringe123
原始信息汇总
Dialoguesum-NPC-dialog-Stemmed-Augmented Dataset
Dataset Composition
-
Features:
inp: stringtarget: string
-
Splits:
train: 59070 examples, 47766276 bytesvalidation: 3000 examples, 2437693 bytestest: 7000 examples, 5709383 bytes
-
Download Size: 26987968 bytes
-
Dataset Size: 55913352 bytes
Data Files
- Config: default
train: data/train-*validation: data/validation-*test: data/test-*



