Labagaite/fr-summarizer-dataset
收藏Hugging Face2024-04-13 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Labagaite/fr-summarizer-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: fr-summarizer-dataset
dtype: string
- name: content
dtype: string
splits:
- name: train
num_bytes: 13739369
num_examples: 1968
- name: validation
num_bytes: 2957786
num_examples: 440
download_size: 7646820
dataset_size: 16697155
configs:
- config_name: string
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
license: mit
task_categories:
- summarization
- text-generation
- text2text-generation
language:
- fr
tags:
- code
- summarizer
- dataset
- llm
- fr
pretty_name: fr-summarizer-dataset
size_categories:
- 1K<n<10K
---
# training data
- Dataset : [fr-summarizer-dataset](https://huggingface.co/datasets/Labagaite/fr-summarizer-dataset)
- Data-size : 7.65 MB
- train : 1.97k rows
- validation : 440 rows
- roles : user , assistant
- Format chatml "role": "role", "content": "content", "user": "user", "assistant": "assistant"
<br>
*French audio podcast transcription*
# Project details
[<img src="https://avatars.githubusercontent.com/u/116890814?v=4" width="100"/>](https://github.com/WillIsback/Report_Maker)
Fine-tuned on French audio podcast transcription data for summarization task. As a result, the model is able to summarize French audio podcast transcription data.
The model will be used for an AI application: [Report Maker](https://github.com/WillIsback/Report_Maker) wich is a powerful tool designed to automate the process of transcribing and summarizing meetings.
It leverages state-of-the-art machine learning models to provide detailed and accurate reports.
# Building the dataset:
The dataset was built with openai GPT3.5-Turbo generativ response to a summarize task. Being already competent in that task, in french and having a big context window.
The max_new_token_length was set to 1024 to fit smaller model training.
Really small model as tiny llama need to truncate wich will affect the context and the quality result of the training.
Check the [prompt](https://github.com/WillIsback/Report_Maker/blob/main/Utils/prompts.py) structure made to perform for 3 summarize task :
- Summarize (simple)
- Map reduce summarize
- Refine summarize
Check also the [code](https://github.com/WillIsback/Report_Maker/blob/main/Utils/summarize_dataset_builder.py) used for generate the response for this dataset
# Formating data for [unsloth](https://github.com/unslothai/unsloth)/[Summarize](https://github.com/WillIsback/LLM_Summarizer_Trainer) training:
```Python
from datasets import load_dataset, Dataset
import pandas as pd
from unsloth.chat_templates import get_chat_template
class ChatTemplate():
def __init__(self, tokenizer):
self.tokenizer = tokenizer
def formating_messages(self,example):
user_chat = {"role": example["user"]["role"], "content": example["user"]["content"]}
assistant_chat = {"role": example["assistant"]["role"], "content": example["assistant"]["content"]}
return {"messages": [user_chat, assistant_chat]}
def formatting_prompts_func(self,examples):
convos = examples["messages"]
texts = [self.tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
return { "text" : texts, }
def load_data(self):
self.tokenizer = get_chat_template(
self.tokenizer,
chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
mapping = {"role": "role", "content": "content", "user": "user", "assistant": "assistant"}, # ShareGPT style
map_eos_token = True, # Maps <|im_end|> to </s> instead
)
dataset_train = load_dataset("Labagaite/fr-summarizer-dataset", split = "train")
dataset_val = load_dataset("Labagaite/fr-summarizer-dataset", split = "validation")
# Group the data
grouped_data_train = [{"user": dataset_train[i], "assistant": dataset_train[i+1]} for i in range(0, len(dataset_train), 2)]
grouped_data_val = [{"user": dataset_val[i], "assistant": dataset_val[i+1]} for i in range(0, len(dataset_val), 2)]
# Convert the list of dictionaries to a DataFrame
df_train = pd.DataFrame(grouped_data_train)
df_val = pd.DataFrame(grouped_data_val)
# Create a new Dataset object
dataset_train = Dataset.from_pandas(df_train)
dataset_val = Dataset.from_pandas(df_val)
dataset_train = dataset_train.map(self.formating_messages, batched = False)
dataset_train = dataset_train.map(self.formatting_prompts_func, batched = True)
dataset_val = dataset_val.map(self.formating_messages, batched = False)
dataset_val = dataset_val.map(self.formatting_prompts_func, batched = True)
return dataset_train, dataset_val
```
提供机构:
Labagaite
原始信息汇总
数据集概述
数据集名称
- 名称: fr-summarizer-dataset
数据集特征
- 特征名称: content
- 数据类型: string
数据集拆分
- 训练集:
- 示例数量: 1968
- 数据大小: 13739369字节
- 验证集:
- 示例数量: 440
- 数据大小: 2957786字节
数据集大小
- 下载大小: 7646820字节
- 总数据集大小: 16697155字节
配置
- 配置名称: string
- 数据文件路径:
- 训练集: data/train-*
- 验证集: data/validation-*
许可证
- 许可证类型: mit
任务类别
- summarization
- text-generation
- text2text-generation
语言
- 语言: fr
标签
- code
- summarizer
- dataset
- llm
- fr
数据集别名
- 别名: fr-summarizer-dataset
数据集大小类别
- 大小范围: 1K<n<10K



