five

Labagaite/fr-summarizer-dataset

收藏
Hugging Face2024-04-13 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Labagaite/fr-summarizer-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: fr-summarizer-dataset dtype: string - name: content dtype: string splits: - name: train num_bytes: 13739369 num_examples: 1968 - name: validation num_bytes: 2957786 num_examples: 440 download_size: 7646820 dataset_size: 16697155 configs: - config_name: string data_files: - split: train path: data/train-* - split: validation path: data/validation-* license: mit task_categories: - summarization - text-generation - text2text-generation language: - fr tags: - code - summarizer - dataset - llm - fr pretty_name: fr-summarizer-dataset size_categories: - 1K<n<10K --- # training data - Dataset : [fr-summarizer-dataset](https://huggingface.co/datasets/Labagaite/fr-summarizer-dataset) - Data-size : 7.65 MB - train : 1.97k rows - validation : 440 rows - roles : user , assistant - Format chatml "role": "role", "content": "content", "user": "user", "assistant": "assistant" <br> *French audio podcast transcription* # Project details [<img src="https://avatars.githubusercontent.com/u/116890814?v=4" width="100"/>](https://github.com/WillIsback/Report_Maker) Fine-tuned on French audio podcast transcription data for summarization task. As a result, the model is able to summarize French audio podcast transcription data. The model will be used for an AI application: [Report Maker](https://github.com/WillIsback/Report_Maker) wich is a powerful tool designed to automate the process of transcribing and summarizing meetings. It leverages state-of-the-art machine learning models to provide detailed and accurate reports. # Building the dataset: The dataset was built with openai GPT3.5-Turbo generativ response to a summarize task. Being already competent in that task, in french and having a big context window. The max_new_token_length was set to 1024 to fit smaller model training. Really small model as tiny llama need to truncate wich will affect the context and the quality result of the training. Check the [prompt](https://github.com/WillIsback/Report_Maker/blob/main/Utils/prompts.py) structure made to perform for 3 summarize task : - Summarize (simple) - Map reduce summarize - Refine summarize Check also the [code](https://github.com/WillIsback/Report_Maker/blob/main/Utils/summarize_dataset_builder.py) used for generate the response for this dataset # Formating data for [unsloth](https://github.com/unslothai/unsloth)/[Summarize](https://github.com/WillIsback/LLM_Summarizer_Trainer) training: ```Python from datasets import load_dataset, Dataset import pandas as pd from unsloth.chat_templates import get_chat_template class ChatTemplate(): def __init__(self, tokenizer): self.tokenizer = tokenizer def formating_messages(self,example): user_chat = {"role": example["user"]["role"], "content": example["user"]["content"]} assistant_chat = {"role": example["assistant"]["role"], "content": example["assistant"]["content"]} return {"messages": [user_chat, assistant_chat]} def formatting_prompts_func(self,examples): convos = examples["messages"] texts = [self.tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } def load_data(self): self.tokenizer = get_chat_template( self.tokenizer, chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth mapping = {"role": "role", "content": "content", "user": "user", "assistant": "assistant"}, # ShareGPT style map_eos_token = True, # Maps <|im_end|> to </s> instead ) dataset_train = load_dataset("Labagaite/fr-summarizer-dataset", split = "train") dataset_val = load_dataset("Labagaite/fr-summarizer-dataset", split = "validation") # Group the data grouped_data_train = [{"user": dataset_train[i], "assistant": dataset_train[i+1]} for i in range(0, len(dataset_train), 2)] grouped_data_val = [{"user": dataset_val[i], "assistant": dataset_val[i+1]} for i in range(0, len(dataset_val), 2)] # Convert the list of dictionaries to a DataFrame df_train = pd.DataFrame(grouped_data_train) df_val = pd.DataFrame(grouped_data_val) # Create a new Dataset object dataset_train = Dataset.from_pandas(df_train) dataset_val = Dataset.from_pandas(df_val) dataset_train = dataset_train.map(self.formating_messages, batched = False) dataset_train = dataset_train.map(self.formatting_prompts_func, batched = True) dataset_val = dataset_val.map(self.formating_messages, batched = False) dataset_val = dataset_val.map(self.formatting_prompts_func, batched = True) return dataset_train, dataset_val ```
提供机构:
Labagaite
原始信息汇总

数据集概述

数据集名称

  • 名称: fr-summarizer-dataset

数据集特征

  • 特征名称: content
  • 数据类型: string

数据集拆分

  • 训练集:
    • 示例数量: 1968
    • 数据大小: 13739369字节
  • 验证集:
    • 示例数量: 440
    • 数据大小: 2957786字节

数据集大小

  • 下载大小: 7646820字节
  • 总数据集大小: 16697155字节

配置

  • 配置名称: string
  • 数据文件路径:
    • 训练集: data/train-*
    • 验证集: data/validation-*

许可证

  • 许可证类型: mit

任务类别

  • summarization
  • text-generation
  • text2text-generation

语言

  • 语言: fr

标签

  • code
  • summarizer
  • dataset
  • llm
  • fr

数据集别名

  • 别名: fr-summarizer-dataset

数据集大小类别

  • 大小范围: 1K<n<10K
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作