machinelearnear/multiturn_chat_milei_gpt
收藏Hugging Face2024-05-21 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/machinelearnear/multiturn_chat_milei_gpt
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- text2text-generation
- text-generation
language:
- es
pretty_name: milei-gpt
size_categories:
- n<1K
---
# Milei-GPT Dataset
Che y si queremos hacer un LLM que hable de la misma forma que un famoso ... como hacemos? Este repo es una excusa para aprender a preparar un dataset para fine-tunear algún LLM, aprender como evaluarlo, como tokenizarlo, como extenderlo de formar sintética, y tantas otras cosas. Al final, si todo sale bien, vamos a tener un modelo que va a hablar como la persona que elegimos, y le podemos poner un RAG (retrieval augmented generation) encima para que nos traiga un contexto correcto y factual en las respuestas. Por ahora, la idea es hacerlo sobre Llama3-8B y usando APIs públicas para procesar la data, sobre mas de 300 horas de entrevistas.
## Paso a paso, que vamos a hacer
- Encontrar todas las entrevistas en YT de algún famoso/a
- Transcribir las entrevistas
- Preparar un dataset (convertir a `ChatML`, tokenization, data sintética, etc.)
- Elegir un modelo base eg. `Llama3-8B` o `Phi-3-mini-128k-instruct`
- Fine-tuning del LLM, evaluación del modelo, y push to HF.
- Armar un RAG indexando las entrevistas y meterle este LLM
Mas info en: https://github.com/machinelearnear/milei-gpt
## Dataset details
**Dataset type:**
The Milei-GPT Dataset is a collection of transcriptions and speaker diarizations from various interviews of Javier Milei. It is intended to facilitate the fine-tuning of large language models (LLMs) to emulate the speaking style and content of Javier Milei.
* **huggingface_dataset.parquet** contains the processed dataset with detailed metadata and speaker-aware transcripts.
**Dataset date:**
The dataset was collected and processed in May 2024.
**Project or resources for more information:** https://github.com/machinelearnear/milei-gpt
**License:**
Apache-2.0
## Intended use
**Primary intended uses:**
The primary use of the Milei-GPT Dataset is to train, fine-tune, and evaluate language models that emulate the speaking style of Javier Milei. This dataset is particularly useful for developing conversational AI systems and chatbots that require a specific speaking style and tone.
**Primary intended users:**
The primary intended users of this dataset are researchers and developers in the fields of natural language processing, machine learning, and artificial intelligence, particularly those focusing on conversational models and dialogue systems.
## Dataset Construction
**Features:**
- **video_id**: Unique identifier for each video
- **channel_id**: The ID of the YouTube channel
- **channel**: The name of the YouTube channel
- **uploader_url**: URL of the uploader's YouTube profile
- **url**: URL of the YouTube video
- **title**: Title of the YouTube video
- **duration**: Duration of the video in seconds
- **view_count**: Number of views of the video
- **candidate_name**: Name of the candidate (Javier Milei)
- **messages**: List of transcribed and diarized messages from the video
**Collection Process:**
1. **Data Collection**: Interviews of Javier Milei were collected from YouTube.
2. **Transcription and Diarization**: Audio was transcribed using WhisperX and diarized using NVIDIA NeMo to distinguish between different speakers.
3. **Dataset Preparation**: The dataset was structured and formatted to include detailed metadata and speaker-aware transcripts, ready for use in training and fine-tuning LLMs.
**Intended Workflow:**
1. **Finding Interviews**: Use scripts to gather interview data from YouTube.
2. **Transcribing and Diarizing**: Transcribe and diarize the audio to identify and separate different voices.
3. **Preparing the Dataset**: Structure the data for model training, including tokenization and synthetic data generation.
4. **Model Selection and Training**: Choose a base model (e.g., Llama3-8B) and fine-tune it with the prepared dataset.
5. **Evaluation and Deployment**: Evaluate the model's performance and deploy it with a RAG system for enhanced response accuracy.
## Citation
If you use this dataset in your research, please cite it as follows:
```
@dataset{milei_gpt_2024,
author = {machinelearnear},
title = {Milei-GPT Dataset},
year = {2024},
url = {https://huggingface.co/datasets/machinelearnear/milei-gpt},
}
```
提供机构:
machinelearnear
原始信息汇总
Milei-GPT Dataset 概述
数据集基本信息
- 许可证: Apache-2.0
- 任务类别:
- 问答
- 文本到文本生成
- 文本生成
- 语言: 西班牙语 (es)
- 数据集名称: Milei-GPT
- 数据集大小: 小于1千
数据集详情
- 数据集类型: Milei-GPT Dataset 包含 Javier Milei 的各种访谈的转录和发言人分割。旨在帮助微调大型语言模型 (LLMs),以模仿 Javier Milei 的讲话风格和内容。
- 数据集文件:
huggingface_dataset.parquet包含处理后的数据集,具有详细的元数据和发言人感知的转录。 - 数据集日期: 2024年5月收集和处理。
- 项目资源: https://github.com/machinelearnear/milei-gpt
数据集用途
- 主要用途: 用于训练、微调和评估模仿 Javier Milei 讲话风格的语言模型。特别适用于开发需要特定讲话风格和语调的对话式AI系统和聊天机器人。
- 主要用户: 自然语言处理、机器学习和人工智能领域的研究人员和开发者,特别是专注于对话模型和对话系统的人士。
数据集构建
-
特征:
video_id: 视频的唯一标识符channel_id: YouTube频道的IDchannel: YouTube频道的名称uploader_url: 上传者的YouTube个人资料URLurl: YouTube视频的URLtitle: YouTube视频的标题duration: 视频的时长(秒)view_count: 视频的观看次数candidate_name: 候选人姓名(Javier Milei)messages: 来自视频的转录和分割消息列表
-
收集过程:
- 数据收集: 从YouTube收集Javier Milei的访谈。
- 转录和分割: 使用WhisperX进行音频转录,使用NVIDIA NeMo进行发言人分割。
- 数据集准备: 结构化和格式化数据集,包括详细的元数据和发言人感知的转录,以供训练和微调LLMs使用。
-
预期工作流程:
- 查找访谈: 使用脚本从YouTube收集访谈数据。
- 转录和分割: 转录和分割音频以识别和分离不同的声音。
- 准备数据集: 为模型训练结构化数据,包括令牌化和合成数据生成。
- 模型选择和训练: 选择基础模型(例如,Llama3-8B)并使用准备好的数据集进行微调。
- 评估和部署: 评估模型的性能并在RAG系统中部署以提高响应的准确性。
引用信息
如果您在研究中使用此数据集,请按以下方式引用:
@dataset{milei_gpt_2024, author = {machinelearnear}, title = {Milei-GPT Dataset}, year = {2024}, url = {https://huggingface.co/datasets/machinelearnear/milei-gpt}, }



