bunkalab/topic_based_chatml_dpo_pairs

Name: bunkalab/topic_based_chatml_dpo_pairs
Creator: bunkalab
Published: 2024-01-14 15:26:46
License: 暂无描述

Hugging Face2024-01-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/bunkalab/topic_based_chatml_dpo_pairs

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en --- # DPO Pairs This is a preprocessed version of [mlabonne/chatml_dpo_pairs](https://huggingface.co/datasets/mlabonne/chatml_dpo_pairs) using [Bunkatopics](https://github.com/charlesdedampierre/BunkaTopics) to extract meaningful Topics that help models converge with less data. The objective was to create a smaller dataset than the original but buy keeping its efficiecency.To achieve this, we compared the two datasets used to train the reward model in [mlabonne/chatml_dpo_pairs](https://huggingface.co/datasets/mlabonne/chatml_dpo_pairs): the rejected Llama answers and the accepted ChatGPT answers from the DPO dataset. We then conducted topic modeling on both datasets, keeping only the topics that existed in the accepted dataset but not in the rejected one. Our hypothesis is that these topics encapsulate the main differences between the two answering styles. This method allows for quicker convergence with significantly less data (around 1/6 of the initial dataset). See the page of the model test [here](https://huggingface.co/charlesdedampierre/TopicNeuralHermes-2.5-Mistral-7B) # Topic Analysis We applied the topic modeling method to both datasets, extracting 30 topics from each. These topics were characterized using the 10 most specific unigrams or bigrams. We then compared the two sets of topics (30 from each dataset) and retained those in the accepted dataset that shared fewer than 2 terms with any topic in the rejected dataset We found the 13 distincitve following topics described by 10 terms each: **Emotional Dynamics**: feelings, Quinn, Austin, minority women, teaching, schools, individual, personality, backgrounds, triggers. **Global Knowledge Queries**: question, information, geography, news articles, Step, answer, capital city, pipeline system, country, analogy. **Digital Interactions and Queries**: questions, question, PersonX, modem, answers, effect relationship, Quora, browser, answer, e-commerce. **Business and Cybersecurity**: email, businesses, initiatives, innovation, advertising papers, spam, breaches, antivirus, payments, prospects. **Lifestyle and Wellness**: sleep, exercise, gifts, shopping, Casey, stores, stress, headaches, options, mood. **Wildlife Ecology**: birds, prey, animals, species, infection, nest, eggs, bacteria, insects, kitty condo. **Environmental Science and Climate**: temperature, gases, greenhouse, emissions, perturbation, sulfur, dioxide, climate change, water, heat. **Maritime and Mechanical Engineering**: ship, bowling, propulsion, beam width, Filing cabinet, LED, lane, containment area, lawnmower, rotors. **Cultural and Social Dynamics**: Lindsey, museum, Kate, Rachel, Jason, Alex, Erin, conversation, Laura, exhibits. **Political Media Analysis**: media platforms, election, politics, teenagers, elections, White House, Barack Obama, nation, Confederate, depression. **International Relations and Policy**: cooperation, EU, nations, alliance, NATO, European Union, member states, policy, monarch, Brexit. **Astrophysics and Physical Sciences**: electrons, km, Moon, acceleration, orbit, friction, current, asteroid, electron, collector emitter. **Film Critique and Analysis**: movie review, film, reviewer, sentiment, critic, flaws, DVD, plot, opinion, originality. While those topics are not domain-specific, they did not appear right away in the rejected dataset. Further research need to undersand the reason behind the prominence of those topics in the accepted dataset. # Load Dataset ```python dataset = load_dataset("bunkalab/topic_based_chatml_dpo_pairs")['train'] ```

提供机构：

bunkalab

原始信息汇总

DPO Pairs

数据集概述

DPO Pairs是一个预处理版本的数据集，基于mlabonne/chatml_dpo_pairs，使用Bunkatopics提取有意义的话题，以帮助模型在较少数据下收敛。该数据集的目标是创建一个比原始数据集更小但保持其效率的版本。

数据处理方法

通过比较用于训练奖励模型的两个数据集（被拒绝的Llama答案和被接受的ChatGPT答案），对这两个数据集进行话题建模，仅保留在被接受数据集中存在但在被拒绝数据集中不存在的话题。这种方法使得模型能够更快地收敛，且数据量约为初始数据集的1/6。

话题分析

对两个数据集应用话题建模方法，从每个数据集中提取30个话题，每个话题由10个最具体的单字或双字词描述。然后比较这两个话题集，保留在被接受数据集中与被拒绝数据集中任何话题共享少于2个词的话题。最终确定了13个独特的话题，每个话题由10个词描述。

话题列表

Emotional Dynamics: feelings, Quinn, Austin, minority women, teaching, schools, individual, personality, backgrounds, triggers.
Global Knowledge Queries: question, information, geography, news articles, Step, answer, capital city, pipeline system, country, analogy.
Digital Interactions and Queries: questions, question, PersonX, modem, answers, effect relationship, Quora, browser, answer, e-commerce.
Business and Cybersecurity: email, businesses, initiatives, innovation, advertising papers, spam, breaches, antivirus, payments, prospects.
Lifestyle and Wellness: sleep, exercise, gifts, shopping, Casey, stores, stress, headaches, options, mood.
Wildlife Ecology: birds, prey, animals, species, infection, nest, eggs, bacteria, insects, kitty condo.
Environmental Science and Climate: temperature, gases, greenhouse, emissions, perturbation, sulfur, dioxide, climate change, water, heat.
Maritime and Mechanical Engineering: ship, bowling, propulsion, beam width, Filing cabinet, LED, lane, containment area, lawnmower, rotors.
Cultural and Social Dynamics: Lindsey, museum, Kate, Rachel, Jason, Alex, Erin, conversation, Laura, exhibits.
Political Media Analysis: media platforms, election, politics, teenagers, elections, White House, Barack Obama, nation, Confederate, depression.
International Relations and Policy: cooperation, EU, nations, alliance, NATO, European Union, member states, policy, monarch, Brexit.
Astrophysics and Physical Sciences: electrons, km, Moon, acceleration, orbit, friction, current, asteroid, electron, collector emitter.
Film Critique and Analysis: movie review, film, reviewer, sentiment, critic, flaws, DVD, plot, opinion, originality.

数据集加载

python dataset = load_dataset("bunkalab/topic_based_chatml_dpo_pairs")[train]

5,000+

优质数据集

54 个

任务类型

进入经典数据集