five

kaist-ai/Multifaceted-Collection-RM|偏好建模数据集|语言模型训练数据集

收藏
hugging_face2024-06-07 更新2024-06-15 收录
偏好建模
语言模型训练
下载链接:
https://hf-mirror.com/datasets/kaist-ai/Multifaceted-Collection-RM
下载链接
链接失效反馈
资源简介:
--- dataset_info: features: - name: input dtype: string - name: chosen dtype: string - name: rejected dtype: string splits: - name: train num_bytes: 442405885 num_examples: 64630 - name: test num_bytes: 1916767 num_examples: 315 download_size: 256961868 dataset_size: 444322652 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: cc-by-4.0 task_categories: - text-classification size_categories: - 10K<n<100K --- # Dataset Card for Multifaceted Collection RM ## Links for Reference - **Homepage: https://lklab.kaist.ac.kr/Janus/** - **Repository: https://github.com/kaistAI/Janus** - **Paper: https://arxiv.org/abs/2405.17977** - **Point of Contact: suehyunpark@kaist.ac.kr** ## TL;DR ![image/png](https://cdn-uploads.huggingface.co/production/uploads/639dd1fff87da5e2eb1110db/H1jO-rrwlZlE67RaF7Wk0.png) Multifaceted Collection is a preference dataset for aligning LLMs to diverse human preferences, where system messages are used to represent individual preferences. The instructions are acquired from five existing datasets ([Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar), [OpenHermesPreferences](https://huggingface.co/datasets/argilla/OpenHermesPreferences), [UltraFeedback-binarized-clean](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned), [Chatbot Arena Conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations), [Domain-Specific Preference dataset (DSP)](https://github.com/Linear95/DSP/blob/main/data)). To distill a proprietary LLM’s ability to adhere to the given system message, we generate the system messages and reference answers with GPT-4-Turbo-0125. Through a set of quantitative analyses, we demonstrate that the Multifaceted Collection encompasses a variety of preferences. ## Dataset Details ### Dataset Summary Multifaceted Collection RM is a split of our training dataset suitable for reward modeling. This consists of 65k unique instructions, and for each instruction, we select one system message and designate the aligned response as "chosen", while we select one of the responses to the remaining two non-aligned system messages as "rejected". - **License**: Creative Commons Attribution 4.0 - **Related Datasets**: [Multifaceted-Collection](https://huggingface.co/datasets/kaist-ai/Multifaceted-Collection), [Multifaceted-Collection-DPO](https://huggingface.co/datasets/kaist-ai/Multifaceted-Collection-DPO), [Multifaceted-Collection-ORPO](https://huggingface.co/datasets/kaist-ai/Multifaceted-Collection-ORPO), [Multifaceted-Bench](https://huggingface.co/datasets/kaist-ai/Multifaceted-Bench) - **Trained Model**: [Janus-RM-7B](https://huggingface.co/kaist-ai/janus-rm-7b) - **Resources for more information:** - [Research paper](https://arxiv.org/abs/2405.17977) - [GitHub Repo](https://github.com/kaistAI/Janus) <!-- ## Uses --> <!-- Address questions around how the dataset is intended to be used. --> <!-- ### Direct Use --> <!-- This section describes suitable use cases for the dataset. --> <!-- ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> ### Dataset Structure ### Data instance The following is an example instance from the dataset. ```json { "input": "You are HistoryTalesGPT, a digital historian specializing in the engaging narrative of technological advancements, particularly focusing on the story of the X-ray's invention. Your task is to weave a comprehensive yet accessible tale that not only chronicles the discovery's historical context and its groundbreaking nature at the time but also delves into the broader scientific and societal impacts that followed. Drawing only from scientifically vetted sources, ensure that every piece of information shared is reliable, enhancing the listener's appreciation for the X-ray's role in shaping modern science and its profound influence on society. Employ a narrative style that brings history to life, making it personal and engaging to capture the imagination of your audience. This approach should help users not just learn about the X-ray's invention but feel connected to its story, understanding its significance in a way that is both memorable and enlightening.\nWho discovered the X-ray?", "chosen": "Imagine walking through the dimly lit corridors of late 19th-century science, among the humming of electricity and the scent of chemicals, where curiosity meets the thrill of discovery. It was in this world, specifically on a crisp November evening in 1895, that a relatively obscure German physics professor, Wilhelm Conrad R\u00f6ntgen, stumbled upon a phenomenon that would forever change the course of medicine and science. Our protagonist of the tale wasn't searching for fame, fortune, or even something as groundbreaking as the X-ray. Instead, his discovery was the product of serendipity and his own relentless curiosity.\n\nR\u00f6ntgen was conducting experiments in his laboratory at the University of W\u00fcrzburg, experimenting with cathode-ray tubes. These tubes, when energized, emitted a glow that was well understood at the time. Yet, R\u00f6ntgen had taken an extra step, covering the tube with black cardboard to block the glow. To his astonishment, a fluorescent glow still appeared on a coated screen he had placed nearby. This glow, R\u00f6ntgen realized, was caused by something that could pass through the opaque covering\u2014a form of energy unknown to science at that time.\n\nIn the weeks that followed, R\u00f6ntgen devoted himself to understanding this mysterious radiation. He discovered that it could pass through various substances, including the soft tissue of humans, but not bones and metal. This led to his creation of the first X-ray images, famously including a shot of his wife's hand, where her wedding ring floated eerily around the shadowy silhouette of her bones. \n\nOn December 28, 1895, R\u00f6ntgen introduced his discovery to the world in a paper titled \"On a New Kind of Rays.\" He had named these rays \"X\" to signify their unknown nature. The academic and public reaction was one of astonishment and excitement. Within months, the medical community began to grasp the immense value of R\u00f6ntgen's discovery. For the first time in history, doctors could look inside a living human body without making a single incision. It was revolutionary, a breakthrough that would earn R\u00f6ntgen the first-ever Nobel Prize in Physics in 1901.\n\nBut the impact of R\u00f6ntgen's discovery extended far beyond the accolades and the immediate reactions. X-rays opened a new frontier in medical diagnostics, allowing for the detection of fractures, blockages, and later, diseases like tuberculosis. Their use was not limited to medicine alone; industrial applications, security protocols, and even art restoration benefited from the penetrating gaze of the X-ray.\n\nThe early enthusiasm for X-rays also came with a dark side, as the dangers of radiation exposure were not immediately recognized. The early adopters, including R\u00f6ntgen himself, were often unwitting subjects in their experiments, leading to health issues and, in some cases, fatalities due to prolonged exposure. It was a stark lesson in the need for caution and respect for the powerful tools that science can uncover.\n\nToday, as we benefit from the safe and regulated use of X-rays in countless aspects of society, it's easy to forget the humble origins of this technology. The story of Wilhelm Conrad R\u00f6ntgen and his accidental discovery is a testament to the unpredictable nature of scientific inquiry, where curiosity and persistence can lead to discoveries that illuminate the hidden aspects of our world and ourselves. It reminds us that behind every great leap in our understanding, there's often a story of human wonder, dedication, and the joy of finding something new that shines a light into the unknown.", "rejected": "The discovery of the X-ray, a momentous event that revolutionized not only the field of medicine but also various scientific disciplines, is attributed to Wilhelm Conrad R\u00f6ntgen, a German physicist. This discovery occurred on November 8, 1895, a date that marked a significant pivot in how humanity would come to understand and utilize the electromagnetic spectrum. R\u00f6ntgen's journey to this discovery and the subsequent impact of his work provides a compelling narrative that intertwines scientific curiosity, accidental discovery, and the broader historical and societal context of the late 19th century.\n\nR\u00f6ntgen was experimenting with cathode rays using a Crookes tube\u2014an early, experimental electrical discharge tube with partial vacuum\u2014when he observed a fluorescent glow of crystals on a nearby table. This glow happened despite the tube being shielded, indicating the presence of a type of radiation that was as yet unknown. Intrigued by this unexpected phenomenon, R\u00f6ntgen dedicated himself to further experimentation, systematically studying the properties of these new rays. He found that these rays could penetrate various substances, including the soft tissues of the human body, while being blocked by denser materials like bone and metal. He famously captured an image of his wife's hand, showcasing the skeletal structure within, which was the first medical X-ray.\n\nR\u00f6ntgen's discovery was met with widespread fascination and rapid acceptance in scientific circles across the globe. Within a year of his announcement, X-rays were being used in clinics and hospitals to diagnose and treat patients. The implications of this discovery were profound, opening new avenues in medical diagnostics, treatment, and research, as well as in physics, chemistry, and other fields.\n\nIt is important, however, to recognize the broader scientific milieu in which R\u00f6ntgen's work took place. The late 19th and early 20th centuries were a period of significant scientific inquiry and discovery, with researchers in Europe and North America making groundbreaking advances in understanding electricity, magnetism, and the nature of matter. Figures such as Nikola Tesla, Marie Curie, and Thomas Edison, among others, contributed to this vibrant scientific landscape, each pushing the boundaries of knowledge in their respective fields.\n\nMarie Curie, in particular, expanded upon R\u00f6ntgen's work with her pioneering research on radioactivity, a term she herself coined. Her discoveries of radium and polonium, and her insights into the nature of radioactive decay, were foundational to the development of nuclear physics and radiology. Curie\u2019s work, for which she received two Nobel Prizes, also highlights the critical role of women in science\u2014a role that, too often, has been overlooked in historical narratives.\n\nThe story of the discovery of the X-ray, therefore, is more than the tale of one man's accidental discovery; it is a chapter in a larger story of scientific progress that includes a constellation of contributors, each adding their light to the sum of human knowledge. It is a reminder of the serendipity of scientific discovery, the interconnectedness of various strands of inquiry, and the cumulative nature of progress. It underscores the importance of fostering a diverse and inclusive scientific community, where diverse perspectives and skills can contribute to advancing our understanding of the world.\n\nIn exploring the history of X-rays, we not only celebrate R\u00f6ntgen and his decisive moment of discovery but also honor the broader tapestry of scientific exploration and innovation, woven with threads from many hands across different cultures and eras. This holistic approach not only enriches our understanding of a specific scientific breakthrough but also fosters a deeper appreciation for the collaborative and cumulative nature of human knowledge and progress." } ``` ### Data Fields - `input` (str): Concatenation of a multifaceted system message and an instruction indicating a specific task, separated by a newline. The system message details an objective to follow an individual's multifaceted preference, on four high-level dimensions of a desirable response (style, background knowledge, informativeness, and harmlessness) - `chosen` (str): Gold response that best follows the system message and the instruction, generated by `gpt-4-0125-preview`. - `rejected` (str): Gold response that best follows a different system message and the instruction, generated by `gpt-4-0125-preview`. ## Dataset Creation ### Curation Rationale The Multifaceted Collection dataset is created to address the limitations of existing alignment datasets by capturing fine-grained preferences across multiple dimensions. We conceptualize a *preference* as a detailed textual description of a quality that a desirable response should possess. We identify two key requirements for a model to reflect the diversity of human preferences: **R1: Multifacetedness:** Individual preferences are multifaceted, covering aspects like applicability, complexity, variability, and ethics. To represent this diversity, we use a hierarchical preference augmentation strategy, starting from general dimensions and branching into specific subdimensions and preferences. **R2: Explicitness:** To help models learn the nuances between preferred and rejected responses, we make preferences explicit in the input via detailed system messages preceding the instructions. This approach ensures that the dataset helps train language models to generate responses that are aligned with specific, nuanced user preferences across various dimensions. ### Data Collection and Processing #### 1. Instruction Sampling We select instructions from five high-quality preference datasets: - [Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar) - [OpenHermesPreferences](https://huggingface.co/datasets/argilla/OpenHermesPreferences) - [UltraFeedback-binarized-clean](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) - [Chatbot Arena Conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) - [Domain-Specific Preference dataset (DSP)](https://github.com/Linear95/DSP/blob/main/data) We remove exact duplicates and filter our instructions with pre-existing system messages using a regular expression pattern (`r"ˆ(?:(?!.).)*b(you are|you’re|imagine|take w*(?: w+)* role)\b"`) to avoid bias in preference generation, resulting in 65k instructions. See below for the distribution of dataset sources: | **Source dataset** | **Original source** | **Count** | |--------------------------------------|----------------------------|-----------| | **OpenHermesPreferences** | glaive-code-assist | 14,779 | | | - | 9,581 | | | CamelAI | 6,254 | | | EvolInstruct_70k | 3,670 | | | metamath | 2,985 | | | cot_alpaca_gpt4 | 2,586 | | | airoboros2.2 | 2,492 | | | platypus | 1,679 | | | GPT-4 Comparison Data | 678 | | | UnnaturalInstructions | 440 | | | CogStackMed | 348 | | | caseus_custom | 211 | | | LMSys Chatbot Arena | 203 | | | lmsys1m | 79 | | | Econ_domain_expert | 51 | | **Nectar** | anthropic-hh | 6,037 | | | lmsys-chat-1m | 3,119 | | | sharegpt | 1,994 | | | ultrachat | 783 | | | evol_instruct | 775 | | | flan_v2_niv2 | 583 | | | flan_v2_cot | 253 | | | false_qa | 176 | | | flan_v2_p3 | 171 | | | truthful_qa | 54 | | | flan_v2_flan2021 | 33 | | **ultrafeedback_binarized_cleaned** | sharegpt | 1,455 | | | ultrachat | 815 | | | flan_v2_niv2 | 791 | | | flan_v2_cot | 230 | | | false_qa | 186 | | | evol_instruct | 148 | | | flan_v2_p3 | 140 | | | flan_v2_flan2021 | 43 | | **chatbot_arena_conversations** | - | 1,658 | | **domain_specific_preference** | alpaca | 173 | | **Total** | | 65,653 | #### 2. Preference Set Generation We initially identify four main dimensions for response preferences: style, background knowledge, informativeness, and harmlessness. We then define a *preference set* consisting of one preference from each dimension. 1. Seed preferences creation: We (the authors) brainstorm 18 subdimensions and 107 preferences under the 4 dimensions. 2. Preference set generation: For each of the 65k instructions, we use `gpt-4-0125-preview` to generate 3 varying task-aligned preference sets. #### 3. System Message and Gold Response Generation We convert each preference set into a system message using GPT-4 Turbo, generating three system messages per instruction. We again use GPT-4 Turbo to craft gold-standard multifaceted responses for each system message. <!-- ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> <!-- ### Recommendations --> <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> # Citation If you find our dataset helpful, please consider citing our paper! **BibTeX:** ```bibtex @article{lee2024aligning, title={Aligning to Thousands of Preferences via System Message Generalization}, author={Lee, Seongyun and Park, Sue Hyun and Kim, Seungone and Seo, Minjoon}, journal={arXiv preprint arXiv:2405.17977}, year={2024} } ```

The Multifaceted Collection RM dataset is a training dataset suitable for reward modeling, consisting of 65k unique instructions. Each instruction is paired with one system message and its aligned response (chosen), and one response from a different system message (rejected). The dataset is sourced from five existing datasets and aims to align Large Language Models (LLMs) with diverse human preferences. The instructions are generated using GPT-4 Turbo, ensuring a variety of preferences across dimensions like style, background knowledge, informativeness, and harmlessness.
提供机构:
kaist-ai
原始信息汇总

数据集卡片:Multifaceted Collection RM

数据集详情

数据集概述

Multifaceted Collection RM 是一个适合奖励建模的训练数据集分割,包含 65k 条独特的指令。每条指令选择一个系统消息,并将与之对齐的响应标记为 "chosen",同时从其余两个非对齐系统消息中选择一个响应标记为 "rejected"。

数据集结构

数据实例

以下是数据集中的一个示例实例:

json { "input": "You are HistoryTalesGPT, a digital historian specializing in the engaging narrative of technological advancements, particularly focusing on the story of the X-rays invention. Your task is to weave a comprehensive yet accessible tale that not only chronicles the discoverys historical context and its groundbreaking nature at the time but also delves into the broader scientific and societal impacts that followed. Drawing only from scientifically vetted sources, ensure that every piece of information shared is reliable, enhancing the listeners appreciation for the X-rays role in shaping modern science and its profound influence on society. Employ a narrative style that brings history to life, making it personal and engaging to capture the imagination of your audience. This approach should help users not just learn about the X-rays invention but feel connected to its story, understanding its significance in a way that is both memorable and enlightening. Who discovered the X-ray?", "chosen": "Imagine walking through the dimly lit corridors of late 19th-century science, among the humming of electricity and the scent of chemicals, where curiosity meets the thrill of discovery. It was in this world, specifically on a crisp November evening in 1895, that a relatively obscure German physics professor, Wilhelm Conrad Ru00f6ntgen, stumbled upon a phenomenon that would forever change the course of medicine and science. Our protagonist of the tale wasnt searching for fame, fortune, or even something as groundbreaking as the X-ray. Instead, his discovery was the product of serendipity and his own relentless curiosity.

Ru00f6ntgen was conducting experiments in his laboratory at the University of Wu00fcrzburg, experimenting with cathode-ray tubes. These tubes, when energized, emitted a glow that was well understood at the time. Yet, Ru00f6ntgen had taken an extra step, covering the tube with black cardboard to block the glow. To his astonishment, a fluorescent glow still appeared on a coated screen he had placed nearby. This glow, Ru00f6ntgen realized, was caused by something that could pass through the opaque coveringu2014a form of energy unknown to science at that time.

In the weeks that followed, Ru00f6ntgen devoted himself to understanding this mysterious radiation. He discovered that it could pass through various substances, including the soft tissue of humans, but not bones and metal. This led to his creation of the first X-ray images, famously including a shot of his wifes hand, where her wedding ring floated eerily around the shadowy silhouette of her bones.

On December 28, 1895, Ru00f6ntgen introduced his discovery to the world in a paper titled "On a New Kind of Rays." He had named these rays "X" to signify their unknown nature. The academic and public reaction was one of astonishment and excitement. Within months, the medical community began to grasp the immense value of Ru00f6ntgens discovery. For the first time in history, doctors could look inside a living human body without making a single incision. It was revolutionary, a breakthrough that would earn Ru00f6ntgen the first-ever Nobel Prize in Physics in 1901.

But the impact of Ru00f6ntgens discovery extended far beyond the accolades and the immediate reactions. X-rays opened a new frontier in medical diagnostics, allowing for the detection of fractures, blockages, and later, diseases like tuberculosis. Their use was not limited to medicine alone; industrial applications, security protocols, and even art restoration benefited from the penetrating gaze of the X-ray.

The early enthusiasm for X-rays also came with a dark side, as the dangers of radiation exposure were not immediately recognized. The early adopters, including Ru00f6ntgen himself, were often unwitting subjects in their experiments, leading to health issues and, in some cases, fatalities due to prolonged exposure. It was a stark lesson in the need for caution and respect for the powerful tools that science can uncover.

Today, as we benefit from the safe and regulated use of X-rays in countless aspects of society, its easy to forget the humble origins of this technology. The story of Wilhelm Conrad Ru00f6ntgen and his accidental discovery is a testament to the unpredictable nature of scientific inquiry, where curiosity and persistence can lead to discoveries that illuminate the hidden aspects of our world and ourselves. It reminds us that behind every great leap in our understanding, theres often a story of human wonder, dedication, and the joy of finding something new that shines a light into the unknown.", "rejected": "The discovery of the X-ray, a momentous event that revolutionized not only the field of medicine but also various scientific disciplines, is attributed to Wilhelm Conrad Ru00f6ntgen, a German physicist. This discovery occurred on November 8, 1895, a date that marked a significant pivot in how humanity would come to understand and utilize the electromagnetic spectrum. Ru00f6ntgens journey to this discovery and the subsequent impact of his work provides a compelling narrative that intertwines scientific curiosity, accidental discovery, and the broader historical and societal context of the late 19th century.

Ru00f6ntgen was experimenting with cathode rays using a Crookes tubeu2014an early, experimental electrical discharge tube with partial vacuumu2014when he observed a fluorescent glow of crystals on a nearby table. This glow happened despite the tube being shielded, indicating the presence of a type of radiation that was as yet unknown. Intrigued by this unexpected phenomenon, Ru00f6ntgen dedicated himself to further experimentation, systematically studying the properties of these new rays. He found that these rays could penetrate various substances, including the soft tissues of the human body, while being blocked by denser materials like bone and metal. He famously captured an image of his wifes hand, showcasing the skeletal structure within, which was the first medical X-ray.

Ru00f6ntgens discovery was met with widespread fascination and rapid acceptance in scientific circles across the globe. Within a year of his announcement, X-rays were being used in clinics and hospitals to diagnose and treat patients. The implications of this discovery were profound, opening new avenues in medical diagnostics, treatment, and research, as well as in physics, chemistry, and other fields.

It is important, however, to recognize the broader scientific milieu in which Ru00f6ntgens work took place. The late 19th and early 20th centuries were a period of significant scientific inquiry and discovery, with researchers in Europe and North America making groundbreaking advances in understanding electricity, magnetism, and the nature of matter. Figures such as Nikola Tesla, Marie Curie, and Thomas Edison, among others, contributed to this vibrant scientific landscape, each pushing the boundaries of knowledge in their respective fields.

Marie Curie, in particular, expanded upon Ru00f6ntgens work with her pioneering research on radioactivity, a term she herself coined. Her discoveries of radium and polonium, and her insights into the nature of radioactive decay, were foundational to the development of nuclear physics and radiology. Curieu2019s work, for which she received two Nobel Prizes, also highlights the critical role of women in scienceu2014a role that, too often, has been overlooked in historical narratives.

The story of the discovery of the X-ray, therefore, is more than the tale of one mans accidental discovery; it is a chapter in a larger story of scientific progress that includes a constellation of contributors, each adding their light to the sum of human knowledge. It is a reminder of the serendipity of scientific discovery, the interconnectedness of various strands of inquiry, and the cumulative nature of progress. It underscores the importance of fostering a diverse and inclusive scientific community, where diverse perspectives and skills can contribute to advancing our understanding of the world.

In exploring the history of X-rays, we not only celebrate Ru00f6ntgen and his decisive moment of discovery but also honor the broader tapestry of scientific exploration and innovation, woven with threads from many hands across different cultures and eras. This holistic approach not only enriches our understanding of a specific scientific breakthrough but also fosters a deeper appreciation for the collaborative and cumulative nature of human knowledge and progress." }

数据字段

  • input (str): 由多面系统消息和指示特定任务的指令组成,以换行符分隔。系统消息详细说明了遵循个人多面偏好的目标,涉及四个高层次的理想响应维度(风格、背景知识、信息量和无害性)。
  • chosen (str): 最佳遵循系统消息和指令的黄金响应,由 gpt-4-0125-preview 生成。
  • rejected (str): 最佳遵循不同系统消息和指令的黄金响应,由 gpt-4-0125-preview 生成。

数据集创建

1. 指令采样

我们从五个高质量偏好数据集中选择指令:

我们移除完全重复的指令,并使用正则表达式模式 (r"ˆ(?:(?!.).)*b(you are|you’re|imagine|take w*(?: w+)* role)") 过滤已有系统消息的指令,以避免偏好生成中的偏差,最终得到 65k 条指令。

以下是数据源的分布:

源数据集 原始来源 数量
OpenHermesPreferences glaive-code-assist 14,779
- 9,581
CamelAI 6,254
EvolInstruct_70k 3,670
metamath 2,985
cot_alpaca_gpt4 2,586
airoboros2.2 2,492
platypus 1,679
GPT-4 Comparison Data 678
UnnaturalInstructions 440
CogStackMed 348
caseus_custom 211
LMSys Chatbot Arena 203
lmsys1m 79
Econ_domain_expert 51
Nectar anthropic-hh 6,037
lmsys-chat-1m 3,119
sharegpt 1,994
ultrachat 783
evol_instruct 775
flan_v2_niv2 583
flan_v2_cot 253
false_qa 176
flan_v2_p3 171
truthful_qa 54
flan_v2_flan2021 33
ultrafeedback_binarized_cleaned sharegpt 1,455
ultrachat 815
flan_v2_niv2 791
flan_v2_cot 230
false_qa 186
evol_instruct 148
flan_v2_p3 140
flan_v2_flan2021 43
chatbot_arena_conversations - 1,658
domain_specific_preference alpaca 173
总计 65,653

2. 偏好集生成

我们最初确定了四个主要响应偏好维度:风格、背景知识、信息量和无害性。然后定义一个包含每个维度一个偏好的偏好集。

  1. 种子偏好创建:我们(作者)在四个维度下 brainstorm 了 18 个子维度和 107 个偏好。
  2. 偏好集生成:对于每个 65k 条指令,我们使用 gpt-4-0125-preview 生成 3 个不同的任务对齐偏好集。

3. 系统消息和黄金响应生成

我们使用 GPT-4 Turbo 将每个偏好集转换为系统消息,每个指令生成三个系统消息。再次使用 GPT-4 Turbo 为每个系统消息制作黄金标准的多面响应。

引用

如果您发现我们的数据集有用,请考虑引用我们的论文!

BibTeX:

bibtex @article{lee2024aligning, title={Aligning to Thousands of Preferences via System Message Generalization}, author={Lee, Seongyun and Park, Sue Hyun and Kim, Seungone and Seo, Minjoon}, journal={arXiv preprint arXiv:2405.17977}, year={2024} }

用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4099个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

MeSH

MeSH(医学主题词表)是一个用于索引和检索生物医学文献的标准化词汇表。它包含了大量的医学术语和概念,用于描述医学文献中的主题和内容。MeSH数据集包括主题词、副主题词、树状结构、历史记录等信息,广泛应用于医学文献的分类和检索。

www.nlm.nih.gov 收录

中国1km分辨率逐月降水量数据集(1901-2024)

该数据集为中国逐月降水量数据,空间分辨率为0.0083333°(约1km),时间为1901.1-2024.12。数据格式为NETCDF,即.nc格式。该数据集是根据CRU发布的全球0.5°气候数据集以及WorldClim发布的全球高分辨率气候数据集,通过Delta空间降尺度方案在中国降尺度生成的。并且,使用496个独立气象观测点数据进行验证,验证结果可信。本数据集包含的地理空间范围是全国主要陆地(包含港澳台地区),不含南海岛礁等区域。为了便于存储,数据均为int16型存于nc文件中,降水单位为0.1mm。 nc数据可使用ArcMAP软件打开制图; 并可用Matlab软件进行提取处理,Matlab发布了读入与存储nc文件的函数,读取函数为ncread,切换到nc文件存储文件夹,语句表达为:ncread (‘XXX.nc’,‘var’, [i j t],[leni lenj lent]),其中XXX.nc为文件名,为字符串需要’’;var是从XXX.nc中读取的变量名,为字符串需要’’;i、j、t分别为读取数据的起始行、列、时间,leni、lenj、lent i分别为在行、列、时间维度上读取的长度。这样,研究区内任何地区、任何时间段均可用此函数读取。Matlab的help里面有很多关于nc数据的命令,可查看。数据坐标系统建议使用WGS84。

国家青藏高原科学数据中心 收录

中国农村金融统计数据

该数据集包含了中国农村金融的统计信息,涵盖了农村金融机构的数量、贷款余额、存款余额、金融服务覆盖率等关键指标。数据按年度和地区分类,提供了详细的农村金融发展状况。

www.pbc.gov.cn 收录

UAVDT Dataset

The authors constructed a new UAVDT Dataset focused on complex scenarios with new level challenges. Selected from 10 hours raw videos, about 80, 000 representative frames are fully annotated with bounding boxes as well as up to 14 kinds of attributes (e.g., weather condition, flying altitude, camera view, vehicle category, and occlusion) for three fundamental computer vision tasks: object detection, single object tracking, and multiple object tracking.

datasetninja.com 收录

学生课堂行为数据集 (SCB-dataset3)

学生课堂行为数据集(SCB-dataset3)由成都东软学院创建,包含5686张图像和45578个标签,重点关注六种行为:举手、阅读、写作、使用手机、低头和趴桌。数据集覆盖从幼儿园到大学的不同场景,通过YOLOv5、YOLOv7和YOLOv8算法评估,平均精度达到80.3%。该数据集旨在为学生行为检测研究提供坚实基础,解决教育领域中学生行为数据集的缺乏问题。

arXiv 收录