PersonaHub
收藏魔搭社区2025-12-04 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/PersonaHub
下载链接
链接失效反馈官方服务:
资源简介:
# Scaling Synthetic Data Creation with 1,000,000,000 Personas
This repo releases data introduced in our paper [Scaling Synthetic Data Creation with 1,000,000,000 Personas](https://arxiv.org/pdf/2406.20094):
We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce **PERSONA HUB** – a collection of **1 billion diverse personas** automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing PERSONA HUB’s use cases in synthesizing high-quality **mathematical and logical reasoning** problems, **instructions** (i.e., user prompts), **knowledge-rich texts**, **game NPCs** and **tools** (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.
<div align="center">
<img src="./assets/persona_overview.png" width="90%">
</div>
## Data Release
### Synthetic Data Samples
To facilitate research in persona-driven data synthesis, we are initially releasing following synthetic data samples we created with various personas, including:
* **50,000 math problems**
* **50,000 logical reasoning problems**
* **50,000 instructions**
* **10,000 knowledge-rich texts**
* **10,000 game NPCs**
* **5,000 tools (functions)**
### Persona Hub
We also release a subset of our PERSONA HUB, including:
* **200,000 personas (early preview)**
* **370,000,000 elite personas (added in Feb 2025)**
## Run Demo
One can try the demo to synthesize data with PERSONA HUB simply by running code in https://github.com/tencent-ailab/persona-hub:
```bash
# ensure that you have installed datasets and openai (pip install datasets openai) and configured the openai_api_key before running
bash demo_openai_synthesize.sh # using gpt4o to synthesize data with PERSONA HUB
```
or
```bash
# ensure that you have installed datasets, transformers and vllm (pip install datasets transformers vllm) before running
bash demo_vllm_synthesize.sh # using open-sourced models to synthesize data with PERSONA HUB
```
Note that the data synthesis prompt templates we provide are for reference only. You can customize your desired prompts in `code/prompt_templates.py`.
## Argilla
You can also access this dataset in [Argilla space](https://argilla-data-explorers.hf.space/), as introduced in the following video:
* Video: https://youtu.be/timmCn8Nr6g?feature=shared
## Contact
* Please email `sggetao@gmail.com` or `dyu@global.tencent.com`
* Github page: https://github.com/tencent-ailab/persona-hub
## Disclaimer
PERSONA HUB can facilitate synthetic data creation at a billion-scale to simulate diverse inputs (i.e., use cases) from a wide variety of real-world users. If this data is used as input to query a target LLM to obtain its outputs at scale, there is a high risk that the LLM's knowledge, intelligence and capabilities will be dumped and easily replicated, thereby challenging the leading position of the most powerful LLMs. It is crucial to avoid misuse and ensure ethical and responsible application to prevent privacy violations and other ethical concerns.
The released data is all generated by public available models (GPT-4, Llama-3 and Qwen), and is intended for research purposes only. Users also must comply with the respective license agreements and usage policies of these models when using the synthesized data. The data may contain inaccuracies, unsafe content, or biases, for which we cannot be held responsible. Please evaluate its accuracy and suitability before use. Tencent and its licensors provide the data AS-IS, without warranty of any kind, express or implied. The views and opinions expressed in the data do not necessarily reflect those of Tencent.
# 基于10亿人设的合成数据规模化生成
本仓库发布了我们在论文《基于10亿人设的合成数据规模化生成》(https://arxiv.org/pdf/2406.20094)中介绍的数据集:
我们提出了一种全新的人设驱动型数据合成方法,该方法借助大语言模型(Large Language Model,LLM)的多元视角来生成多样化的合成数据。为了在规模化场景下充分利用该方法,我们推出了**人设中心(PERSONA HUB)**——一个从网络数据中自动精选的**10亿个多样化人设**集合。这10亿人设(约占全球总人口的13%)作为全球知识的分布式载体,能够覆盖大语言模型中几乎所有的视角维度,从而助力各类场景下的规模化多样化合成数据生成。通过展示人设中心在规模化生成高质量**数学与逻辑推理题**、**指令(即用户提示词)**、**知识丰富型文本**、**游戏非玩家角色(Non-Player Character,NPC)**以及**工具(函数)**等场景中的应用案例,我们证明了人设驱动型数据合成具备通用性、可扩展性、灵活性与易用性,有望推动合成数据生成与实际应用的范式变革,对大语言模型的研发产生深远影响。
<div align="center">
<img src="./assets/persona_overview.png" width="90%">
</div>
## 数据集发布
### 合成数据样本
为推动人设驱动型数据合成相关研究,我们首批发布了基于各类人设生成的以下合成数据样本,包括:
* **50,000道数学题**
* **50,000道逻辑推理题**
* **50,000条指令**
* **10,000篇知识丰富型文本**
* **10,000个游戏非玩家角色**
* **5,000个工具(函数)**
### 人设中心
我们同时发布了人设中心的子集,包括:
* **200,000个人设(早期预览版)**
* **370,000,000个精英人设(2025年2月新增)**
## 运行演示
用户可通过运行https://github.com/tencent-ailab/persona-hub 中的代码,轻松尝试使用人设中心生成数据的演示:
bash
# 运行前请确保已安装datasets与openai库(执行pip install datasets openai)并配置好openai_api_key
bash demo_openai_synthesize.sh # 使用GPT-4o结合人设中心生成数据
或
bash
# 运行前请确保已安装datasets、transformers与vllm库(执行pip install datasets transformers vllm)
bash demo_vllm_synthesize.sh # 使用开源模型结合人设中心生成数据
请注意,我们提供的数据合成提示模板仅作参考。您可在`code/prompt_templates.py`中自定义所需的提示词。
## Argilla
您还可在[Argilla空间](https://argilla-data-explorers.hf.space/)中访问本数据集,相关介绍视频如下:
* 视频:https://youtu.be/timmCn8Nr6g?feature=shared
## 联系方式
* 请发送邮件至`sggetao@gmail.com` 或 `dyu@global.tencent.com`
* Github主页:https://github.com/tencent-ailab/persona-hub
## 免责声明
人设中心可助力实现十亿级规模的合成数据生成,以模拟来自各类真实世界用户的多样化输入(即使用场景)。若将此类数据作为输入,规模化查询目标大语言模型以获取其输出,将存在极高风险导致该大语言模型的知识、智能与能力被窃取并轻易复制,进而挑战顶级大语言模型的领先地位。因此,需严格避免滥用,确保以伦理且负责任的方式应用该技术,防止隐私泄露及其他伦理问题。
本次发布的数据均由公开可用的模型(GPT-4、Llama-3与Qwen)生成,仅用于研究目的。用户在使用合成数据时,还必须遵守这些模型各自的许可协议与使用政策。本数据可能存在不准确、不安全内容或偏见,我们对此不承担任何责任。请在使用前评估其准确性与适用性。腾讯及其许可方按“现状”提供本数据,不提供任何形式的明示或默示担保。本数据所表达的观点与意见不一定代表腾讯的立场。
提供机构:
maas
创建时间:
2024-07-02



