five

AceGPT-v2-AlignmentData

收藏
魔搭社区2025-11-02 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/FreedomIntelligence/AceGPT-v2-AlignmentData
下载链接
链接失效反馈
官方服务:
资源简介:
## Introduction To efficiently achieve native alignment in AceGPT-v2, this dataset was constructed to train a small alignment model to filter the entire pre-train dataset. Therefore, this dataset was built through the following steps: 1. Randomly select 96K samples from [ArabicText 2022](https://data.baai.ac.cn/details/ArabicText-2022). 2. Use **GPT-4-turbo** to rewrite the extracted data according to the provided prompts. 3. Organize the rewritten data into pairs to create training data for the Alignment LLM. ## System Prompt for Arabic Data Alignment Rewriting ``` ### Revised Prompt for rewriting Arabic Data for LLM Training **Objective**:Rewriting Arabic text data. These rewriting Arabic text data will assist in filtering and refining data for training large language models. **Criteria for rewriting**: - **Grammar and Syntax**: Revising the text to ensure it adheres to standard grammatical rules and language norms of Arabic. - **Cultural Appropriateness**: Identify and exclude any content that is illegal or culturally offensive in Arabic contexts. - **Noise**: Remove extraneous elements such as advertisements, web links, Garbled Characters , URLs, and any irrelevant content from the text. If the entire text is junk content, discard the whole segment. - **Consistency**: Ensuring consistency in language and style throughout the sentence. - **Mathematical Formula Formatting**: If there are mathematical formulas in the text, standardize the formatting of the formulas for clarity. - **Code Formatting**: If there is code in the text, standardize code snippets for readability. **Instructions**: 1. Read the text carefully. 2. Analyze the text against the listed criteria and output the analysis of the text. 3. If the given paragraph is entirely incorrect and difficult to rewrite, the rewritten text directly output 'None'. 4. If there are no errors, the rewritten text directly output the content of the Arabic text. 5. Please refer to the example to output the analysis and the rewritten text. 6. Please do not output any content after Rewritten text. for example: ### Arabic text data rewriting Arabic text:ودعا الى اجراء اصلاح للاطار التشريعي والمؤسسي الذي ينظّم الحكومات المحلية وادارة الاراضي في زامبيا بغية تلبية احتياجات السكان في المناطق الحضرية المتسارعة النمو وجمع موارد مالية لتحسين تنفيذ الخدمات. Analysis: Grammar and Syntax: Minor corrections needed for better clarity. Cultural Appropriateness: Content is appropriate. Noise: No extraneous elements found. Consistency: The text is consistent in style and language. Mathematical Formula Formatting: Not applicable. Code Formatting: Not applicable. Rewritten text:ودعا إلى إجراء إصلاح للإطار التشريعي والمؤسسي الذي ينظم الحكومات المحلية وإدارة الأراضي في زامبيا بهدف تلبية احتياجات السكان في المناطق الحضرية السريعة النمو وجمع الموارد المالية لتحسين تنفيذ الخدمات. ### Arabic text data rewriting Arabic text:{prompt} Analysis: Rewritten text: ``` ## Paper For more details, please refer to [link](https://huggingface.co/FreedomIntelligence/AceGPT-v2-70B-Chat/blob/main/Alignment_at_Pre_training__a_Case_Study_of_Aligning_LLMs_in_Arabic.pdf) ### BibTeX entry and citation info ``` @inproceedings{liang2024alignment, title={Alignment at Pre-training! Towards Native Alignment for Arabic {LLM}s}, author={Juhao Liang and Zhenyang Cai and Jianqing Zhu and Huang Huang and Kewei Zong and Bang An and Mosen Alharthi and Juncai He and Lian Zhang and Haizhou Li and Benyou Wang and Jinchao Xu}, booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}, year={2024}, url={https://openreview.net/forum?id=woRFmNJiLp} } ``` ``` @article{zhu2024second, title={Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion}, author={Zhu, Jianqing and Huang, Huang and Lin, Zhihang and Liang, Juhao and Tang, Zhengyang and Almubarak, Khalid and Alharthi, Mosen and An, Bang and He, Juncai and Wu, Xiangbo and Yu, Fei and Chen, Junying and Ma, Zhuoheng and Du, Yuhao and Hu, Yan and Zhang, He and Alghamdi, Emad A. and Zhang, Lian and Sun, Ruoyu and Li, Haizhou and Wang, Benyou and Xu, Jinchao}, journal={}, year={2024} } ```

## 简介 为高效实现AceGPT-v2的原生对齐目标,本数据集被构建用于训练小型对齐模型,以对全量预训练数据集进行筛选。因此,本数据集通过以下步骤构建: 1. 从[阿拉伯文本2022(ArabicText 2022)](https://data.baai.ac.cn/details/ArabicText-2022)中随机选取96K条样本。 2. 借助**GPT-4-turbo**,依据给定提示词对提取得到的文本进行重写。 3. 将重写后的文本整理为配对格式,以生成适配对齐大语言模型(Large Language Model,LLM)的训练数据。 ## 阿拉伯语数据对齐重写系统提示词 ### 面向大语言模型训练的阿拉伯语数据重写优化提示词 **目标**:对阿拉伯语文本数据进行重写。此类重写后的阿拉伯语文本将用于辅助筛选与精调大语言模型训练所需的数据。 **重写准则**: - **语法与句法**:修订文本,确保其符合阿拉伯语标准语法规则与语言规范。 - **文化适配性**:识别并剔除阿拉伯语语境下属于非法或存在文化冒犯性的内容。 - **噪声去除**:从文本中移除广告、网页链接、乱码字符、统一资源定位符(Uniform Resource Locator)等无关元素。若整段文本均为垃圾内容,则直接丢弃该片段。 - **一致性**:确保整句文本在语言与风格上保持统一。 - **数学公式格式规范**:若文本中包含数学公式,需对公式格式进行标准化处理以提升可读性。 - **代码格式规范**:若文本中包含代码片段,需对其进行标准化处理以提升可读性。 **操作指南**: 1. 仔细阅读待处理文本。 2. 对照上述准则对文本进行分析,并输出分析结果。 3. 若给定段落完全错误且无法重写,则直接输出'None'作为重写结果。 4. 若文本无错误,则直接输出原阿拉伯语文本作为重写结果。 5. 请参照示例格式输出分析结果与重写文本。 6. 请不要在'Rewritten text'后输出任何额外内容。 示例: ### 阿拉伯语文本数据重写示例 阿拉伯语文本:ودعا الى اجراء اصلاح للاطار التشريعي والمؤسسي الذي ينظّم الحكومات المحلية وادارة الاراضي في زامبيا بغية تلبية احتياجات السكان في المناطق الحضرية المتسارعة النمو وجمع موارد مالية لتحسين تنفيذ الخدمات. 分析: 语法与句法: 需进行小幅修正以提升清晰度。 文化适配性: 内容合规无害。 噪声去除: 未发现无关元素。 一致性: 文本在语言与风格上保持统一。 数学公式格式规范: 不适用。 代码格式规范: 不适用。 重写文本:ودعا إلى إجراء إصلاح للإطار التشريعي والمؤسسي الذي ينظم الحكومات المحلية وإدارة الأراضي في زامبيا بهدف تلبية احتياجات السكان في المناطق الحضرية السريعة النمو وجمع الموارد المالية لتحسين تنفيذ الخدمات. ### 阿拉伯语文本数据重写示例 阿拉伯语文本:{prompt} 分析: 重写文本: ## 论文 如需了解更多细节,请参阅[链接](https://huggingface.co/FreedomIntelligence/AceGPT-v2-70B-Chat/blob/main/Alignment_at_Pre_training__a_Case_Study_of_Aligning_LLMs_in_Arabic.pdf) ### BibTeX引用格式与引证信息 @inproceedings{liang2024alignment, title={Alignment at Pre-training! Towards Native Alignment for Arabic {LLM}s}, author={Juhao Liang and Zhenyang Cai and Jianqing Zhu and Huang Huang and Kewei Zong and Bang An and Mosen Alharthi and Juncai He and Lian Zhang and Haizhou Li and Benyou Wang and Jinchao Xu}, booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}, year={2024}, url={https://openreview.net/forum?id=woRFmNJiLp} } @article{zhu2024second, title={Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion}, author={Zhu, Jianqing and Huang, Huang and Lin, Zhihang and Liang, Juhao and Tang, Zhengyang and Almubarak, Khalid and Alharthi, Mosen and An, Bang and He, Juncai and Wu, Xiangbo and Yu, Fei and Chen, Junying and Ma, Zhuoheng and Du, Yuhao and Hu, Yan and Zhang, He and Alghamdi, Emad A. and Zhang, Lian and Sun, Ruoyu and Li, Haizhou and Wang, Benyou and Xu, Jinchao}, journal={}, year={2024} }
提供机构:
maas
创建时间:
2025-01-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作