five

community-alignment-dataset

收藏
魔搭社区2025-12-05 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/facebook/community-alignment-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
<h1 align="center"> Community Alignment </h1> <h3 align="center"> <a href="https://github.com/facebookresearch/community-alignment-dataset">Github</a> &nbsp; | &nbsp; <a href="https://arxiv.org/abs/2507.09650">Paper</a> </h3> ## Dataset *Community Alignment* is a large-scale open source, multilingual and multi-turn preference dataset to align LLMs with human preferences across cultures. It features prompt-level overlap in annotators, enabling social-choice-based and distributional approaches to LLM alignment, as well as natural language explanations for choices. * [**Large-scale**] ~200,000 comparisons of LLM responses, collected from >3,000 unique annotators who provided feedback at an individual level. * [**Multilingual**] Contains comparisons in English, French, Italian, Hindi, and Portuguese. 63% of comparisons are non-English. * [**Prompt-level overlap**] 2599 prompts feature at least 10 annotations per comparison where annotators overlap across prompts. * [**High-quality natural language explanations**] For 27% of prompts, annotators provided detailed explanations why they preferred one response over another. ### Splits * [Full dataset](https://huggingface.co/datasets/facebook/community-alignment-dataset/blob/main/data/full_community_alignment.csv) - May include conversations with missing turns within a 4-turn exchange. The assigned_lang field indicates the language assigned to the annotator; however, in certain cases, the annotator may have used a different language. * [Filtered dataset](https://huggingface.co/datasets/facebook/community-alignment-dataset/blob/main/data/filtered_community_alignment.csv) -- Contains only conversations without missing turns in the middle. An LLM judge is used to filter conversations where the detected language of the user prompts does not match the assigned language. Additionally, prompts are filtered out if the annotator incorrectly used the prompt field for their own natural language explanation or asked the LLM to choose its preferred response. ## License Community Alignment is released under the Creative Commons Attribution 4.0 International License (CC-BY-4.0). ## Codebook Please see [Appendix H of the paper](https://arxiv.org/abs/2507.09650) for the codebook. ## Usage In ~27% of the conversations in our dataset, annotators initiate the dialogue with their own prompts. These prompts do not reflect the position of Meta or its employees. Users must implement appropriate filtering and moderation measures when utilizing this dataset for training purposes to ensure that the generated outputs adhere to their own content standards. The user-initiated conversations can be easily filtered out of the dataset using the `is_pregenerated_first_prompt` flag. ## Attribution When using this dataset in any publications or research output, please cite the accompanying paper. For BibTex, use ```BibTex @article{zhang2025cultivating, title = {Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset}, author = {Lily Hong Zhang and Smitha Milli and Karen Jusko and Jonathan Smith and Brandon Amos and Wassim and Bouaziz and Manon Revel and Jack Kussman and Lisa Titus and Bhaktipriya Radharapu and Jane Yu and Vidya Sarma and Kris Rose and Maximilian Nickel}, year = {2025}, journal = {arXiv preprint arXiv: 2507.09650} } ``` For in-text citations, use ```Text Zhang, L. H., Milli, S., Jusko, K., Smith, J., Amos, B., Bouaziz, W., Revel, M., Kussmann, J., Titus, L., Radharapu, B., Yu, J., Sarma, V., Rose, K., Nickel, M. (2025). Cultivating Pluralism In Algorithmic Monoculture: The Community Alignent Dataset. ``` ## Feedback If you use Community Alignment, we would love to know (a) what you found valuable in it and (b) what features you wish it had (as well as any other feedback you may have). This will help support and guide us in doing future projects of this kind. Additionally, if you encounter any issues, such as the presence of personal or private information (PII) or requests from participants for data removal, please let us know. You can contact us at [communityalignment@meta.com](mailto:communityalignment@meta.com).

<h1 align="center"> 社区对齐(Community Alignment) </h1> <h3 align="center"> <a href="https://github.com/facebookresearch/community-alignment-dataset">Github</a> &nbsp; | &nbsp; <a href="https://arxiv.org/abs/2507.09650">论文</a> </h3> ## 数据集 *社区对齐(Community Alignment)* 是一款大规模开源、多语言多轮次偏好数据集,旨在让大语言模型(Large Language Model,LLM)适配跨文化的人类偏好。该数据集具备标注者的提示词级重叠特性,可支持基于社会选择和分布的大语言模型对齐方法,同时为偏好选择提供自然语言解释。 * [**大规模**] 包含约20万条大语言模型回复对比数据,由超过3000名独立标注者以个体视角提供反馈。 * [**多语言**] 涵盖英语、法语、意大利语、印地语及葡萄牙语的对比数据,其中63%的对比数据为非英语内容。 * [**提示词级重叠**] 2599个提示词的每轮对比至少有10条标注,且标注者在不同提示词间存在重叠。 * [**高质量自然语言解释**] 27%的提示词对应的标注者提供了详细解释,阐明其偏好某一回复的原因。 ### 数据集划分 * [完整数据集(Full dataset)](https://huggingface.co/datasets/facebook/community-alignment-dataset/blob/main/data/full_community_alignment.csv) - 包含4轮对话中存在缺失轮次的对话内容。`assigned_lang` 字段标注了分配给标注者的语言,但部分场景中标注者实际使用了其他语言。 * [过滤后数据集(Filtered dataset)](https://huggingface.co/datasets/facebook/community-alignment-dataset/blob/main/data/filtered_community_alignment.csv) - 仅包含中间轮次无缺失的对话。使用大语言模型评审器(LLM judge)过滤掉用户提示词检测语言与分配语言不一致的对话;此外,若标注者误将自身自然语言解释填入提示词字段,或要求大语言模型自行选择偏好回复,此类提示词也会被剔除。 ## 许可证 社区对齐数据集采用知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License,CC-BY-4.0)发布。 ## 编码手册 请参阅论文附录H获取编码手册,链接:https://arxiv.org/abs/2507.09650 ## 使用说明 本数据集约27%的对话由标注者以自身提示词发起,此类提示词不代表Meta及其员工的立场。用户在将该数据集用于模型训练时,需采取适当的过滤和审核措施,确保生成内容符合自身内容规范。可通过 `is_pregenerated_first_prompt` 字段轻松过滤掉用户发起的对话。 ## 引用声明 若在任何出版物或研究成果中使用本数据集,请引用配套论文。BibTex引用格式如下: BibTex @article{zhang2025cultivating, title = {Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset}, author = {Lily Hong Zhang and Smitha Milli and Karen Jusko and Jonathan Smith and Brandon Amos and Wassim and Bouaziz and Manon Revel and Jack Kussman and Lisa Titus and Bhaktipriya Radharapu and Jane Yu and Vidya Sarma and Kris Rose and Maximilian Nickel}, year = {2025}, journal = {arXiv preprint arXiv: 2507.09650} } 正文引用格式如下: Text Zhang, L. H., Milli, S., Jusko, K., Smith, J., Amos, B., Bouaziz, W., Revel, M., Kussmann, J., Titus, L., Radharapu, B., Yu, J., Sarma, V., Rose, K., Nickel, M. (2025). Cultivating Pluralism In Algorithmic Monoculture: The Community Alignent Dataset. ## 反馈与问题反馈 若您使用了社区对齐数据集,我们希望了解:(a) 您认为该数据集的哪些价值点值得认可;(b) 您希望其新增哪些功能,以及其他任何反馈意见。这将帮助我们支撑并指导未来同类项目的开展。此外,若您发现任何个人或私密信息(PII),或收到参与者提出的数据删除请求,请及时告知我们。您可通过邮箱 <a href="mailto:communityalignment@meta.com">communityalignment@meta.com</a> 联系我们。
提供机构:
maas
创建时间:
2025-07-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作