menlo

Name: menlo
Creator: maas
Published: 2025-12-05 12:14:53
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/facebook/menlo

下载链接

链接失效反馈

官方服务：

资源简介：

<img src="https://cdn-uploads.huggingface.co/production/uploads/68da8c7ff071f8164ec27f32/31bgvd4QTMBG740lT0yz0.png" alt="description" width="500"> This dataset is released as part of **[MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 languages](https://arxiv.org/abs/2509.26601)**. ## MENLO **tl;dr**: Massively multilingual preference evaluation, reward modeling, and post-training to improve LLMs' language proficiency Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt–response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation. For more details, please refer to our [MENLO](https://arxiv.org/abs/2509.26601) paper. ## Citation If you use the MENLO dataset from our work, please cite with the following BibTex entry: ```bibtex @article{whitehouse2025menlo, title={MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages}, author={Chenxi Whitehouse and Sebastian Ruder and Tony Lin and Oksana Kurylo and Haruka Takagi and Janice Lam and Nicolò Busetto and Denise Diaz}, year={2025}, journal={arXiv preprint arXiv:2509.26601}, url={https://arxiv.org/abs/2509.26601}, } ``` ## License Use of this repository and related resources are governed by MENLO Research License.

本数据集作为**《MENLO：从偏好到能力——跨47种语言的类母语质量评估与建模》**（arXiv:2509.26601）的配套研究资源正式发布。 ## MENLO **一句话摘要**：大规模多语言偏好评估、奖励建模与后训练，用于提升大语言模型（Large Language Model, LLM）的语言能力。确保跨多语言的大语言模型回复达到类母语水准极具挑战。为此，我们提出MENLO框架，该框架基于受众设计启发的机制，实现类母语回复质量的可操作化评估。依托MENLO框架，我们构建了包含6423条人工标注的提示-回复偏好对的数据集，覆盖47种语言变体的四个质量维度，且标注者间一致性较高。我们的评估结果显示，零样本（Zero-shot）大语言模型评委在采用成对评估与结构化标注准则后，性能得到显著提升，但在本数据集上仍不及人工标注者。我们通过强化学习、奖励塑形与多任务学习等微调方法实现了性能的大幅改进。此外，我们证明经强化学习训练的评委可作为生成式奖励模型，用于提升大语言模型的多语言能力，尽管其与人类判断仍存在偏差。我们的研究结果为可扩展的多语言评估与偏好对齐指明了颇具前景的方向。我们公开了本数据集与评估框架，以支持多语言大语言模型评估领域的后续研究。如需了解更多细节，请参阅我们的[MENLO](https://arxiv.org/abs/2509.26601)论文。 ## 引用若使用本工作中的MENLO数据集，请引用如下BibTex条目： bibtex @article{whitehouse2025menlo, title={MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages}, author={Chenxi Whitehouse and Sebastian Ruder and Tony Lin and Oksana Kurylo and Haruka Takagi and Janice Lam and Nicolò Busetto and Denise Diaz}, year={2025}, journal={arXiv preprint arXiv:2509.26601}, url={https://arxiv.org/abs/2509.26601}, } ## 许可本仓库及相关资源的使用受MENLO研究许可协议约束。

提供机构：

maas

创建时间：

2025-10-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集