POLLUX

Name: POLLUX
Creator: maas
Published: 2025-12-05 16:40:08
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-05 收录

下载链接：

https://modelscope.cn/datasets/ai-forever/POLLUX

下载链接

链接失效反馈

官方服务：

资源简介：

![banner](images/logo_pollux_horiz_short_WHITEBG.png) # Dataset Card for the POLLUX dataset  The POLLUX dataset provides a quantitative and qualitative assessment of LLMs’ generative capabilities in Russian across the variety of tasks and evaluation criteria. ## Dataset Details ### Dataset Description  The POLLUX dataset is built upon two comprehensive taxonomies: generative tasks and evaluation criteria. The generative task taxonomy encompasses nearly 400 tasks originally derived from user requests to LLM services, with each task featuring three distinct complexity levels. Complementing this, the evaluation criteria taxonomy comprises over 300 criteria organized into five categories: Critical, General, Subjective, Domain-specific, and Task-specific criteria. Each criterion includes detailed descriptions and scoring rubrics to ensure consistent evaluation. The dataset contains 2,100 unique, manually created instructions that are evenly distributed across all tasks in the taxonomy. These instructions were written entirely from scratch by domain experts who were prohibited from consulting internet sources or any printed or digital materials, ensuring originality and authenticity. For each instruction, responses were generated by seven leading (by the time of the work) LLMs: [OpenAI o1](https://openai.com/o1/), [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), [OpenAI GPT-4o](https://openai.com/index/hello-gpt-4o/), [LLama 3.1 405B](https://huggingface.co/meta-llama/Llama-3.1-405B), [GigaChat Max](https://giga.chat/), [YandexGPT 4 Pro](https://ya.ru/ai/gpt), and [T-pro-it-1.0](https://huggingface.co/t-tech/T-pro-it-1.0), resulting in nearly 11,500 total responses. Each response is evaluated against a carefully curated set of criteria (averaging 16 criteria per instruction) that includes all Critical, Subjective, and General criteria, along with the relevant Task-specific and Domain-specific criteria. The annotation process involved multiple experts evaluating each response, with at least two experts assigned to assess each criterion in the corresponding set. Experts provided both numerical scores and rationales for their assessments. This comprehensive annotation procedure yielded 471,515 individual point estimates with accompanying comments, which were then aggregated across overlapping annotations to produce 161,076 final consolidated estimates. - **Language(s) (NLP):** Russian - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) ### Dataset Sources  - **Repository:** [POLLUX code base](https://github.com/ai-forever/POLLUX) - **Paper:** [ArXiv preprint](https://arxiv.org/pdf/2505.24616) - **Demo:** [Interactive demo](https://ai-forever.github.io/POLLUX/) ## Uses  ### Direct Use  The POLLUX dataset is specifically designed to comprehensively assess the generative capabilities of language models. The evaluation framework operates on a straightforward principle: language models generate responses to dataset instructions, and these responses are subsequently evaluated against carefully selected sets of corresponding criteria. Beyond its primary function as an evaluation tool, POLLUX serves as a versatile benchmark for LM-as-a-Judge methodologies. The dataset provides all essential components required for such applications: original instructions, diverse responses from multiple state-of-the-art language models, corresponding numerical scores, and detailed textual commentary from expert evaluators. ### Out-of-Scope Use  While the POLLUX dataset could potentially serve as a valuable addition to supervised fine-tuning (SFT) datasets, it is strongly discouraged to use any portion of it for training purposes. Instead, POLLUX should be preserved in its intended role as a high-quality, large-scale evaluation benchmark that comprehensively covers the majority of generative tasks and linguistic phenomena specific to the Russian language. The primary rationale for this recommendation lies in maintaining the integrity and reliability of the evaluation framework. Using POLLUX data for training would compromise its effectiveness as an independent assessment tool, potentially leading to inflated performance metrics and undermining the validity of comparative analyses. By keeping POLLUX exclusively as an evaluation resource, researchers can ensure unbiased and meaningful assessments of model capabilities across diverse Russian language generation tasks. ## Dataset Structure  The POLLUX dataset consists of samples that represent aggregated numerical evaluations of language model responses. Each sample provides a quantitative assessment of how well a language model's answer performs against specific evaluation criteria when responding to a given instruction. Each sample then is described by the following fields: - `instruction`: `str`, the original instruction; instruction means the prompt itself alongside the context if any; - `reference_answer`: `str`, the correct answer to a given instruction; only present for those instructions that permit definite correct answer; - `answer`: `str`, an answer given by a language model; - `model_id`: `str`, the identity of a language model; includes `o1` ([OpenAI o1](https://openai.com/o1/)), `gpt4` ([OpenAI GPT-4o](https://openai.com/index/hello-gpt-4o/)), `claude-3.5-sonnet` ([Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet)), `llama 405B` ([LLama 3.1 405B](https://huggingface.co/meta-llama/Llama-3.1-405B)), `gigachat_max` ([GigaChat Max](https://giga.chat/)), `yandexgpt_pro` ([YandexGPT 4 Pro](https://ya.ru/ai/gpt)) and `tpro` ([T-pro-it-1.0](https://huggingface.co/t-tech/T-pro-it-1.0)); - `task_type`: `str`, first level of generative tasks taxonomy, see full taxonomy in Appendix O in [preprint](https://arxiv.org/pdf/2505.24616); - `task_subtype`: `str`, second level of generative tasks taxonomy; - `task_subsubtype`: `str`, third level of generative tasks taxonomy; - `difficulty`: `str`, complexity level; one of the `Easy`, `Medium`, `Hard` for all the tasks but `Решить задачу (STEM)`, which accepts `High School` and `University` complexity levels; - `domain`: `str`, functional style of the instruction; - `is_provocative`: `bool`, whether the instruction encourages the model to elaborate on the sensitive topics; - `criteria_name`: `str`, name of the evaluation aspect; - `criteria_description`: `str`, description of a corresponding evaluation aspect; - `rubrics`: `str`, a list of numerical scores, with each score accompanied by detailed guidelines for when to assign that specific value; - `rubrics_example`: `str`, an example for numerical scores assignment; - `annotations`: `List[Dict[str, int | string]]`, a list of point estimates. Each point estimate consists of numerical score and expert rationale; - `criterion_score`: float, an average (over the annotations) numerical criterion evaluation; ## Dataset Creation ### Curation Rationale  The POLLUX dataset is designed with the primary objective of establishing a systematic framework for evaluating the generative capabilities of Russian-language models. By providing comprehensive, high-quality annotation data that encompasses both qualitative insights and quantitative metrics, this dataset addresses a critical gap in Russian NLP evaluation resources. The systematic approach enables researchers to conduct rigorous assessments of model performance while the dual-layered annotation structure—combining numerical scores with detailed qualitative feedback—offers nuanced perspectives on model strengths and limitations. ### Source Data  #### Data Collection and Processing  For each pair (instruction, answer) we assembled a set of evaluation criteria (Critical, Subjective, General and relevant Domain- and Task-specific criteria). The source instructions were developed by domain experts possessing specialized expertise tailored to each specific task category. A total of 50 samples were created per task group (representing the first level of our taxonomy), with complexity levels systematically distributed across three tiers: 10 easy, 15 medium, and 25 hard instructions. Each instruction underwent rigorous validation to ensure compliance with the corresponding task definitions and complexity level requirements. To maintain originality and prevent bias, all instructions were created entirely from scratch, with experts explicitly prohibited from consulting internet resources or any published materials. Seven leading large language models were evaluated using this instruction set: ([OpenAI o1](https://openai.com/o1/), [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), [OpenAI GPT-4o](https://openai.com/index/hello-gpt-4o/), [LLama 3.1 405B](https://huggingface.co/meta-llama/Llama-3.1-405B), [GigaChat Max](https://giga.chat/), [YandexGPT 4 Pro](https://ya.ru/ai/gpt), and [T-pro-it-1.0](https://huggingface.co/t-tech/T-pro-it-1.0). All models were tested across the complete instruction dataset, with the exception of STEM, coding, and QA tasks, which were evaluated using only three models: GigaChat Max, YandexGPT 4 Pro, and OpenAI GPT-4o. To ensure consistency and comparability, all models were executed using their default inference hyperparameters. For comprehensive evaluation, each instruction-answer pair was assessed using a multi-dimensional criteria framework encompassing Critical, Subjective, and General evaluation metrics, supplemented by relevant Domain-specific and Task-specific criteria tailored to the particular instruction category. #### Who are the source data producers?  For instruction creation and criteria annotation 10 expert panels (5 for each functional style, editors and translators, separate panels for code-related tasks, Science, Technology, Engineering, and Mathematics (STEM) problems and information seeking) were formed. See Appendix K in [preprint](https://arxiv.org/pdf/2505.24616) for description of panels and Table 21 in Appendix L for profiles of 45 experts, who developed instructions. We carefully selected experts who possessed both relevant academic credentials and practical experience within each panel's area of expertise. ### Annotations  #### Annotation process  Each instruction-answer pair was evaluated using an average of nearly 16 criteria. Experts assigned numerical scores and provided detailed textual reasoning for their assessments. The evaluation framework included different types of criteria with varying levels of overlap: Domain- and Task-specific criteria shared an overlap of two, General and Subjective criteria had an overlap of three, while Critical criteria demonstrated the highest overlap of five. Inter-annotator agreement was consistently strong, ranging from 0.71 to 0.97 (detailed results are presented in Table 20, Appendix J of the [preprint](https://arxiv.org/pdf/2505.24616)). #### Who are the annotators?  For criteria annotation we employed expert panels analogously to instruction creation. See Table 22 in Appendix L in [preprint](https://arxiv.org/pdf/2505.24616) for profiles of experts involved. Appendix L also contains aggregate sociodemographic statistics with some of them as follows: ![banner](images/gender.png) ![banner](images/age.png) ![banner](images/education.png) ![banner](images/field.png) ![banner](images/professions.png) #### Personal and Sensitive Information  All instructions, prompts, scenarios, and directives contained within the POLLUX dataset are completely fictional. No instruction represents real requests, actual communications, or genuine directives from any individual or organization. This dataset contains no sensitive personal information, private data, or confidential material from real individuals. All names, personal details, organizations, and identifying information referenced within the dataset are entirely fictional. All content has been written specifically for research purposes. No real-world communications, documents, or data sources were used in the compilation of this dataset. Any resemblance to real persons (living or deceased), actual events, existing organizations, or genuine circumstances is purely coincidental and unintentional. All content serves exclusively as test material for evaluating language model responses and capabilities within controlled academic research environments. ## DATASET DISCLAIMER AND TERMS OF USE  This dataset ("POLLUX") is provided exclusively for academic research and language model evaluation purposes. By accessing, downloading, or using this dataset, you acknowledge and agree to the following terms: - This dataset is intended for testing, evaluating, and researching language models and natural language processing systems. - Any content within this dataset that addresses sensitive, controversial, or potentially objectionable topics is included exclusively for the purpose of evaluating how language models respond to such material. The inclusion of such content does not constitute endorsement, advocacy, or promotion of any particular viewpoint, ideology, or action. - The dataset creators make no claims that the content represents factual information, authoritative statements, or calls to action. All content is provided as-is for computational analysis purposes only. - Users must comply with their institutional review board requirements and applicable research ethics guidelines when utilizing this dataset. - The dataset is provided "as is" without warranty of any kind. The creators disclaim all liability for any consequences arising from the use of this dataset. - The instructions and scenarios presented in this dataset are not intended for real-world implementation and should not be interpreted as actionable guidance for any actual situation. By using this dataset, you confirm that you will use the data responsibly in accordance with these terms. ## Citation  ``` @misc{martynov2025eyejudgementdissectingevaluation, title={Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX}, author={Nikita Martynov and Anastasia Mordasheva and Dmitriy Gorbetskiy and Danil Astafurov and Ulyana Isaeva and Elina Basyrova and Sergey Skachkov and Victoria Berestova and Nikolay Ivanov and Valeriia Zanina and Alena Fenogenova}, year={2025}, eprint={2505.24616}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.24616}, } ```

![banner](images/logo_pollux_horiz_short_WHITEBG.png) # POLLUX数据集卡片  POLLUX数据集针对俄语环境下的大语言模型（LLM）生成能力，提供了覆盖多样任务与评估准则的定量与定性评估框架。 ## 数据集详情 ### 数据集概述  POLLUX数据集基于两大完备分类体系构建：生成任务分类与评估准则分类。生成任务分类体系涵盖近400项任务，这些任务最初源自面向大语言模型服务的用户请求，每项任务均设置三个明确的复杂度等级。与之配套的评估准则分类体系包含超300项准则，划分为五大类别：关键准则（Critical）、通用准则（General）、主观准则（Subjective）、领域专属准则（Domain-specific）以及任务专属准则（Task-specific）。每项准则均配有详细描述与评分细则，以保障评估的一致性。该数据集包含2100条独特的人工编写指令，均匀分布于分类体系中的所有任务。这些指令完全由领域专家从头编写，专家被禁止查阅互联网资源或任何印刷/数字资料，以确保内容的原创性与真实性。针对每条指令，由7款截至研发时处于领先地位的大语言模型生成回复：[OpenAI o1](https://openai.com/o1/)、[Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet)、[OpenAI GPT-4o](https://openai.com/index/hello-gpt-4o/)、[LLama 3.1 405B](https://huggingface.co/meta-llama/Llama-3.1-405B)、[GigaChat Max](https://giga.chat/)、[YandexGPT 4 Pro](https://ya.ru/ai/gpt)以及[T-pro-it-1.0](https://huggingface.co/t-tech/T-pro-it-1.0)，最终生成近11500条总回复。每条回复均基于精心甄选的准则集进行评估（每条指令平均对应16项准则），该准则集涵盖所有关键、主观与通用准则，以及相关的任务专属与领域专属准则。标注流程邀请多名专家对每条回复进行评估，每项准则的评估至少由两名专家完成。专家需提供数值评分与评估理由。这套全面的标注流程共产生471515条带备注的单独评分项，随后通过合并重叠标注，最终得到161076条整合后的最终评分。 - **语言（自然语言处理）：** 俄语 - **许可证：** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) ### 数据集来源  - **代码仓库：** [POLLUX代码库](https://github.com/ai-forever/POLLUX) - **论文：** [ArXiv预印本](https://arxiv.org/pdf/2505.24616) - **演示demo：** [交互式演示](https://ai-forever.github.io/POLLUX/) ## 数据集用途 ### 直接用途  POLLUX数据集专为全面评估大语言模型的生成能力而设计。该评估框架遵循简明原则：大语言模型针对数据集中的指令生成回复，随后基于精心挑选的对应准则集对回复进行评估。除作为核心评估工具外，POLLUX还可作为大语言模型作为评判者（LM-as-a-Judge）方法论的通用基准。该数据集提供了此类应用所需的全部核心组件：原始指令、来自多款顶尖大语言模型的多样化回复、对应的数值评分以及专家评估者提供的详细文本评论。 ### 超出范围的用途  尽管POLLUX数据集有可能作为监督微调（SFT）数据集的有价值补充，但强烈禁止将其任何部分用于训练目的。相反，POLLUX应保持其原本的定位：作为高质量、大规模的评估基准，全面覆盖俄语环境下的绝大多数生成任务与语言现象。提出这一建议的核心理由在于维护评估框架的完整性与可靠性。使用POLLUX数据进行训练会损害其作为独立评估工具的有效性，可能导致性能指标虚高，削弱对比分析的可信度。仅将POLLUX作为评估资源，研究者可确保针对多样化俄语生成任务的模型能力评估具备无偏性与实际意义。 ## 数据集结构  POLLUX数据集由代表大语言模型回复整合数值评估的样本组成。每个样本针对给定指令的回复在特定评估准则下的表现提供定量评估。每个样本由以下字段构成： - `instruction`: `str`，原始指令；指令指提示本身及相关上下文（若有）； - `reference_answer`: `str`，给定指令的正确答案；仅在存在明确正确答案的指令中包含该字段； - `answer`: `str`，大语言模型生成的回复； - `model_id`: `str`，大语言模型的标识，包含`o1`（[OpenAI o1](https://openai.com/o1/)）、`gpt4`（[OpenAI GPT-4o](https://openai.com/index/hello-gpt-4o/)）、`claude-3.5-sonnet`（[Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet)）、`llama 405B`（[LLama 3.1 405B](https://huggingface.co/meta-llama/Llama-3.1-405B)）、`gigachat_max`（[GigaChat Max](https://giga.chat/)）、`yandexgpt_pro`（[YandexGPT 4 Pro](https://ya.ru/ai/gpt)）以及`tpro`（[T-pro-it-1.0](https://huggingface.co/t-tech/T-pro-it-1.0)）； - `task_type`: `str`，生成任务分类体系的一级类别，详见[预印本](https://arxiv.org/pdf/2505.24616)附录O中的完整分类体系； - `task_subtype`: `str`，生成任务分类体系的二级类别； - `task_subsubtype`: `str`，生成任务分类体系的三级类别； - `difficulty`: `str`，复杂度等级，多数任务的等级为`Easy`（简单）、`Medium`（中等）、`Hard`（困难），但`Решить задачу (STEM)`（解决STEM问题）的复杂度等级为`High School`（高中）与`University`（大学）； - `domain`: `str`，指令的功能风格； - `is_provocative`: `bool`，标识指令是否诱导模型对敏感话题进行展开论述； - `criteria_name`: `str`，评估维度的名称； - `criteria_description`: `str`，对应评估维度的描述； - `rubrics`: `str`，数值评分列表，每项评分均配有详细的赋值指南； - `rubrics_example`: `str`，数值评分赋值示例； - `annotations`: `List[Dict[str, int | string]]`，评分项列表。每条评分项包含数值评分与专家评估理由； - `criterion_score`: `float`，（基于所有评分项的）平均数值评分。 ## 数据集创建 ### 遴选依据  POLLUX数据集的核心目标是构建一套系统化的俄语大语言模型生成能力评估框架。通过提供兼具定性洞察与定量指标的高质量标注数据，该数据集填补了俄语自然语言处理评估资源的关键空白。这套系统化方法使研究者能够开展严谨的模型性能评估，而双层标注结构——结合数值评分与详细定性反馈——则为分析模型的优势与局限提供了细致视角。 ### 源数据  #### 数据收集与处理  针对每条（指令，回复）对，我们组装了一套评估准则集：关键准则、主观准则、通用准则以及相关的领域专属与任务专属准则。源指令由针对各特定任务类别具备专业专长的领域专家编写。每个任务组（对应分类体系的一级类别）共创建50个样本，复杂度等级系统分布为三个层级：10个简单、15个中等与25个困难指令。每条指令均经过严格验证，以确保符合对应任务定义与复杂度等级要求。为保持原创性并避免偏差，所有指令均完全从头编写，专家明确被禁止查阅互联网资源或任何公开资料。本次评估涵盖7款领先大语言模型：[OpenAI o1](https://openai.com/o1/)、[Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet)、[OpenAI GPT-4o](https://openai.com/index/hello-gpt-4o/)、[LLama 3.1 405B](https://huggingface.co/meta-llama/Llama-3.1-405B)、[GigaChat Max](https://giga.chat/)、[YandexGPT 4 Pro](https://ya.ru/ai/gpt)以及[T-pro-it-1.0](https://huggingface.co/t-tech/T-pro-it-1.0)。所有模型均在完整指令集上进行测试，但STEM、编码与问答任务仅使用三款模型进行评估：GigaChat Max、YandexGPT 4 Pro与OpenAI GPT-4o。为确保一致性与可比性，所有模型均采用默认推理超参数运行。为实现全面评估，每条指令-回复对均基于多维度准则框架进行评估，该框架涵盖关键、主观与通用评估指标，并辅以针对特定指令类别的领域专属与任务专属准则。 #### 源数据生产者是谁？  指令创建与准则标注工作共组建10个专家小组（每个功能风格对应5个小组，包含编辑与译员，另有针对编码任务、科学、技术、工程与数学（STEM）问题以及信息检索的独立小组）。详见[预印本](https://arxiv.org/pdf/2505.24616)附录K中的小组介绍，以及附录L中的表21，其中列出了45位编写指令的专家背景。我们精心遴选的专家均具备对应小组专业领域的学术资质与实践经验。 ### 标注  #### 标注流程  每条指令-回复对平均需基于近16项准则进行评估。专家需分配数值评分并提供详细的文本推理依据。评估框架包含不同重叠程度的准则类型：领域专属与任务专属准则的重叠项为2项，通用与主观准则的重叠项为3项，而关键准则的重叠项最高，达5项。标注者间一致性始终处于较高水平，范围为0.71至0.97（详细结果见[预印本](https://arxiv.org/pdf/2505.24616)附录J中的表20）。 #### 标注者是谁？  准则标注工作采用与指令创建类似的专家小组模式。详见[预印本](https://arxiv.org/pdf/2505.24616)附录L中的表22，其中列出了参与标注的专家背景。附录L还包含汇总的社会人口统计数据，部分示例如下： ![banner](images/gender.png) ![banner](images/age.png) ![banner](images/education.png) ![banner](images/field.png) ![banner](images/professions.png) #### 个人与敏感信息  POLLUX数据集中的所有指令、提示、场景与指令均为完全虚构。没有任何指令代表真实请求、实际通信或任何个人或组织的真实指令。该数据集不包含任何敏感个人信息、私人数据或来自真实个体的机密资料。数据集中提及的所有姓名、个人详情、组织与标识信息均为虚构。所有内容均专为研究目的编写。编译本数据集未使用任何现实世界的通信、文档或数据源。任何与真实人物（在世或已故）、实际事件、现有组织或真实情况的相似之处均属偶然且无意。所有内容仅作为受控学术研究环境中评估大语言模型回复与能力的测试材料。 ## 数据集免责声明与使用条款  本数据集（"POLLUX"）仅用于学术研究与大语言模型评估目的。访问、下载或使用本数据集即表示您确认并同意以下条款： - 本数据集旨在用于测试、评估与研究大语言模型及自然语言处理系统。 - 数据集中涉及敏感、争议或潜在不当话题的内容，仅用于评估大语言模型对这类内容的响应方式。包含此类内容并不构成对任何特定观点、意识形态或行动的认可、倡导或推广。 - 数据集创建者未声明本数据集的内容代表事实信息、权威陈述或行动号召。所有内容仅按现状提供，仅用于计算分析。 - 用户在使用本数据集时，必须遵守其所在机构的伦理审查委员会要求与适用的研究伦理准则。 - 本数据集按“现状”提供，不附带任何形式的保证。创建者不对因使用本数据集产生的任何后果承担责任。 - 本数据集中的指令与场景并非旨在用于现实世界实施，不应被解释为针对任何实际情况的可操作指导。使用本数据集即表示您确认将按照上述条款负责任地使用数据。 ## 引用  @misc{martynov2025eyejudgementdissectingevaluation, title={Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX}, author={Nikita Martynov and Anastasia Mordasheva and Dmitriy Gorbetskiy and Danil Astafurov and Ulyana Isaeva and Elina Basyrova and Sergey Skachkov and Victoria Berestova and Nikolay Ivanov and Valeriia Zanina and Alena Fenogenova}, year={2025}, eprint={2505.24616}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.24616}, }

提供机构：

maas

创建时间：

2025-07-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集