rwq-elo/rwq-battle-records

Name: rwq-elo/rwq-battle-records
Creator: rwq-elo
Published: 2024-03-06 11:36:37
License: 暂无描述

Hugging Face2024-03-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/rwq-elo/rwq-battle-records

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-nc-4.0 --- # RWQ battle records dataset The dataset stores the battle records of 24 popular LLMs conduct Elo pairwise battles on [RWQ questions](https://huggingface.co/datasets/rwq-elo/rwq-questions) and use GPT-4 as judger to determine the winner on each round of QA. ## Columns | Column Name | Data Type | Description | | -------------- | --------- | ------------------------------------------------------------------------------------------------------------ | | question | string | The question to ask LLM. | | model | string | The id/name of LLM. | | model_a | string | The id/name of model 1 of pairwise LLM to battle facing another on the same question. | | model_b | string | The id/name of model 2 of pairwise LLM to battle facing another on the same question. | | winner | string | The winner model valued as one of `model_a, model_b, tie or tie(all bad)` as outcome of one pairwise battle. | | judger | string | The gpt name with version, such as gpt-4-turbo. | | tstamp | string | The time battle happens, format as `2023-11-23 02:56:34.433226`. | | answer_a | string | The answer of model_a. | | answer_b | string | The answer of model_b. | | gpt_4_response | string | The reponse text of gpt-4 as judger to evaluate and score the better LLM. | | gpt_4_score | string | The scores of model_a and model_b with json text, e.g., `{'model_a': '0', 'model_b': '1'}`. | | is_valid | boolean | The row is valid or not. Set to false, when gpt-4 reject the eval because of policy. | | elo_rating | float | The elo rating score of LLM. | ## Citation TODO

提供机构：

rwq-elo

原始信息汇总

RWQ battle records dataset

该数据集存储了24个流行的LLM在RWQ问题上进行Elo配对战斗的记录，并使用GPT-4作为裁判来确定每轮QA的胜者。

列信息

列名	数据类型	描述
question	string	向LLM提出的问题。
model	string	LLM的ID或名称。
model_a	string	配对战斗中作为模型1的LLM的ID或名称。
model_b	string	配对战斗中作为模型2的LLM的ID或名称。
winner	string	配对战斗的结果，值为`model_a`, `model_b`, `tie`或`tie(all bad)`。
judger	string	GPT的名称及版本，例如gpt-4-turbo。
tstamp	string	战斗发生的时间，格式为`2023-11-23 02:56:34.433226`。
answer_a	string	模型a的回答。
answer_b	string	模型b的回答。
gpt_4_response	string	GPT-4作为裁判的评估和评分响应文本。
gpt_4_score	string	模型a和模型b的得分，格式为JSON文本，例如`{model_a: 0, model_b: 1}`。
is_valid	boolean	该行是否有效。当GPT-4因策略拒绝评估时，设置为false。
elo_rating	float	LLM的Elo评分。

5,000+

优质数据集

54 个

任务类型

进入经典数据集