five

multilingual-llm-jokes-4o-claude-gemini

收藏
魔搭社区2025-12-05 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/Rapidata/multilingual-llm-jokes-4o-claude-gemini
下载链接
链接失效反馈
官方服务:
资源简介:
<style> figure { display: flex; flex-direction: column; align-items: center; /* center contents */ margin: 0; /* remove any figure margins */ padding: 0; } figure img { display: block; /* remove inline-image whitespace below */ max-width: 100%; height: auto; } figure figcaption { margin-block-start: 0; /* no gap above caption */ margin-block-end: 0; /* no gap below caption */ padding-top: 0.1em; /* tiny, adjustable gap */ font-style: italic; color: #555; text-align: center; } </style> # Rapidata Generated Joke Preference Dataset <a href="https://www.rapidata.ai"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66f5624c42b853e73e0738eb/jfxR79bOztqaC6_yNNnGU.jpeg" width="400" alt="Dataset visualization"> </a> We collected 1'000'000+ human opinions on the jokes generated by state-of-the-art LLMs to decide which model is the funniest. The labelers are shown a joke in their language and asked to answer 'Yes' or 'No' to the question 'Is this joke funny?'. It took us less than 5 days to get all of the responses. The jokes are evenly distributed across 5 languages: English, Arabic, Japanese, Vietnamese, Portuguese and across 4 model configurations. ## Joke Generation Process We prompt each model in a certain language to tell a joke on a given topic. Topics are simple objects like bicycle, ghost, elephant. We want to see which jokes a model can generate on each topic, so for each topic we try to generate 5 different jokes. In some cases, models struggle to generate different jokes. If a model is unable to generate a different joke for 10 times, we stop the generation process. So, if all of the combinations resulted in 5 unique jokes, we would have: \\(5 (\mathrm{languages}) \times 4 (\mathrm{models}) \times 100 (\mathrm{topics}) \times 5 (\mathrm{samples}) = 10\textrm{'}000\\) jokes. However, there are less, only \\(9\textrm{'}978\\) jokes in the dataset, because sometimes models failed to produce 5 different jokes on a given topic in English. For other languages, all of the jokes were generated. ## Columns - `model` — model that was used for joke generation, if not specified otherwise, we run a model with the default parameters from the API. Here is the list of model configurations: - `gpt-4o` - `claude-sonnet-4-20250514` - `gemini-2.5-flash-non-thinking` (no thinking tokens) - `gemini-2.5-flash-thinking` (default number of thinking tokens) - `joke_english` — the `joke` translated into English using GPT-4o - `yes_ratios` — fraction of "Yes" responses - `yes_user_score_ratios` — fraction of user_scores of people (internal metric which indicates trust in a user) who responded "Yes" <!-- Or, more formally, `yes_user_score_ratios` = \\(\frac{\sum_{r \in R}{\mathbb{1}\{r_{\mathrm{ans}} = \text{"Yes"}\} \cdot r_{\mathrm{user\_score}}}}{\sum_{r \in R}{r_{\mathrm{user\_score}}}}\\) --> ## Analysis Notebook We prepared a [notebook](https://colab.research.google.com/drive/1bkc_ITpOZhQDNp5bJbOWCHxrrjFe9UXk?usp=sharing) with a simple data analysis of our results. Here are the key graphs from it. <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/67054b6963cd3c1d5da67fc5/a7lZSJyaICsp5M7N7QAlR.png" alt="drawing" width="450"/> <figcaption>Model scores. The non-thinking version of Gemini has the lowest score while all other thinking model configurations perform similarly</figcaption> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/67054b6963cd3c1d5da67fc5/rVfReF7JEaoi7_Xl7R07T.png" alt="drawing" width="450"/> <figcaption>Language scores. We observe a significant difference in average scores for different languages. Quite unsurprisingly, the language with the highest quality jokes is English</figcaption> </center> To have a better feel for the types of jokes models tend to produce in different languages, we made an interactive plot of joke embeddings. We project embeddings of `joke_english` on a 2D space and draw them on a canvas. <figure> <img src="https://cdn-uploads.huggingface.co/production/uploads/67054b6963cd3c1d5da67fc5/cxmKoOZEyBMs0JuLc_6iG.png" alt="A Japanese joke"> <figcaption>Screenshot of the canvas. Hovering on an example of a top 1% Japanese joke. The success of this joke shows the style of jokes preferred by the Japanese</figcaption> </figure> For all of the graphs, check out the [notebook](https://colab.research.google.com/drive/1bkc_ITpOZhQDNp5bJbOWCHxrrjFe9UXk?usp=sharing). These are just the first steps in exploring the quality of generated humour and its reception by different cultures. We can't wait to see what you can do with this dataset! ## About Rapidata Rapidata's technology makes collecting human feedback at scale faster and more accessible than ever before. Visit [rapidata.ai](https://www.rapidata.ai/) to learn more about how we're revolutionizing human feedback collection for AI development.

<style> figure { display: flex; flex-direction: column; align-items: center; /* center contents */ margin: 0; /* remove any figure margins */ padding: 0; } figure img { display: block; /* remove inline-image whitespace below */ max-width: 100%; height: auto; } figure figcaption { margin-block-start: 0; /* no gap above caption */ margin-block-end: 0; /* no gap below caption */ padding-top: 0.1em; /* tiny, adjustable gap */ font-style: italic; color: #555; text-align: center; } </style> # Rapidata 生成式笑话偏好数据集 <a href="https://www.rapidata.ai"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66f5624c42b853e73e0738eb/jfxR79bOztqaC6_yNNnGU.jpeg" width="400" alt="数据集可视化"> </a> 我们收集了超过100万条人类对当前最先进大语言模型(Large Language Model, LLM)生成的笑话的评价,以甄别哪款模型生成的笑话最具趣味性。标注人员会看到与其母语对应的笑话,并被要求针对问题“这个笑话好笑吗?”以“是”或“否”作答。我们仅用不到5天便收集完成全部标注反馈。本次数据集的笑话均匀覆盖5种语言:英语、阿拉伯语、日语、越南语、葡萄牙语,同时涵盖4种模型配置。 ## 笑话生成流程 我们针对指定语言,向各模型发起生成特定主题笑话的提示词请求。主题均为日常简单事物,例如自行车、幽灵、大象等。为观测模型在单一主题下的创作能力,我们计划为每个主题生成5个差异化笑话。但部分模型难以产出差异化内容,若连续10次尝试均无法生成新的差异化笑话,我们将终止该主题的生成流程。 理论上,若所有组合均可生成5个独特笑话,总笑话量应为:(5 (mathrm{语言}) imes 4 (mathrm{模型}) imes 100 (mathrm{主题}) imes 5 (mathrm{样本}) = 10000) 个。 但实际数据集仅包含9978个笑话,原因是部分模型无法为英语主题生成5个差异化笑话,其余语言的主题均完成了全部5个笑话的生成。 ## 数据字段说明 - `model` — 用于生成笑话的模型,若无额外说明,我们将使用API默认参数运行对应模型。本次使用的模型配置如下: - `gpt-4o` - `claude-sonnet-4-20250514` - `gemini-2.5-flash-non-thinking`(无思考Token(Token)) - `gemini-2.5-flash-thinking`(使用默认数量的思考Token(Token)) - `joke_english` — 使用GPT-4o将原笑话翻译为英语的版本 - `yes_ratios` — 选择“是”的标注占总标注量的比例 - `yes_user_score_ratios` — 选择“是”的标注用户的用户评分加权占比(内部指标,用于表征标注用户的可信度) <!-- Or, more formally, `yes_user_score_ratios` = (frac{sum_{r in R}{mathbb{1}{r_{mathrm{ans}} = ext{"Yes"}} cdot r_{mathrm{user\_score}}}}{sum_{r in R}{r_{mathrm{user\_score}}}}) --> ## 分析笔记 我们准备了一份[分析笔记](https://colab.research.google.com/drive/1bkc_ITpOZhQDNp5bJbOWCHxrrjFe9UXk?usp=sharing),用于对本次实验结果进行基础数据分析,以下为其中的核心图表。 <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/67054b6963cd3c1d5da67fc5/a7lZSJyaICsp5M7N7QAlR.png" alt="模型得分" width="450"/> <figcaption>模型得分。Gemini无思考Token版本得分最低,其余带思考Token的模型配置得分表现相近</figcaption> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/67054b6963cd3c1d5da67fc5/rVfReF7JEaoi7_Xl7R07T.png" alt="语言得分" width="450"/> <figcaption>语言得分。不同语言的平均得分存在显著差异,不出所料,英语对应的笑话质量得分最高</figcaption> </center> 为了更直观地了解不同语言下模型生成的笑话类型,我们构建了笑话嵌入向量的交互式可视化图表。我们将`joke_english`字段对应的嵌入向量投影至二维空间,并绘制于画布之上。 <figure> <img src="https://cdn-uploads.huggingface.co/production/uploads/67054b6963cd3c1d5da67fc5/cxmKoOZEyBMs0JuLc_6iG.png" alt="日语笑话示例"/> <figcaption>可视化画布截图。鼠标悬停于排名前1%的日语笑话示例之上,该笑话的受欢迎程度体现了日本受众偏好的笑话风格</figcaption> </figure> 如需查看全部图表,请访问上述[分析笔记](https://colab.research.google.com/drive/1bkc_ITpOZhQDNp5bJbOWCHxrrjFe9UXk?usp=sharing)。 本次工作仅是探索生成式幽默质量及其跨文化接受度的初步尝试,我们热切期待各位研究者能基于本数据集产出更多优质成果! ## 关于Rapidata Rapidata的技术让大规模人类标注反馈的收集工作比以往任何时候都更加快捷易用。访问[rapidata.ai](https://www.rapidata.ai/),了解我们如何革新AI开发领域的人类标注反馈收集流程。
提供机构:
maas
创建时间:
2025-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作