five

MaLA-LM/PolyWrite

收藏
Hugging Face2024-09-27 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MaLA-LM/PolyWrite
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation pretty_name: PolyWrite --- PolyWrite is a novel multilingual dataset developed for evaluating open-ended generation across 240 languages. We use ChatGPT to create diverse prompts in English, and then use Google Translate to translate these prompts into various languages, enabling models to generate creative content in multilingual settings. The benchmark includes 31 writing tasks—such as storytelling and email writing—across 155 unique prompts. To ensure translation quality, we back-translate the multilingual prompts into English and calculate BLEU scores between the original and back-translated versions, filtering out any translations with BLEU scores below 20. The final dataset contains a total of 35,751 prompts. # Meta data - **category**: This field indicates the type of task or content. - **name**: This field stores the unique identifier or title of the specific prompt or task within the dataset. - **prompt_en**: The English version of the prompt that initiates the writing task. - **lang_script**: This field captures the language and script used in the evaluation, ensuring the correct language and script are identified for multilingual tasks. - **prompt_translated**: This field contains the prompt translated into the target language. - **prompt_backtranslated**: The back-translated version of the prompt, obtained by translating the target language prompt back into English. - **bleu**: This numeric field measures the BLEU score to evaluate the quality of back-translated text compared to the original English prompt. - **chrf++**: Another evaluation metric, chrF++ is used evaluate the quality of back-translated text compared to the original English prompt. - **uuid**: A universally unique identifier (UUID) assigned to each prompt or task in the dataset, ensuring that every entry can be distinctly referenced within the dataset. ## Citation This dataset is first used in the below paper. ``` @article{ji2024emma500enhancingmassivelymultilingual, title={{EMMA}-500: Enhancing Massively Multilingual Adaptation of Large Language Models}, author={Shaoxiong Ji and Zihao Li and Indraneil Paul and Jaakko Paavola and Peiqin Lin and Pinzhen Chen and Dayyán O'Brien and Hengyu Luo and Hinrich Schütze and Jörg Tiedemann and Barry Haddow}, year={2024}, journal={arXiv preprint 2409.17892}, url={https://arxiv.org/abs/2409.17892}, } ```
提供机构:
MaLA-LM
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作