five

ilsp/flores200_en-el

收藏
Hugging Face2024-01-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ilsp/flores200_en-el
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - el license: cc-by-sa-4.0 size_categories: - 1K<n<10K task_categories: - translation dataset_info: features: - name: en dtype: string - name: el dtype: string splits: - name: validation num_bytes: 406555 num_examples: 997 - name: test num_bytes: 427413 num_examples: 1012 download_size: 481524 dataset_size: 833968 configs: - config_name: default data_files: - split: validation path: data/validation-* - split: test path: data/test-* --- # FLORES-200 EN-EL with prompts for translation by LLMs Based on [FLORES-200](https://huggingface.co/datasets/Muennighoff/flores200) dataset. Publication: @article{nllb2022, author = {NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang}, title = {No Language Left Behind: Scaling Human-Centered Machine Translation}, year = {2022} } Number of examples : 1012 ## FLORES-200 for EN to EL with 0-shot prompts Contains 2 prompt variants: - EN:\n\[English Sentence\]\nEL: - English:\n\[English Sentence\]\nΕλληνικά: ## FLORES-200 for EL to EN with 0-shot prompts Contains 2 prompt variants: - EL:\n\[Greek Sentence\]\nEL: - Ελληνικά:\n\[Greek Sentence\]\nEnglish: ## How to load datasets ```python from datasets import load_dataset input_file = 'flores200.en2el.test.0-shot.json' dataset = load_dataset( 'json', data_files=input_file, field='examples', split='train' ) ``` ## How to generate translation results with different configurations ```python from multiprocessing import cpu_count def generate_translations(datapoint, config, config_name): for idx, variant in enumerate(datapoint["prompts_results"]): # REPLACE generate WITH ACTUAL FUNCTION WHICH TAKES GENERATION CONFIG result = generate(variant["prompt"], config=config) datapoint["prompts_results"][idx].update({config_name: result}) return datapoint dataset = dataset.map( function=generate_translations, fn_kwargs={"config": config, "config_name": config_name}, keep_in_memory=False, num_proc=min(len(dataset), cpu_count()), ) ``` ## How to push updated datasets to hub ```python from huggingface_hub import HfApi input_file = "flores200.en2el.test.0-shot.json" model_name = "meltemi-v0.2" output_file = input_file.replace(".json", ".{}.json".format(model_name) dataset.to_json(output_file, force_ascii=False, indent=4, orient="index") api = HfApi() api.upload_file( path_or_fileobj=output_file, path_in_repo="results/{}/{}".format(model_name, output_file) repo_id="ilsp/flores200-en-el-prompt", repo_type="dataset", ) ```
提供机构:
ilsp
原始信息汇总

数据集概述

基本信息

  • 语言: 英语 (en) 和 希腊语 (el)
  • 许可证: cc-by-sa-4.0
  • 大小类别: 1K<n<10K
  • 任务类别: 翻译

数据集结构

  • 特征:
    • en: 字符串类型
    • el: 字符串类型
  • 分割:
    • validation: 406555 字节, 997 个样本
    • test: 427413 字节, 1012 个样本
  • 下载大小: 481524 字节
  • 数据集大小: 833968 字节

配置

  • 默认配置:
    • validation: 数据路径为 data/validation-*
    • test: 数据路径为 data/test-*

数据集内容

  • 示例数量: 1012
  • 提示变体:
    • 英语到希腊语:
      • EN: [English Sentence] EL:
      • English: [English Sentence] Ελληνικά:
    • 希腊语到英语:
      • EL: [Greek Sentence] EL:
      • Ελληνικά: [Greek Sentence] English:

加载数据集

python from datasets import load_dataset

input_file = flores200.en2el.test.0-shot.json dataset = load_dataset( json, data_files=input_file, field=examples, split=train )

生成翻译结果

python from multiprocessing import cpu_count

def generate_translations(datapoint, config, config_name): for idx, variant in enumerate(datapoint["prompts_results"]): result = generate(variant["prompt"], config=config) datapoint["prompts_results"][idx].update({config_name: result}) return datapoint

dataset = dataset.map( function=generate_translations, fn_kwargs={"config": config, "config_name": config_name}, keep_in_memory=False, num_proc=min(len(dataset), cpu_count()), )

推送更新数据集

python from huggingface_hub import HfApi

input_file = "flores200.en2el.test.0-shot.json" model_name = "meltemi-v0.2" output_file = input_file.replace(".json", ".{}.json".format(model_name)

dataset.to_json(output_file, force_ascii=False, indent=4, orient="index")

api = HfApi()

api.upload_file( path_or_fileobj=output_file, path_in_repo="results/{}/{}".format(model_name, output_file) repo_id="ilsp/flores200-en-el-prompt", repo_type="dataset", )

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作