five

visheratin/laion-coco-nllb

收藏
Hugging Face2024-04-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/visheratin/laion-coco-nllb
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ace - acm - acq - aeb - af - ajp - ak - als - am - apc - ar - ars - ary - arz - as - ast - awa - ayr - azb - azj - ba - bm - ban - be - bem - bn - bho - bjn - bo - bs - bug - bg - ca - ceb - cs - cjk - ckb - crh - cy - da - de - dik - dyu - dz - el - en - eo - et - eu - ee - fo - fj - fi - fon - fr - fur - fuv - gaz - gd - ga - gl - gn - gu - ht - ha - he - hi - hne - hr - hu - hy - ig - ilo - id - is - it - jv - ja - kab - kac - kam - kn - ks - ka - kk - kbp - kea - khk - km - ki - rw - ky - kmb - kmr - knc - kg - ko - lo - lij - li - ln - lt - lmo - ltg - lb - lua - lg - luo - lus - lvs - mag - mai - ml - mar - min - mk - mt - mni - mos - mi - my - nl - nn - nb - npi - nso - nus - ny - oc - ory - pag - pa - pap - pbt - pes - plt - pl - pt - prs - quy - ro - rn - ru - sg - sa - sat - scn - shn - si - sk - sl - sm - sn - sd - so - st - es - sc - sr - ss - su - sv - swh - szl - ta - taq - tt - te - tg - tl - th - ti - tpi - tn - ts - tk - tum - tr - tw - tzm - ug - uk - umb - ur - uzn - vec - vi - war - wo - xh - ydd - yo - yue - zh - zsm - zu license: cc-by-nc-4.0 size_categories: - 100K<n<1M task_categories: - image-to-text - translation pretty_name: LAION-COCO translated to 200 languages dataset_info: features: - name: id dtype: string - name: url dtype: string - name: eng_caption dtype: string - name: captions sequence: sequence: string - name: score dtype: float64 splits: - name: test num_bytes: 271360114 num_examples: 14906 - name: train num_bytes: 15986931307 num_examples: 878978 download_size: 10358151216 dataset_size: 16258291421 language_details: ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn configs: - config_name: default data_files: - split: test path: data/test-* - split: train path: data/train-* --- # LAION COCO translated into 200 languages This dataset contains the samples of the [LAION-COCO](https://huggingface.co/datasets/laion/laion-coco) dataset translated to 200 languages using the largest [NLLB-200 model](https://huggingface.co/facebook/nllb-200-3.3B) (3.3B parameters). ## Fields description 1. `id` - unique ID of the image. 2. `url` - original URL of the image from the LAION-COCO dataset. 3. `eng_caption` - original English caption from the LAION-COCO dataset. 4. `captions` - a list of captions translated to the languages from the Flores 200 dataset. Every item in the list is a list where the first element is a BCP-47 language code, and the second one is a caption in this language. The list of all language codes for the Flores 200 dataset can be found [here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200). 5. `score` - aesthetic score generated using [LAION aesthetic predictor](https://github.com/christophschuhmann/improved-aesthetic-predictor/). The images in the dataset have the score of 4.5+. ## Images The dataset was filtered to contain only working image URLs. However, the availability may change in the future. Because of that, all images from this dataset are available at [https://nllb-data.com/](https://nllb-data.com/). To get the image, use the following format: ``` https://nllb-data.com/{id}.jpg ``` ## Paper The dataset was used to train the models in the paper: "[NLLB-CLIP - train performant multilingual image retrieval model on a budget](https://arxiv.org/abs/2309.01859)".
提供机构:
visheratin
原始信息汇总

数据集概述

数据集名称

LAION-COCO translated to 200 languages

许可协议

cc-by-nc-4.0

语言支持

数据集支持多种语言,具体包括但不限于:ace, acm, acq, aeb, af, ajp, ak, als, am, apc, ar, ars, ary, arz, as, ast, awa, ayr, azb, azj, ba, bm, ban, be, bem, bn, bho, bjn, bo, bs, bug, bg, ca, ceb, cs, cjk, ckb, crh, cy, da, de, dik, dyu, dz, el, en, eo, et, eu, ee, fo, fj, fi, fon, fr, fur, fuv, gaz, gd, ga, gl, gn, gu, ht, ha, he, hi, hne, hr, hu, hy, ig, ilo, id, is, it, jv, ja, kab, kac, kam, kn, ks, ka, kk, kbp, kea, khk, km, ki, rw, ky, kmb, kmr, knc, kg, ko, lo, lij, li, ln, lt, lmo, ltg, lb, lua, lg, luo, lus, lvs, mag, mai, ml, mar, min, mk, mt, mni, mos, mi, my, nl, nn, nb, npi, nso, nus, ny, oc, ory, pag, pa, pap, pbt, pes, plt, pl, pt, prs, quy, ro, rn, ru, sg, sa, sat, scn, shn, si, sk, sl, sm, sn, sd, so, st, es, sc, sr, ss, su, sv, swh, szl, ta, taq, tt, te, tg, tl, th, ti, tpi, tn, ts, tk, tum, tr, tw, tzm, ug, uk, umb, ur, uzn, vec, vi, war, wo, xh, ydd, yo, yue, zh, zsm, zu

数据集大小

  • 下载大小:10358151216字节
  • 数据集大小:16258291421字节

任务类别

  • image-to-text
  • translation

数据集特征

  • id: 字符串类型,图像的唯一ID。
  • url: 字符串类型,图像在LAION-COCO数据集中的原始URL。
  • eng_caption: 字符串类型,LAION-COCO数据集中的原始英文标题。
  • captions: 序列类型,包含翻译成Flores 200数据集语言的标题列表。每个项目是一个列表,第一个元素是BCP-47语言代码,第二个元素是该语言的标题。
  • score: 浮点数类型,使用LAION美学预测器生成的审美评分,数据集中的图像评分均在4.5以上。

数据集划分

  • test: 包含14906个示例,总字节数为271360114。
  • train: 包含878978个示例,总字节数为15986931307。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作