five

RobotsMaliAI/bayelemabaga

收藏
Hugging Face2023-04-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RobotsMaliAI/bayelemabaga
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - translation - text-generation language: - bm - fr size_categories: - 10K<n<100K --- # BAYƐLƐMABAGA: Parallel French - Bambara Dataset for Machine Learning ## Overview The Bayelemabaga dataset is a collection of 46976 aligned machine translation ready Bambara-French lines, originating from [Corpus Bambara de Reference](http://cormande.huma-num.fr/corbama/run.cgi/first_form). The dataset is constitued of text extracted from **264** text files, varing from periodicals, books, short stories, blog posts, part of the Bible and the Quran. ## Snapshot: 46976 | | | |:---|---:| | **Lines** | **46976** | | French Tokens (spacy) | 691312 | | Bambara Tokens (daba) | 660732 | | French Types | 32018 | | Bambara Types | 29382 | | Avg. Fr line length | 77.6 | | Avg. Bam line length | 61.69 | | Number of text sources | 264 | ## Data Splits | | | | |:-----:|:---:|------:| | Train | 80% | 37580 | | Valid | 10% | 4698 | | Test | 10% | 4698 | || ## Remarks * We are working on resolving some last minute misalignment issues. ### Maintenance * This dataset is supposed to be actively maintained. ### Benchmarks: - `Coming soon` ### Sources - [`sources`](./bayelemabaga/sources.txt) ### To note: - ʃ => (sh/shy) sound: Symbol left in the dataset, although not a part of bambara orthography nor French orthography. ## License - `CC-BY-SA-4.0` ## Version - `1.0.1` ## Citation ``` @misc{bayelemabagamldataset2022 title={Machine Learning Dataset Development for Manding Languages}, author={ Valentin Vydrin and Jean-Jacques Meric and Kirill Maslinsky and Andrij Rovenchak and Allahsera Auguste Tapo and Sebastien Diarra and Christopher Homan and Marco Zampieri and Michael Leventhal }, howpublished = {url{https://github.com/robotsmali-ai/datasets}}, year={2022} } ``` ## Contacts - `sdiarra <at> robotsmali <dot> org` - `aat3261 <at> rit <dot> edu`
提供机构:
RobotsMaliAI
原始信息汇总

BAYƐLƐMABAGA: Parallel French - Bambara Dataset for Machine Learning

Overview

  • Lines: 46976
  • Sources: 264 text files, including periodicals, books, short stories, blog posts, part of the Bible and the Quran.
  • Origin: Extracted from Corpus Bambara de Reference.

Snapshot

Lines 46976
French Tokens (spacy) 691312
Bambara Tokens (daba) 660732
French Types 32018
Bambara Types 29382
Avg. Fr line length 77.6
Avg. Bam line length 61.69
Number of text sources 264

Data Splits

Train 80% 37580
Valid 10% 4698
Test 10% 4698

License

  • CC-BY-SA-4.0

Version

  • 1.0.1
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作