RobotsMaliAI/bayelemabaga

Name: RobotsMaliAI/bayelemabaga
Creator: RobotsMaliAI
Published: 2023-04-24 16:56:24
License: 暂无描述

Hugging Face2023-04-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/RobotsMaliAI/bayelemabaga

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - translation - text-generation language: - bm - fr size_categories: - 10K<n<100K --- # BAYƐLƐMABAGA: Parallel French - Bambara Dataset for Machine Learning ## Overview The Bayelemabaga dataset is a collection of 46976 aligned machine translation ready Bambara-French lines, originating from [Corpus Bambara de Reference](http://cormande.huma-num.fr/corbama/run.cgi/first_form). The dataset is constitued of text extracted from **264** text files, varing from periodicals, books, short stories, blog posts, part of the Bible and the Quran. ## Snapshot: 46976 | | | |:---|---:| | **Lines** | **46976** | | French Tokens (spacy) | 691312 | | Bambara Tokens (daba) | 660732 | | French Types | 32018 | | Bambara Types | 29382 | | Avg. Fr line length | 77.6 | | Avg. Bam line length | 61.69 | | Number of text sources | 264 | ## Data Splits | | | | |:-----:|:---:|------:| | Train | 80% | 37580 | | Valid | 10% | 4698 | | Test | 10% | 4698 | || ## Remarks * We are working on resolving some last minute misalignment issues. ### Maintenance * This dataset is supposed to be actively maintained. ### Benchmarks: - `Coming soon` ### Sources - [`sources`](./bayelemabaga/sources.txt) ### To note: - ʃ => (sh/shy) sound: Symbol left in the dataset, although not a part of bambara orthography nor French orthography. ## License - `CC-BY-SA-4.0` ## Version - `1.0.1` ## Citation ``` @misc{bayelemabagamldataset2022 title={Machine Learning Dataset Development for Manding Languages}, author={ Valentin Vydrin and Jean-Jacques Meric and Kirill Maslinsky and Andrij Rovenchak and Allahsera Auguste Tapo and Sebastien Diarra and Christopher Homan and Marco Zampieri and Michael Leventhal }, howpublished = {url{https://github.com/robotsmali-ai/datasets}}, year={2022} } ``` ## Contacts - `sdiarra <at> robotsmali <dot> org` - `aat3261 <at> rit <dot> edu`

提供机构：

RobotsMaliAI

原始信息汇总

BAYƐLƐMABAGA: Parallel French - Bambara Dataset for Machine Learning

Overview

Lines: 46976
Sources: 264 text files, including periodicals, books, short stories, blog posts, part of the Bible and the Quran.
Origin: Extracted from Corpus Bambara de Reference.

Snapshot


Lines	46976
French Tokens (spacy)	691312
Bambara Tokens (daba)	660732
French Types	32018
Bambara Types	29382
Avg. Fr line length	77.6
Avg. Bam line length	61.69
Number of text sources	264

Data Splits


Train	80%	37580
Valid	10%	4698
Test	10%	4698

License

CC-BY-SA-4.0

Version

1.0.1

5,000+

优质数据集

54 个

任务类型

进入经典数据集