Propicto/propicto-commonvoice

Name: Propicto/propicto-commonvoice
Creator: Propicto
Published: 2025-04-02 11:46:06
License: 暂无描述

Hugging Face2025-04-02 更新2025-11-01 收录

下载链接：

https://hf-mirror.com/datasets/Propicto/propicto-commonvoice

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 task_categories: - translation language: - fr tags: - pictograms - AAC pretty_name: Propicto-commonvoice --- # Propicto-commonvoice ## 📝 Dataset Description - **Public:** True - **Tasks:** MT Propicto-commonvoice is a dataset of aligned speech-id/transcription/pictograms (the pictograms correspond to the identifier associated with an ARASAAC pictogram) in French. It was created from the CommonVoice-15.0 French corpus. Propicto-commonvoice contains three CSV files: `train`, `valid`, and `test`, with the following statistics: | **Split** | **Number of utterances** | |:-----------:|:-----------------------:| | train | 527,544 | | valid | 16,130 | | test | 16,132 | ## ⚒️ Dataset Structure Each file contains the following information : ```csv clips : the unique identifier of the utterance, which corresponds to a unique audio clip file (in mp3) from the commonvoice dataset text : the transcription of the audio clip pictos : the sequence of id pictograms from ARASAAC tokens : the sequence of tokens, each of them is the keyword associated to the ARASAAC id pictogram ``` ## 💡 Dataset example For the given sample : ```csv clips : common_voice_fr_24683664.mp3 text : l'auteur est connu comme auteur de romans policiers pictos : [8476, 11258, 8456, 12313, 11258, 7074, 2450, 5547] tokens : le écrivain connaître comme écrivain de livre agent_de_police_municipale ``` - The `text` is the associated transcription, in en : “the author is known as a writer of detective novels”.<br /> - `pictos` is the sequence of pictogram IDs, each of them can be retrieved from here : 8476 = https://static.arasaac.org/pictograms/8476/8476_2500.png<br /> - `tokens` are retrieved from a specific lexicon and can be used to train translation models. ![Example](example.png) ## ℹ️ Dataset Sources - **Repository:** [cv-corpus-15.0-2023-09-08-fr](https://commonvoice.mozilla.org/fr/datasets) - **Papers :** - [Common Voice: A Massively-Multilingual Speech Corpus](https://aclanthology.org/2020.lrec-1.520/) (Ardila et al., LREC 2020) ## 💻 Uses Propicto-CommonVoice is intended for training Speech-to-Pictogram and Text-to-Pictogram translation models. It can also be used to fine-tune large language models for translation into pictograms. ## ⚙️ Dataset Creation The dataset was created using a specific formalism that converts French oral transcriptions into corresponding sequences of pictograms. This formalism incorporates grammatical rules to handle specific phenomena (e.g., negation, named entities, pronominal forms, plural forms) in French, as well as a dictionary associating each ARASAAC pictogram ID with a set of keywords (tokens). It was presented in: [A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/) (Macaire et al., LREC-COLING 2024). **Source Data:** Read speech (oral transcriptions). ## ⁉️ Limitations Translations may contain inaccuracies due to incorrect or missing mappings of words to pictograms. ## 💡 Information - **Curated by:** Cécile MACAIRE - **Funded by :** [PROPICTO ANR-20-CE93-0005](https://anr.fr/Projet-ANR-20-CE93-0005) - **Language(s) (NLP):** French - **License:** cc0-1.0 ## 📌 Citation ```bibtex @inproceedings{macaire24_interspeech, title = {Towards Speech-to-Pictograms Translation}, author = {Cécile Macaire and Chloé Dion and Didier Schwab and Benjamin Lecouteux and Emmanuelle Esperança-Rodier}, year = {2024}, booktitle = {Interspeech 2024}, pages = {857--861}, doi = {10.21437/Interspeech.2024-490}, issn = {2958-1796}, } ``` ## 👩‍🏫 Dataset Card Authors **Cécile MACAIRE, Chloé DION, Emmanuelle ESPÉRANÇA-RODIER, Benjamin LECOUTEUX, Didier SCHWAB**

提供机构：

Propicto

5,000+

优质数据集

54 个

任务类型

进入经典数据集