Propicto/propicto-orfeo

Name: Propicto/propicto-orfeo
Creator: Propicto
Published: 2025-01-20 10:01:09
License: 暂无描述

Hugging Face2025-01-20 更新2025-11-01 收录

下载链接：

https://hf-mirror.com/datasets/Propicto/propicto-orfeo

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 task_categories: - translation language: - fr tags: - pictograms - AAC pretty_name: Propicto-orféo --- # Propicto-orféo ## 📝 Dataset Description Propicto-orféo is a dataset of aligned speech-id/transcription/pictograms (the pictograms correspond to the identifier associated with an ARASAAC pictogram) in French. It was created from the CEFC-Orféo corpus. This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets. Propicto-orféo contains three CSV files : train, valid and test, with the following statistics : | **Split** | **Number of utterances** | |:-----------:|:-----------------------:| | train | 231 374 | | valid | 28 796 | | test | 29 009 | ## ⚒️ Dataset Structure Each file contains the following information : ```csv clips : the unique identifier of the utterance, which corresponds to a unique audio clip file (in wav) for the orféo dataset text : the transcription of the audio clip pictos : the sequence of id pictograms from ARASAAC tokens : the sequence of tokens, each of them is the keyword associated to the ARASAAC id pictogram ``` ## 💡 Dataset example For the given sample : ```csv clips : cefc-cfpb-1000-5-1186 text : tu essayes de mélanger les deux pictos : [6625, 26144, 7074, 5515, 5367] tokens : toi essayer de mélanger à_côté_de ``` - The `clips` is from the Orféo subcorpus [CFPB, 1000-5](https://orfeo.ortolang.fr/annis-sample/cfpb/CFPB-1000-5.html), with the sentence ID 1186. - The `text` is the associated transcription, in en : “you try to mix the two”. - `pictos` is the sequence of pictogram IDs, each of them can be retrieved from here : 6625 = https://static.arasaac.org/pictograms/6625/6625_2500.png - `tokens` are retrieved from a specific lexicon and can be used to train translation models. ![Example](example.png) ## ℹ️ Dataset Sources - **Repository:** [CEFC-Orféo](https://www.ortolang.fr/market/corpora/cefc-orfeo) - **Papers :** - [C. Benzitoun, J.-M. Debaisieux, H.-J. Deulofeu (2016). Le projet ORFÉO : un corpus d'études pour le français contemporain. Corpus n°15, p. 91-114](https://journals.openedition.org/corpus/2936) - [J.-M. Debaisieux & C. Benzitoun (2020). Orféo : un corpus et une plateforme pour l’étude du français contemporain. Langages n°219](https://shs.cairn.info/revue-langages-2020-3?lang=fr) ## 💻 Uses Propicto-orféo is intended to be used to train Speech-to-Pictograms translation and Text-to-Pictograms translation models. This dataset can also be used to fine-tune large language models to perform translation into pictograms. ## ⚙️ Dataset Creation The dataset is created by applying a specific formalism that converts french oral transcriptions into a corresponding sequence of pictograms. The formalism includes a set of grammatical rules to handle specific phenomenon (negation, name entities, pronominal form, plural, ...) to the French language, as well as a dictionary which associates each ARASAAC ID pictogram with a set of keywords (tokens). This formalism was presented at [LREC](https://aclanthology.org/2024.lrec-main.76/). Source Data : conversations / meetings / daily life situations (oral transcriptions) ## ⁉️ Limitations The translation can be partially incorrect, due to incorrect or missing words translated into pictograms. ## 💡 Information - **Curated by:** Cécile MACAIRE - **Funded by :** [PROPICTO ANR-20-CE93-0005](https://anr.fr/Projet-ANR-20-CE93-0005) - **Language(s) (NLP):** French - **License:** CC-BY-NC-SA-4.0 ## 📌 Citation ```bibtex @inproceedings{macaire-etal-2024-multimodal, title = "A Multimodal {F}rench Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation", author = "Macaire, C{\'e}cile and Dion, Chlo{\'e} and Arrigo, Jordan and Lemaire, Claire and Esperan{\c{c}}a-Rodier, Emmanuelle and Lecouteux, Benjamin and Schwab, Didier", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", year = "2024", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.76", pages = "839--849", } @inproceedings{macaire24_interspeech, title = {Towards Speech-to-Pictograms Translation}, author = {Cécile Macaire and Chloé Dion and Didier Schwab and Benjamin Lecouteux and Emmanuelle Esperança-Rodier}, year = {2024}, booktitle = {Interspeech 2024}, pages = {857--861}, doi = {10.21437/Interspeech.2024-490}, issn = {2958-1796}, } ``` ## 👩‍🏫 Dataset Card Authors **Cécile MACAIRE, Chloé DION, Emmanuelle ESPÉRANÇA-RODIER, Benjamin LECOUTEUX, Didier SCHWAB**

提供机构：

Propicto

5,000+

优质数据集

54 个

任务类型

进入经典数据集