five

projecte-aina/4catac

收藏
Hugging Face2024-03-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/projecte-aina/4catac
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language: - ca language_creators: - expert-generated license: cc-by-4.0 multilinguality: - monolingual pretty_name: 4catac size_categories: - n<1K source_datasets: [] task_categories: - text-to-speech task_ids: [] --- # Dataset Card for 4catac ## Table of Contents - [Dataset Card Creation Guide](#dataset-card-creation-guide) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://projecteaina.cat/tech/](https://projecteaina.cat/tech/) - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** langtech@bsc.es ### Dataset Summary *4catac: examples of phonetic transcription in 4 Catalan accents* is a dataset of phonetic transcriptions in four Catalan accents: Balearic, Central, North-Western and Valencian. It consists of 160 sentences transcribed using [IPA](https://www.internationalphoneticassociation.org/content/full-ipa-chart), following the [recommendations of the Institut d'Estudis Catalans](https://publicacions.iec.cat/repository/pdf/00000041/00000087.pdf). These sentences are the same for the four accents but may have small morphological adaptations to make them more natural for the accent. This dataset can be used for any purpose, whether academic or commercial, under the terms of the [CC BY 4.0]((https://creativecommons.org/licenses/by/4.0/)). Give appropriate credit, provide a link to the license, and indicate if changes were made. ### Supported Tasks and Leaderboards This dataset can be utilized to evaluate phonetic transcription systems across four distinct Catalan accents: Balearic, Central, North-Western and Valencian. ### Languages The dataset is in Catalan (ca-ES). ## Dataset Structure ### Data Instances Four tsv files, one for each accent: * Projecte BSC frases - Balear.tsv * Projecte BSC frases - Central.tsv * Projecte BSC frases - Nord-Occ.tsv * Projecte BSC frases - Val.tsv ### Data Fields The data fields are the same among all the files: * `sentence` (str): sentence * `transcription` (str): transcription ### Data Splits There is only one split for each accent. ## Dataset Creation ### Curation Rationale We created this dataset to thoroughly evaluate transcription systems across the diverse variants of Catalan. We expect that this dataset will contribute to the development of language models in Catalan, a low-resource language. Language technologies in Catalan often overlook some of its variants. With the publication of this dataset, we aim to address this bias. ### Source Data We commissioned the creation of these sentences and their transcriptions to a team of experts at [CLiC (Centre de Llenguatge i Computació)](https://clic.ub.edu/en/que-es-clic). #### Initial Data Collection and Normalization We commissioned the creation of these sentences and their transcriptions to a team of experts at [CLiC (Centre de Llenguatge i Computació)](https://clic.ub.edu/en/que-es-clic). #### Who are the source language producers? The original sentences were intentionally written to showcase various phonetic phenomena across Catalan accents. The task was entrused to [CLiC (Centre de Llenguatge i Computació)](https://clic.ub.edu/en/que-es-clic). ### Annotations #### Annotation process Each member of the annotation team proposed part of the sentences and transcribed them. Each transcription was reviewed by the other team members and discussed until a consensus was reached. To do the annotation they used a Google Drive spreadsheet. They also developed the specifications for the criteria used. These guidelines will be published soon on Zenodo. #### Who are the annotators? The annotation was entrusted to the [CLiC (Centre de Llenguatge i Computació)](https://clic.ub.edu/en/que-es-clic) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work. The annotation team was composed of: * 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree. * 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan. * 1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and Linguistics, Ph.D. in Signal Theory and Communications. ### Personal and Sensitive Information This dataset doesn't contain any personal or sensitive information. ## Considerations for Using the Data ### Social Impact of Dataset We expect that this dataset will contribute to the development of language models in Catalan, a low-resource language. Language technologies in Catalan often overlook some of its variants. With the publication of this dataset, we aim to address this bias. ### Discussion of Biases It is a very small dataset developed to evaluate phonetic transcription systems. We didn't identify any biases or risks in the dataset. ### Other Known Limitations [N/A] ## Additional Information ### Dataset Curators Copyright 2023 Language Technologies Unit (LangTech) at Barcelona Supercomputing Center. This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). ### Licensing Information This dataset can be used for any purpose, whether academic or commercial, under the terms of the [CC BY 4.0]((https://creativecommons.org/licenses/by/4.0/)). Give appropriate credit , provide a link to the license, and indicate if changes were made. ### Citation Information DOI: 10.57967/hf/1492 ### Contributions The drafting of the examples and their annotation, as well as the specification of the criteria used, was entrusted to [CLiC (Centre de Llenguatge i Computació)](https://clic.ub.edu/en/que-es-clic).
提供机构:
projecte-aina
原始信息汇总

数据集卡片 for 4catac

数据集描述

数据集摘要

4catac: examples of phonetic transcription in 4 Catalan accents 是一个包含四种加泰罗尼亚语口音(巴利阿里、中央、西北和瓦伦西亚)的语音转录数据集。该数据集包含160个句子,使用IPA进行转录,遵循加泰罗尼亚语研究所的建议。这些句子在四种口音中是相同的,但可能有小规模的形态适应,使其更自然地适应口音。

该数据集可以在CC BY 4.0许可下用于任何学术或商业目的。请适当注明出处,提供许可证链接,并指出是否进行了修改。

支持的任务和排行榜

该数据集可用于评估四种不同加泰罗尼亚语口音(巴利阿里、中央、西北和瓦伦西亚)的语音转录系统。

语言

该数据集使用加泰罗尼亚语(ca-ES)。

数据集结构

数据实例

四个tsv文件,每个口音一个:

  • Projecte BSC frases - Balear.tsv
  • Projecte BSC frases - Central.tsv
  • Projecte BSC frases - Nord-Occ.tsv
  • Projecte BSC frases - Val.tsv

数据字段

所有文件的数据字段相同:

  • sentence (str): 句子
  • transcription (str): 转录

数据分割

每个口音只有一个分割。

数据集创建

策划理由

我们创建这个数据集是为了全面评估加泰罗尼亚语不同变体的转录系统。我们期望这个数据集将有助于加泰罗尼亚语(一种低资源语言)的语言模型的发展。加泰罗尼亚语的语言技术往往忽视了其中的一些变体。通过发布这个数据集,我们旨在解决这种偏见。

源数据

我们委托CLiC(语言与计算中心)的专家团队创建这些句子和它们的转录。

初始数据收集和规范化

我们委托CLiC(语言与计算中心)的专家团队创建这些句子和它们的转录。

源语言生产者是谁?

原始句子是故意编写的,以展示加泰罗尼亚语口音中的各种语音现象。该任务委托给CLiC(语言与计算中心)

注释

注释过程

每个注释团队成员提出部分句子和转录它们。每个转录都由其他团队成员审查和讨论,直到达成共识。他们使用Google Drive电子表格进行注释,并制定了使用的标准。这些指南将很快在Zenodo上发布。

注释者是谁?

注释工作委托给巴塞罗那大学的CLiC(语言与计算中心)团队。他们选择了一个由三名注释者(两名男性和一名女性)组成的团队,他们获得了奖学金来完成这项工作。

注释团队由以下成员组成:

  • 2名男性注释者,年龄18-25岁,母语为加泰罗尼亚语,加泰罗尼亚语语言学专业学生。
  • 1名女性注释者,年龄18-25岁,母语为加泰罗尼亚语,现代语言和文学专业学生,专注于加泰罗尼亚语。
  • 1名女性监督者,年龄40-50岁,母语为加泰罗尼亚语,物理学和语言学专业毕业生,信号理论和通信专业博士。

个人和敏感信息

该数据集不包含任何个人或敏感信息。

使用数据的注意事项

数据集的社会影响

我们期望这个数据集将有助于加泰罗尼亚语(一种低资源语言)的语言模型的发展。加泰罗尼亚语的语言技术往往忽视了其中的一些变体。通过发布这个数据集,我们旨在解决这种偏见。

偏见的讨论

这是一个非常小的数据集,用于评估语音转录系统。我们没有发现数据集中的任何偏见或风险。

其他已知限制

[N/A]

附加信息

数据集策展人

版权所有2023年巴塞罗那超级计算中心的语言技术单元(LangTech)。

该项目由加泰罗尼亚政府通过Aina项目推动和资助。

许可信息

该数据集可以在CC BY 4.0许可下用于任何学术或商业目的。请适当注明出处,提供许可证链接,并指出是否进行了修改。

引用信息

DOI: 10.57967/hf/1492

贡献

示例的草拟和注释,以及使用的标准的制定,委托给CLiC(语言与计算中心)

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作