five

slone/myv-rus-2022-quality-annotated

收藏
Hugging Face2024-10-30 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/slone/myv-rus-2022-quality-annotated
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: int64 - name: ru dtype: string - name: myv dtype: string - name: src dtype: string - name: meaning_score dtype: float64 - name: fluency_score dtype: float64 - name: is_good dtype: int64 splits: - name: validation num_bytes: 538090 num_examples: 1500 - name: test num_bytes: 523051 num_examples: 1500 download_size: 556560 dataset_size: 1061141 configs: - config_name: default data_files: - split: validation path: data/validation-* - split: test path: data/test-* license: cc-by-sa-4.0 language: - ru - myv task_categories: - sentence-similarity - text-classification - translation size_categories: - 1K<n<10K --- # Dataset Card for Dataset Name A small parallel Erzya-Russian dataset, manually annotated for quality: meaning preservation and fluency of the Erzya sentences. ## Dataset Details ### Dataset Description - **Curated by:** David Dale, Árpád Váldazs - **Language(s) (NLP):** Erzya, Russian - **License:** CC-BY-SA-4.0 ### Dataset Sources - **Repository:** https://github.com/slone-nlp/myv-nmt/ - **Paper:** https://www2.statmt.org/wmt24/pdf/2024.wmt-1.49.pdf ## Uses Evaluation of automatic metrics of machine translation into Erzya, or of Erzya parallel data quality. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> The dataset consists of the validation and test splits of [slone/myv_ru_2022](https://huggingface.co/datasets/slone/myv_ru_2022), and follows the same structure. This includes the following fields: - `id`: numeric id of the sentence in the split, the same as in the original dataset. - `ru`: a sentence in Russian - `myv`: a corresponding sentence in Erzya - `src`: a sring identifier of the data source There are three additional fields, new to this version of the dataset: - `meaning_score`: a score of semantic similarity between the Russian and the Erzya sentences. Possible values are `0` (a huge difference in the meaning), `0.5` (difference in minor details), and `1` (equivalent meaning). - `fluency_score`: a score of fluency of the Erzya sentence. Possible values are `0` (serious problems with fluency or grammaticality, or a wrong language), `0.5` (the sentence is acceptable but does not feel natural), and `1` (fluent). - `is_good`: a flag that the translation pair is good; equals `1` if both scores above are `1`, and `0`, otherwise. The splits of the dataset contain 1500 sentence pairs each. In each, about 60% of the data have "good" labels for both meaning and fluency. ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> There were two main motivations for creating this dataset: 1. Provide cleaner development and test sets for evaluating machine translation between Erzya and Russian. 2. Provide a set of annotations that could be used to validate automatic metrics of translation into Erzya or of parallel Russian-Erzya data quality. ### Source Data Various parallel texts in Russian and Erzya, pre-aligned or automatically aligned by sentence. For more details, see the parent dataset [slone/myv_ru_2022](https://huggingface.co/datasets/slone/myv_ru_2022). ### Annotations The labels in the dataset have been provided by a single annotator, a native speaker of Russian and a fluent Erzya speaker. ## Bias, Risks, and Limitations The sentences in the dataset may inherit all the peculiarities of their corresponding sources. In particular, a large proportion of sentences from the "constitution" and "wiki" sources contain overly literal translations from Russian (which is reflected in their fluency scores). Some of the sentences are misaligned (due to automatic sentence splitting and alignment), which is normally reflected in their meaning scores. The accuracy and fluency labels were provided by a single annotator without additional validation, and may contain occasional errors. ## Dataset Card Contact @cointegrated
提供机构:
slone
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作