Syntheresis/SAW-corpus

Name: Syntheresis/SAW-corpus
Creator: Syntheresis
Published: 2024-04-13 13:47:38
License: 暂无描述

Hugging Face2024-04-13 更新2025-11-29 收录

下载链接：

https://hf-mirror.com/datasets/Syntheresis/SAW-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language_creators: - found language: - hy license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 100M<n<1B source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling --- # Dataset Card for SAW Corpus ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description ### Dataset Summary The Selective Armenian Web (SAW) Corpus is a collection of Armenian language texts, selectively compiled from various online sources. It aims to support natural language processing tasks, offering a wide range of text types, including news articles, legal documents, and other web content. ### Supported Tasks and Leaderboards - `language-modeling` - `masked-language-modeling` ### Languages The dataset is composed entirely in Armenian (hy), with all texts containing at least 50% Armenian characters. ## Dataset Structure ### Data Instances A typical data instance in this dataset might look like the following: { "text": "Հայաստանում կատարվել է 2 164 083 պատվաստում\n\nՊատվաստումային գործընթացը շարունակվում է:Ապրիլի 24-ի դրությամբ կատարվել է՝", "link": "https://hy.armradio.am/2022/04/25/հայաստանում-կատարվել-է-2-164-083-պատվաստում/", "date": "2022-04-25", "tags": ["Կարևոր", "Հասարակություն"], "source": "hy.armradio.am" } ### Data Fields - `text`: The main content of the article or text. Always includes the title. - `url`: The URL where the text was sourced from. - `date`: The publication date of the text. - `tags`: A list of tags or categories associated with the text. - `source`: The name of the website or platform where the text was sourced from. ### Data Splits The dataset is divided into three splits: train, validation (val), and test. Below are the details for each split: | Split | Samples | Words | | ----- | ------: | ----------: | | Train | 849,392 | 284,764,117 | | Val | 47,226 | 16,638,182 | | Test | 47,309 | 15,621,729 | ## Dataset Creation ### Curation Rationale The SAW Corpus was curated with the intent to create a comprehensive resource for Armenian language processing. The rationale behind its creation was to compile a diverse and significant collection of Armenian texts from various online sources, suitable for training robust language models and other NLP tasks. The dataset aims to fill the gap in Armenian language resources and provide a valuable tool for both academic research and practical applications in NLP. ### Source Data #### Initial Data Collection and Normalization The texts for the SAW Corpus were collected from a wide range of Armenian online sources, including news websites, document archives, and other relevant web content. The collection process involved selectively sourcing texts that were representative of contemporary Armenian usage. Normalization and cleaning processes were applied to ensure the quality and consistency of the dataset. These processes included: - Removing extraneous formatting and correcting obvious errors. - Standardizing punctuation marks such as commas, colons, and dashes. - Harmonizing variations of specific Armenian characters (e.g., standardizing 'և' and 'եւ'). - Markdown style was used for formatting tables, ordered, and unordered lists. The focus was on maintaining the integrity and diversity of the original content while ensuring the texts were suitable for NLP tasks. Markdown style was used for formatting tables, ordered, and unordered lists. #### Annotations The dataset does not contain any additional annotations. #### Personal and Sensitive Information The dataset consists of texts collected from publicly available sources. Due to the extensive volume of data, no specific steps were taken to identify or remove personal or sensitive information from each text. Users are advised to be aware of this when utilizing the dataset, particularly in contexts where privacy and data protection are concerns. ## Considerations for Using the Data ### Social Impact of Dataset The dataset supports advancements in NLP for the Armenian language, which can aid in diverse applications ranging from language research to the development of linguistic technologies. ### Discussion of Biases As the dataset aggregates content from various online sources, it may inherently carry the biases present in these sources. This can include skewness in topics, styles, or viewpoints. ### Other Known Limitations The dataset primarily includes Eastern Armenian texts and does not cover Western Armenian, which limits its linguistic diversity. While the dataset is rich in formal and literary styles, being a written corpus, it may not adequately represent spoken dialects and colloquial forms of Armenian. ## Additional Information ### Dataset Curators Curated by Mkrtich Minasyan. ### Licensing Information This dataset is distributed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. ### Citation Information @dataset{saw_corpus_2024, title = {Selective Armenian Web (SAW) Corpus}, author = {Mkrtich Minasyan}, year = {2024} }

提供机构：

Syntheresis

5,000+

优质数据集

54 个

任务类型

进入经典数据集