Random-Mary-Smith/port_data_random

Name: Random-Mary-Smith/port_data_random
Creator: Random-Mary-Smith
Published: 2023-11-29 16:15:35
License: 暂无描述

Hugging Face2023-11-29 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Random-Mary-Smith/port_data_random

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: - config_name: law features: - name: text dtype: string - name: label dtype: class_label: names: '0': pt-PT '1': pt-BR splits: - name: train num_bytes: 123139395 num_examples: 397405 - name: validation num_bytes: 56663 num_examples: 200 - name: test num_bytes: 271938 num_examples: 1000 download_size: 163260550 dataset_size: 123467996 - config_name: literature features: - name: text dtype: string - name: label dtype: class_label: names: '0': pt-PT '1': pt-BR splits: - name: train num_bytes: 3517766 num_examples: 10315 - name: validation num_bytes: 1484637 num_examples: 4422 - name: test num_bytes: 2125655 num_examples: 6317 download_size: 9807283 dataset_size: 7128058 - config_name: news features: - name: text dtype: string - name: label dtype: class_label: names: '0': pt-PT '1': pt-BR splits: - name: train num_bytes: 641932369 num_examples: 905849 - name: validation num_bytes: 275126414 num_examples: 388222 - name: test num_bytes: 393251206 num_examples: 554603 download_size: 1816074839 dataset_size: 1310309989 - config_name: politics features: - name: text dtype: string - name: label dtype: class_label: names: '0': pt-PT '1': pt-BR splits: - name: train num_bytes: 189377492 num_examples: 5582 - name: validation num_bytes: 72061567 num_examples: 2393 - name: test num_bytes: 103743651 num_examples: 3419 download_size: 479587978 dataset_size: 365182710 - config_name: social_media features: - name: text dtype: string - name: label dtype: class_label: names: '0': pt-PT '1': pt-BR splits: - name: train num_bytes: 253714655 num_examples: 2019173 - name: validation num_bytes: 43449 num_examples: 400 - name: test num_bytes: 65642 num_examples: 600 download_size: 345744861 dataset_size: 253823746 - config_name: web features: - name: text dtype: string - name: label dtype: class_label: names: '0': pt-PT '1': pt-BR splits: - name: train num_bytes: 59528182 num_examples: 14480 - name: validation num_bytes: 26157753 num_examples: 6206 - name: test num_bytes: 35875967 num_examples: 8866 download_size: 164136819 dataset_size: 121561902 language: - pt pretty_name: Portuguese Language Identification size_categories: - 1M<n<10M --- # Portuguese Varieties Identification This repository contains the code for the paper "Enhancing Portuguese Varieties Identification with Domain-Agnostic Ensemble Approaches," submitted to EACL 2024. In this README, you can find more information about the corpus created to support the training of a model to identify the Portuguese variety of a given text. The corpus is composed of four million documents across six textual domains (law, literature, news, politics, social media, web). In terms of models, we covered three types of techniques: a) a baseline model using N-Grams and Naive Bayes; b) a model using a pre-trained language model (BERT); c) Anomaly-based language identification using autoencoders. To mitigate the variability introduced by the different domains, we used an ensemble approach to combine the predictions of domain-specialized models trained in isolation. The work developed in this repository is part of the initiative **anonymized for EACL** ### Quickstart ``` # In /benchmarks folder 1. Install the requirements pip install -r requirements.txt 2. Run the benchmarking script ./run.sh ``` ### Corpus The developed corpus is a composition of pre-existing datasets initially created for other NLP tasks that provide permissive licenses. The first release of the corpus is available on [Huggingface](https://huggingface.co/datasets/Random-Mary-Smith/port_data_random). #### Data Sources The corpus consists of the following datasets: <table> <tr> <th>Domain</th> <th>Variety</th> <th>Dataset</th> <th>Original Task</th> <th># Docs</th> <th>License</th> <th>Silver Labeled</th> </tr> <tr> <td rowspan="5">Literature</td> <td rowspan="3">PT-PT</td> <td><a href="http://arquivopessoa.net/">Arquivo Pessoa</a></td> <td>-</td> <td>~4k</td> <td>CC</td> <td>✔</td> </tr> <tr> <td><a href="https://www.gutenberg.org/ebooks/bookshelf/99">Gutenberg Project</a></td> <td>-</td> <td>6</td> <td>CC</td> <td>✔</td> </tr> <tr> <td><a href="https://www.clul.ulisboa.pt/recurso/corpus-de-textos-literarios">LT-Corpus</a></td> <td>-</td> <td>56</td> <td>ELRA END USER</td> <td>✘</td> </tr> <tr> <td rowspan="2">PT-BR</td> <td><a href="https://www.kaggle.com/datasets/rtatman/brazilian-portuguese-literature-corpus">Brazilian Literature</a></td> <td>Author Identification</td> <td>81</td> <td>CC</td> <td>✘</td> </tr> <tr> <td>LT-Corpus</td> <td>-</td> <td>8</td> <td>ELRA END USER</td> <td>✘</td> </tr> <tr> <td rowspan="2">Politics</td> <td>PT-PT</td> <td><a href="http://www.statmt.org/europarl/">Koehn (2005) Europarl</a></td> <td>Machine Translation</td> <td>~10k</td> <td>CC</td> <td>✘</td> </tr> <tr> <td>PT-BR</td> <td>Brazilian Senate Speeches</td> <td>-</td> <td>~5k</td> <td>CC</td> <td>✔</td> </tr> <tr> <td rowspan="2">Journalistic</td> <td>PT-PT</td> <td><a href="https://www.linguateca.pt/CETEMPublico/">CETEM Público</a></td> <td>-</td> <td>1M</td> <td>CC</td> <td>✘</td> </tr> <tr> <td>PT-BR</td> <td><a href="https://www.linguateca.pt/CETEMFolha/">CETEM Folha</a></td> <td>-</td> <td>272k</td> <td>CC</td> <td>✘</td> </tr> <tr> <td rowspan="3">Social Media</td> <td>PT-PT</td> <td><a href="https://www.aclweb.org/anthology/2021.ranlp-1.37/">Ramalho (2021)</a></td> <td>Fake News Detection</td> <td>2M</td> <td>MIT</td> <td>✔</td> </tr> <tr> <td rowspan="2">PT-BR</td> <td><a href="https://www.aclweb.org/anthology/2022.lrec-1.322/">Vargas (2022)</a></td> <td>Hate Speech Detection</td> <td>5k</td> <td>CC-BY-NC-4.0</td> <td>✘</td> </tr> <tr> <td><a href="https://www.aclweb.org/anthology/2021.wlp-1.72/">Cunha (2021)</a></td> <td>Fake News Detection</td> <td>2k</td> <td>GPL-3.0 license</td> <td>✔</td> </tr> <tr> <td>Web</td> <td>BOTH</td> <td><a href="https://www.aclweb.org/anthology/2020.lrec-1.451/">Ortiz-Suarez (2020)</a></td> <td>-</td> <td>10k</td> <td>CC</td> <td>✔</td> </tr> </table> Table 1: Data Sources ##### Note: The dataset "Brazilian Senate Speeches" was created by the authors of this paper, using web crawling of the Brazilian Senate website and is available in the Huggingface repository. #### Annotation Schema & Data Preprocessing Pipeline We leveraged our knowledge of the Portuguese language to identify data sources that guaranteed mono-variety documents. However, this first release lacks any kind of supervision, so we cannot guarantee that all documents are mono-variety. In the future, we plan to release a second version of the corpus with a more robust annotation schema, combining automatic and manual annotation. To improve the quality of the corpus, we applied a preprocessing pipeline to all documents. The pipeline consists of the following steps: 1. Remove all NaN values. 2. Remove all empty documents. 3. Remove all duplicated documents. 4. Apply the [clean_text](https://github.com/jfilter/clean-text) library to remove non-relevant information for language identification from the documents. 5. Remove all documents with a length significantly more than two standard deviations from the mean length of the documents in the corpus. The pipeline is illustrated in Figure 1. <img src="assets/pipeline_lid.jpg" alt="Image Description"> Figure 1: Data Pre-Processing Pipeline #### Class Distribution The class distribution of the corpus is presented in Table 2. The corpus is highly imbalanced, with the majority of the documents being from the journalistic domain. In the future, we plan to release a second version of the corpus with a more balanced distribution across the six domains. Depending on the imbalance of the textual domain, we used different strategies to perform train-validation-test splits. For the heavily imbalanced domains, we ensured a minimum of 100 documents for validation and 400 for testing. In the other domains, we applied a stratified split. <table> <tr> <th>Domain</th> <th># PT-PT</th> <th># PT-BR</th> <th>Stratified</th> </tr> <tr> <td>Politics</td> <td>6500</td> <td>4894</td> <td>✓</td> </tr> <tr> <td>Web</td> <td>7960</td> <td>21592</td> <td>✓</td> </tr> <tr> <td>Literature</td> <td>18282</td> <td>2772</td> <td>✓</td> </tr> <tr> <td>Law</td> <td>392839</td> <td>5766</td> <td>✕</td> </tr> <tr> <td>Journalistic</td> <td>1494494</td> <td>354180</td> <td>✓</td> </tr> <tr> <td>Social Media</td> <td>2013951</td> <td>6222</td> <td>✕</td> </tr> </table> Table 2: Class Balance across the six textual domains in both varieties of Portuguese. #### Future Releases & How to Contribute We plan to release a second version of this corpus considering more textual domains and extending the scope to other Portuguese varieties. If you want to contribute to this corpus, please [contact us](). ### Models We explored three Machine Learning based techniques founded on the corpus compiled to present a reliable language identification model capable of operating in a real-world scenario, independent of the textual domain. The three techniques are: * A baseline model using N-Grams and Naive Bayes; * A model using a pre-trained language model (BERT); * Anomaly-based language identification using autoencoders. To mitigate the impact of the variability introduced by the different domains, we used an ensemble approach to combine the predictions of domain-specialized models trained in isolation. #### Baseline Model The baseline model is a Naive Bayes classifier trained on the TF-IDF representation of the documents. The model is trained using the [scikit-learn](https://scikit-learn.org/stable/) library. After performing a grid search to find the best hyperparameters, we obtained the following results: <table align="center"> <tr> <th>Tokenizer</th> <th># Features</th> <th>max_df</th> <th>Lowercase</th> <th>Stop_words</th> <th>Token_pattern</th> <th>Ngram_range</th> <th>Analyzer Algorithm</th> </tr> <tr> <td>NLTK Portuguese</td> <td>40000</td> <td>1.0</td> <td>False</td> <td>NLTK Stopwords</td> <td>None</td> <td>(1, 2)</td> <td>word</td> </tr> <tr> <td>NLTK Portuguese</td> <td>30000</td> <td>1.0</td> <td>False</td> <td>NLTK Stopwords</td> <td>None</td> <td>(1, 5)</td> <td>char_wb</td> </tr> </table> Table 3: Hyperparameters of the baseline model. The F1-scores obtained by this technique are presented in Figure 2. The architecture strugles to generalize outside the domains used for training, compromising the performance of the model in a real-world scenario. <img src="assets/n_grams_isolated.jpg" alt="Image Description" style="width:70%;"> Figure 2: F1-Scores N-Grams based Model #### Autoencoder Model The autoencoder model proposes a anomaly-detection approach to language identification. The model is composed of encoder-decoder feed-foward layers trained using BERTimbau embeddings as input. The results obtained by this technique are presented in Figure 3. This model presents intermidiate results between the baseline model and the BERT model. <img src="assets/autoencoder_isolated.jpg" alt="Image Description" style="width:70%;"> Figure 3: F1-Scores Autoencoder based Model #### BERT Model The BERT model is a fine-tuned version of [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) on the corpus compiled. The model is trained using the [Huggingface](https://huggingface.co/) library. The results obtained by this technique are presented in Figure 4. This model is capable of generalizing to unseen domains, making it a good candidate for a real-world scenario. <img src="assets/bert_isolated.jpg" alt="Image Description" style="width:70%;"> Figure 4: F1-Scores BERT based Model

提供机构：

Random-Mary-Smith

原始信息汇总

数据集概述

数据集信息

法律领域 (law)

特征:
- text: 字符串类型
- label: 分类标签，包含两个类别：pt-PT 和 pt-BR
分割:
- train: 123,139,395 字节，397,405 个样本
- validation: 56,663 字节，200 个样本
- test: 271,938 字节，1,000 个样本
下载大小: 163,260,550 字节
数据集大小: 123,467,996 字节

文学领域 (literature)

特征:
- text: 字符串类型
- label: 分类标签，包含两个类别：pt-PT 和 pt-BR
分割:
- train: 3,517,766 字节，10,315 个样本
- validation: 1,484,637 字节，4,422 个样本
- test: 2,125,655 字节，6,317 个样本
下载大小: 9,807,283 字节
数据集大小: 7,128,058 字节

新闻领域 (news)

特征:
- text: 字符串类型
- label: 分类标签，包含两个类别：pt-PT 和 pt-BR
分割:
- train: 641,932,369 字节，905,849 个样本
- validation: 275,126,414 字节，388,222 个样本
- test: 393,251,206 字节，554,603 个样本
下载大小: 1,816,074,839 字节
数据集大小: 1,310,309,989 字节

政治领域 (politics)

特征:
- text: 字符串类型
- label: 分类标签，包含两个类别：pt-PT 和 pt-BR
分割:
- train: 189,377,492 字节，5,582 个样本
- validation: 72,061,567 字节，2,393 个样本
- test: 103,743,651 字节，3,419 个样本
下载大小: 479,587,978 字节
数据集大小: 365,182,710 字节

社交媒体领域 (social_media)

特征:
- text: 字符串类型
- label: 分类标签，包含两个类别：pt-PT 和 pt-BR
分割:
- train: 253,714,655 字节，2,019,173 个样本
- validation: 43,449 字节，400 个样本
- test: 65,642 字节，600 个样本
下载大小: 345,744,861 字节
数据集大小: 253,823,746 字节

网络领域 (web)

特征:
- text: 字符串类型
- label: 分类标签，包含两个类别：pt-PT 和 pt-BR
分割:
- train: 59,528,182 字节，14,480 个样本
- validation: 26,157,753 字节，6,206 个样本
- test: 35,875,967 字节，8,866 个样本
下载大小: 164,136,819 字节
数据集大小: 121,561,902 字节

语言

葡萄牙语 (pt)

数据集名称

Portuguese Language Identification

数据集大小分类

1M < n < 10M

5,000+

优质数据集

54 个

任务类型

进入经典数据集