Random-Mary-Smith/port_data_random
收藏Hugging Face2023-11-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Random-Mary-Smith/port_data_random
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
- config_name: law
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 123139395
num_examples: 397405
- name: validation
num_bytes: 56663
num_examples: 200
- name: test
num_bytes: 271938
num_examples: 1000
download_size: 163260550
dataset_size: 123467996
- config_name: literature
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 3517766
num_examples: 10315
- name: validation
num_bytes: 1484637
num_examples: 4422
- name: test
num_bytes: 2125655
num_examples: 6317
download_size: 9807283
dataset_size: 7128058
- config_name: news
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 641932369
num_examples: 905849
- name: validation
num_bytes: 275126414
num_examples: 388222
- name: test
num_bytes: 393251206
num_examples: 554603
download_size: 1816074839
dataset_size: 1310309989
- config_name: politics
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 189377492
num_examples: 5582
- name: validation
num_bytes: 72061567
num_examples: 2393
- name: test
num_bytes: 103743651
num_examples: 3419
download_size: 479587978
dataset_size: 365182710
- config_name: social_media
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 253714655
num_examples: 2019173
- name: validation
num_bytes: 43449
num_examples: 400
- name: test
num_bytes: 65642
num_examples: 600
download_size: 345744861
dataset_size: 253823746
- config_name: web
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 59528182
num_examples: 14480
- name: validation
num_bytes: 26157753
num_examples: 6206
- name: test
num_bytes: 35875967
num_examples: 8866
download_size: 164136819
dataset_size: 121561902
language:
- pt
pretty_name: Portuguese Language Identification
size_categories:
- 1M<n<10M
---
# Portuguese Varieties Identification
This repository contains the code for the paper "Enhancing Portuguese Varieties Identification with Domain-Agnostic Ensemble Approaches," submitted to EACL 2024. In this README, you can find more information about the corpus created to support the training of a model to identify the Portuguese variety of a given text.
The corpus is composed of four million documents across six textual domains (law, literature, news, politics, social media, web). In terms of models, we covered three types of techniques:
a) a baseline model using N-Grams and Naive Bayes;
b) a model using a pre-trained language model (BERT);
c) Anomaly-based language identification using autoencoders. To mitigate the variability introduced by the different domains, we used an ensemble approach to combine the predictions of domain-specialized models trained in isolation.
The work developed in this repository is part of the initiative **anonymized for EACL**
### Quickstart
```
# In /benchmarks folder
1. Install the requirements
pip install -r requirements.txt
2. Run the benchmarking script
./run.sh
```
### Corpus
The developed corpus is a composition of pre-existing datasets initially created for other NLP tasks that provide permissive licenses. The first release of the corpus is available on [Huggingface](https://huggingface.co/datasets/Random-Mary-Smith/port_data_random).
#### Data Sources
The corpus consists of the following datasets:
<p align="center">
<table>
<tr>
<th>Domain</th>
<th>Variety</th>
<th>Dataset</th>
<th>Original Task</th>
<th># Docs</th>
<th>License</th>
<th>Silver Labeled</th>
</tr>
<tr>
<td rowspan="5">Literature</td>
<td rowspan="3">PT-PT</td>
<td><a href="http://arquivopessoa.net/">Arquivo Pessoa</a></td>
<td>-</td>
<td>~4k</td>
<td>CC</td>
<td>✔</td>
</tr>
<tr>
<td><a href="https://www.gutenberg.org/ebooks/bookshelf/99">Gutenberg Project</a></td>
<td>-</td>
<td>6</td>
<td>CC</td>
<td>✔</td>
</tr>
<tr>
<td><a href="https://www.clul.ulisboa.pt/recurso/corpus-de-textos-literarios">LT-Corpus</a></td>
<td>-</td>
<td>56</td>
<td>ELRA END USER</td>
<td>✘</td>
</tr>
<tr>
<td rowspan="2">PT-BR</td>
<td><a href="https://www.kaggle.com/datasets/rtatman/brazilian-portuguese-literature-corpus">Brazilian Literature</a></td>
<td>Author Identification</td>
<td>81</td>
<td>CC</td>
<td>✘</td>
</tr>
<tr>
<td>LT-Corpus</td>
<td>-</td>
<td>8</td>
<td>ELRA END USER</td>
<td>✘</td>
</tr>
<tr>
<td rowspan="2">Politics</td>
<td>PT-PT</td>
<td><a href="http://www.statmt.org/europarl/">Koehn (2005) Europarl</a></td>
<td>Machine Translation</td>
<td>~10k</td>
<td>CC</td>
<td>✘</td>
</tr>
<tr>
<td>PT-BR</td>
<td>Brazilian Senate Speeches</td>
<td>-</td>
<td>~5k</td>
<td>CC</td>
<td>✔</td>
</tr>
<tr>
<td rowspan="2">Journalistic</td>
<td>PT-PT</td>
<td><a href="https://www.linguateca.pt/CETEMPublico/">CETEM Público</a></td>
<td>-</td>
<td>1M</td>
<td>CC</td>
<td>✘</td>
</tr>
<tr>
<td>PT-BR</td>
<td><a href="https://www.linguateca.pt/CETEMFolha/">CETEM Folha</a></td>
<td>-</td>
<td>272k</td>
<td>CC</td>
<td>✘</td>
</tr>
<tr>
<td rowspan="3">Social Media</td>
<td>PT-PT</td>
<td><a href="https://www.aclweb.org/anthology/2021.ranlp-1.37/">Ramalho (2021)</a></td>
<td>Fake News Detection</td>
<td>2M</td>
<td>MIT</td>
<td>✔</td>
</tr>
<tr>
<td rowspan="2">PT-BR</td>
<td><a href="https://www.aclweb.org/anthology/2022.lrec-1.322/">Vargas (2022)</a></td>
<td>Hate Speech Detection</td>
<td>5k</td>
<td>CC-BY-NC-4.0</td>
<td>✘</td>
</tr>
<tr>
<td><a href="https://www.aclweb.org/anthology/2021.wlp-1.72/">Cunha (2021)</a></td>
<td>Fake News Detection</td>
<td>2k</td>
<td>GPL-3.0 license</td>
<td>✔</td>
</tr>
<tr>
<td>Web</td>
<td>BOTH</td>
<td><a href="https://www.aclweb.org/anthology/2020.lrec-1.451/">Ortiz-Suarez (2020)</a></td>
<td>-</td>
<td>10k</td>
<td>CC</td>
<td>✔</td>
</tr>
</table>
</p>
<p align="center">
<em>Table 1: Data Sources</em>
</p>
#####
Note: The dataset "Brazilian Senate Speeches" was created by the authors of this paper, using web crawling of the Brazilian Senate website and is available in the Huggingface repository.
#### Annotation Schema & Data Preprocessing Pipeline
We leveraged our knowledge of the Portuguese language to identify data sources that guaranteed mono-variety documents. However, this first release lacks any kind of supervision, so we cannot guarantee that all documents are mono-variety. In the future, we plan to release a second version of the corpus with a more robust annotation schema, combining automatic and manual annotation.
To improve the quality of the corpus, we applied a preprocessing pipeline to all documents. The pipeline consists of the following steps:
1. Remove all NaN values.
2. Remove all empty documents.
3. Remove all duplicated documents.
4. Apply the [clean_text](https://github.com/jfilter/clean-text) library to remove non-relevant information for language identification from the documents.
5. Remove all documents with a length significantly more than two standard deviations from the mean length of the documents in the corpus.
The pipeline is illustrated in Figure 1.
<p align="center">
<img src="assets/pipeline_lid.jpg" alt="Image Description">
</p>
<p align="center">
<em>Figure 1: Data Pre-Processing Pipeline</em>
</p>
#### Class Distribution
The class distribution of the corpus is presented in Table 2. The corpus is highly imbalanced, with the majority of the documents being from the journalistic domain. In the future, we plan to release a second version of the corpus with a more balanced distribution across the six domains. Depending on the imbalance of the textual domain, we used different strategies to perform train-validation-test splits. For the heavily imbalanced domains, we ensured a minimum of 100 documents for validation and 400 for testing. In the other domains, we applied a stratified split.
<p align="center">
<table>
<tr>
<th>Domain</th>
<th># PT-PT</th>
<th># PT-BR</th>
<th>Stratified</th>
</tr>
<tr>
<td>Politics</td>
<td>6500</td>
<td>4894</td>
<td>✓</td>
</tr>
<tr>
<td>Web</td>
<td>7960</td>
<td>21592</td>
<td>✓</td>
</tr>
<tr>
<td>Literature</td>
<td>18282</td>
<td>2772</td>
<td>✓</td>
</tr>
<tr>
<td>Law</td>
<td>392839</td>
<td>5766</td>
<td>✕</td>
</tr>
<tr>
<td>Journalistic</td>
<td>1494494</td>
<td>354180</td>
<td>✓</td>
</tr>
<tr>
<td>Social Media</td>
<td>2013951</td>
<td>6222</td>
<td>✕</td>
</tr>
</table>
</p>
<p align="center">
<em>Table 2: Class Balance across the six textual domains in both varieties of Portuguese.</em>
</p>
#### Future Releases & How to Contribute
We plan to release a second version of this corpus considering more textual domains and extending the scope to other Portuguese varieties. If you want to contribute to this corpus, please [contact us]().
### Models
We explored three Machine Learning based techniques founded on the corpus compiled to present a reliable language identification model capable of operating in a real-world scenario, independent of the textual domain. The three techniques are:
* A baseline model using N-Grams and Naive Bayes;
* A model using a pre-trained language model (BERT);
* Anomaly-based language identification using autoencoders.
To mitigate the impact of the variability introduced by the different domains, we used an ensemble approach to combine the predictions of domain-specialized models trained in isolation.
#### Baseline Model
The baseline model is a Naive Bayes classifier trained on the TF-IDF representation of the documents. The model is trained using the [scikit-learn](https://scikit-learn.org/stable/) library. After performing a grid search to find the best hyperparameters, we obtained the following results:
<table align="center">
<tr>
<th>Tokenizer</th>
<th># Features</th>
<th>max_df</th>
<th>Lowercase</th>
<th>Stop_words</th>
<th>Token_pattern</th>
<th>Ngram_range</th>
<th>Analyzer Algorithm</th>
</tr>
<tr>
<td>NLTK Portuguese</td>
<td>40000</td>
<td>1.0</td>
<td>False</td>
<td>NLTK Stopwords</td>
<td>None</td>
<td>(1, 2)</td>
<td>word</td>
</tr>
<tr>
<td>NLTK Portuguese</td>
<td>30000</td>
<td>1.0</td>
<td>False</td>
<td>NLTK Stopwords</td>
<td>None</td>
<td>(1, 5)</td>
<td>char_wb</td>
</tr>
</table>
<p align="center">
Table 3: Hyperparameters of the baseline model.
</p>
The F1-scores obtained by this technique are presented in Figure 2. The architecture strugles to generalize outside the domains used for training, compromising the performance of the model in a real-world scenario.
<p align="center">
<img src="assets/n_grams_isolated.jpg" alt="Image Description" style="width:70%;">
</p>
<p align="center">
<em>Figure 2: F1-Scores N-Grams based Model</em>
</p>
#### Autoencoder Model
The autoencoder model proposes a anomaly-detection approach to language identification. The model is composed of encoder-decoder feed-foward layers trained using BERTimbau embeddings as input. The results obtained by this technique are presented in Figure 3. This model presents intermidiate results between the baseline model and the BERT model.
<p align="center">
<img src="assets/autoencoder_isolated.jpg" alt="Image Description" style="width:70%;">
</p>
<p align="center">
<em>Figure 3: F1-Scores Autoencoder based Model</em>
</p>
#### BERT Model
The BERT model is a fine-tuned version of [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) on the corpus compiled. The model is trained using the [Huggingface](https://huggingface.co/) library. The results obtained by this technique are presented in Figure 4. This model is capable of generalizing to unseen domains, making it a good candidate for a real-world scenario.
<p align="center">
<img src="assets/bert_isolated.jpg" alt="Image Description" style="width:70%;">
</p>
<p align="center">
<em>Figure 4: F1-Scores BERT based Model</em>
</p>
提供机构:
Random-Mary-Smith
原始信息汇总
数据集概述
数据集信息
法律领域 (law)
- 特征:
text: 字符串类型label: 分类标签,包含两个类别:pt-PT和pt-BR
- 分割:
train: 123,139,395 字节,397,405 个样本validation: 56,663 字节,200 个样本test: 271,938 字节,1,000 个样本
- 下载大小: 163,260,550 字节
- 数据集大小: 123,467,996 字节
文学领域 (literature)
- 特征:
text: 字符串类型label: 分类标签,包含两个类别:pt-PT和pt-BR
- 分割:
train: 3,517,766 字节,10,315 个样本validation: 1,484,637 字节,4,422 个样本test: 2,125,655 字节,6,317 个样本
- 下载大小: 9,807,283 字节
- 数据集大小: 7,128,058 字节
新闻领域 (news)
- 特征:
text: 字符串类型label: 分类标签,包含两个类别:pt-PT和pt-BR
- 分割:
train: 641,932,369 字节,905,849 个样本validation: 275,126,414 字节,388,222 个样本test: 393,251,206 字节,554,603 个样本
- 下载大小: 1,816,074,839 字节
- 数据集大小: 1,310,309,989 字节
政治领域 (politics)
- 特征:
text: 字符串类型label: 分类标签,包含两个类别:pt-PT和pt-BR
- 分割:
train: 189,377,492 字节,5,582 个样本validation: 72,061,567 字节,2,393 个样本test: 103,743,651 字节,3,419 个样本
- 下载大小: 479,587,978 字节
- 数据集大小: 365,182,710 字节
社交媒体领域 (social_media)
- 特征:
text: 字符串类型label: 分类标签,包含两个类别:pt-PT和pt-BR
- 分割:
train: 253,714,655 字节,2,019,173 个样本validation: 43,449 字节,400 个样本test: 65,642 字节,600 个样本
- 下载大小: 345,744,861 字节
- 数据集大小: 253,823,746 字节
网络领域 (web)
- 特征:
text: 字符串类型label: 分类标签,包含两个类别:pt-PT和pt-BR
- 分割:
train: 59,528,182 字节,14,480 个样本validation: 26,157,753 字节,6,206 个样本test: 35,875,967 字节,8,866 个样本
- 下载大小: 164,136,819 字节
- 数据集大小: 121,561,902 字节
语言
- 葡萄牙语 (pt)
数据集名称
- Portuguese Language Identification
数据集大小分类
- 1M < n < 10M



