dost-asti/Embeddings
收藏数据集描述
该数据集是ITANONG项目的一部分,包含一个10亿标记的塔加洛语数据集,并提供了多种预训练的嵌入模型。这些模型使用著名的语料库中的正式文本数据集进行训练,详细信息已在我们的论文中详细说明。以下是嵌入模型的详细信息:
| 嵌入技术 | 变体 | 模型文件格式 | 嵌入大小 |
|---|---|---|---|
| Word2Vec | Skipgram | .bin | 20 |
| Word2Vec | Skipgram | .bin | 30 |
| Word2Vec | Skipgram | .bin | 50 |
| Word2Vec | Skipgram | .bin | 100 |
| Word2Vec | Skipgram | .bin | 200 |
| Word2Vec | Skipgram | .bin | 300 |
| Word2Vec | Skipgram | .txt | 20 |
| Word2Vec | Skipgram | .txt | 30 |
| Word2Vec | Skipgram | .txt | 50 |
| Word2Vec | Skipgram | .txt | 100 |
| Word2Vec | Skipgram | .txt | 200 |
| Word2Vec | Skipgram | .txt | 300 |
| Word2Vec | CBOW | .bin | 20 |
| Word2Vec | CBOW | .bin | 30 |
| Word2Vec | CBOW | .bin | 50 |
| Word2Vec | CBOW | .bin | 100 |
| Word2Vec | CBOW | .bin | 200 |
| Word2Vec | CBOW | .bin | 300 |
| Word2Vec | CBOW | .txt | 20 |
| Word2Vec | CBOW | .txt | 30 |
| Word2Vec | CBOW | .txt | 50 |
| Word2Vec | CBOW | .txt | 100 |
| Word2Vec | CBOW | .txt | 200 |
| Word2Vec | CBOW | .txt | 300 |
| FastText | Skipgram | .bin | 20 |
| FastText | Skipgram | .bin | 30 |
| FastText | Skipgram | .bin | 50 |
| FastText | Skipgram | .bin | 100 |
| FastText | Skipgram | .bin | 200 |
| FastText | Skipgram | .bin | 300 |
| FastText | Skipgram | .txt | 20 |
| FastText | Skipgram | .txt | 30 |
| FastText | Skipgram | .txt | 50 |
| FastText | Skipgram | .txt | 100 |
| FastText | Skipgram | .txt | 200 |
| FastText | Skipgram | .txt | 300 |
| FastText | CBOW | .bin | 20 |
| FastText | CBOW | .bin | 30 |
| FastText | CBOW | .bin | 50 |
| FastText | CBOW | .bin | 100 |
| FastText | CBOW | .bin | 200 |
| FastText | CBOW | .bin | 300 |
| FastText | CBOW | .txt | 20 |
| FastText | CBOW | .txt | 30 |
| FastText | CBOW | .txt | 50 |
| FastText | CBOW | .txt | 100 |
| FastText | CBOW | .txt | 200 |
| FastText | CBOW | .txt | 300 |
训练细节
该模型使用Nvidia V100-32GB GPU在DOST-ASTI计算和存档研究环境(COARE)上进行训练。
训练数据
训练数据集从正式和非正式来源编译而成,包含194,001个来自正式渠道的实例。更多关于预处理和训练参数的信息请参阅我们的论文。
引用
论文标题:iTANONG-DS : A Collection of Benchmark Datasets for Downstream Natural Language Processing Tasks on Select Philippine Languages
Bibtex:
@inproceedings{visperas-etal-2023-itanong, title = "i{TANONG}-{DS} : A Collection of Benchmark Datasets for Downstream Natural Language Processing Tasks on Select {P}hilippine Languages", author = "Visperas, Moses L. and Borjal, Christalline Joie and Adoptante, Aunhel John M and Abacial, Danielle Shine R. and Decano, Ma. Miciella and Peramo, Elmer C", editor = "Abbas, Mourad and Freihat, Abed Alhakim", booktitle = "Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023)", month = dec, year = "2023", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.icnlsp-1.34", pages = "316--323", }



