five

Luwidji/BalitaNLP

收藏
Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Luwidji/BalitaNLP
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - tl - fil pretty_name: Filipino multi-modal NLP dataset. Consists of 350k+ Filipino news articles and associated images size_categories: - 100K<n<1M source_datasets: - original tags: - news task_categories: - text-to-image - image-to-text - text-generation - summarization - text-classification task_ids: - news-articles-headline-generation - news-articles-summarization dataset_info: - config_name: default features: - name: title dtype: string - name: body sequence: string - name: image dtype: image - name: website dtype: string - name: category_group dtype: string - name: category dtype: string - name: title_choice_first_paragraph dtype: string - name: title_choices sequence: string - name: title_choice_gold_idx dtype: int32 - name: date dtype: string - name: author dtype: string - name: url dtype: string - name: img_url dtype: string splits: - name: train num_bytes: 39449948960.917 num_examples: 281403 - name: validation num_bytes: 5093856283.5 num_examples: 35175 - name: test num_bytes: 4923596011.806 num_examples: 35177 download_size: 39815624261 dataset_size: 49467401256.223 - config_name: no-image features: - name: title dtype: string - name: body sequence: string - name: category_group dtype: string - name: category dtype: string - name: website dtype: string - name: title_choice_first_paragraph dtype: string - name: title_choices sequence: string - name: title_choice_gold_idx dtype: int32 - name: date dtype: string - name: author dtype: string - name: url dtype: string - name: img_url dtype: string splits: - name: train num_bytes: 578462672 num_examples: 281403 - name: validation num_bytes: 74036069 num_examples: 35175 - name: test num_bytes: 73488921 num_examples: 35177 download_size: 427915786 dataset_size: 725987662 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* - config_name: no-image data_files: - split: train path: no-image/train-* - split: validation path: no-image/validation-* - split: test path: no-image/test-* --- A Filipino multi-modal language dataset for text+visual tasks. Consists of 351,755 Filipino news articles (w/ associated images) gathered from Filipino news outlets. # Description Total # of articles: **351,755** 80-10-10 split for training, validation, and testing. Dataset field descriptions: ```body - Article text title - Article title body - Article body. Separated into paragraphs image - Article image website - Name of the news outlet category_group - Category grouped into 5 distinct classes. News, Sports, Entertainment, Crime, and Other category - News category name given by the news outlet date - Date published author - Article author url - URL of the article img_url - URL of the article image title_choice_first_paragraph - Opening paragraph of the article title_choices - 4 possible titles, one of them being the true one title_choice_gold_idx - Idx of the true title among the choices ``` title_choice_* fields are included to support the task of textual entailment — taking advantage of the "inverted pyramid" structure of news articles. # Dataset Usage Two dataset configurations: **default** (includes images) and **no-image** (excludes images) **Using** `datasets` **library** **default** ``` from datasets import load_dataset dset = load_dataset('LanceBunag/BalitaNLP', streaming=True) # streaming recommended due to size of dataset w/ images ``` **no-image** ``` from datasets import load_dataset dset = load_dataset('LanceBunag/BalitaNLP', 'no-image') ``` # Citation Published in [Buñag & Esquivel, 2023](https://storage.googleapis.com/public-kenricklancebunag/Transformer-based%20Conditional%20Language%20Models%20-%20IEOM%20Submission.pdf). If you are using **BalitaNLP** in your work, please cite the following: ``` @inproceedings{bunagtransformer, author={Bunag, Kenrick Lance T and Esquivel, Rosanna A} title={Transformer-Based Conditional Language Models to Generate Filipino News Articles}, year = {2023}, publisher = {IEOM Society International}, url = {https://ieomsociety.org/proceedings/2023manila/595.pdf}, booktitle = {Proceedings of the International Conference on Industrial Engineering and Operations Management}, pages = {2231–2237}, numpages = {7}, location = {Manila, Philippines}, } ```
提供机构:
Luwidji
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作