five

drelhaj/AraFinNews

收藏
Hugging Face2025-11-30 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/drelhaj/AraFinNews
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: ar license: cc-by-4.0 tags: - arabic - finance - summarization - dataset task_categories: - summarization size_categories: - 100K+ --- # AraFinNews: The Arabic Financial News Dataset (212K) ## For the JSON file format please check our [AraFinNews GitHub repo](https://github.com/ArabicNLP-UK/AraFinNews) **AraFinNews** is the largest openly available dataset of Arabic financial news, comprising **212,500 full-length articles** collected from [Argaam.com](https://www.argaam.com/) — a leading financial news portal in the Arab world. The dataset provides structured, machine-readable text suitable for research in **financial NLP**, **abstractive summarisation**, **event extraction**, and **domain-specific language modelling**. Comparable to the **CNN/DailyMail** dataset for English, AraFinNews offers an Arabic equivalent for headline-style abstractive summarisation. Each record pairs a full Arabic financial article with its professionally written headline, enabling high-quality training and evaluation of summarisation and financial text understanding systems. --- ## 📰 Dataset Overview | Field | Description | |---------|-----------------------------------------------| | `id` | Unique numeric identifier | | `title` | Arabic headline (financial news title) | | `date` | Publication date (ISO format) | | `article` | Full article text | | `url` | Public link on *Argaam.com* | - **Total articles:** 212,500 - **Language:** Modern Standard Arabic - **Domain:** Finance, markets, economics, corporate activity - **Format:** CSV (UTF-8) - **Licence:** CC BY-NC 4.0 --- ## ⚙️ Updated Data Splits In addition to the original ID lists, the repository now includes **fully populated split files**: - `AraFinNews_train.csv` — 80% of the dataset - `AraFinNews_validation.csv` — 10% - `AraFinNews_test.csv` — 10% These files contain **the complete article and headline rows**, making them directly compatible with the Hugging Face Dataset Viewer and eliminating the need to reconstruct splits manually. The older ID-only files are still included for reference: - `AraFinNews_train_ids.csv` - `AraFinNews_val_ids.csv` - `AraFinNews_test_ids.csv` These can be used if a user prefers to work from the master file `AraFinNews.csv`. --- ## ⚙️ Intended Use AraFinNews supports research in: - Abstractive and extractive summarisation - Financial event and entity extraction - Sentiment and stance analysis in financial narratives - Domain-specific pretraining and adaptation of Arabic LLMs - Financial question answering and narrative analysis The dataset is released strictly for **non-commercial research and educational use**. --- ## 📦 Access and Usage ### Load the full dataset import pandas as pd df = pd.read_csv("AraFinNews.csv") df.sample(5) ### Load directly from the full splits import pandas as pd train = pd.read_csv("AraFinNews_train.csv") val = pd.read_csv("AraFinNews_validation.csv") test = pd.read_csv("AraFinNews_test.csv") ### (Optional) Load splits using ID files import pandas as pd df = pd.read_csv("AraFinNews.csv") train_ids = pd.read_csv("AraFinNews_train_ids.csv")["id"] train_df = df[df["id"].isin(train_ids)] --- ## 📚 Citation If you use this dataset, please cite: > **El-Haj, M.** & **Rayson, P.** (2025). *AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs.* Proceedings of IEEE Big Data 2025. This work is associated with the following preprint: **https://arxiv.org/abs/2511.01265** --- ## 🏛️ Repository Structure Note: json files are on [AraFinNews GitHub](https://github.com/ArabicNLP-UK/AraFinNews) AraFinNews/ │ ├── AraFinNews.csv ├── train.csv ├── validation.csv ├── test.csv │ ├── AraFinNews_train_ids.csv ├── AraFinNews_val_ids.csv ├── AraFinNews_test_ids.csv │ ├── AraFinNews_json_files/ │ ├── 000001.json │ ├── 000002.json │ └── ... │ └── README.md --- **Contact:** Mo El-Haj VinUniversity, Hanoi / Lancaster University, UK dr.melhaj@gmail.com
提供机构:
drelhaj
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作