duong2110/financial-summarization-vi
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/duong2110/financial-summarization-vi
下载链接
链接失效反馈官方服务:
资源简介:
# Financial-Summarization-VI Dataset
## 1. Overview
**Financial-Summarization-VI** is a Vietnamese dataset designed for the task of **abstractive text summarization in the financial domain**. The dataset aims to support research and development of Natural Language Processing (NLP) models that can generate concise and informative summaries from financial texts.
This dataset is particularly valuable due to the scarcity of high-quality **Vietnamese financial NLP resources**, making it useful for both academic research and real-world applications such as fintech systems, financial news aggregation, and automated reporting.
---
## 2. Task Definition
The dataset is built for the task of:
* **Abstractive Summarization**: generating a summary that may not directly copy phrases from the original text but instead paraphrases and condenses the key information.
### Input:
A financial document (e.g., news article, report, or analysis).
### Output:
A concise summary capturing the main ideas of the input text.
---
## 3. Dataset Structure
Each sample in the dataset follows a standard Hugging Face format:
```json
{
"text": "Full financial article content...",
"summary": "Concise summary of the article..."
}
```
### Fields:
* `text`: The original financial document.
* `summary`: The corresponding human-written or generated summary.
---
## 4. Data Characteristics
### Language
* Vietnamese
### Domain
* Finance (e.g., stock market, macroeconomics, corporate reports, investment analysis)
### Key Features
* Domain-specific terminology (e.g., interest rates, equities, financial indicators)
* Structured and informative summaries
* Suitable for training and evaluating summarization models
---
## 5. Dataset Construction
### Data Sources
The dataset is constructed from financial-related textual content, which may include:
* Financial news articles
* Market analysis reports
* Economic commentary
### Preprocessing Steps
Typical preprocessing may include:
* Text cleaning (removal of HTML tags, noise)
* Normalization (encoding, punctuation handling)
* Filtering irrelevant or low-quality samples
### Summary Creation
Summaries are generated using one of the following approaches:
* Human-written summaries (preferred for quality)
* Model-generated summaries (requires validation)
---
## 6. Use Cases
This dataset can be used for:
* Training summarization models (e.g., T5, BART, PEGASUS)
* Fine-tuning large language models (LLMs)
* Evaluating summarization performance in Vietnamese
* Financial text understanding and information extraction
* Building real-world applications such as:
* Financial news summarizers
* Investment assistants
* Automated reporting systems
---
## 7. Evaluation Metrics
Common evaluation metrics for summarization include:
* **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**
* ROUGE-1
* ROUGE-2
* ROUGE-L
* **BERTScore**
* Measures semantic similarity between generated and reference summaries
These metrics help assess both lexical overlap and semantic quality.
---
## 8. How to Load the Dataset
```python
from datasets import load_dataset
dataset = load_dataset("duong2110/financial-summarization-vi")
```
---
## 9. Citation
If you use this dataset in your research, please cite it appropriately:
```bibtex
@dataset{financial_summarization_vi,
title = {Financial Dataset for Vietnamese},
author = {DuongNT},
year = {2026},
url = {https://huggingface.co/datasets/duong2110/financial-summarization-vi}
}
```
---
## 10. Conclusion
Financial-Summarization-VI is a valuable contribution to the Vietnamese NLP ecosystem, particularly in the financial domain. It provides a foundation for developing and evaluating summarization models in a low-resource language setting, enabling further research and real-world applications.
---
提供机构:
duong2110



