duong2110/financial-summarization-vi

Name: duong2110/financial-summarization-vi
Creator: duong2110
Published: 2026-03-26 15:35:53
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/duong2110/financial-summarization-vi

下载链接

链接失效反馈

官方服务：

资源简介：

# Financial-Summarization-VI Dataset ## 1. Overview **Financial-Summarization-VI** is a Vietnamese dataset designed for the task of **abstractive text summarization in the financial domain**. The dataset aims to support research and development of Natural Language Processing (NLP) models that can generate concise and informative summaries from financial texts. This dataset is particularly valuable due to the scarcity of high-quality **Vietnamese financial NLP resources**, making it useful for both academic research and real-world applications such as fintech systems, financial news aggregation, and automated reporting. --- ## 2. Task Definition The dataset is built for the task of: * **Abstractive Summarization**: generating a summary that may not directly copy phrases from the original text but instead paraphrases and condenses the key information. ### Input: A financial document (e.g., news article, report, or analysis). ### Output: A concise summary capturing the main ideas of the input text. --- ## 3. Dataset Structure Each sample in the dataset follows a standard Hugging Face format: ```json { "text": "Full financial article content...", "summary": "Concise summary of the article..." } ``` ### Fields: * `text`: The original financial document. * `summary`: The corresponding human-written or generated summary. --- ## 4. Data Characteristics ### Language * Vietnamese ### Domain * Finance (e.g., stock market, macroeconomics, corporate reports, investment analysis) ### Key Features * Domain-specific terminology (e.g., interest rates, equities, financial indicators) * Structured and informative summaries * Suitable for training and evaluating summarization models --- ## 5. Dataset Construction ### Data Sources The dataset is constructed from financial-related textual content, which may include: * Financial news articles * Market analysis reports * Economic commentary ### Preprocessing Steps Typical preprocessing may include: * Text cleaning (removal of HTML tags, noise) * Normalization (encoding, punctuation handling) * Filtering irrelevant or low-quality samples ### Summary Creation Summaries are generated using one of the following approaches: * Human-written summaries (preferred for quality) * Model-generated summaries (requires validation) --- ## 6. Use Cases This dataset can be used for: * Training summarization models (e.g., T5, BART, PEGASUS) * Fine-tuning large language models (LLMs) * Evaluating summarization performance in Vietnamese * Financial text understanding and information extraction * Building real-world applications such as: * Financial news summarizers * Investment assistants * Automated reporting systems --- ## 7. Evaluation Metrics Common evaluation metrics for summarization include: * **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** * ROUGE-1 * ROUGE-2 * ROUGE-L * **BERTScore** * Measures semantic similarity between generated and reference summaries These metrics help assess both lexical overlap and semantic quality. --- ## 8. How to Load the Dataset ```python from datasets import load_dataset dataset = load_dataset("duong2110/financial-summarization-vi") ``` --- ## 9. Citation If you use this dataset in your research, please cite it appropriately: ```bibtex @dataset{financial_summarization_vi, title = {Financial Dataset for Vietnamese}, author = {DuongNT}, year = {2026}, url = {https://huggingface.co/datasets/duong2110/financial-summarization-vi} } ``` --- ## 10. Conclusion Financial-Summarization-VI is a valuable contribution to the Vietnamese NLP ecosystem, particularly in the financial domain. It provides a foundation for developing and evaluating summarization models in a low-resource language setting, enabling further research and real-world applications. ---

提供机构：

duong2110

5,000+

优质数据集

54 个

任务类型

进入经典数据集