HausaSRS: A Hausa-English Parallel Corpus of Software Requirements Specifications
收藏Mendeley Data2026-05-21 收录
下载链接:
https://data.mendeley.com/datasets/zs2jb2hzzn
下载链接
链接失效反馈官方服务:
资源简介:
This study hypothesizes that software requirements written in English can be systematically transformed into a high-quality Hausa Requirements Engineering dataset through a controlled pipeline that combines document harvesting, domain filtering, glossary-guided translation, weak annotation, and expert validation. A related hypothesis is that, in a low-resource setting, combining automated corpus construction with Human-in-the-Loop review can produce data of sufficient quality for downstream NLP tasks such as translation, FR/NFR classification, and token-level entity extraction.
The data consists of a parallel and annotated corpus derived from ~350 Software Requirements Specification (SRS) documents collected across the health, education, and finance domains. These source documents were processed from PDF & DOCX formats. Text was extracted using pdfplumber and python-docx, cleaned with rule-based preprocessing to remove non-semantic artifacts such as page numbers, repeated spacing, URLs, and boilerplate labels, and filtered to retain requirement-like content using requirement-engineering keywords and modal patterns. English segments were retained, technical terms were anchored through a custom SRS glossary and named-entity recognition, and the retained text was translated into Hausa. Hausa outputs were normalized and weakly annotated with BIO tags and FR/NFR labels. Synthetic IEEE-style Hausa requirement templates were also introduced to strengthen corpus structure. After cleaning, deduplication, and removal of malformed rows, the dataset was stored as a cleaned silver corpus and partitioned into train, validation, and test subsets.
Results show it is feasible to construct a domain-specific Hausa RE resource from heterogeneous SRS documents using a semi-automated workflow. It also shows that a glossary-aware translation and annotation strategy can preserve important software engineering concepts such as actors, system entities, constraints, and quality attributes in Hausa. Notably, automated annotation alone is not sufficient for reliable low-resource RE data; expert correction by a Hausa-speaking NLP specialist was necessary to refine mistranslations, resolve ambiguous labels, and correct token boundaries. This confirms the importance of expert validation in producing a gold-standard corpus from an initially silver dataset.
The dataset is a structured representation of software requirements knowledge in Hausa, aligned with common RE tasks. The parallel English–Hausa component supports machine translation and cross-lingual modeling. The FR/NFR labels support requirement classification, while the BIO tags support sequence labeling and information extraction. Researchers can use the data for translation benchmarking, low-resource RE classification, domain-adaptive pretraining, or Hausa-specific entity extraction. More broadly, the data demonstrates a reproducible pathway for creating Requirements Engineering datasets in under-resourced languages.
创建时间:
2026-04-30



