five

ismailduru/Telco-DPR

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ismailduru/Telco-DPR
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 configs: - config_name: corpus data_files: - split: small path: corpus/corpus-00000-of-00001.parquet - split: extended path: corpus/extended-00000-of-00001.parquet - config_name: queries data_files: - split: queries path: queries/queries-00000-of-00001.parquet - config_name: relevant_docs data_files: - split: test path: relevant_docs/test-00000-of-00001.parquet - split: train path: relevant_docs/train-00000-of-00001.parquet --- # Dataset Information This paper proposes a Question-Answering (QA) system for the telecom domain using 3rd Generation Partnership Project (3GPP) technical documents. Alongside, a hybrid dataset, Telco-DPR, which consists of a curated 3GPP corpus in a hybrid format, combining text and tables, is presented. Additionally, the dataset includes a set of synthetic question/answer pairs designed to evaluate the retrieval performance of QA systems on this type of data. The retrieval models, including the sparse model, Best Matching 25 (BM25), as well as dense models, such as Dense Passage Retriever (DPR) and Dense Hierarchical Retrieval (DHR), are evaluated and compared using top-K accuracy and Mean Reciprocal Rank (MRR). The results show that DHR, a retriever model utilising hierarchical passage selection through fine-tuning at both the document and passage levels, outperforms traditional methods in retrieving relevant technical information, achieving a Top-10 accuracy of 86.2%. Additionally, the Retriever-Augmented Generation (RAG) technique, used in the proposed QA system, is evaluated to demonstrate the benefits of using the hybrid dataset and the DHR. The proposed QA system, using the developed RAG model and the Generative Pretrained Transformer (GPT)-4, achieves a 14% improvement in answer accuracy, when compared to a previous benchmark on the same dataset. https://arxiv.org/abs/2410.19790 ### Python Code to Load Dataset ### Git Clone Load Dataset -> git clone https://huggingface.co/datasets/thainasaraiva/Telco-DPR ```python from datasets import concatenate_datasets, load_dataset, DatasetDict corpus_ds = load_dataset('parquet', data_dir='./Telco-DPR/corpus', data_files={"corpus":'corpus-00000-of-00001.parquet'}) corpus_extend_ds = load_dataset('parquet', data_dir='./Telco-DPR/corpus', data_files={"extended":'extended-00000-of-00001.parquet'}) corpus_ds=DatasetDict({ 'corpus': concatenate_datasets([corpus_ds['corpus'],corpus_extend_ds['extended']]) }) queries_ds = load_dataset('parquet', data_dir='./Telco-DPR/queries', data_files={"queries":'queries-00000-of-00001.parquet'}) relevant_docs_ds = load_dataset('parquet', data_dir='./Telco-DPR/relevant_docs', data_files={"train":'train-00000-of-00001.parquet',"test":'test-00000-of-00001.parquet'})
提供机构:
ismailduru
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作