five

nareshmodina/TeleSpec-Data

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nareshmodina/TeleSpec-Data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - en tags: - telecom - LLM - ETSI - 3GPP - standards - continual-pretraining task_categories: - text-generation size_categories: - 10M<n<100M configs: - config_name: default data_files: - split: train path: data/*/*.parquet - config_name: 3gpp-standard data_files: - split: train path: data/3gpp-standard/*.parquet - config_name: etsi-standard data_files: - split: train path: data/etsi-standard/*.parquet --- # TeleSpec-Data ## Dataset Summary TeleSpec-Data is a dataset of telecommunications standards documents from two major bodies: ETSI and 3GPP. It is designed for continual pretraining of language models on telecom domain knowledge. The dataset consists of two categories: - **3gpp-standard**: 15,054 3GPP technical specifications and reports covering Release 8 through Release 19 (updated to April 2025), sourced from [TSpec-LLM](https://huggingface.co/datasets/rasoul-nikbakht/TSpec-LLM). - **etsi-standard**: 23,248 ETSI documents spanning 15 working groups (TS, TR, EN, ES, GS, GR, EG, ETR, ETS, GTS, and others), covering years 2000–2024, extracted from the PDF corpus in [NetSpec-LLM](https://huggingface.co/datasets/rasoul-nikbakht/NetSpec-LLM). **Total: 38,302 documents.** ## Dataset Structure ### Data Fields - `id`: Unique identifier for each document. - `category`: Category of the document — `3gpp-standard` or `etsi-standard`. - `content`: Full text of the document. Sections are separated by ` \n `. - `metadata`: JSON string with document-level metadata. Fields vary by category. ### Data Instances An example from the 3GPP subset: ``` { "id": "3gpp_24229_Rel-17_hc0", "category": "3gpp-standard", "content": "3GPP TS 24.229 Release 17 Vhc0 \n 1 Scope \n The present document specifies...", "metadata": "{\"doc_number\": \"24229\", \"series\": \"24\", \"release\": \"17\", \"version\": \"hc0\", \"filename\": \"24229-hc0.md\", \"series_dir\": \"24_series\"}" } ``` An example from the ETSI subset: ``` { "id": "etsi_28737", "category": "etsi-standard", "content": "ETSI TS 132 371 V7.3.1 (2008-07) \n Security Management concept and requirements \n 1 Scope \n The present document specifies...", "metadata": "{\"working_group\": \"TS\", \"year\": 2008, \"deliverable_type\": \"TS\", \"version\": \"7.3.1\", \"title\": \"Security Management concept and requirements\"}" } ``` ## Sample Code ```python import json from datasets import load_dataset # Full dataset ds = load_dataset("NareshModina/TeleSpec-Data") # Single category ds = load_dataset("NareshModina/TeleSpec-Data", name="etsi-standard") ds = load_dataset("NareshModina/TeleSpec-Data", name="3gpp-standard") sample = ds["train"][0] print(f"ID: {sample['id']}\nCategory: {sample['category']}\nContent: {sample['content'][:200]}") for key, value in json.loads(sample["metadata"]).items(): print(f"{key}: {value}") ``` ## Citation If you use this dataset, please cite: ```bibtex @dataset{modina2025telespecdata, author = {Naresh Modina}, title = {TeleSpec-Data: A Telecommunications Standards Dataset for Language Model Pretraining}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/NareshModina/TeleSpec-Data} } ``` Please also cite the upstream sources: ```bibtex @misc{nikbakht2024tspecllm, title = {TSpec-LLM: An Open-source Dataset for LLM Understanding of 3GPP Specifications}, author = {Rasoul Nikbakht and Mohamed Benzaghta and Giovanni Geraci}, year = {2024}, eprint = {2406.01768}, archivePrefix = {arXiv}, primaryClass = {cs.NI}, url = {https://arxiv.org/abs/2406.01768} } ``` ETSI document corpus sourced from [rasoul-nikbakht/NetSpec-LLM](https://huggingface.co/datasets/rasoul-nikbakht/NetSpec-LLM) (CC BY-NC 4.0).
提供机构:
nareshmodina
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作