TelecomX
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/3t6rbtcms8
下载链接
链接失效反馈官方服务:
资源简介:
**Short description**
The task is to analyze the presented data on Internet traffic consumption by subscribers, by examining the data on the volume of transmitted and received traffic from the communication equipment (switches) exports. The data of the described subject area is presented in the form of files with synthetic data. Full description with diagrams in attachment. Additional dataset options are available here -- https://github.com/d-yacenko/dataset
For entities such as Client, Physical, Company, Plan, Subscriber, and PSXAttrs, data are provided as a current snapshot — one entity, one file with current
data.
Data from switches are exported every 10 minutes. For example, with 10 switches, 24*6*10 = 1440 files are
exported per day. File names contain the switch name and export time. Alternatively, data could be streamed
from equipment via systems like Kafka.
This work presents three dataset variants — exports from 6 switches over 7 days for operators of varying
sizes:
• telecom10k - operator with 10,000 subscribers (51MB),
• telecom100k - operator with 100,000 subscribers (696MB),
• telecom1000k - operator with 1,000,000 subscribers (7.2GB).
The task is to analyze the presented data on Internet traffic consumption by subscribers, by examining the
data on the volume of transmitted and received traffic from the communication equipment (switches) exports.
During the analysis, it is necessary to compare the retrospective consumption of a subscriber’s traffic with the
current, and upon detecting atypical consumption, to conclude hacking. A data showcase table (data mart)
should be constructed for each hour of data from the switches, i.e., the number of showcases should equal the
period (in hours) for which operational data were exported, e.g., 24*7 = 168. The showcase should present the
following data:
• Time,
• Client name,
• Client contract number,
• Contact data for communication with the client,
• Presumed hacking status (hacked/clear),
• Justification of the presumed hacking status (brief history of traffic consumption).
The methods required for this task include data cleaning, data loading, data mart calculation, etc. In addition
to data analysis, it is important to apply data governance practices to control data quality at all stages of the
analysis (data quality), determine data origins (data lineage), and describe the glossary of the subject area.
Expected Result. Based on the available data, a set of data marts with a calculation interval of 1 hour of
input data should be constructed, containing information on consumer traffic consumption and signs of suspected
hacking.
创建时间:
2024-05-07



