KGCW 2024 Challenge @ ESWC 2024
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/10721874
下载链接
链接失效反馈官方服务:
资源简介:
Knowledge Graph Construction Workshop 2024: challenge
Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.
Task description
The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.
We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.
Track 1: Conformance
The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:
RML-Core
RML-IO
RML-CC
RML-FNML
RML-Star
These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.
Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.
Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.
Track 2: Performance
Part 1: Knowledge Graph Construction Parameters
These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.
Data
Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).
Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).
Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).
Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).
Number of input files: scaling the number of datasets (1, 5, 10, 15).
Mappings
Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).
Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).
Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)
Part 2: GTFS-Madrid-Bench
The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.
Scaling
GTFS-1 SQL
GTFS-10 SQL
GTFS-100 SQL
GTFS-1000 SQL
Heterogeneity
GTFS-100 XML + JSON
GTFS-100 CSV + XML
GTFS-100 CSV + JSON
GTFS-100 SQL + XML + JSON + CSV
Example pipeline
The ground truth dataset and baseline results are generated in different stepsfor each parameter:
The provided CSV files and SQL schema are loaded into a MySQL relational database.
Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format
The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Knowledge graph construction timeout is set to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.
Each parameter has its own directory in the ground truth dataset with thefollowing files:
Input dataset as CSV.
Mapping file as RML.
Execution plan for the pipeline in metadata.json.
Datasets
Knowledge Graph Construction Parameters
The dataset consists of:
Input dataset as CSV for each parameter.
Mapping file as RML for each parameter.
Baseline results for each parameter with the example pipeline.
Ground truth dataset for each parameter generated with the example pipeline.
Format
All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.
GTFS-Madrid-Bench
The dataset consists of:
Input dataset as CSV with SQL schema for the scaling and a combination of XML,
CSV, and JSON is provided for the heterogeneity.
Mapping file as RML for both scaling and heterogeneity.
SPARQL queries to retrieve the results.
Baseline results with the example pipeline.
Ground truth dataset generated with the example pipeline.
Format
CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.
Evaluation criteria
Submissions must evaluate the following metrics:
Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.
CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.
Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.
Expected output
Duplicate values
Scale
Number of Triples
0 percent
2000000 triples
25 percent
1500020 triples
50 percent
1000020 triples
75 percent
500020 triples
100 percent
20 triples
Empty values
Scale
Number of Triples
0 percent
2000000 triples
25 percent
1500000 triples
50 percent
1000000 triples
75 percent
500000 triples
100 percent
0 triples
Mappings
Scale
Number of Triples
1TM + 15POM
1500000 triples
3TM + 5POM
1500000 triples
5TM + 3POM
1500000 triples
15TM + 1POM
1500000 triples
Properties
Scale
Number of Triples
1M rows 1 column
1000000 triples
1M rows 10 columns
10000000 triples
1M rows 20 columns
20000000 triples
1M rows 30 columns
30000000 triples
Records
Scale
Number of Triples
10K rows 20 columns
200000 triples
100K rows 20 columns
2000000 triples
1M rows 20 columns
20000000 triples
10M rows 20 columns
200000000 triples
Joins
1-1 joins
Scale
Number of Triples
0 percent
0 triples
25 percent
125000 triples
50 percent
250000 triples
75 percent
375000 triples
100 percent
500000 triples
1-N joins
Scale
Number of Triples
1-10 0 percent
0 triples
1-10 25 percent
125000 triples
1-10 50 percent
250000 triples
1-10 75 percent
375000 triples
1-10 100 percent
500000 triples
1-5 50 percent
250000 triples
1-10 50 percent
250000 triples
1-15 50 percent
250005 triples
1-20 50 percent
250000 triples
1-N joins
Scale
Number of Triples
10-1 0 percent
0 triples
10-1 25 percent
125000 triples
10-1 50 percent
250000 triples
10-1 75 percent
375000 triples
10-1 100 percent
500000 triples
5-1 50 percent
250000 triples
10-1 50 percent
250000 triples
15-1 50 percent
250005 triples
20-1 50 percent
250000 triples
N-M joins
Scale
Number of Triples
5-5 50 percent
1374085 triples
10-5 50 percent
1375185 triples
5-10 50 percent
1375290 triples
5-5 25 percent
718785 triples
5-5 50 percent
1374085 triples
5-5 75 percent
1968100 triples
5-5 100 percent
2500000 triples
5-10 25 percent
719310 triples
5-10 50 percent
1375290 triples
5-10 75 percent
1967660 triples
5-10 100 percent
2500000 triples
10-5 25 percent
719370 triples
10-5 50 percent
1375185 triples
10-5 75 percent
1968235 triples
10-5 100 percent
2500000 triples
GTFS Madrid Bench
Generated Knowledge Graph
Scale
Number of Triples
1
395953 triples
10
3959530 triples
100
39595300 triples
1000
395953000 triples
Queries
Query
Scale 1
Scale 10
Scale 100
Scale 1000
Q1
58540 results
585400 results
No results available
No results available
Q2
636 results
11998 results
125565 results
1261368 results
Q3
421 results
4207 results
42067 results
420667 results
Q4
13 results
130 results
1300 results
13000 results
Q5
35 results
350 results
3500 results
35000 results
Q6
1 result
1 result
1 result
1 result
Q7
68 results
67 results
67 results
53 results
Q8
35460 results
354600 results
No results available
No results available
Q9
130 results
1300 results
13000 results
130000 results
Q10
1 result
1 result
1 result
1 result
Q11
130 results
260 results
260 results
260 results
Q12
13 results
130 results
1300 results
13000 results
Q13
265 results
2650 results
26500 results
265000 results
Q14
2234 results
22340 results
223400 results
No results available
Q15
592 results
8684 results
35502 results
206628 results
Q16
390 results
780 results
260 results
780 results
Q17
855 results
8550 results
85500 results
855000 results
Q18
104 results
1300 results
13000 results
130000 results
创建时间:
2024-06-11



