Cleaned LargeRDFBench dumps
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/5008279
下载链接
链接失效反馈官方服务:
资源简介:
Dumps for each of the LargeRDFBench datasets in two formats:
A single N-Triples (no prefixes, no unquoted numbers/booleans) file compressed with zstd
A HDT file with sidecar index file (.hdt.index.v1-1) for faster querying.
.mark files, which are JSON files storing the SHA-256 hashes of the above files and the input dump file from the original LargeRDFBench.
The files in this dataset where generated using the "fix" subcommand of the freqel-driver command-line utility. The files in this zenodo dataset where generated from commit 47cea26 of said repository. Nearly all of the cleanup code however is from rdfit version 1.0.6, which is available from maven central.
There are four reasons to use this dataset as a substitute for the original:
Flatter file structure: there is a single file per dataset
All data is in N-Triples (no RDF/XML or Turtle syntax in .nt-named files)
Valid IRIs and valid N-Triples syntax (no parsers errors, at most warnings)
Provided .hdt files are directly queryable
Since the original dumps have syntax errors and invalid IRIs, there are multiple ways to handle such issues and this dataset is one set of choices for handling them. For example, the Virtuoso endpoints of the original (as of commit 49d1401) ingest and expose invalid IRIs and langtags without complaining. Thus, there are SPARQL queries for which the results obtained using this cleaned version and the original Virtuoso endpoint bundles will differ. As far as we know, such possibility does not apply to the LargeRDFBench SPARQL queries. No triples were discarded in the cleaning process, rather triples with invalid IRIs (as per RFC 3987) and invalid language tags are mapped to valid counter parts. Literals are mostly unaffected, except for one particular syntax violation in the Affymetrix dataset: non-escaped null characters (U+0000) in lexical forms were replaced with spaces (U+0020) to make HDT files possible. The syntax fixes were made using RIt.tolerant() functionality of the rdfit library, version 1.0.6. The list of transformations (beyond flattening the file structure and storing as N-Ttiples and HDT) was:
Percent-encode characters not allowed at their current position in the IRI by RFC 3987.
If percent-encoding is not allowed at that position by RFC 3987 (e.g., port rule), the character will be erased
Erase invalid character encodings (when the binary representation is so messed up it does not appear as the wrong character but is straight up invalid UTF-8)
Replace '_' in language tags with '-' (e.g., en_US becomes en-US)
For NT/Turtle, \-escape occurrences of \r (0x0D) and \n (0x0A) inside single-quoted lexical forms.
For NT/Turtle, replace \ with \\ in any \x-escape where x is not in tbnrf"' (see ECHAR).
For NT/Turtle, identify UCHAR) escape sequences that represent an UTF-8 encoding instead of an unicode code point. Such sequences are composed of only byte-sized code points, which value sequence correspond to a valid UTF-8 sequence and where at least one such byte has a value that is the code point of a control character. Given such conditions, the sequence of UCHARs is replaced by a single UCHAR for the character encoded in UTF-8. Example: \x00C3\x0085, which corresponds to Å in UTF-8 becomes \u00C5 since U+0085 is a control character.
For NT/Turtle, @PREFIX and @BASE are rewritten to @prefix and @base
For NT/Turtle, literals true and false with any variation in case (e.g., True) are replaced
with the standard true and false.
For NT/Turtle, a lexical form followed by an without space or with a number of ^
characters different from 2 is replaced with ^^
For NT/Turtle, replace invalid unquoted plain literals with plain string literals. For this, the code assumes the invalid unquoted literal has no spaces (i.e., whitespace is a separator and never part of the invalid literal). Examples of this fix in action:
t:chromosome X becomes t:chromosome "X", 2e-3.4 becomes "2e-3.4" (expoent must be an integer) and falseful becomes "falseful"
Strip leading whitespace, %20, %09, %0A %0D and strip underlines at any position from IRI schemes. Affymetrix and Jamendo are affected
For Turtle/NT/TriG, replace NULL characters (U+0000) in string literals with (U+0020). Use case: only Affymetrix
Changelog
1.0.1: Re-generated LMDB.index.v1-1 to fix wrong results on queries with unbound subject, owl:sameAs predicate and bound object.
创建时间:
2022-04-18



