A hybrid approach to the small unannotated corpus-based language comparison and its application to the Old East Slavic charters - Supplementary material 5 (Corpus-based language distance measurement results)
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14169774
下载链接
链接失效反馈官方服务:
资源简介:
General description
These are the results of the experiments with the use of corpus-based language distance measurement package on the material of Old East Slavic, modern East Slavic, and modern standard Slavic lects. There are 40 possible experiments for each data set, divided by the usage of:
topic antimodelling heuristic,
Soerensen-Dice coefficient-based normalisation,
the presence of hybridisation of frequency-based metric for coinciding units and combined frequency-based metric and string similarity measure for non-coinciding units,
hybridisation type,
exact type of string similarity measure used for combination,
alphabet entropy-based normalisation for vector-based string similarity measures.
In addition, modern standard Slavic dataset undergoes experiments 4 times that differ by the share of its size, used for measurements (0.1, 0.3, 0.6 and 1).
For further information on each of the experiment parameters, refer to the documentation of the package.
Data set structure
Executive summary
Data set consists of 240 folders that represent information on the experiments and 1 .csv-file that aggregates the resulting values into a single table.
Folders with indices 1-20 and 121-140 contain experiments with 0.1 share of the modern standard Slavic dataset; the first sequence applies topic antimodelling heuristic, the second sequence does not employ it.
Folders with indices 21-40 and 141-160 contain experiments with 0.3 share of the modern standard Slavic dataset; the first sequence applies topic antimodelling heuristic, the second sequence does not employ it.
Folders with indices 41-60 and 161-180 contain experiments with 0.6 share of the modern standard Slavic dataset; the first sequence applies topic antimodelling heuristic, the second sequence does not employ it.
Folders with indices 61-80 and 181-200 contain experiments with the full share of the modern standard Slavic dataset; the first sequence applies topic antimodelling heuristic, the second sequence does not employ it.
Folders with indices 81-100 and 201-220 contain experiments with the full share of the modern East Slavic dataset; the first sequence applies topic antimodelling heuristic, the second sequence does not employ it.
Folders with indices 101-120 and 221-240 contain experiments with the full share of the Old East Slavic dataset; the first sequence applies topic antimodelling heuristic, the second sequence does not employ it.
.csv-file
Named aggregated_results.csv, lies in the root of the dataset. Separator is comma (,). Contains 13 columns and 241 row. The first row is header, the other 240 rows contain description for each conducted experiment and its resulting values, according to the columns. The columns are the following (in rtl order):
X. (int) - experiment ID; column is used as index.
Material (string) - the data set used for language distance measurement. The possible values are:
Slavic standard - Croatian, Slovenian, Slovak standard lects.
Modern East Slavic - Northern Russian lect Megra, Central Russian lect Belogornoje, and Northern Belarusian lect Zialionka.
Old East Slavic - Novgorod, Polack and Smolensk parts of the Old East Slavic continuum.
Gensim (int) - the binary numeric indicator (0 or 1) of using the heuristic of topic antimodelling, namely, cleaning the words that were defined as a topic words by gensim Latent Dirichlet Association implementation (Rehurek & Sojka, 2010). The intention of using this heuristic is to remove the tokens that are characteristic for the genre of the texts presented in the corpus for the sake of increasing the presence of the tokens that are characteristic of the lects themselves.
Split (float) - the used share of the data set (from 0 to 1); required to check the influence of the data set size on the metric efficiency.
Hybridisation (string) - the indicator of implementation of the hybridisation between the frequency-based metric between the 3-shingles (character 3-grams) that coincide for the compared lect pair, and the combination of frequency-based metric and string similarity measure between the 3-shingles that do not coincide for the compared lect pair. The possible values are:
TRUE: the experiment utilises hybridisation
FALSE: the experiment does not utilise hybridisation.
Hybridisation_type (string) - the indicator of how the frequency metric between coinciding 3-shingles and the combined metric between non-coinciding 3-shingles undergo the hybridisation process. The values are:
JOINED - the approach is to multiply the means of the two.
ARRAY - the approach is to join all the values into a single list, and then to score the mean.
NOT_USED - experiment does not employ hybridisation. (Hybridisation is FALSE).
Soerensen_normalisation (string) - the indicator of whether the frequency-based metric value undergoes normalisation with the division by Soerensen-Dice coefficient (a measure of number of coincidences between two lists) (Soerensen, 1948), in order to compensate the skewing between the coinciding and non-coinciding 3-shingles of the lects. The values are:
NOT_USED - Hybridisation_type is ARRAY, so there are no values to use Soerensen-Dice coefficient on.
TRUE - the frequency-based metric undergoes division by the Soerensen-Dice coefficient.
FALSE - the algorithm does not apply the normalisation by the Soerensen-Dice coefficient.
Alphabet_normalisation (string) - indicator of whether the algorithm applies normalisation with the alphabet entropy (Shannon, 1948), the measure of differences in the skewings of symbols distribution in the texts, between the given lects. The values are:
NOT_USED - a heuristic may not be implemented; present either in the cases, when Hybridisation is FALSE, or when the next parameter, Auxiliary_metrics is not VDND or VWJDND.
TRUE - the experiment employs the heuristic.
FALSE - the experiment does not employ the heuristic.
Auxiliary_metrics (string) - the string similarity measure, used for the combination with the frequency-based metric for non-coinciding 3-shingles between analysed lects. There are five possible values:
LDND (Levenshtein distance normalised between analysed 3-shingles) (Holman et al., 2008).
WJWDND (weighted Jaro-Winkler distance normalised between analysed 3-shingles) (Gueddah et al., 2015).
VDND (Euclidean distance between the sums of symbol vector values between 3-shingles).
VWJDND (VDND multiplied by scoring Jaro (Jaro, 1989) distance between analysed 3-shingles).
Outgroup.identification (string) - the indicator of whether the outgroup detected in the given experiment coincides with the lect that preliminary manual classification supposes to be the outgroup. There are two possible values:
CORRECT - the detected outgroup coincides with the supposed one.
INCORRECT - the detected outgroup does not coincide with the supposed one.
Outer.distance.split (float) - the length of the outgroup branch.
Inner.distance.split (float) - the distance between the split between the outgroup and the ingroup, and the split between the two ingroup lects.
Split.difference (float) - the division of Outer.distance.split by Inner.distance.split.
Folders
Each folder contains 6 files, each named according to the used experiment setup:
3 .csv-files that contain unit-by-unit comparison between each pair of the analysed lects. Each .csv-file is semi-colon-separated, and has 4 columns, header row, and rows that describes each unit-to-unit comparison. The columns contain the following information (in rtl order):
[Name of the first compared lect] : unit (character 3-shingle, or just 3-shingle) of the [name of the first compared lect] that undergoes comparison with units of the [name of the second compared lect]; datatype: string.
[Name of the second compared lect] : unit (character 3-shingle, or just 3-shingle) of the [name of the second compared lect] that undergoes comparison with units of the [name of the first compared lect]; if units coincide, contains value id.; datatype: string.
[Experiment setup]: name of the metric, a combination of the [experiment setup](concatenated through - parameters) and its exact part, which compares the two units; datatype: string. The possible values are:
[experiment setup] - DistRank - the frequency-based metric that compares identical units
[experiment setup] - hybrid - the string similarity measure for non-identical units, combined with the frequency-based metric
Distance: value of the metric; datatype: float
.info-file that contains data on branch lengths along with coincidence/non-coincidence of the detected outgroup with the manually defined one. The file is a tabular-separated plain text that always contains three values: coincidence (CORRECT)/non-coincidence (INCORRECT) of the yielded classification with the supposed one; outer distance split (the length of the outgroup branch; datatype: float) and inner distance split (the length of the ingroup branch before split of its lects; datatype: float).
.newick -file that contains the result of an experiment, the phylogenetic tree built by UPGMA classifier. One can read it with ape::read.tree (R), or Phylo.read (Python).
.png -file that contains the phylogenetic tree visualisation.
How-to
For the analysis of the results, download and unpack the archive, and further refer to the companion R notebook.
创建时间:
2024-12-02



