five

Supplementary Data to "Model interpretability enhances domain generalization in the case of textual complexity modeling"

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/Supplementary_Data_to_Model_interpretability_enhances_domain_generalization_in_the_case_of_textual_complexity_modeling_/25676394
下载链接
链接失效反馈
官方服务:
资源简介:
This data collection contains models, training features, log files, and the results table accompanying the paper “Interpretability Benefits Generalizability: The Case of Modeling Textual Complexity” by Frans van der Sluis and Egon L. van den Broek. Master TableThe master table (master_table.csv) consolidates results for each model configuration assessed and includes the following key columns: data_file: Identifies the feature set used as input for the model.id: Unique identifier for the training run, linking to a corresponding log file in logs.tar.gz.model: Type of model evaluated, either torch, glmnet (glm), vec2read, probabilistic language models (plm), or chatgpt3.5.depth: Interaction depth categorized into level 1, 2, or 3.input features: Total number of input features used._parameters_: Calculated number of model coefficients or parameters.Performance MetricsMetrics listed are evaluated during the outer loop of the training process. Classification Metrics (_classif_): Includes accuracy (acc), precision, recall, f1 score, true negatives (tn), true positives (tp), false negatives (fn), and false positives (fp) for task 1 (text classification).Generalization Metrics (_guardian_): Metrics for task 2 (predicting appraisals of processing difficulty) include rank order correlation (rho), Pearson’s correlation with the predicted class probability for the ‘complex’ class (rprob), and Pearson’s correlation with the logit-value of the prediction (rlogit).Torch-Specific Columnsouter_epoch and inner_epoch: Specifies the number of training epochs run in the respective outer and inner loop.total_epoch: Total number of epochs run before early stopping.best_epoch: Epoch achieving the lowest validation loss, used for further processing. Note that the model was re-trained on the full EnSiWiki dataset if not stopped early._net_: Details parameters (e.g., model depth and width) for torch models, with parameter ranges (e.g., learning rate and weight decay) listed in JSON format and tuned in the inner loop._tuned_: Specifies parameters set after tuning.Intermediate tuning resultsThe repository includes intermediate and final results from the tuning process: Inner loop results: Intermediate tuning results for FNN models are provided in pytorch_cv_inner_loop.csv.Outer loop results: Final results from the outer loop are available for both FNN and GLM models. These results form the input to the master table. Files include:glmnet_cv_outer_loop.csvpytorch_cv_outer_loop.csvModel FilesThree specific model files highlighted within the paper are stored in the files: glmnet_cv_277.RDS, glmnet_cv_281.RDS, and glmnet_cv_282.RDS. These files store the logistic regression models with feature interaction depth at levels 1, 3, and 2, respectively. The files are encoded in RDS format and can be read using the R function readRDS along with the glmnet and mlr3 libraries. Feature Sets FilesIncludes two sets of features derived from the training and target corpora: Engineered features and BERT embeddings. These feature sets are inputs for the computational models. Note that the code expects data in 'parquet' format, but they are provided here in CSV format for easier access: bert_base_uncased-features_embedding-layer-10_11.csv.gz for BERT features.wundt_v16_8-model-ridge-na.csv.gz for engineered features.Index Columns for Feature Files: filename: Source of the corpus.id: Unique identifier for the article.wiki_id: Specific article identifier used in Wikipedia.pair_id: Identifier linking article pairs, referring to the Pairs table.length: Word count of the article.lang: Language of the article, with options including ‘simple’, ‘english’, or unspecified (’’).label: Classification of the article source, where 0 = Simple Wikipedia, 1 = English Wikipedia, 2 = The Guardian articles.Feature Columns for BERT Embeddings: d_0 to d_767: Represents one of the 768 embedding dimensions from the BERT model.Feature Columns for Engineered Features: lucene_characters_per_word_mean (Feature LenCha): Average word length in characters.lucene_syllables_per_word_mean (Feature LenSyl): Average word length in syllables.wordnet_log_cumulative_spread_*_posall_stonge_mean (Feature Sem): Semantic neighborhood density based on WordNet.snowball_swe_w55_n*_mean (Feature EntWor): Word repetition for n-grams (n=1,2,3) measured within a 55-word sliding window.esa_swe_w30_mean (Feature EntSem): Semantic entropy measured within a high-dimensional topic space via Explicit Semantic Analysis.loc__cor2_f0_i0_log_mean_mean (Feature Dep): Mean logarithmic length of dependencies within a sentence.cc_sent_geo_log_p_mean (Feature LogPr): Log probabilities of 5-gram word sequences per sentence, derived from the CommonCrawl corpus.coref_fwin_fill_local_*_mean (Feature CohRef): Mean coherence measured by the number of coreferences with 1 to 5 preceding sentences.esacoh_fwin_fill_local_*_mean (Feature CohSem): Mean coherence measured by semantic relatedness to 1 to 5 preceding sentences.causal_word_span_ratio (Feature ConCau): Ratio of causal connectives to the number of words.noncausal_word_span_ratio (Feature ConAlt): Ratio of non-causal connectives to the number of words.Feature Interactions RankingsThe feature sets are accompanied by a list of the top-100,000 second- and third-order combinations of features, ranked by their F-score. These precomputed lists facilitate the inclusion of multiplicative feature interactions at depth levels 2 and 3 in the models. Note that the code expects data in 'parquet' format, but they are provided here in CSV format for easier access: bert_base_uncased-features_embedding-layer-10_11-combis_depth2.csv.gz for second-order combinations of BERT features.bert_base_uncased-features_embedding-layer-10_11-combis_depth3.csv.gz for third-order combinations of BERT features.wundt_v16_8-model-ridge-na-combis_depth2.csv.gz for second-order combinations of engineered features.wundt_v16_8-model-ridge-na-combis_depth3.csv.gz for third-order combinations of engineered features.Columns for Combinations Files: combi: A JSON-formatted list of feature indexes, mapping to columns in the feature data files.f: The F-score indicating the discriminative power of the feature combination for distinguishing between Simple and English Wikipedia texts.p: The p-value associated with the F-score.CodeThe code accompanying this work is available on Github, https://github.com/fsluis/textual-complexity, and archived at Zenodo, doi 10.5281/zenodo.14359835. ReferencesVan der Sluis, F., & van den Broek, E. L. (2025). Model interpretability enhances domain generalization in the case of textual complexity modeling. Patterns, 6, 101177. https://doi.org/10.1016/j.patter.2025.101177
创建时间:
2025-01-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作