BiXT model weights (Perceiving Longer Sequences with Bi-Directional Cross-Attention Transformers)
收藏Figshare2025-03-10 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/_b_BiXT_model_weights_b_Perceiving_Longer_Sequences_with_Bi-Directional_Cross-Attention_Transformers_/28561820
下载链接
链接失效反馈官方服务:
资源简介:
BiXT model weightsThis collection includes PyTorch weights of various BiXT models trained on the ImageNet dataset, as introduced in the paper: Markus Hiller, Krista A. Ehinger, and Tom Drummond. "Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers." The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024. (available here)These weights are in support of the following github repository containing all code: https://github.com/mrkshllr/BiXT/ In detail, weights for the following models are available: ImageNet Classification ModelsDefault BiXT-Tiny models with 64 latents:BiXT-Ti/16: bixt_ti_l64_p16.zipBiXT-Ti/8: bixt_ti_l64_p16s8.zipBiXT-Ti/4: bixt_ti_l64_p16s4.zipPrevious BiXT-Tiny models fine-tuned on larger 384x384 images:BiXT-Ti/16-ft384: bixt_ti_l64_p16_ft384.zipBiXT-Ti/8-ft384: bixt_ti_l64_p16s8_ft384.zipBiXT-Ti/4-ft384: bixt_ti_l64_p16s4_ft384.zipConvolutional Alternative: BiXT-Tiny w/ conv-tokeniser:BiXT-Ti/16 (conv): bixt_conv_ti_l64_p16.zipSlightly larger models with embedding dimension of 256 instead of 192 (default tiny):BiXT-d256/16: bixt_ed256_l64_p16.zipBiXT-d256/8: bixt_ed256LS_l64_p16s8.zipBiXT-d256/8-ft384: bixt_ed256LS_l64_p16s8_ft384.zipModels for Dense Downstream TasksNote that for standard ImageNet training, we simply use a standard classification loss on the average-pooled latent embeddings for training. This means that for a 12 layer BiXT network, the refined patch tokens only receive a gradient until layer 11 -- which is why we employ only a one-sided cross-attention for the last layer (see BiXT model file here).For simplicity and easy transfer to dense downstream tasks, we therefore simply create and train BiXT-models with a depth of 13 and train these on ImageNet (see here); Afterwards, the last one-sided cross-attention that exclusively refines the latent vectors is simply discarded and the remaining (fully-trained) 12-layer network is used for finetuning on downstream tasks.Note: It is, of course, entirely possible to replace or extend our simple classification loss on the averaged latent vectors through other token-side losses (e.g. Masked Image Modelling) to provide a gradient signal for the token side and thereby directly train both, the latent and token refinement for all layers.Dense (d13) BiXT-Tiny models with 64 latents:BiXT-Ti/16 (d13): bixt_ti_l64_d13_p16.zipBiXT-Ti/8 (d13): bixt_ti_l64_d13_p16s8.zipBiXT-Ti/4 (d13): bixt_ti_l64_d13_p16s4.zipPrevious dense (d13) BiXT-Tiny models fine-tuned on larger 384x384 images:BiXT-Ti/16-ft384 (d13): bixt_ti_l64_d13_p16_ft384.zipBiXT-Ti/8-ft384 (d13): bixt_ti_l64_d13_p16s8_ft384.zipDense Convolutional Alternative: BiXT-Tiny (d13) w/ conv-tokeniser:BiXT-Ti/16 (conv, d13): bixt_conv_ti_l64_d13_p16.zipBiXT-Ti/8 (conv, d13): bixt_conv_ti_l64_d13_p8.zipNoteIn case you need any further information or model files, please make sure to check out the code in the github repository here and/or reach out! Additional information as well as details on the performance of each pre-trained model is also provided in the repository.
创建时间:
2025-03-10



