five

BUT-FIT/FLiP-data

收藏
Hugging Face2026-04-23 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/BUT-FIT/FLiP-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 multilinguality: monolingual task_categories: - sentence-similarity - feature-extraction tags: - sonar - speech-embeddings - text-embeddings - common-voice - interpretability arxiv: 2604.18109 pretty_name: FLiP-data --- # FLiP-data Preprocessed data for the [FLiP](https://github.com/BUTSpeechFIT/FLiP) project — **Factorized Linear Projection for Interpreting Multimodal Multilingual Sentence Embeddings**. FLiP trains a factorized log-linear model to recover lexical content (keywords) from pretrained sentence embeddings via a single linear projection, with no fine-tuning of the encoder. ## Contents SONAR embeddings and transcripts for **Mozilla Common Voice v15 English** (train / dev / test): | File | Description | |------|-------------| | `*_speech_embs.npy` | SONAR speech embeddings (float32, shape `[N, 1024]`) | | `*_text_embs.npy` | SONAR text embeddings (float32, shape `[N, 1024]`) | | `*_sim_scores.npy` | Cosine similarity between paired speech and text embeddings | | `*_transcript.txt` | Reference transcripts (one utterance per line) | | `*_entities_gemini2.5_flash_lite.jsonl` | Named entities extracted with Gemini 2.5 Flash Lite | Splits: `train` (~1M utterances), `dev` (~16k), `test` (~16k). ## Source data Embeddings were computed from [Mozilla Common Voice v15](https://commonvoice.mozilla.org/) English using the [SONAR](https://github.com/facebookresearch/SONAR) encoder. Audio and transcripts from Common Voice are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). ## Trained checkpoints | HF repo | Training data | Embedding | Rank | Size | |---------|--------------|-----------|-----:|-----:| | [BUT-FIT/FLiP-en-sonar](https://huggingface.co/BUT-FIT/FLiP-en-sonar) → `mcv15/rank-512/` | MCV v15 EN | SONAR | 512 | 207 MB | | [BUT-FIT/FLiP-en-sonar](https://huggingface.co/BUT-FIT/FLiP-en-sonar) → `mcv15/rank-1024/` | MCV v15 EN | SONAR | 1024 | 414 MB | ## Usage See the [FLiP GitHub repo](https://github.com/BUTSpeechFIT/FLiP) for full installation instructions and training/evaluation scripts. Quick start after downloading: ```python import numpy as np train_speech = np.load("cv_15/en/sonar_embeddings/train_speech_embs.npy") train_text = np.load("cv_15/en/sonar_embeddings/train_text_embs.npy") ``` ## Citation ```bibtex @misc{kesiraju2026flip, title = {{FLiP}: Towards understanding and interpreting multimodal multilingual sentence embeddings}, author = {Kesiraju, Santosh and Yusuf, Bolaji and Sedl{\'a}{\v{c}}ek, Simon and Plchot, Old{\v{r}}ich and Schwarz, Petr}, year = {2026}, eprint = {2604.18109}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2604.18109}, } ```
提供机构:
BUT-FIT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作