five

CSRef 项目下的 data 文件夹

收藏
魔搭社区2026-04-17 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/lihongh/CSRef_data
下载链接
链接失效反馈
官方服务:
资源简介:
# 🎤 Contrastive Semantic Alignment for Speech Referring Expression Comprehension ([CSRef](https://github.com/macrorise-lh/CSRef)) ![CSRef Logo](https://img.shields.io/badge/CSRef-v1.0-blue) ![Python](https://img.shields.io/badge/python-3.9.23-green) ![PyTorch](https://img.shields.io/badge/PyTorch-2.8.0-red) This repository contains the implementation of the approach described in the paper "CSRef: Contrastive Semantic Alignment for Speech Referring Expression Comprehension". 🚀 ## 📋 Project Overview ### What is CSRef? CSRef is a deep learning framework designed to comprehend referring expressions in speech and localize the corresponding objects in images. The framework employs a two-stage training approach: 1. **CSA Stage**: A pretraining stage that learns to align speech and text semantics through contrastive learning. It leverages the structured semantic space of text to guide the representation learning of raw speech. 2. **SREC Stage**: The main training stage that leverages the speech encoder from the CSA stage to perform referring expression comprehension by aligning speech with visual features. ### Key Features and Capabilities - **Two-stage training approach**: First learns speech-text alignment, then applies it to speech-visual tasks - **Multi-modal fusion**: Integrates speech and visual modalities effectively - **Flexible architecture**: Supports various speech encoders and visual backbones ### Potential Applications and Use Cases - **Human-computer interaction**: Enabling natural language control of computer vision systems - **Robotic vision**: Allowing robots to understand and locate objects based on verbal descriptions ## 🛠️ Installation Instructions ### Prerequisites - **Python**: 3.9.23 (tested with this version) - **CUDA**: 12.6 or higher (for GPU support) - **PyTorch**: 2.8 or higher - **Operating System**: Linux (tested on Ubuntu 22.04) ### Step-by-Step Environment Setup 1. **Clone the repository** ```bash git clone https://github.com/macrorise-lh/CSRef.git cd CSRef ``` 2. **Create a conda virtual environment** ```bash conda create -n csref python=3.9 conda activate csref ``` 3. **Install PyTorch** ```bash # For CUDA 12.6 pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126 ``` 4. **Install dependencies from requirements.txt** ```bash pip install -r requirements.txt ``` ## 💾 Data Preparation Before training, you need to download and prepare the required datasets: ### Speech Referring Expressions Annotations We provide two methods to obtain the speech referring expressions annotations: #### Method 1: Automatic Download from [Hugging Face](https://huggingface.co/collections/lihong-huang/speech-referring-expression-comprehension-srec-68a97ed74ea0b45b56dcc4f9) The simplest way is to use the Hugging Face dataset integration. When you run training with the `_hf` configuration files, the datasets will be automatically downloaded: ```bash # Example: This will automatically download RefCOCO speech dataset from Hugging Face CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/train_speech.sh configs/csref_refcoco_speech_hf.py 1 ``` Available datasets with automatic download: - `configs/csref_refcoco_speech_hf.py` - RefCOCO_speech dataset - `configs/csref_refcoco+_speech_hf.py` - RefCOCO+_speech dataset - `configs/csref_refcocog_speech_hf.py` - RefCOCOg_speech dataset - `configs/csref_srefface_hf.py` - SRRefFace dataset - `configs/csref_srefface+_hf.py` - SRRefFace+ dataset - `configs/csref_sreffaceg_hf.py` - SRRefFaceG dataset #### Method 2: Manual Download from [ModelScope](https://modelscope.cn/datasets/lihongh/CSRef_data) Alternatively, you can manually download the complete dataset and pre-trained model weights: ```bash # Download from ModelScope # Follow the link: https://modelscope.cn/datasets/lihongh/CSRef_data # Extract files to the appropriate directories in the data folder following the Project Structure ``` **Advantages of Manual Download:** - Complete offline access to all datasets - Faster training startup (no download time) **Data Organization:** 📁 After manual download, organize the files according to the directory structure shown in the [Project Structure](#-project-structure) section. ### 📦 Additional Required Datasets and Weights 1. **Download [LibriSpeech ASR dataset](https://www.openslr.org/12/) for CSA pre-training** ```bash # Create directory mkdir -p data/audios # Download and extract LibriSpeech cd data/audios # train sets - 960 hours wget https://www.openslr.org/resources/12/train-other-500.tar.gz wget https://www.openslr.org/resources/12/train-clean-360.tar.gz wget https://www.openslr.org/resources/12/train-clean-100.tar.gz # dev sets wget https://www.openslr.org/resources/12/dev-other.tar.gz wget https://www.openslr.org/resources/12/dev-clean.tar.gz tar -xvzf train-other-500.tar.gz tar -xvzf train-clean-360.tar.gz tar -xvzf train-clean-100.tar.gz tar -xvzf dev-other.tar.gz tar -xvzf dev-clean.tar.gz cd ../../ ``` 2. **Download [COCO images](https://cocodataset.org/#download)** ```bash # Create directory mkdir -p data/images # Download and extract COCO train2014 images cd data/images wget http://images.cocodataset.org/zips/train2014.zip unzip train2014.zip rm train2014.zip cd ../../ ``` 3. **Download pre-trained encoders** ```bash # Create directory mkdir -p data/weights # Download BERT and Wav2Vec2 models cd data/weights git lfs install git clone https://huggingface.co/facebook/wav2vec2-base git clone https://huggingface.co/google-bert/bert-base-uncased cd ../../ # Download CSA pretrained Speech Encoder wget https://modelscope.cn/datasets/lihongh/CSRef_data/resolve/master/data/weights/CSA_speech_encoder.pth # Download pretrained visual backbone CSPDarkNet # following https://github.com/luogen1996/SimREC/blob/main/DATA_PRE_README.md#pretrained-weights # or https://modelscope.cn/datasets/lihongh/CSRef_data/resolve/master/data/weights/cspdarknet_coco.pth ``` ## 🚀 Usage Examples ### 🏋️ Training #### 🎯 CSA Stage Training The CSA stage learns semantic alignment between speech and text modalities: ```bash # Single GPU training CUDA_VISIBLE_DEVICES=0 PORT=23450 bash tools/train_CSA.sh configs/csref_CSA_librispeech.py 1 # Multi-GPU training (4 GPUs) CUDA_VISIBLE_DEVICES=1,2,3,4 PORT=23450 bash tools/train_CSA.sh configs/csref_CSA_librispeech.py 4 ``` Key parameters: - `CUDA_VISIBLE_DEVICES`: Specifies which GPUs to use - `PORT`: Port number for distributed training - `configs/csref_CSA_librispeech.py`: Configuration file for CSA stage - `4`: Number of GPUs to use #### 🔍 SREC Stage Training The SREC stage uses the trained speech encoder to perform referring expression comprehension: ```bash # Single GPU training on RefCOCO+ CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/train_speech.sh configs/csref_refcoco_speech.py 1 ``` You can also train on other datasets by using different configuration files: - `configs/csref_refcoco_speech.py` / `configs/csref_refcoco_speech_hf.py`: For RefCOCO_speech dataset - `configs/csref_refcoco+_speech.py` / `configs/csref_refcoco+_speech_hf.py`: For RefCOCO+_speech dataset - `configs/csref_refcocog_speech.py` / `configs/csref_refcocog_speech_hf.py`: For RefCOCOg_speech dataset - `configs/csref_srefface.py` / `configs/csref_srefface_hf.py`: For SRRefFace dataset - `configs/csref_srefface+.py` / `configs/csref_srefface+_hf.py`: For SRRefFace+ dataset - `configs/csref_sreffaceg.py` / `configs/csref_sreffaceg_hf.py`: For SRRefFaceG dataset **Note:** Use configuration files with `_hf` suffix for automatic Hugging Face dataset download, or without `_hf` suffix if you have manually downloaded and organized the data. ### 📊 Evaluation ```bash # Evaluate SREC model # Using automatically downloaded Hugging Face datasets CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/eval_speech.sh configs/csref_refcoco_speech_hf.py 1 data/weights/csref_refcoco_speech.pth # Using manually downloaded datasets CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/eval_speech.sh configs/csref_refcoco_speech.py 1 data/weights/csref_refcoco_speech.pth ``` **Note:** Make sure to use the corresponding configuration file (`_hf` or non-`_hf`) that matches your data preparation method. ## 📂 Project Structure The CSRef project is organized as follows: ``` CSRef/ ├── configs/ # Configuration files │ ├── csref_*.py # Main configuration files for different datasets │ └── common/ # Common configuration modules │ ├── dataset_*.py # Dataset configurations │ ├── optim.py # Optimizer configurations │ ├── train.py # Training configurations │ └── models/ # Model configurations ├── csref/ # Core library code │ ├── config/ # Configuration management │ ├── datasets/ # Dataset handling │ ├── layers/ # Neural network layers │ ├── models/ # Model definitions │ │ ├── backbones/ # Visual backbones │ │ ├── heads/ # Detection heads │ │ ├── losses/ # Loss functions │ │ ├── speech_encoders/ # Speech encoders │ │ ├── text_encoder/ # Text encoders │ │ └── utils/ # Model utilities │ ├── scheduler/ # Learning rate schedulers │ └── utils/ # Utility functions ├── tools/ # Training and evaluation scripts │ ├── train_*.py # Training scripts │ ├── train_*.sh # Training shell scripts │ ├── eval_*.py # Evaluation scripts │ └── eval_*.sh # Evaluation shell scripts ├── data/ # Data directory (to be created by user) │ ├── audios/ # Audio files │ │ ├── LibriSpeech/ │ │ ├── refcoco_speech/ │ │ ├── refcoco+_speech/ │ │ └── refcocog_speech/ │ ├── images/ # Image files │ │ └── train2014/ # COCO train2014 images │ ├── anns/ # Annotation files │ │ ├── general_object/ # General object annotations (RefCOCO/RefCOCO+/RefCOCOg) │ │ └── face_centric/ # Face-centric annotations (SRRefFace series) │ ├── hf_cache/ # Hugging Face dataset cache (auto-created) │ └── weights/ # Pre-trained model weights │ ├── wav2vec2-base/ # Wav2Vec2 base model │ ├── bert-base-uncased/ # BERT base uncased model │ ├── CSA_speech_encoder.pth # Pre-trained CSA speech encoder │ └── csref_*.pth # Trained CSRef model weights (if downloaded manually) ├── requirements.txt # Python dependencies ├── README.md # This file └── .gitignore # Git ignore rules ``` ## License Information This project is licensed under the Apache-2.0 License - see the [LICENSE](LICENSE) file for details. ## Acknowledgement Thanks a lot for the nicely organized code from the following repos: - [SimREC](https://github.com/luogen1996/SimREC)

# 🎤 面向**语音指代表达理解(Speech Referring Expression Comprehension)**的**对比语义对齐(Contrastive Semantic Alignment)**方法([CSRef](https://github.com/macrorise-lh/CSRef)) ![CSRef 标识](https://img.shields.io/badge/CSRef-v1.0-blue) ![Python](https://img.shields.io/badge/python-3.9.23-green) ![PyTorch](https://img.shields.io/badge/PyTorch-2.8.0-red) 本仓库包含论文《CSRef:面向语音指代表达理解的对比语义对齐方法》中所提方法的实现代码。🚀 ## 📋 项目概览 ### 何为CSRef? CSRef是一款深度学习框架,旨在理解语音中的指代表达,并在图像中定位对应的目标物体。该框架采用两阶段训练流程: 1. **CSA阶段**:通过对比学习实现语音与文本语义对齐的预训练阶段,利用文本的结构化语义空间指导原始语音的表征学习。 2. **SREC阶段**:主训练阶段,复用CSA阶段训练得到的语音编码器,通过对齐语音与视觉特征完成指代表达理解任务。 ### 关键特性与能力 - **两阶段训练范式**:先学习语音-文本语义对齐,再将其迁移至语音-视觉任务 - **多模态融合**:有效融合语音与视觉模态 - **灵活架构**:支持多种语音编码器与视觉骨干网络 ### 潜在应用场景 - **人机交互**:实现计算机视觉系统的自然语言控制 - **机器人视觉**:允许机器人根据口头描述理解并定位目标物体 ## 🛠️ 安装指南 ### 前置依赖 - **Python**:3.9.23(已在此版本验证) - **CUDA**:12.6及以上版本(用于GPU加速) - **PyTorch**:2.8及以上版本 - **操作系统**:Linux(已在Ubuntu 22.04上测试) ### 分步环境搭建 1. **克隆仓库** bash git clone https://github.com/macrorise-lh/CSRef.git cd CSRef 2. **创建Conda虚拟环境** bash conda create -n csref python=3.9 conda activate csref 3. **安装PyTorch** bash # For CUDA 12.6 pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126 4. **通过requirements.txt安装依赖** bash pip install -r requirements.txt ## 💾 数据准备 在启动训练前,您需要下载并准备所需的数据集: ### 语音指代表达标注 我们提供两种方式获取语音指代表达标注: #### 方式1:从[Hugging Face](https://huggingface.co/collections/lihong-huang/speech-referring-expression-comprehension-srec-68a97ed74ea0b45b56dcc4f9)自动下载 最简单的方式是使用Hugging Face数据集集成功能。当您使用带`_hf`后缀的配置文件运行训练时,数据集将自动下载: bash # 示例:该命令将从Hugging Face自动下载RefCOCO speech数据集 CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/train_speech.sh configs/csref_refcoco_speech_hf.py 1 支持自动下载的数据集列表: - `configs/csref_refcoco_speech_hf.py` - RefCOCO_speech数据集 - `configs/csref_refcoco+_speech_hf.py` - RefCOCO+_speech数据集 - `configs/csref_refcocog_speech_hf.py` - RefCOCOg_speech数据集 - `configs/csref_srefface_hf.py` - SRRefFace数据集 - `configs/csref_srefface+_hf.py` - SRRefFace+数据集 - `configs/csref_sreffaceg_hf.py` - SRRefFaceG数据集 #### 方式2:从[ModelScope](https://modelscope.cn/datasets/lihongh/CSRef_data)手动下载 您也可以手动下载完整数据集与预训练模型权重: bash # 从ModelScope下载 # 访问链接:https://modelscope.cn/datasets/lihongh/CSRef_data # 按照项目结构章节中的目录说明,将文件解压至data目录的对应路径 **手动下载优势**: - 可离线完整访问所有数据集 - 无需等待下载,可更快启动训练 **数据组织说明**:📁 手动下载后,请按照[项目结构](#-project-structure)章节中的目录结构整理文件。 ### 📦 额外所需数据集与权重 1. **为CSA预训练下载[LibriSpeech ASR数据集](https://www.openslr.org/12/)** bash # 创建目录 mkdir -p data/audios # 下载并解压LibriSpeech cd data/audios # 训练集 - 共960小时 wget https://www.openslr.org/resources/12/train-other-500.tar.gz wget https://www.openslr.org/resources/12/train-clean-360.tar.gz wget https://www.openslr.org/resources/12/train-clean-100.tar.gz # 验证集 wget https://www.openslr.org/resources/12/dev-other.tar.gz wget https://www.openslr.org/resources/12/dev-clean.tar.gz tar -xvzf train-other-500.tar.gz tar -xvzf train-clean-360.tar.gz tar -xvzf train-clean-100.tar.gz tar -xvzf dev-other.tar.gz tar -xvzf dev-clean.tar.gz cd ../../ 2. **下载[COCO图像数据集](https://cocodataset.org/#download)** bash # 创建目录 mkdir -p data/images # 下载并解压COCO train2014图像 cd data/images wget http://images.cocodataset.org/zips/train2014.zip unzip train2014.zip rm train2014.zip cd ../../ 3. **下载预训练编码器** bash # 创建目录 mkdir -p data/weights # 下载BERT与Wav2Vec2模型 cd data/weights git lfs install git clone https://huggingface.co/facebook/wav2vec2-base git clone https://huggingface.co/google-bert/bert-base-uncased cd ../../ # 下载CSA预训练语音编码器 wget https://modelscope.cn/datasets/lihongh/CSRef_data/resolve/master/data/weights/CSA_speech_encoder.pth # 下载预训练视觉骨干网络CSPDarkNet # 参考 https://github.com/luogen1996/SimREC/blob/main/DATA_PRE_README.md#pretrained-weights # 或从 https://modelscope.cn/datasets/lihongh/CSRef_data/resolve/master/data/weights/cspdarknet_coco.pth 下载 ## 🚀 使用示例 ### 🏋️ 训练 #### 🎯 CSA阶段训练 CSA阶段用于学习语音与文本模态间的语义对齐: bash # 单GPU训练 CUDA_VISIBLE_DEVICES=0 PORT=23450 bash tools/train_CSA.sh configs/csref_CSA_librispeech.py 1 # 多GPU训练(4张GPU) CUDA_VISIBLE_DEVICES=1,2,3,4 PORT=23450 bash tools/train_CSA.sh configs/csref_CSA_librispeech.py 4 关键参数说明: - `CUDA_VISIBLE_DEVICES`:指定使用的GPU设备 - `PORT`:分布式训练所用的端口号 - `configs/csref_CSA_librispeech.py`:CSA阶段的配置文件 - `4`:使用的GPU数量 #### 🔍 SREC阶段训练 SREC阶段使用已训练好的语音编码器完成指代表达理解任务: bash # 在RefCOCO+数据集上进行单GPU训练 CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/train_speech.sh configs/csref_refcoco_speech.py 1 您也可以通过更换配置文件在其他数据集上训练: - `configs/csref_refcoco_speech.py` / `configs/csref_refcoco_speech_hf.py`:用于RefCOCO_speech数据集 - `configs/csref_refcoco+_speech.py` / `configs/csref_refcoco+_speech_hf.py`:用于RefCOCO+_speech数据集 - `configs/csref_refcocog_speech.py` / `configs/csref_refcocog_speech_hf.py`:用于RefCOCOg_speech数据集 - `configs/csref_srefface.py` / `configs/csref_srefface_hf.py`:用于SRRefFace数据集 - `configs/csref_srefface+.py` / `configs/csref_srefface+_hf.py`:用于SRRefFace+数据集 - `configs/csref_sreffaceg.py` / `configs/csref_sreffaceg_hf.py`:用于SRRefFaceG数据集 **注意**:若需自动从Hugging Face下载数据集,请使用带`_hf`后缀的配置文件;若已手动下载并整理数据集,则使用无`_hf`后缀的配置文件。 ### 📊 评估 bash # 评估SREC模型 # 使用自动下载的Hugging Face数据集 CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/eval_speech.sh configs/csref_refcoco_speech_hf.py 1 data/weights/csref_refcoco_speech.pth # 使用手动下载的数据集 CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/eval_speech.sh configs/csref_refcoco_speech.py 1 data/weights/csref_refcoco_speech.pth **注意**:请确保使用与数据准备方式匹配的配置文件(带`_hf`后缀或无`_hf`后缀)。 ## 📂 项目结构 CSRef项目的目录结构如下: CSRef/ ├── configs/ # 配置文件目录 │ ├── csref_*.py # 适配不同数据集的主配置文件 │ └── common/ # 通用配置模块 │ ├── dataset_*.py # 数据集配置文件 │ ├── optim.py # 优化器配置文件 │ ├── train.py # 训练配置文件 │ └── models/ # 模型配置文件 ├── csref/ # 核心库代码 │ ├── config/ # 配置管理模块 │ ├── datasets/ # 数据集处理模块 │ ├── layers/ # 神经网络层模块 │ ├── models/ # 模型定义目录 │ │ ├── backbones/ # 视觉骨干网络 │ │ ├── heads/ # 检测头模块 │ │ ├── losses/ # 损失函数模块 │ │ ├── speech_encoders/ # 语音编码器模块 │ │ ├── text_encoder/ # 文本编码器模块 │ │ └── utils/ # 模型工具函数 │ ├── scheduler/ # 学习率调度器模块 │ └── utils/ # 通用工具函数 ├── tools/ # 训练与评估脚本目录 │ ├── train_*.py # 训练Python脚本 │ ├── train_*.sh # 训练Shell脚本 │ ├── eval_*.py # 评估Python脚本 │ └── eval_*.sh # 评估Shell脚本 ├── data/ # 数据目录(需用户自行创建) │ ├── audios/ # 音频文件目录 │ │ ├── LibriSpeech/ │ │ ├── refcoco_speech/ │ │ ├── refcoco+_speech/ │ │ └── refcocog_speech/ │ ├── images/ # 图像文件目录 │ │ └── train2014/ # COCO train2014图像目录 │ ├── anns/ # 标注文件目录 │ │ ├── general_object/ # 通用目标标注(RefCOCO/RefCOCO+/RefCOCOg) │ │ └── face_centric/ # 以人脸为中心的标注(SRRefFace系列) │ ├── hf_cache/ # Hugging Face数据集缓存目录(自动创建) │ └── weights/ # 预训练模型权重目录 │ ├── wav2vec2-base/ # Wav2Vec2 base模型 │ ├── bert-base-uncased/ # BERT base uncased模型 │ ├── CSA_speech_encoder.pth # 预训练CSA语音编码器 │ └── csref_*.pth # 手动下载的训练好的CSRef模型权重 ├── requirements.txt # Python依赖声明文件 ├── README.md # 本说明文件 └── .gitignore # Git忽略规则文件 ## 许可证信息 本项目采用Apache-2.0许可证,详细信息请参见[LICENSE](LICENSE)文件。 ## 致谢 非常感谢以下开源仓库提供的优质代码框架: - [SimREC](https://github.com/luogen1996/SimREC)
提供机构:
maas
创建时间:
2025-08-23
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是CSRef项目的数据文件夹,包含用于语音指代表达理解的多模态框架,该框架采用两阶段训练方法,先对齐语音与文本语义,再应用于语音-视觉任务。它支持多种语音和视觉编码器,适用于人机交互和机器人视觉等场景。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务