CSRef 项目下的 data 文件夹
收藏魔搭社区2026-04-17 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/lihongh/CSRef_data
下载链接
链接失效反馈官方服务:
资源简介:
# 🎤 Contrastive Semantic Alignment for Speech Referring Expression Comprehension ([CSRef](https://github.com/macrorise-lh/CSRef))
  
This repository contains the implementation of the approach described in the paper "CSRef: Contrastive Semantic Alignment for Speech Referring Expression Comprehension". 🚀
## 📋 Project Overview
### What is CSRef?
CSRef is a deep learning framework designed to comprehend referring expressions in speech and localize the corresponding objects in images. The framework employs a two-stage training approach:
1. **CSA Stage**: A pretraining stage that learns to align speech and text semantics through contrastive learning. It leverages the structured semantic space of text to guide the representation learning of raw speech.
2. **SREC Stage**: The main training stage that leverages the speech encoder from the CSA stage to perform referring expression comprehension by aligning speech with visual features.
### Key Features and Capabilities
- **Two-stage training approach**: First learns speech-text alignment, then applies it to speech-visual tasks
- **Multi-modal fusion**: Integrates speech and visual modalities effectively
- **Flexible architecture**: Supports various speech encoders and visual backbones
### Potential Applications and Use Cases
- **Human-computer interaction**: Enabling natural language control of computer vision systems
- **Robotic vision**: Allowing robots to understand and locate objects based on verbal descriptions
## 🛠️ Installation Instructions
### Prerequisites
- **Python**: 3.9.23 (tested with this version)
- **CUDA**: 12.6 or higher (for GPU support)
- **PyTorch**: 2.8 or higher
- **Operating System**: Linux (tested on Ubuntu 22.04)
### Step-by-Step Environment Setup
1. **Clone the repository**
```bash
git clone https://github.com/macrorise-lh/CSRef.git
cd CSRef
```
2. **Create a conda virtual environment**
```bash
conda create -n csref python=3.9
conda activate csref
```
3. **Install PyTorch**
```bash
# For CUDA 12.6
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
```
4. **Install dependencies from requirements.txt**
```bash
pip install -r requirements.txt
```
## 💾 Data Preparation
Before training, you need to download and prepare the required datasets:
### Speech Referring Expressions Annotations
We provide two methods to obtain the speech referring expressions annotations:
#### Method 1: Automatic Download from [Hugging Face](https://huggingface.co/collections/lihong-huang/speech-referring-expression-comprehension-srec-68a97ed74ea0b45b56dcc4f9)
The simplest way is to use the Hugging Face dataset integration. When you run training with the `_hf` configuration files, the datasets will be automatically downloaded:
```bash
# Example: This will automatically download RefCOCO speech dataset from Hugging Face
CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/train_speech.sh configs/csref_refcoco_speech_hf.py 1
```
Available datasets with automatic download:
- `configs/csref_refcoco_speech_hf.py` - RefCOCO_speech dataset
- `configs/csref_refcoco+_speech_hf.py` - RefCOCO+_speech dataset
- `configs/csref_refcocog_speech_hf.py` - RefCOCOg_speech dataset
- `configs/csref_srefface_hf.py` - SRRefFace dataset
- `configs/csref_srefface+_hf.py` - SRRefFace+ dataset
- `configs/csref_sreffaceg_hf.py` - SRRefFaceG dataset
#### Method 2: Manual Download from [ModelScope](https://modelscope.cn/datasets/lihongh/CSRef_data)
Alternatively, you can manually download the complete dataset and pre-trained model weights:
```bash
# Download from ModelScope
# Follow the link: https://modelscope.cn/datasets/lihongh/CSRef_data
# Extract files to the appropriate directories in the data folder following the Project Structure
```
**Advantages of Manual Download:**
- Complete offline access to all datasets
- Faster training startup (no download time)
**Data Organization:** 📁 After manual download, organize the files according to the directory structure shown in the [Project Structure](#-project-structure) section.
### 📦 Additional Required Datasets and Weights
1. **Download [LibriSpeech ASR dataset](https://www.openslr.org/12/) for CSA pre-training**
```bash
# Create directory
mkdir -p data/audios
# Download and extract LibriSpeech
cd data/audios
# train sets - 960 hours
wget https://www.openslr.org/resources/12/train-other-500.tar.gz
wget https://www.openslr.org/resources/12/train-clean-360.tar.gz
wget https://www.openslr.org/resources/12/train-clean-100.tar.gz
# dev sets
wget https://www.openslr.org/resources/12/dev-other.tar.gz
wget https://www.openslr.org/resources/12/dev-clean.tar.gz
tar -xvzf train-other-500.tar.gz
tar -xvzf train-clean-360.tar.gz
tar -xvzf train-clean-100.tar.gz
tar -xvzf dev-other.tar.gz
tar -xvzf dev-clean.tar.gz
cd ../../
```
2. **Download [COCO images](https://cocodataset.org/#download)**
```bash
# Create directory
mkdir -p data/images
# Download and extract COCO train2014 images
cd data/images
wget http://images.cocodataset.org/zips/train2014.zip
unzip train2014.zip
rm train2014.zip
cd ../../
```
3. **Download pre-trained encoders**
```bash
# Create directory
mkdir -p data/weights
# Download BERT and Wav2Vec2 models
cd data/weights
git lfs install
git clone https://huggingface.co/facebook/wav2vec2-base
git clone https://huggingface.co/google-bert/bert-base-uncased
cd ../../
# Download CSA pretrained Speech Encoder
wget https://modelscope.cn/datasets/lihongh/CSRef_data/resolve/master/data/weights/CSA_speech_encoder.pth
# Download pretrained visual backbone CSPDarkNet
# following https://github.com/luogen1996/SimREC/blob/main/DATA_PRE_README.md#pretrained-weights
# or https://modelscope.cn/datasets/lihongh/CSRef_data/resolve/master/data/weights/cspdarknet_coco.pth
```
## 🚀 Usage Examples
### 🏋️ Training
#### 🎯 CSA Stage Training
The CSA stage learns semantic alignment between speech and text modalities:
```bash
# Single GPU training
CUDA_VISIBLE_DEVICES=0 PORT=23450 bash tools/train_CSA.sh configs/csref_CSA_librispeech.py 1
# Multi-GPU training (4 GPUs)
CUDA_VISIBLE_DEVICES=1,2,3,4 PORT=23450 bash tools/train_CSA.sh configs/csref_CSA_librispeech.py 4
```
Key parameters:
- `CUDA_VISIBLE_DEVICES`: Specifies which GPUs to use
- `PORT`: Port number for distributed training
- `configs/csref_CSA_librispeech.py`: Configuration file for CSA stage
- `4`: Number of GPUs to use
#### 🔍 SREC Stage Training
The SREC stage uses the trained speech encoder to perform referring expression comprehension:
```bash
# Single GPU training on RefCOCO+
CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/train_speech.sh configs/csref_refcoco_speech.py 1
```
You can also train on other datasets by using different configuration files:
- `configs/csref_refcoco_speech.py` / `configs/csref_refcoco_speech_hf.py`: For RefCOCO_speech dataset
- `configs/csref_refcoco+_speech.py` / `configs/csref_refcoco+_speech_hf.py`: For RefCOCO+_speech dataset
- `configs/csref_refcocog_speech.py` / `configs/csref_refcocog_speech_hf.py`: For RefCOCOg_speech dataset
- `configs/csref_srefface.py` / `configs/csref_srefface_hf.py`: For SRRefFace dataset
- `configs/csref_srefface+.py` / `configs/csref_srefface+_hf.py`: For SRRefFace+ dataset
- `configs/csref_sreffaceg.py` / `configs/csref_sreffaceg_hf.py`: For SRRefFaceG dataset
**Note:** Use configuration files with `_hf` suffix for automatic Hugging Face dataset download, or without `_hf` suffix if you have manually downloaded and organized the data.
### 📊 Evaluation
```bash
# Evaluate SREC model
# Using automatically downloaded Hugging Face datasets
CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/eval_speech.sh configs/csref_refcoco_speech_hf.py 1 data/weights/csref_refcoco_speech.pth
# Using manually downloaded datasets
CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/eval_speech.sh configs/csref_refcoco_speech.py 1 data/weights/csref_refcoco_speech.pth
```
**Note:** Make sure to use the corresponding configuration file (`_hf` or non-`_hf`) that matches your data preparation method.
## 📂 Project Structure
The CSRef project is organized as follows:
```
CSRef/
├── configs/ # Configuration files
│ ├── csref_*.py # Main configuration files for different datasets
│ └── common/ # Common configuration modules
│ ├── dataset_*.py # Dataset configurations
│ ├── optim.py # Optimizer configurations
│ ├── train.py # Training configurations
│ └── models/ # Model configurations
├── csref/ # Core library code
│ ├── config/ # Configuration management
│ ├── datasets/ # Dataset handling
│ ├── layers/ # Neural network layers
│ ├── models/ # Model definitions
│ │ ├── backbones/ # Visual backbones
│ │ ├── heads/ # Detection heads
│ │ ├── losses/ # Loss functions
│ │ ├── speech_encoders/ # Speech encoders
│ │ ├── text_encoder/ # Text encoders
│ │ └── utils/ # Model utilities
│ ├── scheduler/ # Learning rate schedulers
│ └── utils/ # Utility functions
├── tools/ # Training and evaluation scripts
│ ├── train_*.py # Training scripts
│ ├── train_*.sh # Training shell scripts
│ ├── eval_*.py # Evaluation scripts
│ └── eval_*.sh # Evaluation shell scripts
├── data/ # Data directory (to be created by user)
│ ├── audios/ # Audio files
│ │ ├── LibriSpeech/
│ │ ├── refcoco_speech/
│ │ ├── refcoco+_speech/
│ │ └── refcocog_speech/
│ ├── images/ # Image files
│ │ └── train2014/ # COCO train2014 images
│ ├── anns/ # Annotation files
│ │ ├── general_object/ # General object annotations (RefCOCO/RefCOCO+/RefCOCOg)
│ │ └── face_centric/ # Face-centric annotations (SRRefFace series)
│ ├── hf_cache/ # Hugging Face dataset cache (auto-created)
│ └── weights/ # Pre-trained model weights
│ ├── wav2vec2-base/ # Wav2Vec2 base model
│ ├── bert-base-uncased/ # BERT base uncased model
│ ├── CSA_speech_encoder.pth # Pre-trained CSA speech encoder
│ └── csref_*.pth # Trained CSRef model weights (if downloaded manually)
├── requirements.txt # Python dependencies
├── README.md # This file
└── .gitignore # Git ignore rules
```
## License Information
This project is licensed under the Apache-2.0 License - see the [LICENSE](LICENSE) file for details.
## Acknowledgement
Thanks a lot for the nicely organized code from the following repos:
- [SimREC](https://github.com/luogen1996/SimREC)
# 🎤 面向**语音指代表达理解(Speech Referring Expression Comprehension)**的**对比语义对齐(Contrastive Semantic Alignment)**方法([CSRef](https://github.com/macrorise-lh/CSRef))
  
本仓库包含论文《CSRef:面向语音指代表达理解的对比语义对齐方法》中所提方法的实现代码。🚀
## 📋 项目概览
### 何为CSRef?
CSRef是一款深度学习框架,旨在理解语音中的指代表达,并在图像中定位对应的目标物体。该框架采用两阶段训练流程:
1. **CSA阶段**:通过对比学习实现语音与文本语义对齐的预训练阶段,利用文本的结构化语义空间指导原始语音的表征学习。
2. **SREC阶段**:主训练阶段,复用CSA阶段训练得到的语音编码器,通过对齐语音与视觉特征完成指代表达理解任务。
### 关键特性与能力
- **两阶段训练范式**:先学习语音-文本语义对齐,再将其迁移至语音-视觉任务
- **多模态融合**:有效融合语音与视觉模态
- **灵活架构**:支持多种语音编码器与视觉骨干网络
### 潜在应用场景
- **人机交互**:实现计算机视觉系统的自然语言控制
- **机器人视觉**:允许机器人根据口头描述理解并定位目标物体
## 🛠️ 安装指南
### 前置依赖
- **Python**:3.9.23(已在此版本验证)
- **CUDA**:12.6及以上版本(用于GPU加速)
- **PyTorch**:2.8及以上版本
- **操作系统**:Linux(已在Ubuntu 22.04上测试)
### 分步环境搭建
1. **克隆仓库**
bash
git clone https://github.com/macrorise-lh/CSRef.git
cd CSRef
2. **创建Conda虚拟环境**
bash
conda create -n csref python=3.9
conda activate csref
3. **安装PyTorch**
bash
# For CUDA 12.6
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
4. **通过requirements.txt安装依赖**
bash
pip install -r requirements.txt
## 💾 数据准备
在启动训练前,您需要下载并准备所需的数据集:
### 语音指代表达标注
我们提供两种方式获取语音指代表达标注:
#### 方式1:从[Hugging Face](https://huggingface.co/collections/lihong-huang/speech-referring-expression-comprehension-srec-68a97ed74ea0b45b56dcc4f9)自动下载
最简单的方式是使用Hugging Face数据集集成功能。当您使用带`_hf`后缀的配置文件运行训练时,数据集将自动下载:
bash
# 示例:该命令将从Hugging Face自动下载RefCOCO speech数据集
CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/train_speech.sh configs/csref_refcoco_speech_hf.py 1
支持自动下载的数据集列表:
- `configs/csref_refcoco_speech_hf.py` - RefCOCO_speech数据集
- `configs/csref_refcoco+_speech_hf.py` - RefCOCO+_speech数据集
- `configs/csref_refcocog_speech_hf.py` - RefCOCOg_speech数据集
- `configs/csref_srefface_hf.py` - SRRefFace数据集
- `configs/csref_srefface+_hf.py` - SRRefFace+数据集
- `configs/csref_sreffaceg_hf.py` - SRRefFaceG数据集
#### 方式2:从[ModelScope](https://modelscope.cn/datasets/lihongh/CSRef_data)手动下载
您也可以手动下载完整数据集与预训练模型权重:
bash
# 从ModelScope下载
# 访问链接:https://modelscope.cn/datasets/lihongh/CSRef_data
# 按照项目结构章节中的目录说明,将文件解压至data目录的对应路径
**手动下载优势**:
- 可离线完整访问所有数据集
- 无需等待下载,可更快启动训练
**数据组织说明**:📁 手动下载后,请按照[项目结构](#-project-structure)章节中的目录结构整理文件。
### 📦 额外所需数据集与权重
1. **为CSA预训练下载[LibriSpeech ASR数据集](https://www.openslr.org/12/)**
bash
# 创建目录
mkdir -p data/audios
# 下载并解压LibriSpeech
cd data/audios
# 训练集 - 共960小时
wget https://www.openslr.org/resources/12/train-other-500.tar.gz
wget https://www.openslr.org/resources/12/train-clean-360.tar.gz
wget https://www.openslr.org/resources/12/train-clean-100.tar.gz
# 验证集
wget https://www.openslr.org/resources/12/dev-other.tar.gz
wget https://www.openslr.org/resources/12/dev-clean.tar.gz
tar -xvzf train-other-500.tar.gz
tar -xvzf train-clean-360.tar.gz
tar -xvzf train-clean-100.tar.gz
tar -xvzf dev-other.tar.gz
tar -xvzf dev-clean.tar.gz
cd ../../
2. **下载[COCO图像数据集](https://cocodataset.org/#download)**
bash
# 创建目录
mkdir -p data/images
# 下载并解压COCO train2014图像
cd data/images
wget http://images.cocodataset.org/zips/train2014.zip
unzip train2014.zip
rm train2014.zip
cd ../../
3. **下载预训练编码器**
bash
# 创建目录
mkdir -p data/weights
# 下载BERT与Wav2Vec2模型
cd data/weights
git lfs install
git clone https://huggingface.co/facebook/wav2vec2-base
git clone https://huggingface.co/google-bert/bert-base-uncased
cd ../../
# 下载CSA预训练语音编码器
wget https://modelscope.cn/datasets/lihongh/CSRef_data/resolve/master/data/weights/CSA_speech_encoder.pth
# 下载预训练视觉骨干网络CSPDarkNet
# 参考 https://github.com/luogen1996/SimREC/blob/main/DATA_PRE_README.md#pretrained-weights
# 或从 https://modelscope.cn/datasets/lihongh/CSRef_data/resolve/master/data/weights/cspdarknet_coco.pth 下载
## 🚀 使用示例
### 🏋️ 训练
#### 🎯 CSA阶段训练
CSA阶段用于学习语音与文本模态间的语义对齐:
bash
# 单GPU训练
CUDA_VISIBLE_DEVICES=0 PORT=23450 bash tools/train_CSA.sh configs/csref_CSA_librispeech.py 1
# 多GPU训练(4张GPU)
CUDA_VISIBLE_DEVICES=1,2,3,4 PORT=23450 bash tools/train_CSA.sh configs/csref_CSA_librispeech.py 4
关键参数说明:
- `CUDA_VISIBLE_DEVICES`:指定使用的GPU设备
- `PORT`:分布式训练所用的端口号
- `configs/csref_CSA_librispeech.py`:CSA阶段的配置文件
- `4`:使用的GPU数量
#### 🔍 SREC阶段训练
SREC阶段使用已训练好的语音编码器完成指代表达理解任务:
bash
# 在RefCOCO+数据集上进行单GPU训练
CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/train_speech.sh configs/csref_refcoco_speech.py 1
您也可以通过更换配置文件在其他数据集上训练:
- `configs/csref_refcoco_speech.py` / `configs/csref_refcoco_speech_hf.py`:用于RefCOCO_speech数据集
- `configs/csref_refcoco+_speech.py` / `configs/csref_refcoco+_speech_hf.py`:用于RefCOCO+_speech数据集
- `configs/csref_refcocog_speech.py` / `configs/csref_refcocog_speech_hf.py`:用于RefCOCOg_speech数据集
- `configs/csref_srefface.py` / `configs/csref_srefface_hf.py`:用于SRRefFace数据集
- `configs/csref_srefface+.py` / `configs/csref_srefface+_hf.py`:用于SRRefFace+数据集
- `configs/csref_sreffaceg.py` / `configs/csref_sreffaceg_hf.py`:用于SRRefFaceG数据集
**注意**:若需自动从Hugging Face下载数据集,请使用带`_hf`后缀的配置文件;若已手动下载并整理数据集,则使用无`_hf`后缀的配置文件。
### 📊 评估
bash
# 评估SREC模型
# 使用自动下载的Hugging Face数据集
CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/eval_speech.sh configs/csref_refcoco_speech_hf.py 1 data/weights/csref_refcoco_speech.pth
# 使用手动下载的数据集
CUDA_VISIBLE_DEVICES=0 PORT=23451 bash tools/eval_speech.sh configs/csref_refcoco_speech.py 1 data/weights/csref_refcoco_speech.pth
**注意**:请确保使用与数据准备方式匹配的配置文件(带`_hf`后缀或无`_hf`后缀)。
## 📂 项目结构
CSRef项目的目录结构如下:
CSRef/
├── configs/ # 配置文件目录
│ ├── csref_*.py # 适配不同数据集的主配置文件
│ └── common/ # 通用配置模块
│ ├── dataset_*.py # 数据集配置文件
│ ├── optim.py # 优化器配置文件
│ ├── train.py # 训练配置文件
│ └── models/ # 模型配置文件
├── csref/ # 核心库代码
│ ├── config/ # 配置管理模块
│ ├── datasets/ # 数据集处理模块
│ ├── layers/ # 神经网络层模块
│ ├── models/ # 模型定义目录
│ │ ├── backbones/ # 视觉骨干网络
│ │ ├── heads/ # 检测头模块
│ │ ├── losses/ # 损失函数模块
│ │ ├── speech_encoders/ # 语音编码器模块
│ │ ├── text_encoder/ # 文本编码器模块
│ │ └── utils/ # 模型工具函数
│ ├── scheduler/ # 学习率调度器模块
│ └── utils/ # 通用工具函数
├── tools/ # 训练与评估脚本目录
│ ├── train_*.py # 训练Python脚本
│ ├── train_*.sh # 训练Shell脚本
│ ├── eval_*.py # 评估Python脚本
│ └── eval_*.sh # 评估Shell脚本
├── data/ # 数据目录(需用户自行创建)
│ ├── audios/ # 音频文件目录
│ │ ├── LibriSpeech/
│ │ ├── refcoco_speech/
│ │ ├── refcoco+_speech/
│ │ └── refcocog_speech/
│ ├── images/ # 图像文件目录
│ │ └── train2014/ # COCO train2014图像目录
│ ├── anns/ # 标注文件目录
│ │ ├── general_object/ # 通用目标标注(RefCOCO/RefCOCO+/RefCOCOg)
│ │ └── face_centric/ # 以人脸为中心的标注(SRRefFace系列)
│ ├── hf_cache/ # Hugging Face数据集缓存目录(自动创建)
│ └── weights/ # 预训练模型权重目录
│ ├── wav2vec2-base/ # Wav2Vec2 base模型
│ ├── bert-base-uncased/ # BERT base uncased模型
│ ├── CSA_speech_encoder.pth # 预训练CSA语音编码器
│ └── csref_*.pth # 手动下载的训练好的CSRef模型权重
├── requirements.txt # Python依赖声明文件
├── README.md # 本说明文件
└── .gitignore # Git忽略规则文件
## 许可证信息
本项目采用Apache-2.0许可证,详细信息请参见[LICENSE](LICENSE)文件。
## 致谢
非常感谢以下开源仓库提供的优质代码框架:
- [SimREC](https://github.com/luogen1996/SimREC)
提供机构:
maas
创建时间:
2025-08-23
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是CSRef项目的数据文件夹,包含用于语音指代表达理解的多模态框架,该框架采用两阶段训练方法,先对齐语音与文本语义,再应用于语音-视觉任务。它支持多种语音和视觉编码器,适用于人机交互和机器人视觉等场景。
以上内容由遇见数据集搜集并总结生成



