Cong123779/AI2Text-Bilingual-ASR-Dataset

Name: Cong123779/AI2Text-Bilingual-ASR-Dataset
Creator: Cong123779
Published: 2026-02-23 20:51:34
License: 暂无描述

Hugging Face2026-02-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Cong123779/AI2Text-Bilingual-ASR-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - vi - en license: cc-by-4.0 task_categories: - automatic-speech-recognition pretty_name: AI2Text Bilingual ASR Dataset (Vietnamese + English) size_categories: - 100K<n<1M --- # AI2Text – Bilingual ASR Dataset A large-scale bilingual (Vietnamese + English) speech dataset used to train the `Cong123779/AI2Text-Bilingual-ASR` model. ## Dataset Summary | Split | Samples | Notes | |-------|---------|-------| | train | ~194,167 | 77% Vietnamese, 23% English | | val | ~30,123 | held-out validation | ## Data Fields Each `manifest.csv` has the following columns: | Column | Description | |--------|-------------| | `id` | Unique sample identifier | | `transcript` | Ground-truth text (prefixed with `<\|vi\|>` or `<\|en\|>`) | | `audio_path` | Relative path to the `.wav` file | | `duration` | Duration in seconds | | `words_json` | JSON array of word-level timestamps | ## Audio Format - Sample rate: **16,000 Hz** - Channels: **Mono** - Format: **WAV** ## Language Distribution - **Vietnamese** (`<|vi|>` token): ~77% - **English** (`<|en|>` token): ~23% ## Usage ```python import pandas as pd from datasets import load_dataset # Load only the manifest CSVs (fast, no audio) ds = load_dataset("Cong123779/AI2Text-Bilingual-ASR-Dataset", data_files={"train": "train/manifest.csv", "val": "val/manifest.csv"}) print(ds) ``` ## License Creative Commons Attribution 4.0 (CC-BY 4.0)

提供机构：

Cong123779

5,000+

优质数据集

54 个

任务类型

进入经典数据集