five

Cong123779/AI2Text-Bilingual-ASR-Dataset

收藏
Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Cong123779/AI2Text-Bilingual-ASR-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - vi - en license: cc-by-4.0 task_categories: - automatic-speech-recognition pretty_name: AI2Text Bilingual ASR Dataset (Vietnamese + English) size_categories: - 100K<n<1M --- # AI2Text – Bilingual ASR Dataset A large-scale bilingual (Vietnamese + English) speech dataset used to train the `Cong123779/AI2Text-Bilingual-ASR` model. ## Dataset Summary | Split | Samples | Notes | |-------|---------|-------| | train | ~194,167 | 77% Vietnamese, 23% English | | val | ~30,123 | held-out validation | ## Data Fields Each `manifest.csv` has the following columns: | Column | Description | |--------|-------------| | `id` | Unique sample identifier | | `transcript` | Ground-truth text (prefixed with `<\|vi\|>` or `<\|en\|>`) | | `audio_path` | Relative path to the `.wav` file | | `duration` | Duration in seconds | | `words_json` | JSON array of word-level timestamps | ## Audio Format - Sample rate: **16,000 Hz** - Channels: **Mono** - Format: **WAV** ## Language Distribution - **Vietnamese** (`<|vi|>` token): ~77% - **English** (`<|en|>` token): ~23% ## Usage ```python import pandas as pd from datasets import load_dataset # Load only the manifest CSVs (fast, no audio) ds = load_dataset("Cong123779/AI2Text-Bilingual-ASR-Dataset", data_files={"train": "train/manifest.csv", "val": "val/manifest.csv"}) print(ds) ``` ## License Creative Commons Attribution 4.0 (CC-BY 4.0)
提供机构:
Cong123779
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作