meetween/mumospee_v1_fix
收藏Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/meetween/mumospee_v1_fix
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
---
## Dataset Statistics
> Generated from: train, test, validation/
### Overview (All Splits Combined)
| Metric | Value |
|--------|-------|
| Total samples | 53,983,241 |
| Total audio duration | 121,957h 08m 34.5s (121,957.1 hours) |
| Average duration per sample | 8.13s |
| Avg transcript length | 16.5 words |
| Total parquet shards | 29 |
### Per-Split Overview
| Split | # Samples | Duration | Avg Duration | Avg Words | Shards |
|-------|----------:|---------:|-------------:|----------:|-------:|
| `train` | 53,319,102 | 120,878h 42m 30.2s (120,878.7h) | 8.16s | 16.5 | 27 |
| `test` | 341,118 | 547h 25m 08.3s (547.4h) | 5.78s | 10.3 | 1 |
| `validation` | 323,021 | 531h 00m 56.0s (531.0h) | 5.92s | 10.4 | 1 |
### Language Distribution
| Value | train samples | train % | test samples | test % | validation samples | validation % | Total samples | Total % | Total Duration | Total Dur % |
|-------|----------:|---------:|----------:|---------:|----------:|---------:|----------:|---------:|---------------:|------------:|
| `en` | 28,643,120 | 53.72% | 266,720 | 78.19% | 248,203 | 76.84% | 29,158,043 | 54.01% | 66,374h 54m 20.5s | 54.42% |
| `zh` | 19,969,319 | 37.45% | 0 | 0.00% | 0 | 0.00% | 19,969,319 | 36.99% | 49,922h 33m 08.9s | 40.93% |
| `ja` | 869,665 | 1.63% | 0 | 0.00% | 0 | 0.00% | 869,665 | 1.61% | 1,715h 27m 28.6s | 1.41% |
| `de` | 841,219 | 1.58% | 13,511 | 3.96% | 13,511 | 4.18% | 868,241 | 1.61% | 1,751h 41m 54.2s | 1.44% |
| `fr` | 777,904 | 1.46% | 14,760 | 4.33% | 14,760 | 4.57% | 807,424 | 1.50% | 1,607h 08m 32.1s | 1.32% |
| `es` | 165,080 | 0.31% | 13,221 | 3.88% | 13,221 | 4.09% | 191,522 | 0.35% | 141h 08m 41.8s | 0.12% |
| `it` | 123,812 | 0.23% | 8,183 | 2.40% | 8,940 | 2.77% | 140,935 | 0.26% | 56h 31m 19.1s | 0.05% |
| `cs` | 106,037 | 0.20% | 0 | 0.00% | 0 | 0.00% | 106,037 | 0.20% | 0.00s | 0.00% |
| `et` | 100,734 | 0.19% | 1,571 | 0.46% | 1,576 | 0.49% | 103,881 | 0.19% | 8h 57m 04.6s | 0.01% |
| `pl` | 102,193 | 0.19% | 0 | 0.00% | 0 | 0.00% | 102,193 | 0.19% | 0.00s | 0.00% |
| `sl` | 100,710 | 0.19% | 360 | 0.11% | 509 | 0.16% | 101,579 | 0.19% | 2h 35m 09.0s | 0.00% |
| `fi` | 100,236 | 0.19% | 0 | 0.00% | 0 | 0.00% | 100,236 | 0.19% | 0.00s | 0.00% |
| `sv` | 99,891 | 0.19% | 0 | 0.00% | 0 | 0.00% | 99,891 | 0.19% | 0.00s | 0.00% |
| `el` | 99,761 | 0.19% | 0 | 0.00% | 0 | 0.00% | 99,761 | 0.18% | 0.00s | 0.00% |
| `pt` | 99,487 | 0.19% | 0 | 0.00% | 0 | 0.00% | 99,487 | 0.18% | 0.00s | 0.00% |
| `ro` | 99,411 | 0.19% | 0 | 0.00% | 0 | 0.00% | 99,411 | 0.18% | 0.00s | 0.00% |
| `nl` | 99,400 | 0.19% | 0 | 0.00% | 0 | 0.00% | 99,400 | 0.18% | 0.00s | 0.00% |
| `hu` | 99,143 | 0.19% | 0 | 0.00% | 0 | 0.00% | 99,143 | 0.18% | 0.00s | 0.00% |
| `lt` | 99,078 | 0.19% | 0 | 0.00% | 0 | 0.00% | 99,078 | 0.18% | 0.00s | 0.00% |
| `da` | 98,868 | 0.19% | 0 | 0.00% | 0 | 0.00% | 98,868 | 0.18% | 0.00s | 0.00% |
| `hr` | 97,028 | 0.18% | 0 | 0.00% | 0 | 0.00% | 97,028 | 0.18% | 0.00s | 0.00% |
| `lv` | 92,504 | 0.17% | 1,629 | 0.48% | 1,125 | 0.35% | 95,258 | 0.18% | 4h 55m 19.3s | 0.00% |
| `mt` | 94,360 | 0.18% | 0 | 0.00% | 0 | 0.00% | 94,360 | 0.17% | 0.00s | 0.00% |
| `sk` | 92,345 | 0.17% | 0 | 0.00% | 0 | 0.00% | 92,345 | 0.17% | 0.00s | 0.00% |
| `ko` | 92,184 | 0.17% | 0 | 0.00% | 0 | 0.00% | 92,184 | 0.17% | 217h 09m 58.0s | 0.18% |
| `bg` | 89,209 | 0.17% | 0 | 0.00% | 0 | 0.00% | 89,209 | 0.17% | 0.00s | 0.00% |
| `ca` | 54,255 | 0.10% | 12,730 | 3.73% | 12,730 | 3.94% | 79,715 | 0.15% | 119h 48m 09.3s | 0.10% |
| `fa` | 4,348 | 0.01% | 3,445 | 1.01% | 3,445 | 1.07% | 11,238 | 0.02% | 14h 20m 32.8s | 0.01% |
| `ar` | 2,776 | 0.01% | 1,695 | 0.50% | 1,758 | 0.54% | 6,229 | 0.01% | 5h 35m 01.6s | 0.00% |
| `mn` | 2,018 | 0.00% | 1,759 | 0.52% | 1,761 | 0.55% | 5,538 | 0.01% | 8h 21m 37.1s | 0.01% |
| `id` | 1,243 | 0.00% | 844 | 0.25% | 792 | 0.25% | 2,879 | 0.01% | 2h 58m 58.8s | 0.00% |
| `cy` | 763 | 0.00% | 690 | 0.20% | 690 | 0.21% | 2,143 | 0.00% | 3h 01m 18.6s | 0.00% |
| `nn` | 426 | 0.00% | 0 | 0.00% | 0 | 0.00% | 426 | 0.00% | 0.00s | 0.00% |
| `la` | 289 | 0.00% | 0 | 0.00% | 0 | 0.00% | 289 | 0.00% | 0.00s | 0.00% |
| `ru` | 113 | 0.00% | 0 | 0.00% | 0 | 0.00% | 113 | 0.00% | 0.00s | 0.00% |
| `he` | 66 | 0.00% | 0 | 0.00% | 0 | 0.00% | 66 | 0.00% | 0.00s | 0.00% |
| `sq` | 40 | 0.00% | 0 | 0.00% | 0 | 0.00% | 40 | 0.00% | 0.00s | 0.00% |
| `tr` | 35 | 0.00% | 0 | 0.00% | 0 | 0.00% | 35 | 0.00% | 0.00s | 0.00% |
| `gl` | 15 | 0.00% | 0 | 0.00% | 0 | 0.00% | 15 | 0.00% | 0.00s | 0.00% |
| `uk` | 10 | 0.00% | 0 | 0.00% | 0 | 0.00% | 10 | 0.00% | 0.00s | 0.00% |
| `af` | 2 | 0.00% | 0 | 0.00% | 0 | 0.00% | 2 | 0.00% | 0.00s | 0.00% |
| `jw` | 1 | 0.00% | 0 | 0.00% | 0 | 0.00% | 1 | 0.00% | 0.00s | 0.00% |
| `ur` | 1 | 0.00% | 0 | 0.00% | 0 | 0.00% | 1 | 0.00% | 0.00s | 0.00% |
| `sr` | 1 | 0.00% | 0 | 0.00% | 0 | 0.00% | 1 | 0.00% | 0.00s | 0.00% |
| `hy` | 1 | 0.00% | 0 | 0.00% | 0 | 0.00% | 1 | 0.00% | 0.00s | 0.00% |
| `no` | 1 | 0.00% | 0 | 0.00% | 0 | 0.00% | 1 | 0.00% | 0.00s | 0.00% |
### Tag / Source Distribution
| Value | train samples | train % | test samples | test % | validation samples | validation % | Total samples | Total % | Total Duration | Total Dur % |
|-------|----------:|---------:|----------:|---------:|----------:|---------:|----------:|---------:|---------------:|------------:|
| `Emilia` | 40,237,834 | 75.47% | 0 | 0.00% | 0 | 0.00% | 40,237,834 | 74.54% | 101,585h 04m 02.8s | 83.30% |
| `GigaSpeech` | 5,053,116 | 9.48% | 0 | 0.00% | 0 | 0.00% | 5,053,116 | 9.36% | 6,297h 24m 07.6s | 5.16% |
| `CoVoST` | 3,591,777 | 6.74% | 290,706 | 85.22% | 288,492 | 89.31% | 4,170,975 | 7.73% | 6,519h 01m 42.7s | 5.35% |
| `MOSEL` | 2,300,046 | 4.31% | 0 | 0.00% | 0 | 0.00% | 2,300,046 | 4.26% | 0.00s | 0.00% |
| `PeopleSpeech` | 1,501,271 | 2.82% | 34,898 | 10.23% | 18,622 | 5.76% | 1,554,791 | 2.88% | 5,987h 42m 22.5s | 4.91% |
| `LibriTTS` | 353,817 | 0.66% | 9,955 | 2.92% | 10,340 | 3.20% | 374,112 | 0.69% | 585h 37m 48.6s | 0.48% |
| `Librispeech` | 281,241 | 0.53% | 5,559 | 1.63% | 5,567 | 1.72% | 292,367 | 0.54% | 982h 18m 30.3s | 0.81% |
### License Distribution
| Value | train samples | train % | test samples | test % | validation samples | validation % | Total samples | Total % |
|-------|----------:|---------:|----------:|---------:|----------:|---------:|----------:|---------:|
| `CC-BY-4.0` | 43,172,938 | 80.97% | 15,514 | 4.55% | 15,907 | 4.92% | 43,204,359 | 80.03% |
| `unknown` | 5,053,116 | 9.48% | 0 | 0.00% | 0 | 0.00% | 5,053,116 | 9.36% |
| `CC0` | 3,591,777 | 6.74% | 290,706 | 85.22% | 288,492 | 89.31% | 4,170,975 | 7.73% |
| `CC-BY;CC-BY-SA` | 1,501,271 | 2.82% | 34,898 | 10.23% | 18,622 | 5.76% | 1,554,791 | 2.88% |
### Example usage
```python
# pip install datasets
from datasets import load_dataset
# ── Load all splits at once ───────────────────────────────────────────────────
dataset = load_dataset("meetween/mumospee")
print(dataset)
# DatasetDict({
# train: Dataset({features: [...], num_rows: ...})
# test: Dataset({features: [...], num_rows: ...})
# validation: Dataset({features: [...], num_rows: ...})
# })
# ── Load a specific split ─────────────────────────────────────────────────────
train_data = load_dataset("meetween/mumospee", split="train")
test_data = load_dataset("meetween/mumospee", split="test")
validation_data = load_dataset("meetween/mumospee", split="validation")
```
### Notes
- `train`: 0 rows with unparseable duration (excluded from duration stats)
- `test`: 0 rows with unparseable duration (excluded from duration stats)
- `validation`: 0 rows with unparseable duration (excluded from duration stats)
- Stats generated in 370.7s total
提供机构:
meetween



