VINAY-UMRETHE/Trivenika
收藏Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/VINAY-UMRETHE/Trivenika
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: filtered
data_files:
- data/OmniBench/*.parquet
- data/LinxyLatexOCR/*.parquet
- data/UnslothLatexOCR/*.parquet
- data/MathVision/*.parquet
- data/WeMath/*.parquet
- data/FoodCaptioned/*.parquet
- data/PokemonBlipCaptioned/*.parquet
- data/PokemonInfo/*.parquet
- data/PokemonCards/*.parquet
- data/Tiny_Stories/*.parquet
- data/Tiny_Stories_Igbo/*.parquet
- data/ToneBooks/*.parquet
- data/EnglishDialects/*.parquet
- data/Elise/*.parquet
- data/Indian_Hindi/*.parquet
- data/Indian_Marathi/*.parquet
- data/SpectoGram_Captioned/*.parquet
- data/Silvar_Med/*.parquet
- data/MilitaryImages/*.parquet
- data/BhojpuriASR/*.parquet
- data/IndoAryanSinhalaASR/*.parquet
- data/KashmiriArabicASR/*.parquet
- data/AngikaDevanagariASR/*.parquet
- data/ArabicAudio/*.parquet
- data/MilitaryAircraftCaptioned/*.parquet
- data/NonverbalTTS/*.parquet
- data/CocoSmall/*.parquet
- data/CelebrityFaces/*.parquet
- data/CelebaCaptions/*.parquet
- config_name: OmniBench
data_files: data/OmniBench/*.parquet
- config_name: LinxyLatexOCR
data_files: data/LinxyLatexOCR/*.parquet
- config_name: UnslothLatexOCR
data_files: data/UnslothLatexOCR/*.parquet
- config_name: MathVision
data_files: data/MathVision/*.parquet
- config_name: WeMath
data_files: data/WeMath/*.parquet
- config_name: FoodCaptioned
data_files: data/FoodCaptioned/*.parquet
- config_name: PokemonBlipCaptioned
data_files: data/PokemonBlipCaptioned/*.parquet
- config_name: PokemonInfo
data_files: data/PokemonInfo/*.parquet
- config_name: PokemonCards
data_files: data/PokemonCards/*.parquet
- config_name: Tiny_Stories
data_files: data/Tiny_Stories/*.parquet
- config_name: Tiny_Stories_Igbo
data_files: data/Tiny_Stories_Igbo/*.parquet
- config_name: ToneBooks
data_files: data/ToneBooks/*.parquet
- config_name: EnglishDialects
data_files: data/EnglishDialects/*.parquet
- config_name: Elise
data_files: data/Elise/*.parquet
- config_name: Indian_Hindi
data_files: data/Indian_Hindi/*.parquet
- config_name: Indian_Marathi
data_files: data/Indian_Marathi/*.parquet
- config_name: SpectoGram_Captioned
data_files: data/SpectoGram_Captioned/*.parquet
- config_name: Silvar_Med
data_files: data/Silvar_Med/*.parquet
- config_name: MilitaryImages
data_files: data/MilitaryImages/*.parquet
- config_name: BhojpuriASR
data_files: data/BhojpuriASR/*.parquet
- config_name: IndoAryanSinhalaASR
data_files: data/IndoAryanSinhalaASR/*.parquet
- config_name: KashmiriArabicASR
data_files: data/KashmiriArabicASR/*.parquet
- config_name: AngikaDevanagariASR
data_files: data/AngikaDevanagariASR/*.parquet
- config_name: ArabicAudio
data_files: data/ArabicAudio/*.parquet
- config_name: MilitaryAircraftCaptioned
data_files: data/MilitaryAircraftCaptioned/*.parquet
- config_name: NonverbalTTS
data_files: data/NonverbalTTS/*.parquet
- config_name: CocoSmall
data_files: data/CocoSmall/*.parquet
- config_name: CelebrityFaces
data_files: data/CelebrityFaces/*.parquet
- config_name: CelebaCaptions
data_files: data/CelebaCaptions/*.parquet
- config_name: NSFW1
data_files: data/NSFW1/*.parquet
- config_name: NSFW2
data_files: data/NSFW2/*.parquet
language:
- hi
- mr
- ks
- en
- ig
- ar
tags:
- trivenika
- text
- image
- audio
- classification
- ocr
- asr
- math
license: odc-by
---
# Trivenika: Stream of Audio-Visual & Language Data
<p align="center">
<img src="https://huggingface.co/datasets/VINAY-UMRETHE/Trivenika/resolve/main/assets/Trivenika.png" alt="Trivenika Logo" width="100%"/>
</p>
<p align="center">
<img src="https://img.shields.io/badge/Size-384K-green?style=for-the-badge">
<img src="https://img.shields.io/badge/Dataset-Trivenika-blue?style=for-the-badge">
</p>
A unified multimodal dataset combining **image**, **audio**, and **text** from diverse public sources.
---
## Quick Start
This dataset uses **Configurations** (Subsets) to manage its diverse data sources. You can load specific parts or the entire "filtered" dataset without downloading the NSFW portions.
### 1. Load the "filtered" Sub-Set (Recommended)
This configuration automatically aggregates all 29 safe subsets, **excluding** NSFW content.
```python
from datasets import load_dataset
# Loads ~384k samples (Images + Audio + Text) skipping NSFW content
dataset = load_dataset("VINAY-UMRETHE/Trivenika", "filtered", split="train")
print(dataset[0])
```
### 2. Load a Specific Subset
If you only need a specific domain (e.g., Hindi Audio or Math Vision), load just that subset.
```python
# Load only the Hindi ASR data
hindi_data = load_dataset("VINAY-UMRETHE/Trivenika", "Indian_Hindi", split="train")
# Load only the Math Vision data
math_data = load_dataset("VINAY-UMRETHE/Trivenika", "MathVision", split="train")
```
---
## Overview
**Trivenika** is a curated multimodal dataset built by harmonizing **31 source datasets** across multiple domains including mathematics, OCR, celebrity recognition, food captioning, Pokémon analysis, and general image understanding.
### Primary Use Case
* Fine-tuning **vision and audio projectors**
* Merging **modality-specific encoders** with base LLMs
* Training models for **Image + Audio → Text** capabilities
* **Tasks:** OCR, ASR, VQA, Math Reasoning, Safety Filtering, NSFW understanding.
### Key Statistics
| Modality | Count |
| --- | --- |
| Images | ~261K (268,026) |
| Audio Files | ~115K (118,788) |
| **Total Entries** | **383,816** |
---
### Available Subsets
| Config Name | Content Description |
| --- | --- |
| **`filtered`** | **(Virtual-Config)** Combines all safe subsets below. |
| `OmniBench` | General multimodal benchmark data |
| `LinxyLatexOCR` | LaTeX OCR images |
| `MathVision` | Visual mathematical problems |
| `Indian_Hindi` | Hindi speech recognition (ASR) |
| `FoodCaptioned` | Food images with descriptions |
| `PokemonCards` | Pokémon card scans and stats |
| `ToneBooks` | Audiobooks with tonal analysis |
| ... | *And More...* |
| `NSFW1` / `NSFW2` | **(Excluded from `filtered`)** |
### Schema
Every split follows this schema:
* `id` (string): Unique identifier.
* `image` (image): PIL-decodable image object (or None).
* `audio` (audio): Audio Data (or None).
* `text` (string): Text caption, transcription, or OCR output.
---
## Source Datasets & Provenance
We aggregate and restructure data from trusted public repositories. All individual licenses apply.
| # | Dataset | Purpose | Size | Link |
|---|--------|--------|------|------|
| 1 | `theneuralmaze/celebrity_faces` | Celebrity face images | 3,000 | [Link](https://huggingface.co/datasets/theneuralmaze/celebrity_faces) |
| 2 | `irodkin/celeba_with_llava_captions` | CelebA with LLaVA-generated captions | 36,646 | [Link](https://huggingface.co/datasets/irodkin/celeba_with_llava_captions) |
| 3 | `DRDELATV/SHORT_NSFW` | Short NSFW image-text pairs | 188 | [Link](https://huggingface.co/datasets/DRDELATV/SHORT_NSFW) |
| 4 | `DRDELATV/NSFW_LP` | NSFW labeled prompts/images | 124 | [Link](https://huggingface.co/datasets/DRDELATV/NSFW_LP) |
| 5 | `RIW/small-coco-wm_50` | filtered COCO subset | 23,716 | [Link](https://huggingface.co/datasets/RIW/small-coco-wm_50) |
| 6 | `linxy/LaTeX_OCR` | Synthetic LaTeX equations + images | 94,236 | [Link](https://huggingface.co/datasets/linxy/LaTeX_OCR) |
| 7 | `unsloth/LaTeX_OCR` | High-quality LaTeX OCR data | 68,686 | [Link](https://huggingface.co/datasets/unsloth/LaTeX_OCR) |
| 8 | `MathLLMs/MathVision` | Mathematical visual problems | 3,344 | [Link](https://huggingface.co/datasets/MathLLMs/MathVision) |
| 9 | `We-Math/We-Math` | General math problem dataset | 1,740 | [Link](https://huggingface.co/datasets/We-Math/We-Math) |
| 10 | `SPRINGLab/IndicTTS_Hidi` | ASR | 11825 | [Link](https://huggingface.co/datasets/SPRINGLab/IndicTTS-Hindi) |
| 11 | `SPRINGLab/IndicTTS_Marathi` | ASR | 10939 | [Link](https://huggingface.co/datasets/SPRINGLab/IndicTTS_Marathi) |
| 12 | `MrDragonFox/Elise` | ASR | 1195 | [Link](https://huggingface.co/datasets/MrDragonFox/Elise) |
| 13 | `Vikhrmodels/ToneBooks` | ASR / Description | 45989 | [Link](https://huggingface.co/datasets/Vikhrmodels/ToneBooks) |
| 14 | `vucinatim/spectrogram-captions` | Audio Spectroscopy | 1000 | [Link](https://huggingface.co/datasets/vucinatim/spectrogram-captions) |
| 15 | `Hanhpt23/Silvar-Med` | Visual Medical Analysis | 856 | [Link](https://huggingface.co/datasets/Hanhpt23/Silvar-Med) |
| 16 | `facebook/omnilingual-asr-corpus` | ASR | 3477 | [Link](https://huggingface.co/datasets/facebook/omnilingual-asr-corpus) |
| 17 | `mehul7/captioned_military_aircraft` | Military Aircrafts captioning | 4865 | [Link](https://huggingface.co/datasets/mehul7/captioned_military_aircraft) |
| 18 | `SinclairSchneider/military_images` | Military Personnel | 1502 | [Link](https://huggingface.co/datasets/SinclairSchneider/military_images) |
| ... | *(Additional sources include Pokémon, food captioning, etc.)* | | | |
---
## Subjects & Tasks Covered
| Subject | Task Type |
|--------|-----------|
| **Celebrity Recognition** | Face Classification |
| **Image Captioning (Celeb)** | Vision-to-Text |
| **NSFW Detection** | Classification, Understanding |
| **General Image Understanding** | Captioning, Object Detection |
| **LaTeX OCR** | Formula Recognition, OCR |
| **Mathematical Reasoning** | Visual Math Problems |
| **Math SFT Data** | Step-by-step Math Solutions |
| **Pokémon** | Captioning, Identification, Classification |
| **Food** | Image Captioning & Identification |
| **Speech Recognition & Generation** | Audio Captioning |
> ✅ All datasets are publicly accessible.
---
## Ethical Considerations & Warnings
**Contains Potentially Sensitive Content**
- Includes **NSFW material** (`NSFW1`, `NSFW2`)
- Not suitable for child-safe applications without filtering
- Use `filtered` to exclude NSFW content
- Apply strict content moderation pipelines in production
- Comply with local regulations regarding adult content and facial recognition
**Tip**: The `filtered` manifest **excludes** NSFW samples and should be used for safety-conscious applications.
---
## Tips for Model Training
This dataset is made for **fine-tuning multimodal projectors** (e.g., LLaVA, Gemma-3n, LFM2).
### Recommended Encoder Pairings
| Model | Vision Encoder | Audio Encoder |
|------|----------------|---------------|
| **(Any Text-Generation Model)** | `timm/mobilenetv5_300m.gemma3n` | `n0mad-0/gemma3n-usm-rip` USM |
---
## License Summary
| Component | License |
| --- | --- |
| Original Public Datasets | Varies (MIT, Apache 2.0, CC-BY, etc.) |
**Note:** By using this dataset, you agree to comply with the licenses of the original source datasets found in the provenance table.
**Not licensed for commercial redistribution** without verifying compliance with each component’s licensing terms.
The use of the dataset as a whole is licensed under the [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/) license.
---
## Citation
If you use this dataset in your research, please cite:
```bibtex
@misc{vinayumrethetrivenika2026,
author = {Vinay Umrethe},
title = {Trivenika: Stream of Audio-Visual & Language Data},
year = {2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/datasets/VINAY-UMRETHE/Trivenika}}
}
```
提供机构:
VINAY-UMRETHE



