NoeFlandre/herero-dataset

Name: NoeFlandre/herero-dataset
Creator: NoeFlandre
Published: 2026-03-26 08:09:55
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/NoeFlandre/herero-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_name: multiple-licenses pretty_name: Herero Language Dataset language: - hz task_categories: - text-classification --- # Herero Language Dataset [![GitHub Repository](https://img.shields.io/badge/Code-GitHub-blue)](https://github.com/NoeFlandre/herero-dataset) [![HuggingFace Dataset](https://img.shields.io/badge/Dataset-HuggingFace-blue)](https://huggingface.co/datasets/NoeFlandre/herero-dataset) A **multi-license collection** of [Herero (Otjiherero)](https://en.wikipedia.org/wiki/Herero_language) text for NLP research. This is **not** a derivative work—sources are aggregated with original licenses preserved. > **📦 Data:** This page — download Parquet files here > **💻 Code:** [GitHub Repository](https://github.com/NoeFlandre/herero-dataset) — scrapers, scripts, and documentation **⚠️ WARNING: This repository is a redistribution of materials from multiple upstream sources under different licenses. Each example remains subject to its original source license. Users are responsible for checking per-row licensing before reuse.** ## Dataset Summary | Metric | Value | |--------|-------| | Documents | 1,951 | | Words | 870,941 | | Sources | 7 | | Language | Herero (hz) | ## Sources | Source | Docs | Words | License | |--------|-----:|------:|---------| | FineWeb-2 | 1,272 | 672,378 | ODC-By 1.0 | | Omnilingual ASR | 437 | 60,485 | CC BY 4.0 | | GlotCC-V1 | 20 | 75,982 | CC BY 4.0 | | Herero Bible (1849) | 81 | 26,807 | Public Domain | | Storybooks Namibia | 46 | 13,246 | CC BY 4.0 | | FinePDFs | 25 | 17,800 | ODC-By 1.0 | | Wikipedia Incubator | 70 | 4,243 | CC BY-SA 3.0 | ## Data Splits | Split | Documents | % | |-------|-----------|---| | train | 1,761 | 90% | | validation | 95 | 5% | | test | 95 | 5% | ## Schema Each row includes provenance columns: | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique document ID | | `text` | string | Herero text (NFC normalized) | | `source` | string | Data source name | | `original_license` | string | Source license (check this!) | | `original_url` | string | Link to original source | | `changes_made` | string | Transformations applied | | `license_type` | string | License category | | `url` | string | Source URL | | `word_count` | int | Word count | | `char_count` | int | Character count | ## Usage ```python from datasets import load_dataset dataset = load_dataset("NoeFlandre/herero-dataset") # Check per-row license BEFORE use for example in dataset["train"]: print(f"License: {example['original_license']} - {example['source']}") ``` ## Collection Classification This dataset is a **collection**, not an adaptation: - Text is aggregated from sources without material modification - Original text is preserved (only formatting normalized) - No translation, rewriting, or merging across sources - Source boundaries maintained via `source` column ## Quality Assurance - **Deduplication**: Within-source SHA-256 content hashing - **Normalization**: Unicode NFC normalization (no text rewriting) - **Filtering**: Min-length 50 chars - **Reproducibility**: Fixed random seed (42) ## Known Limitations - ~871K words is moderate for low-resource language research - FineWeb-2 dominates (~77% of content) - Historical text uses 19th-century orthography - Speech transcripts may contain disfluencies ## Ethical Considerations - No personal data included - All content publicly available at time of scraping - Attribution preserved via `original_url` column - Users must respect individual source licenses ## Licensing Summary | License Type | Sources | Requirements | |-------------|---------|--------------| | Public Domain | 1 | None | | CC BY | 3 | Attribution | | CC BY-SA | 1 | Attribution + ShareAlike | | ODC-By | 2 | Attribution + Keep open | **You must check the `original_license` column for each example.** ## Files - `data/train.parquet` - Training split - `data/validation.parquet` - Validation split - `data/test.parquet` - Test split - `sources.csv` - Full source manifest ## Disclaimer This documentation is provided for informational purposes only and does not constitute legal advice. Uncertain licensing cases were reviewed conservatively. For legal questions, consult the original license texts or seek professional counsel.

提供机构：

NoeFlandre

5,000+

优质数据集

54 个

任务类型

进入经典数据集