five

HiTZ/ConceptualCaptions_eu

收藏
Hugging Face2026-03-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/HiTZ/ConceptualCaptions_eu
下载链接
链接失效反馈
官方服务:
资源简介:
# CC3M-eu (Basque Translation) ## 📚 Overview **CC3M-eu** is a Basque version of the **Conceptual Captions 3M** dataset. It consists of approximately 3.3 million image-description pairs where the original English captions have been translated into Basque using the **mt-hitz-en/eu** specialized translation pipeline. **Important:** This is **not the official dataset**. It is an independent community translation effort designed to facilitate the training of Vision-Language Models (VLMs) and CLIP-style encoders for the Basque language. ## ✍️ Authors & Acknowledgements - **Original dataset:** *Conceptual Captions* — © Google LLC - **Basque translation & curation:** <Lukas Arana / HiTZ>, 2025 - **Translation Engine:** Neural Machine Translation via `mt-hitz-en/eu` (HiTZ Center). If you use this Basque split, please cite both the original Conceptual Captions paper and this translation work. ## 📁 Dataset Schema The schema has been adapted to include both the source and the translated content: 1. **id**: Unique identifier for each sample. 2. **url**: The original source URL for the image. 3. **caption_en**: The original English description. 4. **caption_eu**: The generated Basque translation. ## 🔧 How We Built It 1. **Extraction**: English captions were pulled from the official CC3M TSV files. 2. **Translation**: Each caption was translated using the `mt-hitz-en/eu` model, which is specifically optimized for English-to-Basque scientific and general domain text. 3. **Cleanup**: Applied basic post-processing to handle HTML entities and formatting artifacts. *Note: No images were hosted or modified; only the textual metadata is provided.* ## 🚦 Limitations & Ethical Considerations - **Non-official**: This version has not been audited by Google; semantic drift may occur during translation. - **Link Stability**: Like the original CC3M, many image URLs may be dead or lead to 404 errors. - **Biases**: The dataset may inherit or amplify social biases present in the original English data or the NMT model. ## 💻 Quick Start ```python from datasets import load_dataset # Load the Basque CC3M dataset ds = load_dataset("lukasArana/CC3M-eu", split="train") # Accessing a sample print(ds[0]["caption_eu"])
提供机构:
HiTZ
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作