pulipakav-1/translated-babylm-telugu

Name: pulipakav-1/translated-babylm-telugu
Creator: pulipakav-1
Published: 2026-04-29 07:07:59
License: 暂无描述

Hugging Face2026-04-29 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/pulipakav-1/translated-babylm-telugu

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是英语BabyLM 2026语料库的泰卢固语翻译版本，使用AI4Bharat开发的IndicTrans2神经机器翻译模型生成。数据集旨在按照BabyLM挑战的设置，用于训练和评估泰卢固语语言模型。数据集包含三个部分：训练集（来自BabyLM-2026-Strict）、验证集（来自BabyLM-dev）和测试集（来自BabyLM-Test），涵盖了多种来源的文本，如英国国家语料库的口语部分、儿童导向语音（CHILDES数据库）、古登堡计划文学文本、电影和电视字幕、简单英语维基百科以及电话对话转录。

This dataset is a Telugu translation of the English BabyLM 2026 corpus, produced using IndicTrans2, a state-of-the-art neural machine translation model developed by AI4Bharat for Indic languages. The dataset is intended for training and evaluating language models on Telugu, following the BabyLM challenge setup. It includes three splits: train (from BabyLM-2026-Strict), val (from BabyLM-dev), and test (from BabyLM-Test), covering various sources such as British National Corpus spoken language, child-directed speech (CHILDES database), Project Gutenberg literary texts, movie and TV subtitles, Simple English Wikipedia, and telephone conversation transcripts.

提供机构：

pulipakav-1

5,000+

优质数据集

54 个

任务类型

进入经典数据集