soketlabs/bhasha-wiki

Name: soketlabs/bhasha-wiki
Creator: soketlabs
Published: 2024-04-16 14:55:39
License: 暂无描述

Hugging Face2024-04-16 更新2024-04-19 收录

下载链接：

https://hf-mirror.com/datasets/soketlabs/bhasha-wiki

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - bn - en - gu - hi - kn - ta - ur license: - cc-by-sa-3.0 size_categories: - 1M<n<10M task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling configs: - config_name: 20231101.bn data_files: - split: train path: ben_Beng/train-* - config_name: 20231101.en data_files: - split: train path: eng_Latn/train-* - config_name: 20231101.gu data_files: - split: train path: guj_Gujr/train-* - config_name: 20231101.hi data_files: - split: train path: hin_Deva/train-* - config_name: 20231101.kn data_files: - split: train path: kan_Knda/train-* - config_name: 20231101.ta data_files: - split: train path: tam_Taml/train-* - config_name: 20231101.ur data_files: - split: train path: urd_Arab/train-* dataset_info: - config_name: 20231101.bn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 18741174694 num_examples: 6345497 download_size: 17781537439 dataset_size: 17781537439 - config_name: 20231101.en features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 19009297439 num_examples: 6345497 download_size: 11344307656 dataset_size: 11344307656 - config_name: 20231101.gu features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 18453210446 num_examples: 6345497 download_size: 17858529783 dataset_size: 17858529783 - config_name: 20231101.hi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 18622892252 num_examples: 6345497 download_size: 17364613184 dataset_size: 17364613184 - config_name: 20231101.kn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 19679016421 num_examples: 6345497 download_size: 18764722116 dataset_size: 18764722116 - config_name: 20231101.ta features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 21457803696 num_examples: 6345497 download_size: 19416722401 dataset_size: 19416722401 - config_name: 20231101.ur features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 17921351051 num_examples: 6345497 download_size: 14665386082 dataset_size: 14665386082 --- # Dataset Card for Bhasha-Wiki  Translated wikipedia articles ## Dataset Details Dataset is being updated ### Dataset Description  We have translated 6.4 million English wikipedia articles into 6 Indic languages. The translations were done using IndicTrans2 model. - **Curated by:** [Soket AI labs](https://soket.ai/) - **Language(s) (NLP):** Hindi, Bengali, Gujarati, Tamil, Kannada, Urdu - **License:** cc-by-sa-3.0 ## Uses  For pretraining or Fine tuning for Indic language models ## Dataset Structure  [More Information Needed] ## Dataset Creation ### Curation Rationale  [More Information Needed] ### Source Data  Wikipedia articles #### Data Collection and Processing  [More Information Needed] #### Who are the source data producers?  [More Information Needed] ### Annotations [optional]  #### Annotation process  [More Information Needed] #### Who are the annotators?  [More Information Needed] #### Personal and Sensitive Information  [More Information Needed] ## Bias, Risks, and Limitations  [More Information Needed] ### Recommendations  Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Glossary [optional]  [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed] ### Licensing Information ### Citation Information ``` @ONLINE{bhasha-wiki, author = "Soket Labs Technology and Research Private Limited", title = "Bhasha-Wiki", url = "https://soket.ai" } ```

提供机构：

soketlabs

原始信息汇总

数据集概述

语言支持

支持的语言包括：孟加拉语（bn）、英语（en）、古吉拉特语（gu）、印地语（hi）、卡纳达语（kn）、泰米尔语（ta）、乌尔都语（ur）。

许可信息

数据集遵循的许可证为：cc-by-sa-3.0。

大小分类

数据集大小分类为：1M<n<10M。

任务分类

支持的任务包括：文本生成（text-generation）和填充掩码（fill-mask）。
具体任务ID为：语言建模（language-modeling）和掩码语言建模（masked-language-modeling）。

配置信息

数据集包含多个配置，每个配置对应不同的语言和数据文件路径。
每个配置包括配置名称、数据文件（分训练集，路径格式为语言代码/训练-*）。

数据集信息

每个语言配置下，数据集的特征包括：id（字符串）、url（字符串）、title（字符串）、text（字符串）、sents（整数32位）、chars（整数32位）、words（整数32位）、tokens（整数32位）。
每个配置的训练集数据量和大小不同，具体如下：
- bn: 训练集字节数18741174694，示例数6345497，下载大小和数据集大小均为17781537439。
- en: 训练集字节数19009297439，示例数6345497，下载大小和数据集大小均为11344307656。
- gu: 训练集字节数18453210446，示例数6345497，下载大小和数据集大小均为17858529783。
- hi: 训练集字节数18622892252，示例数6345497，下载大小和数据集大小均为17364613184。
- kn: 训练集字节数19679016421，示例数6345497，下载大小和数据集大小均为18764722116。
- ta: 训练集字节数21457803696，示例数6345497，下载大小和数据集大小均为19416722401。
- ur: 训练集字节数17921351051，示例数6345497，下载大小和数据集大小均为14665386082。

5,000+

优质数据集

54 个

任务类型

进入经典数据集