soketlabs/bhasha-wiki
收藏Hugging Face2024-04-16 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/soketlabs/bhasha-wiki
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- bn
- en
- gu
- hi
- kn
- ta
- ur
license:
- cc-by-sa-3.0
size_categories:
- 1M<n<10M
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
configs:
- config_name: 20231101.bn
data_files:
- split: train
path: ben_Beng/train-*
- config_name: 20231101.en
data_files:
- split: train
path: eng_Latn/train-*
- config_name: 20231101.gu
data_files:
- split: train
path: guj_Gujr/train-*
- config_name: 20231101.hi
data_files:
- split: train
path: hin_Deva/train-*
- config_name: 20231101.kn
data_files:
- split: train
path: kan_Knda/train-*
- config_name: 20231101.ta
data_files:
- split: train
path: tam_Taml/train-*
- config_name: 20231101.ur
data_files:
- split: train
path: urd_Arab/train-*
dataset_info:
- config_name: 20231101.bn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
- name: sents
dtype: int32
- name: chars
dtype: int32
- name: words
dtype: int32
- name: tokens
dtype: int32
splits:
- name: train
num_bytes: 18741174694
num_examples: 6345497
download_size: 17781537439
dataset_size: 17781537439
- config_name: 20231101.en
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
- name: sents
dtype: int32
- name: chars
dtype: int32
- name: words
dtype: int32
- name: tokens
dtype: int32
splits:
- name: train
num_bytes: 19009297439
num_examples: 6345497
download_size: 11344307656
dataset_size: 11344307656
- config_name: 20231101.gu
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
- name: sents
dtype: int32
- name: chars
dtype: int32
- name: words
dtype: int32
- name: tokens
dtype: int32
splits:
- name: train
num_bytes: 18453210446
num_examples: 6345497
download_size: 17858529783
dataset_size: 17858529783
- config_name: 20231101.hi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
- name: sents
dtype: int32
- name: chars
dtype: int32
- name: words
dtype: int32
- name: tokens
dtype: int32
splits:
- name: train
num_bytes: 18622892252
num_examples: 6345497
download_size: 17364613184
dataset_size: 17364613184
- config_name: 20231101.kn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
- name: sents
dtype: int32
- name: chars
dtype: int32
- name: words
dtype: int32
- name: tokens
dtype: int32
splits:
- name: train
num_bytes: 19679016421
num_examples: 6345497
download_size: 18764722116
dataset_size: 18764722116
- config_name: 20231101.ta
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
- name: sents
dtype: int32
- name: chars
dtype: int32
- name: words
dtype: int32
- name: tokens
dtype: int32
splits:
- name: train
num_bytes: 21457803696
num_examples: 6345497
download_size: 19416722401
dataset_size: 19416722401
- config_name: 20231101.ur
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
- name: sents
dtype: int32
- name: chars
dtype: int32
- name: words
dtype: int32
- name: tokens
dtype: int32
splits:
- name: train
num_bytes: 17921351051
num_examples: 6345497
download_size: 14665386082
dataset_size: 14665386082
---
# Dataset Card for Bhasha-Wiki
<!-- Provide a quick summary of the dataset. -->
Translated wikipedia articles
## Dataset Details
Dataset is being updated
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
We have translated 6.4 million English wikipedia articles into 6 Indic languages. The translations were done using IndicTrans2 model.
- **Curated by:** [Soket AI labs](https://soket.ai/)
- **Language(s) (NLP):** Hindi, Bengali, Gujarati, Tamil, Kannada, Urdu
- **License:** cc-by-sa-3.0
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
For pretraining or Fine tuning for Indic language models
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
[More Information Needed]
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
[More Information Needed]
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
Wikipedia articles
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
[More Information Needed]
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
[More Information Needed]
### Annotations [optional]
<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
[More Information Needed]
#### Who are the annotators?
<!-- This section describes the people or systems who created the annotations. -->
[More Information Needed]
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
[More Information Needed]
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations.
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Dataset Card Authors [optional]
[More Information Needed]
## Dataset Card Contact
[More Information Needed]
### Licensing Information
### Citation Information
```
@ONLINE{bhasha-wiki,
author = "Soket Labs Technology and Research Private Limited",
title = "Bhasha-Wiki",
url = "https://soket.ai"
}
```
提供机构:
soketlabs
原始信息汇总
数据集概述
语言支持
- 支持的语言包括:孟加拉语(bn)、英语(en)、古吉拉特语(gu)、印地语(hi)、卡纳达语(kn)、泰米尔语(ta)、乌尔都语(ur)。
许可信息
- 数据集遵循的许可证为:cc-by-sa-3.0。
大小分类
- 数据集大小分类为:1M<n<10M。
任务分类
- 支持的任务包括:文本生成(text-generation)和填充掩码(fill-mask)。
- 具体任务ID为:语言建模(language-modeling)和掩码语言建模(masked-language-modeling)。
配置信息
- 数据集包含多个配置,每个配置对应不同的语言和数据文件路径。
- 每个配置包括配置名称、数据文件(分训练集,路径格式为语言代码/训练-*)。
数据集信息
- 每个语言配置下,数据集的特征包括:id(字符串)、url(字符串)、title(字符串)、text(字符串)、sents(整数32位)、chars(整数32位)、words(整数32位)、tokens(整数32位)。
- 每个配置的训练集数据量和大小不同,具体如下:
- bn: 训练集字节数18741174694,示例数6345497,下载大小和数据集大小均为17781537439。
- en: 训练集字节数19009297439,示例数6345497,下载大小和数据集大小均为11344307656。
- gu: 训练集字节数18453210446,示例数6345497,下载大小和数据集大小均为17858529783。
- hi: 训练集字节数18622892252,示例数6345497,下载大小和数据集大小均为17364613184。
- kn: 训练集字节数19679016421,示例数6345497,下载大小和数据集大小均为18764722116。
- ta: 训练集字节数21457803696,示例数6345497,下载大小和数据集大小均为19416722401。
- ur: 训练集字节数17921351051,示例数6345497,下载大小和数据集大小均为14665386082。



