five

soketlabs/bhasha-wiki

收藏
Hugging Face2024-04-16 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/soketlabs/bhasha-wiki
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - bn - en - gu - hi - kn - ta - ur license: - cc-by-sa-3.0 size_categories: - 1M<n<10M task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling configs: - config_name: 20231101.bn data_files: - split: train path: ben_Beng/train-* - config_name: 20231101.en data_files: - split: train path: eng_Latn/train-* - config_name: 20231101.gu data_files: - split: train path: guj_Gujr/train-* - config_name: 20231101.hi data_files: - split: train path: hin_Deva/train-* - config_name: 20231101.kn data_files: - split: train path: kan_Knda/train-* - config_name: 20231101.ta data_files: - split: train path: tam_Taml/train-* - config_name: 20231101.ur data_files: - split: train path: urd_Arab/train-* dataset_info: - config_name: 20231101.bn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 18741174694 num_examples: 6345497 download_size: 17781537439 dataset_size: 17781537439 - config_name: 20231101.en features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 19009297439 num_examples: 6345497 download_size: 11344307656 dataset_size: 11344307656 - config_name: 20231101.gu features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 18453210446 num_examples: 6345497 download_size: 17858529783 dataset_size: 17858529783 - config_name: 20231101.hi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 18622892252 num_examples: 6345497 download_size: 17364613184 dataset_size: 17364613184 - config_name: 20231101.kn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 19679016421 num_examples: 6345497 download_size: 18764722116 dataset_size: 18764722116 - config_name: 20231101.ta features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 21457803696 num_examples: 6345497 download_size: 19416722401 dataset_size: 19416722401 - config_name: 20231101.ur features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string - name: sents dtype: int32 - name: chars dtype: int32 - name: words dtype: int32 - name: tokens dtype: int32 splits: - name: train num_bytes: 17921351051 num_examples: 6345497 download_size: 14665386082 dataset_size: 14665386082 --- # Dataset Card for Bhasha-Wiki <!-- Provide a quick summary of the dataset. --> Translated wikipedia articles ## Dataset Details Dataset is being updated ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> We have translated 6.4 million English wikipedia articles into 6 Indic languages. The translations were done using IndicTrans2 model. - **Curated by:** [Soket AI labs](https://soket.ai/) - **Language(s) (NLP):** Hindi, Bengali, Gujarati, Tamil, Kannada, Urdu - **License:** cc-by-sa-3.0 ## Uses <!-- Address questions around how the dataset is intended to be used. --> For pretraining or Fine tuning for Indic language models ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> [More Information Needed] ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> [More Information Needed] ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> Wikipedia articles #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> [More Information Needed] #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [More Information Needed] ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> [More Information Needed] #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> [More Information Needed] #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> [More Information Needed] ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> [More Information Needed] ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed] ### Licensing Information ### Citation Information ``` @ONLINE{bhasha-wiki, author = "Soket Labs Technology and Research Private Limited", title = "Bhasha-Wiki", url = "https://soket.ai" } ```
提供机构:
soketlabs
原始信息汇总

数据集概述

语言支持

  • 支持的语言包括:孟加拉语(bn)、英语(en)、古吉拉特语(gu)、印地语(hi)、卡纳达语(kn)、泰米尔语(ta)、乌尔都语(ur)。

许可信息

  • 数据集遵循的许可证为:cc-by-sa-3.0。

大小分类

  • 数据集大小分类为:1M<n<10M。

任务分类

  • 支持的任务包括:文本生成(text-generation)和填充掩码(fill-mask)。
  • 具体任务ID为:语言建模(language-modeling)和掩码语言建模(masked-language-modeling)。

配置信息

  • 数据集包含多个配置,每个配置对应不同的语言和数据文件路径。
  • 每个配置包括配置名称、数据文件(分训练集,路径格式为语言代码/训练-*)。

数据集信息

  • 每个语言配置下,数据集的特征包括:id(字符串)、url(字符串)、title(字符串)、text(字符串)、sents(整数32位)、chars(整数32位)、words(整数32位)、tokens(整数32位)。
  • 每个配置的训练集数据量和大小不同,具体如下:
    • bn: 训练集字节数18741174694,示例数6345497,下载大小和数据集大小均为17781537439。
    • en: 训练集字节数19009297439,示例数6345497,下载大小和数据集大小均为11344307656。
    • gu: 训练集字节数18453210446,示例数6345497,下载大小和数据集大小均为17858529783。
    • hi: 训练集字节数18622892252,示例数6345497,下载大小和数据集大小均为17364613184。
    • kn: 训练集字节数19679016421,示例数6345497,下载大小和数据集大小均为18764722116。
    • ta: 训练集字节数21457803696,示例数6345497,下载大小和数据集大小均为19416722401。
    • ur: 训练集字节数17921351051,示例数6345497,下载大小和数据集大小均为14665386082。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作