BhabhaAI/indic-instruct-data-v0.1-filtered
收藏Hugging Face2024-03-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BhabhaAI/indic-instruct-data-v0.1-filtered
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- hi
multilinguality:
- multilingual
size_categories:
- 5K<n<400K
language_bcp47:
- en-US
- hi-IN
configs:
- config_name: anudesh
data_files:
- split: en
path: anudesh/en*
- split: hi
path: anudesh/hi*
- config_name: dolly
data_files:
- split: en
path: dolly/en*
- split: hi
path: dolly/hi*
- config_name: flan_v2
data_files:
- split: en
path: flan_v2/en*
- split: hi
path: flan_v2/hi*
- config_name: hh-rlhf
data_files:
- split: en
path: hh-rlhf/en*
- split: hi
path: hh-rlhf/hi*
- config_name: nmt-seed
data_files:
- split: hi
path: nmt-seed/hi*
- config_name: wikihow
data_files:
- split: en
path: wikihow/en*
- split: hi
path: wikihow/hi*
- config_name: oasst1
data_files:
- split: en
path: oasst1/en*
- split: hi
path: oasst1/hi*
- config_name: lm_sys
data_files:
- split: en
path: lm_sys/en*
- split: hi
path: lm_sys/hi*
---
This is filtered version of [indic-instruct-data-v0.1](https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1).
**UPDATE: 4 March 2024** - This dataset has been further filtered to create [indic-instruct-data-v0.2-filtered](https://huggingface.co/datasets/BhabhaAI/indic-instruct-data-v0.2-filtered).
## Filtering Approach
1. Drop exampels containing ["search the web", "www.", ".py", ".com", "spanish", "french", "japanese", "given two strings, check whether one string is a rotation of another", "openai", "xml", "arrange the words", "__", "noinput" "idiom", "alphabetic", "alliteration", "translat", "paraphrase", "code", "def ", "http", "https", "index.html", "html", "python", "```", "identify the language", "word count", "number of words", "count the number", "identify the language", "spelling", "word count", " x ", " y ", "'x'", "'y'", "language"]
2. Compare English to translated Hindi words and character ratio to avoid duplicated words in translation. This drop row containing reptation of characters/words. Example: ल्लोलोलोलोलोलोलोल or न्याय की दृष्टि से न्याय की दृष्टि से न्याय की दृष्टि से न्याय की दृष्टि से
Anudesh and oasst1 dataset have been kept as it because they don't have their English counterparts to filter.
提供机构:
BhabhaAI
原始信息汇总
数据集概述
语言支持
- 支持语言:英语(en)、印地语(hi)
- 多语言性:多语种
数据规模
- 数据量:5K<n<400K
语言标识
- BCP 47 标识:en-US、hi-IN
配置详情
-
config_name: anudesh
- 数据文件:
- 分割:en
- 路径:anudesh/en*
- 分割:hi
- 路径:anudesh/hi*
- 分割:en
- 数据文件:
-
config_name: dolly
- 数据文件:
- 分割:en
- 路径:dolly/en*
- 分割:hi
- 路径:dolly/hi*
- 分割:en
- 数据文件:
-
config_name: flan_v2
- 数据文件:
- 分割:en
- 路径:flan_v2/en*
- 分割:hi
- 路径:flan_v2/hi*
- 分割:en
- 数据文件:
-
config_name: hh-rlhf
- 数据文件:
- 分割:en
- 路径:hh-rlhf/en*
- 分割:hi
- 路径:hh-rlhf/hi*
- 分割:en
- 数据文件:
-
config_name: nmt-seed
- 数据文件:
- 分割:hi
- 路径:nmt-seed/hi*
- 分割:hi
- 数据文件:
-
config_name: wikihow
- 数据文件:
- 分割:en
- 路径:wikihow/en*
- 分割:hi
- 路径:wikihow/hi*
- 分割:en
- 数据文件:
-
config_name: oasst1
- 数据文件:
- 分割:en
- 路径:oasst1/en*
- 分割:hi
- 路径:oasst1/hi*
- 分割:en
- 数据文件:
-
config_name: lm_sys
- 数据文件:
- 分割:en
- 路径:lm_sys/en*
- 分割:hi
- 路径:lm_sys/hi*
- 分割:en
- 数据文件:



