acwater/NLP
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/acwater/NLP
下载链接
链接失效反馈官方服务:
资源简介:
# bilingual_belle_sft
## Dataset Summary
This dataset contains bilingual Chinese-English supervised fine-tuning data converted into a Belle-style schema:
```json
{
"id": "unique_string_id",
"conversations": [
{"from": "human", "value": "..."},
{"from": "assistant", "value": "..."}
]
}
```
The corpus was built by downloading, cleaning, converting, and merging the following public datasets:
- `shareAI/ShareGPT-Chinese-English-90k`
- `tatsu-lab/alpaca`
- `shibing624/alpaca-zh`
- `BAAI/COIG`
## Files
- `train.jsonl`: final merged Belle-format JSONL
- `stats.json`: merged corpus statistics
## Filtering And Cleaning
The pipeline applies the following rules after conversion into Belle-style format:
- keep only samples with `20 < token_count < 1024`
- token count is computed with the `cl100k_base` tokenizer
- remove empty or malformed samples
- remove samples with empty assistant replies
- remove obvious garbled or meaningless content
- validate alternating `human` and `assistant` turns
- deduplicate within each dataset
- merge all cleaned datasets
- deduplicate globally after merge
- reindex final ids as `bilingual_belle_00000001`, `bilingual_belle_00000002`, ...
## Language Mix
Merged counts:
- `output_count`: 224691
- `zh_count`: 60203
- `en_count`: 118974
- `mixed_count`: 45514
## Source Schema
Instruction-style records are normalized into a single-turn conversation. Chat-style records are normalized into alternating `human` and `assistant` turns. Unsupported roles such as `system`, `tool`, and `function` are removed during cleaning.
## Intended Use
This dataset is intended for bilingual zh-en SFT training in Belle-style conversational format.
## License
Please review the licenses and terms of the original source datasets before redistribution or commercial use.
提供机构:
acwater



