five

acwater/NLP

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/acwater/NLP
下载链接
链接失效反馈
官方服务:
资源简介:
# bilingual_belle_sft ## Dataset Summary This dataset contains bilingual Chinese-English supervised fine-tuning data converted into a Belle-style schema: ```json { "id": "unique_string_id", "conversations": [ {"from": "human", "value": "..."}, {"from": "assistant", "value": "..."} ] } ``` The corpus was built by downloading, cleaning, converting, and merging the following public datasets: - `shareAI/ShareGPT-Chinese-English-90k` - `tatsu-lab/alpaca` - `shibing624/alpaca-zh` - `BAAI/COIG` ## Files - `train.jsonl`: final merged Belle-format JSONL - `stats.json`: merged corpus statistics ## Filtering And Cleaning The pipeline applies the following rules after conversion into Belle-style format: - keep only samples with `20 < token_count < 1024` - token count is computed with the `cl100k_base` tokenizer - remove empty or malformed samples - remove samples with empty assistant replies - remove obvious garbled or meaningless content - validate alternating `human` and `assistant` turns - deduplicate within each dataset - merge all cleaned datasets - deduplicate globally after merge - reindex final ids as `bilingual_belle_00000001`, `bilingual_belle_00000002`, ... ## Language Mix Merged counts: - `output_count`: 224691 - `zh_count`: 60203 - `en_count`: 118974 - `mixed_count`: 45514 ## Source Schema Instruction-style records are normalized into a single-turn conversation. Chat-style records are normalized into alternating `human` and `assistant` turns. Unsupported roles such as `system`, `tool`, and `function` are removed during cleaning. ## Intended Use This dataset is intended for bilingual zh-en SFT training in Belle-style conversational format. ## License Please review the licenses and terms of the original source datasets before redistribution or commercial use.
提供机构:
acwater
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作