acwater/NLP

Name: acwater/NLP
Creator: acwater
Published: 2026-04-16 22:43:22
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/acwater/NLP

下载链接

链接失效反馈

官方服务：

资源简介：

# bilingual_belle_sft ## Dataset Summary This dataset contains bilingual Chinese-English supervised fine-tuning data converted into a Belle-style schema: ```json { "id": "unique_string_id", "conversations": [ {"from": "human", "value": "..."}, {"from": "assistant", "value": "..."} ] } ``` The corpus was built by downloading, cleaning, converting, and merging the following public datasets: - `shareAI/ShareGPT-Chinese-English-90k` - `tatsu-lab/alpaca` - `shibing624/alpaca-zh` - `BAAI/COIG` ## Files - `train.jsonl`: final merged Belle-format JSONL - `stats.json`: merged corpus statistics ## Filtering And Cleaning The pipeline applies the following rules after conversion into Belle-style format: - keep only samples with `20 < token_count < 1024` - token count is computed with the `cl100k_base` tokenizer - remove empty or malformed samples - remove samples with empty assistant replies - remove obvious garbled or meaningless content - validate alternating `human` and `assistant` turns - deduplicate within each dataset - merge all cleaned datasets - deduplicate globally after merge - reindex final ids as `bilingual_belle_00000001`, `bilingual_belle_00000002`, ... ## Language Mix Merged counts: - `output_count`: 224691 - `zh_count`: 60203 - `en_count`: 118974 - `mixed_count`: 45514 ## Source Schema Instruction-style records are normalized into a single-turn conversation. Chat-style records are normalized into alternating `human` and `assistant` turns. Unsupported roles such as `system`, `tool`, and `function` are removed during cleaning. ## Intended Use This dataset is intended for bilingual zh-en SFT training in Belle-style conversational format. ## License Please review the licenses and terms of the original source datasets before redistribution or commercial use.

提供机构：

acwater

5,000+

优质数据集

54 个

任务类型

进入经典数据集