OpenDCAI/dataflow-demo-Text
收藏Hugging Face2025-12-29 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/OpenDCAI/dataflow-demo-Text
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含三个独立的子数据集,展示了DataFlow项目的不同处理流程。第一个数据集是预训练过滤流程演示,展示了从Common Crawl网页数据中过滤无效页面、广告、色情和无关内容,并提取有意义信息生成结构化问答对的过程。第二个数据集是多轮对话合成数据集,包含15,240个样本,使用GPT-4o API通过ConsistentChatGenerator操作符合成的6轮多轮对话数据。第三个数据集是SFT合成数据集,包含14,799个样本,使用GPT-4o API通过CondorGenerator、CondorRefiner和AlpagasusFilter操作符合成的指令-响应数据。
This repository contains three independent datasets demonstrating different pipelines of the DataFlow project. The first dataset demonstrates the pretraining filtering pipeline, which filters out invalid pages, advertisements, pornography, and irrelevant content from Common Crawl web page data, and extracts meaningful information into structured Question-Answer pairs. The second dataset is a multi-turn conversation synthesis dataset with 15,240 samples, synthesized using the GPT-4o API via the ConsistentChatGenerator operator. The third dataset is an SFT synthesis dataset with 14,799 samples, synthesized using the GPT-4o API via the CondorGenerator, CondorRefiner, and AlpagasusFilter operators.
提供机构:
OpenDCAI



