agentlans/finewebedu-refinement
收藏Hugging Face2025-04-07 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/agentlans/finewebedu-refinement
下载链接
链接失效反馈官方服务:
资源简介:
finewebedu-refinement数据集包含了对HuggingFaceFW/fineweb-edu数据集中文本摘录的简化版本,旨在使用简单语言,移除不必要词汇,采用主动语态,并拆分长句。该数据集共有9996个段落,提供原始文本与简化文本两种格式。数据集存在一定的局限性,如可能忽略长文本的细节,移除引用和格式,过度简化专业术语,包含生成错误,继承原始数据集偏见,以及代码和数学排版问题。
The finewebedu-refinement dataset contains simplified versions of text excerpts from the HuggingFaceFW/fineweb-edu dataset, aiming to use simple language, remove unnecessary words, employ active voice, and break down long sentences. The dataset includes a total of 9996 passages, provided in both original and simplified text formats. The dataset has certain limitations, such as skipping details in long texts, removing references and formatting, oversimplifying jargon, containing garbled words from the language model generation, inheriting biases from the original dataset, and issues with code and math typesetting.
提供机构:
agentlans



