hrkhosravi/high-quality-english-sentences
收藏Hugging Face2025-12-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/hrkhosravi/high-quality-english-sentences
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含从C4和FineWeb(非FineWeb-Edu)收集的高质量英语句子集合。这些句子经过精心筛选和处理,以确保质量和独特性。高质量意味着它们是可读的英语句子,不是垃圾信息,尽管可能仍有拼写和语法错误。数据集创建的目的是为各种NLP任务提供多样化的高质量英语句子。数据处理包括初始句子过滤(质量分数>0.5,长度≥20字符)、额外过滤(移除不以大写字母开头的句子和有未匹配括号的句子)、去重(精确匹配)和训练测试分割(90%训练,10%测试)。最终数据集包含1,705,221个句子,其中训练集1,534,699个,测试集170,522个。
This dataset contains a collection of high-quality English sentences sourced from C4 and FineWeb (*not* FineWeb-Edu). The sentences have been carefully filtered and processed to ensure quality and uniqueness. High-quality means theyre legible English and not spam, although they may still have spelling and grammar errors. The dataset was created to provide diverse high-quality English sentences for various NLP tasks. Data processing includes initial sentence filtering (quality score > 0.5, length >= 20 characters), additional filtering (removed sentences not starting with a capital letter and with unmatched parentheses), deduplication (exact match), and train-test split (90% train, 10% test). The final dataset contains 1,705,221 sentences, with 1,534,699 in the train set and 170,522 in the test set.
提供机构:
hrkhosravi



