agentlans/finewebedu-sentences
收藏Hugging Face2024-07-09 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/agentlans/finewebedu-sentences
下载链接
链接失效反馈官方服务:
资源简介:
Fineweb-edu Sentences数据集包含从网络上收集的句子。这些句子通过spaCy包进行分割,去重并经过半自动化的过程筛选出完整的句子。数据集的大小约为70万条英语句子,每条句子的长度不超过512个BERT标记。数据集的`source`字段包含每个句子的来源URL。数据集的许可证为Open Data Commons License Attribution。
A dataset of sentences collected from the web. The dataset was created by splitting the text into individual sentences using the spaCy package, then removing duplicates and filtering for complete sentences in a semi-automated process. The dataset contains about 700,000 English language sentences, each sentence is 512 tokens long or less as assessed using the BERT tokenizer. The annotations include the source URL of each sentence.
提供机构:
agentlans



