five

Loay & Safa Dataset

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/ggh75fd25f
下载链接
链接失效反馈
官方服务:
资源简介:
The goal of this project is to provide large availability of textual data in electronic form. In order to test system performance such as retrieval systems, search engine, plagiarism checking systems… etc. Hence, we collect data from papers, books and articles with the possibility of recurrence of the file content and the size of files ranging within [1KB-20,374KB]. The overall size of the “Raw Dataset” are formed 4.64GB. This dataset also provide “Modified Dataset” after lexical analysis to analyze the input file and extract words that contain only English alphabet characters. Because this files were took from many resources (For each word check each letter for handling some situations such as (we ’re → we are, don’t → do not, bi-cycle → bicycle, B.S. → BS and up/down → up down). Despite our best efforts to clean this dataset, it contains a very small percentage of non-English words and non-words. No automatic spelling correction was performed. The size of the resulted data will be 4.27GB with filtering of the resulted files for passing only non-empty files. Note: the raw dataset contains a text file that may contain (English alphabet, other symbols, non-printable character, and numbers). Last Update: November, 2016.
创建时间:
2016-12-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作