Loay & Safa Dataset
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/ggh75fd25f
下载链接
链接失效反馈官方服务:
资源简介:
The goal of this project is to provide large availability of textual data in electronic form. In order to test system performance such as retrieval systems, search engine, plagiarism checking systems… etc. Hence, we collect data from papers, books and articles with the possibility of recurrence of the file content and the size of files ranging within [1KB-20,374KB]. The overall size of the “Raw Dataset” are formed 4.64GB.
This dataset also provide “Modified Dataset” after lexical analysis to analyze the input file and extract words that contain only English alphabet characters. Because this files were took from many resources (For each word check each letter for handling some situations such as (we ’re → we are, don’t → do not, bi-cycle → bicycle, B.S. → BS and up/down → up down). Despite our best efforts to clean this dataset, it contains a very small percentage of non-English words and non-words. No automatic spelling correction was performed. The size of the resulted data will be 4.27GB with filtering of the resulted files for passing only non-empty files.
Note: the raw dataset contains a text file that may contain (English alphabet, other symbols, non-printable character, and numbers).
Last Update: November, 2016.
创建时间:
2016-12-09



