marianna13/PDF_extraction_sample
收藏Hugging Face2023-10-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/marianna13/PDF_extraction_sample
下载链接
链接失效反馈官方服务:
资源简介:
# Some stats for random 10 WAT files from CC (see [GitHub](https://github.com/marianna13/PDF_extraction) for more info)
## Stats for the links
|Number of PDF links|
|------------------|
| 131379|
|Number of working PDF links from 10k sample|
|-------------------------|
| 3904|
|sum(num_words)|
|--------------|
| 384953|
|sum(num_tokens)|
|---------------|
| 715422|
| avg(num_words)|
|-----------------|
|6999.145454545454|
| avg(num_tokens)|
|------------------|
|13007.672727272728|
## Stats for extracted data (for 100 random URLs)
1 process:
| total_processing_time | No error | FSTimeoutError | FileDataError cannot open broken document | Empty doc | ValueError Protocol not known: "http | TypeError _request() got an unexpected keyword argument 'target_options' | FileNotFoundError |
|------------------------:|-----------:|------------------:|--------------------------------------------:|------------:|---------------------------------------:|---------------------------------------------------------------------------:|--------------------:|
| 147.385 | 54 | 17 | 11 | 8 | 2 | 1 | 7 |
5 processes:
| total_processing_time | No error | FSTimeoutError | FileDataError cannot open broken document | Empty doc | ValueError Protocol not known: "http | FileNotFoundError | TypeError _request() got an unexpected keyword argument 'target_options' |
|------------------------:|-----------:|------------------:|--------------------------------------------:|------------:|---------------------------------------:|--------------------:|---------------------------------------------------------------------------:|
| 28.9343 | 53 | 17 | 12 | 8 | 2 | 7 | 1 |
10 processes:
| total_processing_time | No error | FSTimeoutError | Empty doc | FileDataError cannot open broken document | FileNotFoundError | TypeError _request() got an unexpected keyword argument 'target_options' | ValueError Protocol not known: "http |
|------------------------:|-----------:|------------------:|------------:|--------------------------------------------:|--------------------:|---------------------------------------------------------------------------:|---------------------------------------:|
| 14.9258 | 55 | 17 | 8 | 12 | 5 | 1 | 2 |
提供机构:
marianna13
原始信息汇总
数据集统计信息
PDF链接统计
| 统计项 | 数值 |
|---|---|
| 总PDF链接数 | 131379 |
| 10k样本中有效PDF链接数 | 3904 |
| 总单词数 | 384953 |
| 总词数 | 715422 |
| 平均单词数 | 6999.145454545454 |
| 平均词数 | 13007.672727272728 |
提取数据统计(100个随机URL)
单进程处理
| 总处理时间 | 无错误 | FSTimeoutError | FileDataError | 空文档 | ValueError | TypeError | FileNotFoundError |
|---|---|---|---|---|---|---|---|
| 147.385 | 54 | 17 | 11 | 8 | 2 | 1 | 7 |
五进程处理
| 总处理时间 | 无错误 | FSTimeoutError | FileDataError | 空文档 | ValueError | FileNotFoundError | TypeError |
|---|---|---|---|---|---|---|---|
| 28.9343 | 53 | 17 | 12 | 8 | 2 | 7 | 1 |
十进程处理
| 总处理时间 | 无错误 | FSTimeoutError | 空文档 | FileDataError | FileNotFoundError | TypeError | ValueError |
|---|---|---|---|---|---|---|---|
| 14.9258 | 55 | 17 | 8 | 12 | 5 | 1 | 2 |



