pixparse/idl-wds
收藏Hugging Face2024-03-29 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/pixparse/idl-wds
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: idl-train
license_link: LICENSE
task_categories:
- image-to-text
size_categories:
- 10M<n<100M
---
# Dataset Card for Industry Documents Library (IDL)
## Dataset Description
- **Point of Contact from curators:** [Kate Tasker, UCSF](mailto:kate.tasker@ucsf.edu)
- **Point of Contact Hugging Face:** [Pablo Montalvo](mailto:pablo@huggingface.co)
### Dataset Summary
Industry Documents Library (IDL) is a document dataset filtered from [UCSF documents library](https://www.industrydocuments.ucsf.edu/) with 19 million pages kept as valid samples.
Each document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the [idl_data](https://github.com/furkanbiten/idl_data) project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up to 3000 pages.
<center>
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/idl_page_example.png" alt="An addendum from an internal legal document" width="600" height="300">
<p><em>An example page of one pdf document from the Industry Documents Library. </em></p>
</center>
This instance of IDL is in [webdataset](https://github.com/webdataset/webdataset/commits/main) .tar format.
### Usage with `chug`
Check out [chug](https://github.com/huggingface/chug), our optimized library for sharded dataset loading!
```python
import chug
task_cfg = chug.DataTaskDocReadCfg(page_sampling='all')
data_cfg = chug.DataCfg(
source='pixparse/idl-wds',
split='train',
batch_size=None,
format='hfids',
num_workers=0,
)
data_loader = chug.create_loader(
data_cfg,
task_cfg,
)
sample = next(iter(data_loader))
```
### Usage with datasets
This dataset can also be used with webdataset library or current releases of Hugging Face `datasets`.
Here is an example using the "streaming" parameter. We do recommend downloading the dataset to save bandwidth.
```python
dataset = load_dataset('pixparse/idl-wds', streaming=True)
print(next(iter(dataset['train'])).keys())
>> dict_keys(['__key__', '__url__', 'json', 'ocr', 'pdf', 'tif'])
```
For faster download, you can directly use the `huggingface_hub` library. Make sure `hf_transfer` is installed prior to downloading and mind that you have enough space locally.
```python
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import HfApi, logging
#logging.set_verbosity_debug()
hf = HfApi()
hf.snapshot_download("pixparse/idl-wds", repo_type="dataset", local_dir_use_symlinks=False)
```
Further, a metadata file `_pdfa-english-train-info-minimal.json` contains the list of samples per shard, with same basename and `.json` or `.pdf` extension,
as well as the count of files per shard.
#### Words and lines document metadata
Initially, we obtained the raw data from the IDL API and combined it with the `idl_data` annotation. This information is then reshaped into lines organized in reading order, under the key lines. We keep non-reshaped word and bounding box information under the word key, should users want to use their own heuristic.
The way we obtain an approximate reading order is simply by looking at the frequency peaks of the leftmost word x-coordinate. A frequency peak means that a high number of lines are starting from the same point. Then, we keep track of the x-coordinate of each such identified column. If no peaks are found, the document is assumed to be readable in plain format.
The code to detect columns can be found here.
```python
def get_columnar_separators(page, min_prominence=0.3, num_bins=10, kernel_width=1):
"""
Identifies the x-coordinates that best separate columns by analyzing the derivative of a histogram
of the 'left' values (xmin) of bounding boxes.
Args:
page (dict): Page data with 'bbox' containing bounding boxes of words.
min_prominence (float): The required prominence of peaks in the histogram.
num_bins (int): Number of bins to use for the histogram.
kernel_width (int): The width of the Gaussian kernel used for smoothing the histogram.
Returns:
separators (list): The x-coordinates that separate the columns, if any.
"""
try:
left_values = [b[0] for b in page['bbox']]
hist, bin_edges = np.histogram(left_values, bins=num_bins)
hist = scipy.ndimage.gaussian_filter1d(hist, kernel_width)
min_val = min(hist)
hist = np.insert(hist, [0, len(hist)], min_val)
bin_width = bin_edges[1] - bin_edges[0]
bin_edges = np.insert(bin_edges, [0, len(bin_edges)], [bin_edges[0] - bin_width, bin_edges[-1] + bin_width])
peaks, _ = scipy.signal.find_peaks(hist, prominence=min_prominence * np.max(hist))
derivatives = np.diff(hist)
separators = []
if len(peaks) > 1:
# This finds the index of the maximum derivative value between peaks
# which indicates peaks after trough --> column
for i in range(len(peaks)-1):
peak_left = peaks[i]
peak_right = peaks[i+1]
max_deriv_index = np.argmax(derivatives[peak_left:peak_right]) + peak_left
separator_x = bin_edges[max_deriv_index + 1]
separators.append(separator_x)
except Exception as e:
separators = []
return separators
```
That way, columnar documents can be better separated. This is a basic heuristic but it should improve overall the readability of the documents.
<div style="text-align: center;">
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/bounding_boxes_straight.png" alt="Numbered bounding boxes on a document" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;">
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/arrows_plot_straight.png" alt="A simple representation of reading order" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;">
</div>
<p style="text-align: center;"><em>Standard reading order for a single-column document. On the left, bounding boxes are ordered, and on the right a rendition of the corresponding reading order is given.</em></p>
<div style="text-align: center;">
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/bounding_boxes.png" alt="Numbered bounding boxes on a document" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;">
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/arrows_plot.png" alt="A simple representation of reading order" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;">
</div>
<p style="text-align: center;"><em>Heuristic-driven columnar reading order for a two-columns document. On the left, bounding boxes are ordered, and on the right a rendition of the corresponding reading order is given. Some inaccuracies remain but the overall reading order is preserved.</em></p>
For each pdf document, we store statistics on number of pages per shard, number of valid samples per shard. A valid sample is a sample that can be encoded then decoded, which we did for each sample.
### Data, metadata and statistics.
<center>
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/idl_page_example.png" alt="An addendum from an internal legal document" width="600" height="300">
<p><em>An example page of one pdf document from the Industry Documents Library. </em></p>
</center>
The metadata for each document has been formatted in this way. Each `pdf` is paired with a `json` file with the following structure. Entries have been shortened for readability.
```json
{
"pages": [
{
"text": [
"COVIDIEN",
"Mallinckrodt",
"Addendum",
"This Addendum to the Consulting Agreement (the \"Agreement\") of July 28, 2010 (\"Effective Date\") by",
"and between David Brushwod, R.Ph., J.D., with an address at P.O. Box 100496, Gainesville, FL 32610-",
],
"bbox": [
[0.185964, 0.058857, 0.092199, 0.011457],
[0.186465, 0.079529, 0.087209, 0.009247],
[0.459241, 0.117854, 0.080015, 0.011332],
[0.117109, 0.13346, 0.751004, 0.014365],
[0.117527, 0.150306, 0.750509, 0.012954]
],
"poly": [
[
{"X": 0.185964, "Y": 0.058857}, {"X": 0.278163, "Y": 0.058857}, {"X": 0.278163, "Y": 0.070315}, {"X": 0.185964, "Y": 0.070315}
],
[
{"X": 0.186465, "Y": 0.079529}, {"X": 0.273673, "Y": 0.079529}, {"X": 0.273673, "Y": 0.088777}, {"X": 0.186465, "Y": 0.088777}
],
[
{"X": 0.459241, "Y": 0.117854}, {"X": 0.539256, "Y": 0.117854}, {"X": 0.539256, "Y": 0.129186}, {"X": 0.459241, "Y": 0.129186}
],
[
{"X": 0.117109, "Y": 0.13346}, {"X": 0.868113, "Y": 0.13346}, {"X": 0.868113, "Y": 0.147825}, {"X": 0.117109, "Y": 0.147825}
],
[
{"X": 0.117527, "Y": 0.150306}, {"X": 0.868036, "Y": 0.150306}, {"X": 0.868036, "Y": 0.163261}, {"X": 0.117527, "Y": 0.163261}
]
],
"score": [
0.9939, 0.5704, 0.9961, 0.9898, 0.9935
]
}
]
}
```
The top-level key, `pages`, is a list of every page in the document. The above example shows only one page. `text` is a list of lines in the document, with their individual associated bounding box in the next entry. `bbox` contains the bounding box coordinates in `left, top, width, height` format, with coordinates relative to the page size. `poly` is the corresponding polygon.
`score` is the confidence score for each line obtained with Textract.
### Data Splits
#### Train
* `idl-train-*.tar`
* Downloaded on 2023/12/16
* 3000 shards, 3144726 samples, 19174595 pages
## Additional Information
### Dataset Curators
Pablo Montalvo, Ross Wightman
### Licensing Information
While the Industry Documents Library is a public archive of documents and audiovisual materials, companies or individuals hold the rights to the information they created, meaning material cannot be “substantially” reproduced in books or other media without the copyright holder’s permission.
The use of copyrighted material, including reproduction, is governed by United States copyright law (Title 17, United States Code). The law may permit the “fair use” of a copyrighted work, including the making of a photocopy, “for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship or research.” 17 U.S.C. § 107.
The Industry Documents Library makes its collections available under court-approved agreements with the rightsholders or under the fair use doctrine, depending on the collection.
According to the US Copyright Office, when determining whether a particular use comes under “fair use” you must consider the following:
the purpose and character of the use, including whether it is of commercial nature or for nonprofit educational purposes;
the nature of the copyrighted work itself;
how much of the work you are using in relation to the copyrighted work as a whole (1 page of a 1000 page work or 1 print advertisement vs. an entire 30 second advertisement);
the effect of the use upon the potential market for or value of the copyrighted work. (For additional information see the US Copyright Office Fair Use Index).
Each user of this website is responsible for ensuring compliance with applicable copyright laws. Persons obtaining, or later using, a copy of copyrighted material in excess of “fair use” may become liable for copyright infringement. By accessing this website, the user agrees to hold harmless the University of California, its affiliates and their directors, officers, employees and agents from all claims and expenses, including attorneys’ fees, arising out of the use of this website by the user.
For more in-depth information on copyright and fair use, visit the [Stanford University Libraries’ Copyright and Fair Use website.](https://fairuse.stanford.edu/)
If you hold copyright to a document or documents in our collections and have concerns about our inclusion of this material, please see the IDL Take-Down Policy or contact us with any questions.
In the dataset, the API from the Industry Documents Library holds the following permissions counts per file, showing all are now public (none are "confidential" or "privileged", only formerly.)
```json
{'public/no restrictions': 3005133,
'public/formerly confidential': 264978,
'public/formerly privileged': 30063,
'public/formerly privileged/formerly confidential': 669,
'public/formerly confidential/formerly privileged': 397,
}
```
license: 其他
license_name: idl-train
license_link: LICENSE
task_categories:
- 图像到文本
size_categories:
- 1000万<样本数<1亿
# 行业文档库(Industry Documents Library,IDL)数据集卡片
## 数据集说明
- **策展人联络人:** [凯特·塔斯克(Kate Tasker),加州大学旧金山分校(University of California, San Francisco,UCSF)](mailto:kate.tasker@ucsf.edu)
- **Hugging Face 联络人:** [巴勃罗·蒙塔尔沃(Pablo Montalvo)](mailto:pablo@huggingface.co)
### 数据集概览
行业文档库(IDL)是从[加州大学旧金山分校文档库](https://www.industrydocuments.ucsf.edu/)中筛选得到的文档数据集,共保留1900万页有效样本。
每份文档由以下文件组成:PDF文件、内容一致的TIFF图像、包含来自[idl_data](https://github.com/furkanbiten/idl_data)项目的丰富Textract光学字符识别(Optical Character Recognition,OCR)标注的JSON文件,以及包含原始旧版OCR标注的.ocr文件。每份PDF的页数范围为1至3000页。
<center>
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/idl_page_example.png" alt="内部法律文件的附录" width="600" height="300">
<p><em>行业文档库中某PDF文档的示例页面。</em></p>
</center>
本版本的IDL数据集采用[webdataset](https://github.com/webdataset/webdataset/commits/main) .tar格式存储。
### 使用`chug`工具加载
请查看[chug](https://github.com/huggingface/chug),这是我们开发的用于分片数据集加载的优化库!
python
import chug
task_cfg = chug.DataTaskDocReadCfg(page_sampling='all')
data_cfg = chug.DataCfg(
source='pixparse/idl-wds',
split='train',
batch_size=None,
format='hfids',
num_workers=0,
)
data_loader = chug.create_loader(
data_cfg,
task_cfg,
)
sample = next(iter(data_loader))
### 使用Hugging Face `datasets`库加载
本数据集也可通过webdataset库或当前版本的Hugging Face `datasets`库加载。以下为使用“流式加载”参数的示例,我们建议您下载数据集以节省带宽。
python
dataset = load_dataset('pixparse/idl-wds', streaming=True)
print(next(iter(dataset['train'])).keys())
>> dict_keys(['__key__', '__url__', 'json', 'ocr', 'pdf', 'tif'])
如需加速下载,您可直接使用`huggingface_hub`库。请确保在下载前安装`hf_transfer`,并确认本地有足够的存储空间。
python
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import HfApi, logging
#logging.set_verbosity_debug()
hf = HfApi()
hf.snapshot_download("pixparse/idl-wds", repo_type="dataset", local_dir_use_symlinks=False)
此外,元数据文件`_pdfa-english-train-info-minimal.json`包含每个分片的样本列表,样本文件与分片具有相同的基名,扩展名为`.json`或`.pdf`,同时还包含每个分片的文件数量。
#### 文档文本与行级元数据
我们最初从IDL API获取原始数据,并结合`idl_data`项目的标注。随后,我们将这些信息按照阅读顺序重组为行级数据,存储在`lines`字段下。若用户希望使用自定义的处理逻辑,我们保留了未经过重组的单词与边界框信息,存储在`word`字段中。
我们通过以下方式获取近似阅读顺序:统计每个页面最左侧单词的x坐标的频率峰值。频率峰值指的是大量行从同一x坐标起始的情况。随后,我们记录每个识别出的列的x坐标。若未检测到峰值,则默认该文档为单栏版式,可按普通顺序阅读。
列检测的代码如下所示:
python
def get_columnar_separators(page, min_prominence=0.3, num_bins=10, kernel_width=1):
"""
Identifies the x-coordinates that best separate columns by analyzing the derivative of a histogram
of the 'left' values (xmin) of bounding boxes of words.
Args:
page (dict): Page data with 'bbox' containing bounding boxes of words.
min_prominence (float): The required prominence of peaks in the histogram.
num_bins (int): Number of bins to use for the histogram.
kernel_width (int): The width of the Gaussian kernel used for smoothing the histogram.
Returns:
separators (list): The x-coordinates that separate the columns, if any.
"""
try:
left_values = [b[0] for b in page['bbox']]
hist, bin_edges = np.histogram(left_values, bins=num_bins)
hist = scipy.ndimage.gaussian_filter1d(hist, kernel_width)
min_val = min(hist)
hist = np.insert(hist, [0, len(hist)], min_val)
bin_width = bin_edges[1] - bin_edges[0]
bin_edges = np.insert(bin_edges, [0, len(bin_edges)], [bin_edges[0] - bin_width, bin_edges[-1] + bin_width])
peaks, _ = scipy.signal.find_peaks(hist, prominence=min_prominence * np.max(hist))
derivatives = np.diff(hist)
separators = []
if len(peaks) > 1:
# This finds the index of the maximum derivative value between peaks
# which indicates peaks after trough --> column
for i in range(len(peaks)-1):
peak_left = peaks[i]
peak_right = peaks[i+1]
max_deriv_index = np.argmax(derivatives[peak_left:peak_right]) + peak_left
separator_x = bin_edges[max_deriv_index + 1]
separators.append(separator_x)
except Exception as e:
separators = []
return separators
通过该方法,多栏文档的阅读顺序可被更准确地划分。这是一种基础的启发式算法,但能够整体提升文档的可读性。
<div style="text-align: center;">
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/bounding_boxes_straight.png" alt="文档上的编号边界框" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;">
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/arrows_plot_straight.png" alt="阅读顺序的简单示意图" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;">
</div>
<p style="text-align: center;"><em>单栏文档的标准阅读顺序。左侧为编号后的边界框,右侧为对应的阅读顺序示意图。</em></p>
<div style="text-align: center;">
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/bounding_boxes.png" alt="文档上的编号边界框" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;">
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/arrows_plot.png" alt="阅读顺序的简单示意图" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;">
</div>
<p style="text-align: center;"><em>基于启发式算法的多栏文档阅读顺序。左侧为编号后的边界框,右侧为对应的阅读顺序示意图。尽管仍存在少量误差,但整体阅读顺序得以保留。</em></p>
对于每份PDF文档,我们存储了每个分片的页数与有效样本数统计信息。有效样本指可完成编码与解码流程的样本,本数据集的所有样本均通过了该校验。
### 数据、元数据与统计信息
<center>
<img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/idl_page_example.png" alt="内部法律文件的附录" width="600" height="300">
<p><em>行业文档库中某PDF文档的示例页面。</em></p>
</center>
每份文档的元数据采用如下格式组织。每个`pdf`文件对应一个`json`元数据文件,结构如下所示。为便于阅读,以下示例进行了精简。
json
{
"pages": [
{
"text": [
"COVIDIEN",
"Mallinckrodt",
"Addendum",
"This Addendum to the Consulting Agreement (the "Agreement") of July 28, 2010 ("Effective Date") by",
"and between David Brushwod, R.Ph., J.D., with an address at P.O. Box 100496, Gainesville, FL 32610-",
],
"bbox": [
[0.185964, 0.058857, 0.092199, 0.011457],
[0.186465, 0.079529, 0.087209, 0.009247],
[0.459241, 0.117854, 0.080015, 0.011332],
[0.117109, 0.13346, 0.751004, 0.014365],
[0.117527, 0.150306, 0.750509, 0.012954]
],
"poly": [
[
{"X": 0.185964, "Y": 0.058857}, {"X": 0.278163, "Y": 0.058857}, {"X": 0.278163, "Y": 0.070315}, {"X": 0.185964, "Y": 0.070315}
],
[
{"X": 0.186465, "Y": 0.079529}, {"X": 0.273673, "Y": 0.079529}, {"X": 0.273673, "Y": 0.088777}, {"X": 0.186465, "Y": 0.088777}
],
[
{"X": 0.459241, "Y": 0.117854}, {"X": 0.539256, "Y": 0.117854}, {"X": 0.539256, "Y": 0.129186}, {"X": 0.459241, "Y": 0.129186}
],
[
{"X": 0.117109, "Y": 0.13346}, {"X": 0.868113, "Y": 0.13346}, {"X": 0.868113, "Y": 0.147825}, {"X": 0.117109, "Y": 0.147825}
],
[
{"X": 0.117527, "Y": 0.150306}, {"X": 0.868036, "Y": 0.150306}, {"X": 0.868036, "Y": 0.163261}, {"X": 0.117527, "Y": 0.163261}
]
],
"score": [
0.9939, 0.5704, 0.9961, 0.9898, 0.9935
]
}
]
}
顶级字段`pages`是文档所有页面的列表,上述示例仅展示了单页数据。`text`为文档的行级文本列表,每行文本对应一个边界框信息,存储在紧随其后的`bbox`字段中。`bbox`字段包含边界框坐标,格式为`左、上、宽、高`,坐标值相对于页面尺寸进行归一化。`poly`为对应的多边形边界信息。
`score`为通过Textract获取的每行文本的置信度得分。
### 数据拆分
#### 训练集
* `idl-train-*.tar`
* 下载时间:2023/12/16
* 共3000个分片,3144726个样本,总计19174595页
## 附加信息
### 数据集策展人
巴勃罗·蒙塔尔沃、罗斯·怀特曼
### 许可信息
尽管行业文档库是公开的文档与视听资料档案馆,但相关公司或个人对其创作的内容保留版权,这意味着未经版权所有者许可,不得在书籍或其他媒体中“实质性”复制这些资料。
受版权保护材料的使用(包括复制)受美国版权法(《美国法典》第17编)约束。该法律允许对受版权保护作品进行“合理使用”,包括出于批评、评论、新闻报道、教学(包括课堂使用的多份副本)、学术或研究目的进行影印。详见17 U.S.C. § 107。
行业文档库根据不同馆藏的情况,通过与版权所有者达成的法院批准协议,或依据合理使用原则,向公众开放其馆藏。
根据美国版权局的规定,在判断某一使用行为是否属于“合理使用”时,需考虑以下因素:
1. 使用的目的与性质,包括该使用是否具有商业性质或用于非营利性教育目的;
2. 受版权保护作品的本身属性;
3. 相对于整部受版权保护作品,您所使用的部分占比(例如1000页作品中的1页,或一则完整的30秒广告相较于单条印刷广告);
4. 该使用行为对受版权保护作品的潜在市场或价值产生的影响。(如需更多信息,请参阅美国版权局的合理使用指南)。
本网站的每位用户均需确保其行为符合适用的版权法律。获取或后续使用超出“合理使用”范围的受版权保护材料副本的用户,可能会面临版权侵权责任。通过访问本网站,用户同意赔偿加州大学及其附属机构、以及其董事、官员、雇员和代理人因用户使用本网站而产生的所有索赔和费用,包括律师费。
如需了解版权与合理使用的更多深入信息,请访问[斯坦福大学图书馆版权与合理使用网站](https://fairuse.stanford.edu/)。
若您持有本馆藏中某份或多份文档的版权,并对我们收录该材料存在疑虑,请查阅IDL下架政策或联系我们咨询相关问题。
在本数据集中,行业文档库的API提供了每份文件的权限状态统计,结果显示所有文件目前均已公开(无“保密”或“特权”类文件,仅部分曾属于保密/特权范畴):
json
{'public/no restrictions': 3005133,
'public/formerly confidential': 264978,
'public/formerly privileged': 30063,
'public/formerly privileged/formerly confidential': 669,
'public/formerly confidential/formerly privileged': 397,
}
提供机构:
pixparse
原始信息汇总
数据集概述
名称: Industry Documents Library (IDL)
来源: 从UCSF文档库中筛选出的数据集。
样本数量: 包含1900万页有效样本。
文件格式:
- PDF文件: 每个文档包含一个PDF文件,页数范围从1到3000页。
- TIFF图像: 每个文档对应一个内容相同的TIFF图像。
- JSON文件: 包含来自idl_data项目的Textract OCR注释。
- .ocr文件: 包含原始的、较旧的OCR注释。
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



