five

idl-wds

收藏
魔搭社区2026-04-28 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/idl-wds
下载链接
链接失效反馈
官方服务:
资源简介:
license: other license_name: idl-train license_link: LICENSE task_categories: - image-to-text size_categories: - 10M<n<100M --- # Dataset Card for Industry Documents Library (IDL) ## Dataset Description - **Point of Contact from curators:** [Kate Tasker, UCSF](mailto:kate.tasker@ucsf.edu) - **Point of Contact Hugging Face:** [Pablo Montalvo](mailto:pablo@huggingface.co) ### Dataset Summary Industry Documents Library (IDL) is a document dataset filtered from [UCSF documents library](https://www.industrydocuments.ucsf.edu/) with 19 million pages kept as valid samples. Each document exists as a collection of a pdf, a tiff image with the same contents rendered, a json file containing extensive Textract OCR annotations from the [idl_data](https://github.com/furkanbiten/idl_data) project, and a .ocr file with the original, older OCR annotation. In each pdf, there may be from 1 to up to 3000 pages. <center> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/idl_page_example.png" alt="An addendum from an internal legal document" width="600" height="300"> <p><em>An example page of one pdf document from the Industry Documents Library. </em></p> </center> This instance of IDL is in [webdataset](https://github.com/webdataset/webdataset/commits/main) .tar format. ### Usage with `chug` Check out [chug](https://github.com/huggingface/chug), our optimized library for sharded dataset loading! ```python import chug task_cfg = chug.DataTaskDocReadCfg(page_sampling='all') data_cfg = chug.DataCfg( source='pixparse/idl-wds', split='train', batch_size=None, format='hfids', num_workers=0, ) data_loader = chug.create_loader( data_cfg, task_cfg, ) sample = next(iter(data_loader)) ``` ### Usage with datasets This dataset can also be used with webdataset library or current releases of Hugging Face `datasets`. Here is an example using the "streaming" parameter. We do recommend downloading the dataset to save bandwidth. ```python dataset = load_dataset('pixparse/idl-wds', streaming=True) print(next(iter(dataset['train'])).keys()) >> dict_keys(['__key__', '__url__', 'json', 'ocr', 'pdf', 'tif']) ``` For faster download, you can directly use the `huggingface_hub` library. Make sure `hf_transfer` is installed prior to downloading and mind that you have enough space locally. ```python import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import HfApi, logging #logging.set_verbosity_debug() hf = HfApi() hf.snapshot_download("pixparse/idl-wds", repo_type="dataset", local_dir_use_symlinks=False) ``` Further, a metadata file `_pdfa-english-train-info-minimal.json` contains the list of samples per shard, with same basename and `.json` or `.pdf` extension, as well as the count of files per shard. #### Words and lines document metadata Initially, we obtained the raw data from the IDL API and combined it with the `idl_data` annotation. This information is then reshaped into lines organized in reading order, under the key lines. We keep non-reshaped word and bounding box information under the word key, should users want to use their own heuristic. The way we obtain an approximate reading order is simply by looking at the frequency peaks of the leftmost word x-coordinate. A frequency peak means that a high number of lines are starting from the same point. Then, we keep track of the x-coordinate of each such identified column. If no peaks are found, the document is assumed to be readable in plain format. The code to detect columns can be found here. ```python def get_columnar_separators(page, min_prominence=0.3, num_bins=10, kernel_width=1): """ Identifies the x-coordinates that best separate columns by analyzing the derivative of a histogram of the 'left' values (xmin) of bounding boxes. Args: page (dict): Page data with 'bbox' containing bounding boxes of words. min_prominence (float): The required prominence of peaks in the histogram. num_bins (int): Number of bins to use for the histogram. kernel_width (int): The width of the Gaussian kernel used for smoothing the histogram. Returns: separators (list): The x-coordinates that separate the columns, if any. """ try: left_values = [b[0] for b in page['bbox']] hist, bin_edges = np.histogram(left_values, bins=num_bins) hist = scipy.ndimage.gaussian_filter1d(hist, kernel_width) min_val = min(hist) hist = np.insert(hist, [0, len(hist)], min_val) bin_width = bin_edges[1] - bin_edges[0] bin_edges = np.insert(bin_edges, [0, len(bin_edges)], [bin_edges[0] - bin_width, bin_edges[-1] + bin_width]) peaks, _ = scipy.signal.find_peaks(hist, prominence=min_prominence * np.max(hist)) derivatives = np.diff(hist) separators = [] if len(peaks) > 1: # This finds the index of the maximum derivative value between peaks # which indicates peaks after trough --> column for i in range(len(peaks)-1): peak_left = peaks[i] peak_right = peaks[i+1] max_deriv_index = np.argmax(derivatives[peak_left:peak_right]) + peak_left separator_x = bin_edges[max_deriv_index + 1] separators.append(separator_x) except Exception as e: separators = [] return separators ``` That way, columnar documents can be better separated. This is a basic heuristic but it should improve overall the readability of the documents. <div style="text-align: center;"> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/bounding_boxes_straight.png" alt="Numbered bounding boxes on a document" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;"> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/arrows_plot_straight.png" alt="A simple representation of reading order" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;"> </div> <p style="text-align: center;"><em>Standard reading order for a single-column document. On the left, bounding boxes are ordered, and on the right a rendition of the corresponding reading order is given.</em></p> <div style="text-align: center;"> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/bounding_boxes.png" alt="Numbered bounding boxes on a document" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;"> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/arrows_plot.png" alt="A simple representation of reading order" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;"> </div> <p style="text-align: center;"><em>Heuristic-driven columnar reading order for a two-columns document. On the left, bounding boxes are ordered, and on the right a rendition of the corresponding reading order is given. Some inaccuracies remain but the overall reading order is preserved.</em></p> For each pdf document, we store statistics on number of pages per shard, number of valid samples per shard. A valid sample is a sample that can be encoded then decoded, which we did for each sample. ### Data, metadata and statistics. <center> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/idl_page_example.png" alt="An addendum from an internal legal document" width="600" height="300"> <p><em>An example page of one pdf document from the Industry Documents Library. </em></p> </center> The metadata for each document has been formatted in this way. Each `pdf` is paired with a `json` file with the following structure. Entries have been shortened for readability. ```json { "pages": [ { "text": [ "COVIDIEN", "Mallinckrodt", "Addendum", "This Addendum to the Consulting Agreement (the \"Agreement\") of July 28, 2010 (\"Effective Date\") by", "and between David Brushwod, R.Ph., J.D., with an address at P.O. Box 100496, Gainesville, FL 32610-", ], "bbox": [ [0.185964, 0.058857, 0.092199, 0.011457], [0.186465, 0.079529, 0.087209, 0.009247], [0.459241, 0.117854, 0.080015, 0.011332], [0.117109, 0.13346, 0.751004, 0.014365], [0.117527, 0.150306, 0.750509, 0.012954] ], "poly": [ [ {"X": 0.185964, "Y": 0.058857}, {"X": 0.278163, "Y": 0.058857}, {"X": 0.278163, "Y": 0.070315}, {"X": 0.185964, "Y": 0.070315} ], [ {"X": 0.186465, "Y": 0.079529}, {"X": 0.273673, "Y": 0.079529}, {"X": 0.273673, "Y": 0.088777}, {"X": 0.186465, "Y": 0.088777} ], [ {"X": 0.459241, "Y": 0.117854}, {"X": 0.539256, "Y": 0.117854}, {"X": 0.539256, "Y": 0.129186}, {"X": 0.459241, "Y": 0.129186} ], [ {"X": 0.117109, "Y": 0.13346}, {"X": 0.868113, "Y": 0.13346}, {"X": 0.868113, "Y": 0.147825}, {"X": 0.117109, "Y": 0.147825} ], [ {"X": 0.117527, "Y": 0.150306}, {"X": 0.868036, "Y": 0.150306}, {"X": 0.868036, "Y": 0.163261}, {"X": 0.117527, "Y": 0.163261} ] ], "score": [ 0.9939, 0.5704, 0.9961, 0.9898, 0.9935 ] } ] } ``` The top-level key, `pages`, is a list of every page in the document. The above example shows only one page. `text` is a list of lines in the document, with their individual associated bounding box in the next entry. `bbox` contains the bounding box coordinates in `left, top, width, height` format, with coordinates relative to the page size. `poly` is the corresponding polygon. `score` is the confidence score for each line obtained with Textract. ### Data Splits #### Train * `idl-train-*.tar` * Downloaded on 2023/12/16 * 3000 shards, 3144726 samples, 19174595 pages ## Additional Information ### Dataset Curators Pablo Montalvo, Ross Wightman ### Licensing Information While the Industry Documents Library is a public archive of documents and audiovisual materials, companies or individuals hold the rights to the information they created, meaning material cannot be “substantially” reproduced in books or other media without the copyright holder’s permission. The use of copyrighted material, including reproduction, is governed by United States copyright law (Title 17, United States Code). The law may permit the “fair use” of a copyrighted work, including the making of a photocopy, “for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship or research.” 17 U.S.C. § 107. The Industry Documents Library makes its collections available under court-approved agreements with the rightsholders or under the fair use doctrine, depending on the collection. According to the US Copyright Office, when determining whether a particular use comes under “fair use” you must consider the following: the purpose and character of the use, including whether it is of commercial nature or for nonprofit educational purposes; the nature of the copyrighted work itself; how much of the work you are using in relation to the copyrighted work as a whole (1 page of a 1000 page work or 1 print advertisement vs. an entire 30 second advertisement); the effect of the use upon the potential market for or value of the copyrighted work. (For additional information see the US Copyright Office Fair Use Index). Each user of this website is responsible for ensuring compliance with applicable copyright laws. Persons obtaining, or later using, a copy of copyrighted material in excess of “fair use” may become liable for copyright infringement. By accessing this website, the user agrees to hold harmless the University of California, its affiliates and their directors, officers, employees and agents from all claims and expenses, including attorneys’ fees, arising out of the use of this website by the user. For more in-depth information on copyright and fair use, visit the [Stanford University Libraries’ Copyright and Fair Use website.](https://fairuse.stanford.edu/) If you hold copyright to a document or documents in our collections and have concerns about our inclusion of this material, please see the IDL Take-Down Policy or contact us with any questions. In the dataset, the API from the Industry Documents Library holds the following permissions counts per file, showing all are now public (none are "confidential" or "privileged", only formerly.) ```json {'public/no restrictions': 3005133, 'public/formerly confidential': 264978, 'public/formerly privileged': 30063, 'public/formerly privileged/formerly confidential': 669, 'public/formerly confidential/formerly privileged': 397, } ```

license: 其他许可 license_name: idl-train license_link: LICENSE task_categories: - 图像到文本 size_categories: - 1000万 < 样本数 < 1亿 --- # 行业文档库(Industry Documents Library,IDL)数据集卡片 ## 数据集描述 - **策展方联系人:** [Kate Tasker, 加州大学旧金山分校](mailto:kate.tasker@ucsf.edu) - **Hugging Face 对接人:** [Pablo Montalvo](mailto:pablo@huggingface.co) ### 数据集概览 行业文档库(IDL)是从[UCSF文档库](https://www.industrydocuments.ucsf.edu/)筛选得到的文档数据集,共保留1900万页有效样本。 每份文档包含以下资源:内容一致的PDF文件、TIFF图像文件、包含[idl_data](https://github.com/furkanbiten/idl_data)项目中Textract光学字符识别(Optical Character Recognition,OCR)标注的JSON文件,以及包含原始旧式OCR标注的.ocr文件。每份PDF的页数范围为1至3000页不等。 <center> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/idl_page_example.png" alt="一份内部法律文件的附录" width="600" height="300"> <p><em>行业文档库中某PDF文档的示例页面。</em></p> </center> 本IDL数据集实例采用[webdataset](https://github.com/webdataset/webdataset/commits/main) .tar打包格式。 ### 使用`chug`库 请参考我们开发的分片数据集加载优化库[chug](https://github.com/huggingface/chug): python import chug task_cfg = chug.DataTaskDocReadCfg(page_sampling='all') data_cfg = chug.DataCfg( source='pixparse/idl-wds', split='train', batch_size=None, format='hfids', num_workers=0, ) data_loader = chug.create_loader( data_cfg, task_cfg, ) sample = next(iter(data_loader)) ### 使用Hugging Face `datasets`库 本数据集也可通过webdataset库或当前版本的Hugging Face `datasets`库使用。以下为启用“流式加载”参数的示例代码,我们推荐下载数据集以节省带宽。 python dataset = load_dataset('pixparse/idl-wds', streaming=True) print(next(iter(dataset['train'])).keys()) >> dict_keys(['__key__', '__url__', 'json', 'ocr', 'pdf', 'tif']) 如需更快下载,可直接使用`huggingface_hub`库。请确保下载前已安装`hf_transfer`,并确认本地存储空间充足。 python import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import HfApi, logging #logging.set_verbosity_debug() hf = HfApi() hf.snapshot_download("pixparse/idl-wds", repo_type="dataset", local_dir_use_symlinks=False) 此外,元数据文件`_pdfa-english-train-info-minimal.json`包含每个分片的样本列表,其文件名与对应`.json`或`.pdf`文件一致,并标注了每个分片的文件数量。 #### 文档元数据的字词与行信息 我们最初通过IDL API获取原始数据,并结合`idl_data`项目的标注信息。随后,我们将文本按阅读顺序重组为行,存储于`lines`键下;同时保留未重组的字词与边界框信息至`word`键,供用户使用自定义启发式算法处理。 我们通过检测最左侧字词的x坐标频率峰值来确定近似阅读顺序:频率峰值意味着大量行从同一x坐标起始。我们将记录每个识别出的列的x坐标。若未检测到峰值,则默认文档可按纯文本格式阅读。 列检测代码如下: python def get_columnar_separators(page, min_prominence=0.3, num_bins=10, kernel_width=1): """ 通过分析字词边界框的'left'值(xmin)直方图的导数,识别最佳分隔列的x坐标。 Args: page (dict): 包含字词边界框'bbox'的页面数据。 min_prominence (float): 直方图峰值所需的突出度。 num_bins (int): 直方图分箱数量。 kernel_width (int): 用于平滑直方图的高斯核宽度。 Returns: separators (list): 分隔列的x坐标列表,若无则为空列表。 """ try: left_values = [b[0] for b in page['bbox']] hist, bin_edges = np.histogram(left_values, bins=num_bins) hist = scipy.ndimage.gaussian_filter1d(hist, kernel_width) min_val = min(hist) hist = np.insert(hist, [0, len(hist)], min_val) bin_width = bin_edges[1] - bin_edges[0] bin_edges = np.insert(bin_edges, [0, len(bin_edges)], [bin_edges[0] - bin_width, bin_edges[-1] + bin_width]) peaks, _ = scipy.signal.find_peaks(hist, prominence=min_prominence * np.max(hist)) derivatives = np.diff(hist) separators = [] if len(peaks) > 1: # 寻找峰值间导数最大值的索引 # 该位置对应谷后峰值,即列分隔点 for i in range(len(peaks)-1): peak_left = peaks[i] peak_right = peaks[i+1] max_deriv_index = np.argmax(derivatives[peak_left:peak_right]) + peak_left separator_x = bin_edges[max_deriv_index + 1] separators.append(separator_x) except Exception as e: separators = [] return separators 这样的处理可以更好地分隔多栏文档。这是一种基础的启发式算法,但能够整体提升文档的可读性。 <div style="text-align: center;"> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/bounding_boxes_straight.png" alt="文档上的带编号边界框" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;"> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/arrows_plot_straight.png" alt="阅读顺序的简易示意图" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;"> </div> <p style="text-align: center;"><em>单栏文档的标准阅读顺序。左侧为有序的边界框,右侧为对应的阅读顺序示意图。</em></p> <div style="text-align: center;"> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/bounding_boxes.png" alt="文档上的带编号边界框" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;"> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/arrows_plot.png" alt="阅读顺序的简易示意图" style="width: 600px; height: 800px; object-fit: cover; display: inline-block;"> </div> <p style="text-align: center;"><em>基于启发式算法的双栏文档阅读顺序。左侧为有序的边界框,右侧为对应的阅读顺序示意图。尽管存在少量误差,但整体阅读顺序得以保留。</em></p> 针对每份PDF文档,我们存储了每个分片的页数、有效样本数量等统计信息。有效样本指可正常编码与解码的样本,本数据集所有样本均通过了该验证。 ### 数据、元数据与统计信息 <center> <img src="https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/doc_images/idl_page_example.png" alt="一份内部法律文件的附录" width="600" height="300"> <p><em>行业文档库中某PDF文档的示例页面。</em></p> </center> 每份文档的元数据格式如下,为便于阅读已对条目进行精简: json { "pages": [ { "text": [ "COVIDIEN", "Mallinckrodt", "Addendum", "This Addendum to the Consulting Agreement (the "Agreement") of July 28, 2010 ("Effective Date") by", "and between David Brushwod, R.Ph., J.D., with an address at P.O. Box 100496, Gainesville, FL 32610-", ], "bbox": [ [0.185964, 0.058857, 0.092199, 0.011457], [0.186465, 0.079529, 0.087209, 0.009247], [0.459241, 0.117854, 0.080015, 0.011332], [0.117109, 0.13346, 0.751004, 0.014365], [0.117527, 0.150306, 0.750509, 0.012954] ], "poly": [ [ {"X": 0.185964, "Y": 0.058857}, {"X": 0.278163, "Y": 0.058857}, {"X": 0.278163, "Y": 0.070315}, {"X": 0.185964, "Y": 0.070315} ], [ {"X": 0.186465, "Y": 0.079529}, {"X": 0.273673, "Y": 0.079529}, {"X": 0.273673, "Y": 0.088777}, {"X": 0.186465, "Y": 0.088777} ], [ {"X": 0.459241, "Y": 0.117854}, {"X": 0.539256, "Y": 0.117854}, {"X": 0.539256, "Y": 0.129186}, {"X": 0.459241, "Y": 0.129186} ], [ {"X": 0.117109, "Y": 0.13346}, {"X": 0.868113, "Y": 0.13346}, {"X": 0.868113, "Y": 0.147825}, {"X": 0.117109, "Y": 0.147825} ], [ {"X": 0.117527, "Y": 0.150306}, {"X": 0.868036, "Y": 0.150306}, {"X": 0.868036, "Y": 0.163261}, {"X": 0.117527, "Y": 0.163261} ] ], "score": [ 0.9939, 0.5704, 0.9961, 0.9898, 0.9935 ] } ] } 顶级键`pages`为文档所有页面的列表,上述示例仅展示单页。`text`为文档的行文本列表,每行对应后续的边界框信息。`bbox`以`left, top, width, height`格式存储边界框坐标,坐标值相对于页面尺寸归一化。`poly`为对应的多边形边界信息。 `score`为Textract为每行生成的置信度分数。 ### 数据拆分 #### 训练集 * `idl-train-*.tar` * 下载时间:2023/12/16 * 共3000个分片,3144726个样本,19174595页 ## 附加信息 ### 数据集策展人 Pablo Montalvo, Ross Wightman ### 许可信息 尽管行业文档库是公开的文档与视听素材档案,但相关企业或个人对其创作的内容保留版权,未经版权持有人许可,不得在书籍或其他媒体中“实质性”复制这些素材。 受版权保护材料的使用(包括复制)受美国版权法(《美国法典》第17编)管辖。法律允许对受版权保护作品的“合理使用”,包括为批评、评论、新闻报道、教学(包括课堂使用的多份副本)、学术或研究目的而制作副本。详见17 U.S.C. § 107。 行业文档库根据与版权持有人的法院批准协议或合理使用原则提供馆藏,具体取决于不同馆藏。 根据美国版权局的规定,在判断某一使用行为是否属于“合理使用”时,需考虑以下因素: 1. 使用的目的与性质,包括其是否具有商业性质或用于非营利性教育目的; 2. 受版权保护作品的本身性质; 3. 相对于整个受版权保护作品,你所使用的部分占比(例如1000页作品中的1页,或1则印刷广告 vs. 一整段30秒的广告); 4. 该使用行为对受版权保护作品的潜在市场或价值产生的影响。(更多信息请参见美国版权局合理使用索引) 本网站的每位使用者均需确保遵守适用的版权法律。超出“合理使用”范围获取或后续使用受版权保护材料的用户,可能面临版权侵权责任。通过访问本网站,用户同意豁免加州大学及其附属机构、董事、官员、员工和代理人因用户使用本网站而产生的所有索赔和费用,包括律师费。 如需了解更多版权与合理使用的详细信息,请访问[斯坦福大学图书馆版权与合理使用网站](https://fairuse.stanford.edu/)。 若您持有本馆藏中某份或多份文档的版权,并对我们收录该材料有异议,请参阅IDL撤稿政策或联系我们咨询相关问题。 在本数据集中,行业文档库的API按文件统计了权限信息,结果显示所有文件目前均已公开(无“机密”或“特权”文件,仅部分曾属于此类): json {'public/no restrictions': 3005133, 'public/formerly confidential': 264978, 'public/formerly privileged': 30063, 'public/formerly privileged/formerly confidential': 669, 'public/formerly confidential/formerly privileged': 397, }
提供机构:
maas
创建时间:
2024-04-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作