five

hunterbown/bell-labs-technical-archive

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hunterbown/bell-labs-technical-archive
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en pretty_name: Bell Labs Documents and Stuff license: other multilinguality: monolingual language_creators: - expert-generated annotations_creators: - machine-generated task_categories: - text-generation size_categories: - 1K<n<10K tags: - datasets - bell-labs - telecommunications - history-of-technology - archive.org - patents - ocr configs: - config_name: default data_files: - split: train path: data/train.jsonl - split: validation path: data/validation.jsonl - split: test path: data/test.jsonl --- # Bell Labs Documents and Stuff This is a conservative public-release subset of the internal BELLA continued-pretraining corpus. It keeps the Bell-system technical material that survived a stricter final pass for public dataset hosting and removes records that still looked risky, off-scope, or too low-signal for a Hugging Face corpus listing. ## What is in the release | Split | Documents | |---|---:| | `train` | 1220 | | `validation` | 29 | | `test` | 42 | The release contains **1291 documents** out of **1530** internally curated pretraining documents. Document types: - `journal_issue`: 875 - `technical_report`: 370 - `patent`: 23 - `book`: 12 - `manual`: 11 Source families: - `archive_org`: 1265 documents - `google_patents`: 23 documents - `other`: 3 documents Selected extraction backends among kept documents, when recoverable from the current local catalog: - `unknown`: 1242 - `pdftotext`: 48 - `archive_text`: 1 Character counts after cleanup: - min: `2514` - median: `114491` - max: `2734561` ## Where the data came from Most documents were pulled from public Archive.org item pages that host Bell System Technical Journal issues, Bell Laboratories Record issues, Bell System Practices, Bell System / Western Electric technical manuals, and Bell-system-adjacent engineering reports. Patent records in this subset were pulled from public Google Patents pages. Every released row includes per-document provenance fields such as `source_url`, `archive_ref`, and `source_family` so downstream users can trace each text file back to the public item page that it came from. A `selected_extraction_backend` field is also present, but many older rows remain `unknown` because parts of the internal corpus were built before backend tracking was recorded uniformly in SQLite. ## How the corpus was built 1. Source records were discovered and imported into the local `bella.db` catalog. 2. For Archive.org-backed items, the pipeline downloaded the preferred PDF and any usable Archive.org text derivative. 3. Text extraction ran quality-first rather than single-backend-first: - `pdftotext` first when a PDF existed - Archive.org DjVu text as an alternate derivative when available - optional `Qianfan OCR` fallback only when `pdftotext` looked weak or mixed 4. Page-level heuristics removed obvious junk such as library stamps, scan boilerplate, table-of-contents pages, index pages, references, HTTP/header dumps, OCR markup artifacts, and pages with too little body text. 5. Document-level cleanup trimmed leading frontmatter such as Google/JSTOR boilerplate, issue mastheads, and leading OCR noise. 6. This public release applied one more conservative pass to exclude records with explicit restriction language, trade-secret notices, off-scope government/legal material, table-of-contents or index-only records, and very short bodies. ## Final public-release exclusions The final pass removed **239 documents**. Exclusion counts by reason: - `index_or_ordering_title`: 69 - `short_body_under_min_chars`: 55 - `government_archive_title`: 39 - `restricted_reproduction_notice`: 35 - `personal_noncommercial_notice`: 22 - `trade_secret_notice`: 20 - `post_1990_non_patent`: 10 - `table_of_contents_title`: 9 - `consumer_magazine_title`: 8 - `all_rights_reserved_notice`: 5 - `offscope_misc_title`: 3 - `oral_history_title`: 2 - `legal_case_title`: 1 The full exclusion log is in [meta/excluded_records.jsonl](meta/excluded_records.jsonl). ## OCR and extraction notes This is not a hand-transcribed corpus. It is a cleaned OCR/text-extraction corpus. Some documents were born-digital or extracted cleanly with `pdftotext`; others depend on OCR or Archive.org text derivatives. The text is usable for corpus work, but it is not guaranteed to be page-faithful, typo-free, or complete. Important limitations: - OCR noise still exists in places, especially in older scans and diagram-heavy technical material. - The corpus is Bell-focused, not a complete Bell Labs bibliography. - The release process is conservative, but it is not legal advice. - Some metadata fields were inferred or normalized during ingestion and cleanup. ## Intended use This subset is appropriate for: - continued pretraining or domain adaptation experiments - retrieval, search, and corpus analysis over Bell-system technical writing - historical telecom and computing research where OCR noise is acceptable This subset is not appropriate for: - licensing-sensitive redistribution without your own review of the source items - claims of perfect OCR fidelity - high-stakes factual applications without source verification ## Files - `data/train.jsonl`, `data/validation.jsonl`, `data/test.jsonl`: the Hugging Face-ready data splits - `meta/release_manifest.json`: build summary, counts, and checksums - `meta/excluded_records.jsonl`: records removed by the public-release filter - `CHECKSUMS.sha256`: file hashes for the whole release directory ## Method provenance This package was generated from: - internal curated input: `data/release/bella_v1/pretrain.jsonl` - build script: `scripts/build_hf_corpus_release.py` - local SQLite catalog: `data/bella.db` - release directory: `data/release/bell_labs_documents_and_stuff` ## Citation If you use the corpus, cite the dataset repo plus the original source repositories named in each row's provenance fields.
提供机构:
hunterbown
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作