hunterbown/bell-labs-technical-archive
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hunterbown/bell-labs-technical-archive
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
pretty_name: Bell Labs Documents and Stuff
license: other
multilinguality: monolingual
language_creators:
- expert-generated
annotations_creators:
- machine-generated
task_categories:
- text-generation
size_categories:
- 1K<n<10K
tags:
- datasets
- bell-labs
- telecommunications
- history-of-technology
- archive.org
- patents
- ocr
configs:
- config_name: default
data_files:
- split: train
path: data/train.jsonl
- split: validation
path: data/validation.jsonl
- split: test
path: data/test.jsonl
---
# Bell Labs Documents and Stuff
This is a conservative public-release subset of the internal BELLA continued-pretraining corpus. It keeps the Bell-system technical material that survived a stricter final pass for public dataset hosting and removes records that still looked risky, off-scope, or too low-signal for a Hugging Face corpus listing.
## What is in the release
| Split | Documents |
|---|---:|
| `train` | 1220 |
| `validation` | 29 |
| `test` | 42 |
The release contains **1291 documents** out of **1530** internally curated pretraining documents.
Document types:
- `journal_issue`: 875
- `technical_report`: 370
- `patent`: 23
- `book`: 12
- `manual`: 11
Source families:
- `archive_org`: 1265 documents
- `google_patents`: 23 documents
- `other`: 3 documents
Selected extraction backends among kept documents, when recoverable from the current local catalog:
- `unknown`: 1242
- `pdftotext`: 48
- `archive_text`: 1
Character counts after cleanup:
- min: `2514`
- median: `114491`
- max: `2734561`
## Where the data came from
Most documents were pulled from public Archive.org item pages that host Bell System Technical Journal issues, Bell Laboratories Record issues, Bell System Practices, Bell System / Western Electric technical manuals, and Bell-system-adjacent engineering reports. Patent records in this subset were pulled from public Google Patents pages.
Every released row includes per-document provenance fields such as `source_url`, `archive_ref`, and `source_family` so downstream users can trace each text file back to the public item page that it came from. A `selected_extraction_backend` field is also present, but many older rows remain `unknown` because parts of the internal corpus were built before backend tracking was recorded uniformly in SQLite.
## How the corpus was built
1. Source records were discovered and imported into the local `bella.db` catalog.
2. For Archive.org-backed items, the pipeline downloaded the preferred PDF and any usable Archive.org text derivative.
3. Text extraction ran quality-first rather than single-backend-first:
- `pdftotext` first when a PDF existed
- Archive.org DjVu text as an alternate derivative when available
- optional `Qianfan OCR` fallback only when `pdftotext` looked weak or mixed
4. Page-level heuristics removed obvious junk such as library stamps, scan boilerplate, table-of-contents pages, index pages, references, HTTP/header dumps, OCR markup artifacts, and pages with too little body text.
5. Document-level cleanup trimmed leading frontmatter such as Google/JSTOR boilerplate, issue mastheads, and leading OCR noise.
6. This public release applied one more conservative pass to exclude records with explicit restriction language, trade-secret notices, off-scope government/legal material, table-of-contents or index-only records, and very short bodies.
## Final public-release exclusions
The final pass removed **239 documents**. Exclusion counts by reason:
- `index_or_ordering_title`: 69
- `short_body_under_min_chars`: 55
- `government_archive_title`: 39
- `restricted_reproduction_notice`: 35
- `personal_noncommercial_notice`: 22
- `trade_secret_notice`: 20
- `post_1990_non_patent`: 10
- `table_of_contents_title`: 9
- `consumer_magazine_title`: 8
- `all_rights_reserved_notice`: 5
- `offscope_misc_title`: 3
- `oral_history_title`: 2
- `legal_case_title`: 1
The full exclusion log is in [meta/excluded_records.jsonl](meta/excluded_records.jsonl).
## OCR and extraction notes
This is not a hand-transcribed corpus. It is a cleaned OCR/text-extraction corpus. Some documents were born-digital or extracted cleanly with `pdftotext`; others depend on OCR or Archive.org text derivatives. The text is usable for corpus work, but it is not guaranteed to be page-faithful, typo-free, or complete.
Important limitations:
- OCR noise still exists in places, especially in older scans and diagram-heavy technical material.
- The corpus is Bell-focused, not a complete Bell Labs bibliography.
- The release process is conservative, but it is not legal advice.
- Some metadata fields were inferred or normalized during ingestion and cleanup.
## Intended use
This subset is appropriate for:
- continued pretraining or domain adaptation experiments
- retrieval, search, and corpus analysis over Bell-system technical writing
- historical telecom and computing research where OCR noise is acceptable
This subset is not appropriate for:
- licensing-sensitive redistribution without your own review of the source items
- claims of perfect OCR fidelity
- high-stakes factual applications without source verification
## Files
- `data/train.jsonl`, `data/validation.jsonl`, `data/test.jsonl`: the Hugging Face-ready data splits
- `meta/release_manifest.json`: build summary, counts, and checksums
- `meta/excluded_records.jsonl`: records removed by the public-release filter
- `CHECKSUMS.sha256`: file hashes for the whole release directory
## Method provenance
This package was generated from:
- internal curated input: `data/release/bella_v1/pretrain.jsonl`
- build script: `scripts/build_hf_corpus_release.py`
- local SQLite catalog: `data/bella.db`
- release directory: `data/release/bell_labs_documents_and_stuff`
## Citation
If you use the corpus, cite the dataset repo plus the original source repositories named in each row's provenance fields.
提供机构:
hunterbown



