Navanjana/ARCHIVE-TEXT-URLS
收藏Hugging Face2025-11-26 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Navanjana/ARCHIVE-TEXT-URLS
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- 10M<n<100M
task_categories:
- text-generation
- question-answering
pretty_name: Internet Archive English Text URLs
tags:
- internet-archive
- books
- documents
- historical-texts
- ocr
---
# Internet Archive English Text URLs Dataset
<div align="center">
<img src="https://i0.wp.com/macmanx.com/wp-content/uploads/2016/01/internetarchive-1.png" alt="Internet Archive Logo" width="300"/>
</div>
## Dataset Description
This dataset contains **11,151,637** direct download URLs to OCR-processed text files from the Internet Archive's digital library. All entries are English-language texts spanning books, documents, historical records, and various other written materials.
### Dataset Summary
- **Total Rows:** 11,151,637
- **Language:** English
- **Source:** [Internet Archive](https://archive.org)
- **Format:** CSV with metadata and direct text file URLs
- **Text Format:** DjVu TXT (OCR-processed)
### Supported Tasks
- Large-scale text corpus creation
- Historical document analysis
- Language model training
- Digital humanities research
- Text mining and information retrieval
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `identifier_access` | string | Direct URL to the item's page on Archive.org |
| `text_file_url` | string | Direct download URL to the OCR text file (`.txt`) |
| `title` | string | Title of the document/book |
| `description` | string | Description or summary of the content |
### Data Example
```python
{
"identifier_access": "https://archive.org/details/0-11_02_local_municipalities_ec102_blue_crane_route_afs_2010-11_unaudited_pdf",
"text_file_url": "https://archive.org/download/0-11_02_local_municipalities_ec102_blue_crane_route_afs_2010-11_unaudited_pdf/0-11_02_local_municipalities_ec102_blue_crane_route_afs_2010-11_unaudited_pdf_djvu.txt",
"title": "EC102 Blue Crane Route AFS 2010-11 Unaudited",
"description": "/Documents/05. Annual Financial Statements/2010-11/02. Local Municipalities/EC102 Blue Crane Route/EC102 Blue Crane Route AFS 2010-11 Unaudited.pdf"
}
```
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("Navanjana/ARCHIVE-TEXT-URLS")
# Load a specific split if available
dataset = load_dataset("Navanjana/ARCHIVE-TEXT-URLS", split="train")
# Stream the dataset (recommended for large datasets)
dataset = load_dataset("Navanjana/ARCHIVE-TEXT-URLS", streaming=True)
```
### Downloading Text Files
Here's a complete example to download and process the text files:
```python
from datasets import load_dataset
import requests
import os
from tqdm import tqdm
import time
def download_text_file(url, output_dir="downloaded_texts", max_retries=3):
"""
Download a text file from the given URL with retry logic.
Args:
url: Direct download URL to the text file
output_dir: Directory to save downloaded files
max_retries: Number of retry attempts for failed downloads
Returns:
tuple: (success: bool, file_path: str or None, error: str or None)
"""
os.makedirs(output_dir, exist_ok=True)
# Extract filename from URL
filename = url.split("/")[-1]
file_path = os.path.join(output_dir, filename)
# Skip if already downloaded
if os.path.exists(file_path):
return True, file_path, None
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=30, stream=True)
if response.status_code == 200:
with open(file_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
return True, file_path, None
elif response.status_code == 404:
return False, None, "File not found (404)"
else:
return False, None, f"HTTP {response.status_code}"
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
return False, None, str(e)
time.sleep(2 ** attempt) # Exponential backoff
return False, None, "Max retries exceeded"
# Load dataset
print("Loading dataset...")
dataset = load_dataset("Navanjana/ARCHIVE-TEXT-URLS", split="train", streaming=True)
# Download sample of texts
successful_downloads = 0
failed_downloads = 0
sample_size = 100 # Adjust as needed
for i, item in enumerate(tqdm(dataset.take(sample_size), total=sample_size)):
success, file_path, error = download_text_file(item['text_file_url'])
if success:
successful_downloads += 1
# Read and process the text
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
text_content = f.read()
# Process your text here
print(f"Downloaded: {item['title'][:50]}... ({len(text_content)} chars)")
else:
failed_downloads += 1
print(f"Failed to download: {item['title'][:50]}... - Error: {error}")
# Be respectful to Archive.org servers
time.sleep(0.5)
print(f"\nDownload Summary:")
print(f"Successful: {successful_downloads}")
print(f"Failed: {failed_downloads}")
```
### Batch Processing with Parallel Downloads
```python
from concurrent.futures import ThreadPoolExecutor, as_completed
from datasets import load_dataset
import requests
def download_single_text(item):
"""Download a single text file and return the content."""
try:
response = requests.get(item['text_file_url'], timeout=30)
if response.status_code == 200:
return {
'identifier': item['identifier_access'],
'title': item['title'],
'text': response.text,
'success': True
}
except Exception as e:
pass
return {
'identifier': item['identifier_access'],
'title': item['title'],
'text': None,
'success': False
}
# Load dataset
dataset = load_dataset("Navanjana/ARCHIVE-TEXT-URLS", split="train", streaming=True)
# Parallel download with thread pool
results = []
with ThreadPoolExecutor(max_workers=5) as executor:
futures = []
for item in dataset.take(100):
futures.append(executor.submit(download_single_text, item))
for future in tqdm(as_completed(futures), total=len(futures)):
result = future.result()
results.append(result)
if result['success']:
print(f"✓ {result['title'][:50]}")
# Filter successful downloads
successful_texts = [r for r in results if r['success']]
print(f"\nSuccessfully downloaded: {len(successful_texts)}/{len(results)}")
```
## Important Notes
### Link Availability
⚠️ **Please note:** Not all URLs in this dataset are guaranteed to be accessible. Some links may return 404 errors or be temporarily unavailable due to:
- Files being removed or relocated on Archive.org
- Temporary server issues
- Items being taken down for copyright or other reasons
- OCR text files not being generated for certain items
**Expected Success Rate:** Approximately 85-95% of links should be functional, but this may vary over time.
### Best Practices
1. **Implement retry logic** for failed downloads
2. **Add delays** between requests to respect Archive.org's servers (0.5-1 second recommended)
3. **Handle errors gracefully** and log failed URLs
4. **Check file size** before processing (some files may be very large)
5. **Respect Archive.org's terms of service** and robots.txt
### Rate Limiting
When downloading files in bulk:
- Use reasonable delays between requests (500ms - 1s)
- Consider using Archive.org's official Python library for better integration
- Monitor your download rate and adjust if you receive rate limit errors
## Source Code
The dataset was created using the `internetarchive` Python library. The scraping code is documented and can be used to create similar datasets for other languages or media types.
### Query Used
```python
QUERY = 'mediatype:"texts" AND language:"English"'
```
## Dataset Creation
### Collection Process
1. Queried Internet Archive's search API for English text items
2. Extracted metadata and constructed direct download URLs
3. Batch processing with progress tracking
4. Output to chunked CSV files (100,000 rows each)
### Data Processing
- Text cleaning for CSV safety
- Standardized URL construction (`{identifier}_djvu.txt`)
- Metadata preservation (title, description)
## Limitations
- OCR quality varies by source document
- Some historical texts may have significant OCR errors
- Text files are raw OCR output without post-processing
- Not all Archive.org items have DjVu TXT files available
- Metadata may be incomplete or inconsistent across items
## Future Plans
- [ ] Add datasets for other languages
- [ ] Include additional metadata fields (publication date, author, etc.)
- [ ] Create pre-processed, cleaned text versions
- [ ] Add file size information
- [ ] Implement automatic link validation
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{navanjana_archive_text_urls_2025,
author = {Navanjana},
title = {Internet Archive English Text URLs Dataset},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Navanjana/ARCHIVE-TEXT-URLS}
}
```
Also consider citing the Internet Archive:
```bibtex
@misc{internet_archive,
author = {{Internet Archive}},
title = {Internet Archive},
year = {1996},
url = {https://archive.org}
}
```
## License
This dataset is released under **CC0 1.0 Universal (Public Domain)**. However, the actual content accessible through the URLs may have various licenses. Users are responsible for respecting the copyright and licensing terms of individual texts.
## Acknowledgments
- **Internet Archive** for preserving and digitizing millions of texts
- The `internetarchive` Python library maintainers
- The open-source community
## Contact
For questions, issues, or suggestions, please open an issue on the Hugging Face dataset page.
---
**Dataset Version:** 1.0
**Last Updated:** 2025
**Status:** Active
提供机构:
Navanjana



