five

Navanjana/ARCHIVE-TEXT-URLS

收藏
Hugging Face2025-11-26 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Navanjana/ARCHIVE-TEXT-URLS
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 size_categories: - 10M<n<100M task_categories: - text-generation - question-answering pretty_name: Internet Archive English Text URLs tags: - internet-archive - books - documents - historical-texts - ocr --- # Internet Archive English Text URLs Dataset <div align="center"> <img src="https://i0.wp.com/macmanx.com/wp-content/uploads/2016/01/internetarchive-1.png" alt="Internet Archive Logo" width="300"/> </div> ## Dataset Description This dataset contains **11,151,637** direct download URLs to OCR-processed text files from the Internet Archive's digital library. All entries are English-language texts spanning books, documents, historical records, and various other written materials. ### Dataset Summary - **Total Rows:** 11,151,637 - **Language:** English - **Source:** [Internet Archive](https://archive.org) - **Format:** CSV with metadata and direct text file URLs - **Text Format:** DjVu TXT (OCR-processed) ### Supported Tasks - Large-scale text corpus creation - Historical document analysis - Language model training - Digital humanities research - Text mining and information retrieval ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `identifier_access` | string | Direct URL to the item's page on Archive.org | | `text_file_url` | string | Direct download URL to the OCR text file (`.txt`) | | `title` | string | Title of the document/book | | `description` | string | Description or summary of the content | ### Data Example ```python { "identifier_access": "https://archive.org/details/0-11_02_local_municipalities_ec102_blue_crane_route_afs_2010-11_unaudited_pdf", "text_file_url": "https://archive.org/download/0-11_02_local_municipalities_ec102_blue_crane_route_afs_2010-11_unaudited_pdf/0-11_02_local_municipalities_ec102_blue_crane_route_afs_2010-11_unaudited_pdf_djvu.txt", "title": "EC102 Blue Crane Route AFS 2010-11 Unaudited", "description": "/Documents/05. Annual Financial Statements/2010-11/02. Local Municipalities/EC102 Blue Crane Route/EC102 Blue Crane Route AFS 2010-11 Unaudited.pdf" } ``` ## Usage ### Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("Navanjana/ARCHIVE-TEXT-URLS") # Load a specific split if available dataset = load_dataset("Navanjana/ARCHIVE-TEXT-URLS", split="train") # Stream the dataset (recommended for large datasets) dataset = load_dataset("Navanjana/ARCHIVE-TEXT-URLS", streaming=True) ``` ### Downloading Text Files Here's a complete example to download and process the text files: ```python from datasets import load_dataset import requests import os from tqdm import tqdm import time def download_text_file(url, output_dir="downloaded_texts", max_retries=3): """ Download a text file from the given URL with retry logic. Args: url: Direct download URL to the text file output_dir: Directory to save downloaded files max_retries: Number of retry attempts for failed downloads Returns: tuple: (success: bool, file_path: str or None, error: str or None) """ os.makedirs(output_dir, exist_ok=True) # Extract filename from URL filename = url.split("/")[-1] file_path = os.path.join(output_dir, filename) # Skip if already downloaded if os.path.exists(file_path): return True, file_path, None for attempt in range(max_retries): try: response = requests.get(url, timeout=30, stream=True) if response.status_code == 200: with open(file_path, 'wb') as f: for chunk in response.iter_content(chunk_size=8192): f.write(chunk) return True, file_path, None elif response.status_code == 404: return False, None, "File not found (404)" else: return False, None, f"HTTP {response.status_code}" except requests.exceptions.RequestException as e: if attempt == max_retries - 1: return False, None, str(e) time.sleep(2 ** attempt) # Exponential backoff return False, None, "Max retries exceeded" # Load dataset print("Loading dataset...") dataset = load_dataset("Navanjana/ARCHIVE-TEXT-URLS", split="train", streaming=True) # Download sample of texts successful_downloads = 0 failed_downloads = 0 sample_size = 100 # Adjust as needed for i, item in enumerate(tqdm(dataset.take(sample_size), total=sample_size)): success, file_path, error = download_text_file(item['text_file_url']) if success: successful_downloads += 1 # Read and process the text with open(file_path, 'r', encoding='utf-8', errors='ignore') as f: text_content = f.read() # Process your text here print(f"Downloaded: {item['title'][:50]}... ({len(text_content)} chars)") else: failed_downloads += 1 print(f"Failed to download: {item['title'][:50]}... - Error: {error}") # Be respectful to Archive.org servers time.sleep(0.5) print(f"\nDownload Summary:") print(f"Successful: {successful_downloads}") print(f"Failed: {failed_downloads}") ``` ### Batch Processing with Parallel Downloads ```python from concurrent.futures import ThreadPoolExecutor, as_completed from datasets import load_dataset import requests def download_single_text(item): """Download a single text file and return the content.""" try: response = requests.get(item['text_file_url'], timeout=30) if response.status_code == 200: return { 'identifier': item['identifier_access'], 'title': item['title'], 'text': response.text, 'success': True } except Exception as e: pass return { 'identifier': item['identifier_access'], 'title': item['title'], 'text': None, 'success': False } # Load dataset dataset = load_dataset("Navanjana/ARCHIVE-TEXT-URLS", split="train", streaming=True) # Parallel download with thread pool results = [] with ThreadPoolExecutor(max_workers=5) as executor: futures = [] for item in dataset.take(100): futures.append(executor.submit(download_single_text, item)) for future in tqdm(as_completed(futures), total=len(futures)): result = future.result() results.append(result) if result['success']: print(f"✓ {result['title'][:50]}") # Filter successful downloads successful_texts = [r for r in results if r['success']] print(f"\nSuccessfully downloaded: {len(successful_texts)}/{len(results)}") ``` ## Important Notes ### Link Availability ⚠️ **Please note:** Not all URLs in this dataset are guaranteed to be accessible. Some links may return 404 errors or be temporarily unavailable due to: - Files being removed or relocated on Archive.org - Temporary server issues - Items being taken down for copyright or other reasons - OCR text files not being generated for certain items **Expected Success Rate:** Approximately 85-95% of links should be functional, but this may vary over time. ### Best Practices 1. **Implement retry logic** for failed downloads 2. **Add delays** between requests to respect Archive.org's servers (0.5-1 second recommended) 3. **Handle errors gracefully** and log failed URLs 4. **Check file size** before processing (some files may be very large) 5. **Respect Archive.org's terms of service** and robots.txt ### Rate Limiting When downloading files in bulk: - Use reasonable delays between requests (500ms - 1s) - Consider using Archive.org's official Python library for better integration - Monitor your download rate and adjust if you receive rate limit errors ## Source Code The dataset was created using the `internetarchive` Python library. The scraping code is documented and can be used to create similar datasets for other languages or media types. ### Query Used ```python QUERY = 'mediatype:"texts" AND language:"English"' ``` ## Dataset Creation ### Collection Process 1. Queried Internet Archive's search API for English text items 2. Extracted metadata and constructed direct download URLs 3. Batch processing with progress tracking 4. Output to chunked CSV files (100,000 rows each) ### Data Processing - Text cleaning for CSV safety - Standardized URL construction (`{identifier}_djvu.txt`) - Metadata preservation (title, description) ## Limitations - OCR quality varies by source document - Some historical texts may have significant OCR errors - Text files are raw OCR output without post-processing - Not all Archive.org items have DjVu TXT files available - Metadata may be incomplete or inconsistent across items ## Future Plans - [ ] Add datasets for other languages - [ ] Include additional metadata fields (publication date, author, etc.) - [ ] Create pre-processed, cleaned text versions - [ ] Add file size information - [ ] Implement automatic link validation ## Citation If you use this dataset, please cite: ```bibtex @dataset{navanjana_archive_text_urls_2025, author = {Navanjana}, title = {Internet Archive English Text URLs Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/Navanjana/ARCHIVE-TEXT-URLS} } ``` Also consider citing the Internet Archive: ```bibtex @misc{internet_archive, author = {{Internet Archive}}, title = {Internet Archive}, year = {1996}, url = {https://archive.org} } ``` ## License This dataset is released under **CC0 1.0 Universal (Public Domain)**. However, the actual content accessible through the URLs may have various licenses. Users are responsible for respecting the copyright and licensing terms of individual texts. ## Acknowledgments - **Internet Archive** for preserving and digitizing millions of texts - The `internetarchive` Python library maintainers - The open-source community ## Contact For questions, issues, or suggestions, please open an issue on the Hugging Face dataset page. --- **Dataset Version:** 1.0 **Last Updated:** 2025 **Status:** Active
提供机构:
Navanjana
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作