five

iscc/iscc-book-covers

收藏
Hugging Face2026-02-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/iscc/iscc-book-covers
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - found language: - en license: other multilinguality: - monolingual size_categories: - 1M<n<10M source_datasets: - cogsci13/Amazon-Reviews-2023-Books-Meta task_categories: - image-feature-extraction - zero-shot-image-classification tags: - iscc - content-identification - similarity-search - deduplication - image - iso-24138 - amazon - books - book-covers pretty_name: "ISCC Codes for Amazon Book Covers" dataset_info: features: - name: image_url dtype: string - name: iscc dtype: string - name: iscc_meta dtype: string - name: iscc_semantic dtype: string - name: iscc_content dtype: string - name: iscc_data dtype: string - name: iscc_instance dtype: string - name: source_row_id dtype: string - name: title dtype: string - name: isbn dtype: string - name: publisher dtype: string --- # ISCC Codes for Amazon Book Covers Amazon book covers enriched with full 256-bit ISCC (International Standard Content Code) identifiers for cover image identification, similarity search, and deduplication research. The `image_url` field links to cover images on Amazon CDN for preview. ## What is ISCC? The **International Standard Content Code** ([ISO 24138:2024](https://www.iso.org/standard/77899.html)) is a content-derived identifier for digital media assets. Unlike traditional identifiers that are assigned arbitrarily, ISCC codes are generated algorithmically from the content itself, enabling: - **Content Identification**: Identify content regardless of format or location - **Similarity Search**: Find visually or semantically similar images - **Deduplication**: Detect exact and near-duplicate content - **Provenance Tracking**: Link derived works to their sources ## ISCC Units Each record contains five 256-bit ISCC-UNITs that capture different aspects of the content: | Unit | Field | Description | |------|-------|-------------| | **Meta-Code** | `iscc_meta` | Similarity based on embedded metadata (filename, title) | | **Semantic-Code** | `iscc_semantic` | AI-based visual semantic similarity (what the image depicts) | | **Content-Code** | `iscc_content` | Perceptual image similarity (visual appearance) | | **Data-Code** | `iscc_data` | Raw binary data similarity (file structure) | | **Instance-Code** | `iscc_instance` | Cryptographic hash for exact matching (like SHA-256) | The `iscc` field contains the composite ISCC-CODE combining all units. ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `image_url` | string | Cover image URL on Amazon CDN | | `iscc` | string | Full composite ISCC-CODE | | `iscc_meta` | string | 256-bit Meta-Code | | `iscc_semantic` | string | 256-bit Semantic-Code | | `iscc_content` | string | 256-bit Content-Code | | `iscc_data` | string | 256-bit Data-Code | | `iscc_instance` | string | 256-bit Instance-Code | | `source_row_id` | string | Original row identifier (`parent_asin`) in source dataset | | `title` | string | Book title | | `isbn` | string | ISBN-13 (preferred) or ISBN-10 | | `publisher` | string | Publisher name and edition info | ### Data Splits | Split | Samples | |-------|---------| | train | 3,079,720 | ## Usage ### Loading the Dataset ```python from datasets import load_dataset ds = load_dataset("iscc/iscc-book-covers") ``` ### Viewing a Sample ```python sample = ds["train"][0] print(f"ISCC: {sample['iscc']}") print(f"Title: {sample['title']}") ``` ### Similarity Search Example ```python import iscc_core as ic # Get two ISCC codes to compare code1 = ds["train"][0]["iscc_content"] code2 = ds["train"][1]["iscc_content"] # Calculate hamming distance (0 = identical, 256 = maximally different) distance = ic.iscc_distance(code1, code2) print(f"Hamming distance: {distance}") # Convert to similarity percentage similarity = 1 - (distance / 256) print(f"Similarity: {similarity:.1%}") ``` ### Finding Near-Duplicates ```python import iscc_core as ic # Threshold for near-duplicates (adjust based on use case) THRESHOLD = 32 # ~87.5% similarity reference = ds["train"][0]["iscc_content"] for i, row in enumerate(ds["train"]): distance = ic.iscc_distance(reference, row["iscc_content"]) if distance <= THRESHOLD and i > 0: print(f"Near-duplicate found: row {i}, distance={distance}") ``` ### Semantic Similarity Search ```python import iscc_core as ic # Find semantically similar images (same subject/concept) reference = ds["train"][0]["iscc_semantic"] similar = [] for i, row in enumerate(ds["train"]): distance = ic.iscc_distance(reference, row["iscc_semantic"]) if distance <= 64: # ~75% semantic similarity similar.append((i, distance)) # Sort by similarity for idx, dist in sorted(similar, key=lambda x: x[1])[:5]: print(f"Row {idx} (distance={dist})") ``` ## Source Data This dataset was derived from [cogsci13/Amazon-Reviews-2023-Books-Meta](https://huggingface.co/datasets/cogsci13/Amazon-Reviews-2023-Books-Meta). ### Source Data Book metadata from the Amazon Reviews 2023 dataset by McAuley Lab. Cover images hosted by Amazon CDN. This derivative dataset contains ISCC codes and references to the original images, not the images themselves. **License**: Research use only. Refer to original dataset terms. ### Processing ISCC codes were generated using: - [iscc-sdk](https://github.com/iscc/iscc-sdk) - High-level ISCC generation - [iscc-sci](https://github.com/iscc/iscc-sci) - Semantic image codes (experimental) All processing was performed on original resolution images downloaded from Amazon CDN. ## Considerations ### Intended Use - Content identification and matching research - Image similarity search algorithm development - Deduplication system benchmarking - Visual-semantic retrieval experiments - ISCC-based indexing research ### Limitations - Semantic codes are generated using experimental AI models and may not capture all semantic nuances - ISCC codes are sensitive to significant image modifications (heavy cropping, overlays, filters) - Image URLs point to Amazon CDN and may become unavailable over time ### Privacy This dataset contains ISCC codes and image URL references derived from the source dataset. Refer to the original dataset documentation for privacy considerations. ## Citation If you use this dataset, please cite both this dataset and the original source: **This Dataset:** ```bibtex @dataset{iscc_book_covers, title = {{ISCC Codes for Amazon Book Covers}}, author = {{ISCC Foundation}}, year = {{2026}}, publisher = {{Hugging Face}}, url = {{https://huggingface.co/datasets/iscc/iscc-book-covers}} } ``` **Amazon Reviews 2023:** ```bibtex @article{hou2024bridging, title = {{Bridging Language and Items for Retrieval and Recommendation}}, author = {{Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}}, journal = {{arXiv preprint arXiv:2403.03952}}, year = {{2024}} } ``` **ISCC Standard:** ```bibtex @misc{iso24138, title = {{ISO 24138:2024 Information and documentation -- International Standard Content Code (ISCC)}}, author = {{International Organization for Standardization}}, year = {{2024}}, url = {{https://www.iso.org/standard/77899.html}} } ``` ## Additional Resources - [ISCC Foundation](https://iscc.io/) - Standards organization - [ISCC Documentation](https://sdk.iscc.codes/) - Technical documentation - [ISO 24138:2024](https://www.iso.org/standard/77899.html) - Official standard - [iscc-sdk](https://github.com/iscc/iscc-sdk) - Python SDK for ISCC generation - [Amazon Reviews 2023](https://amazon-reviews-2023.github.io/) ## Contact - **Dataset Issues**: [iscc-datasets GitHub](https://github.com/iscc/iscc-datasets/issues) - **ISCC Questions**: [ISCC Foundation](https://iscc.io/)
提供机构:
iscc
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作