five

ferMorales/GACETA_UNAM_GTE_Qwen2

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ferMorales/GACETA_UNAM_GTE_Qwen2
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Gaceta UNAM Embeddings — GTE-Qwen2 language: - es license: other task_categories: - feature-extraction - text-retrieval tags: - embeddings - rag - semantic-search - ocr - spanish - unam - gte-qwen2 --- # Gaceta UNAM Embeddings — GTE-Qwen2 Dataset of semantic embeddings for text fragments from Gaceta UNAM. Generated with the **Alibaba-NLP/gte-Qwen2** model. ## Summary - **Embedding model:** Alibaba-NLP/gte-Qwen2 - **Records (chunks):** 207,888 - **Unique documents (source_file):** 5,628 - **Embedding dimension:** 1,536 - **Time coverage (issue_date):** 1954-08-23 to 2026-02-09 - **Invalid/empty issue_date:** 623 records ## Schema (columns) | Column | Type | Description | |---|---|---| | `id` | string | Unique chunk identifier (UUID). | | `vector` | fixed_size_list\<float32\>[1536] | Normalized semantic vector from GTE-Qwen2. | | `payload` | string (JSON) | Full metadata and text for the chunk (see below). | ### Payload JSON fields | Field | Type | Description | |---|---|---| | `doc_id` | string | Document/issue identifier. | | `chunk_id` | string | Unique chunk identifier. | | `chunk_index` | int | Chunk position within the document. | | `corpus` | string | Corpus segment (e.g., gum10, gum80). | | `decade` | string | Time grouping derived from the document. | | `issue_date` | string | Issue date (YYYY-MM-DD; may be empty or invalid in some cases). | | `page_start` | int/null | Chunk start page (null in this version). | | `page_end` | int/null | Chunk end page (null in this version). | | `source_pdf` | string | Source PDF path. | | `source_file` | string | Source JSON file path. | | `char_count` | int | Character count of the chunk text. | | `total_chunks` | int | Total chunks in the parent document. | | `section_hierarchy` | list\<string\> | Section headings for the chunk. | | `metadata` | object | Source PDF metadata (title, date, page_count, extraction method, etc.). | | `text` | string | Chunk text used for embedding generation. | ## Corpus distribution | Corpus | Chunks | |---|---| | gum10 | 60,285 | | gum90 | 48,962 | | gum80 | 48,687 | | gum00 | 27,988 | | gum70 | 13,989 | | gum60 | 4,695 | | gum50 | 3,282 | ## Recommended use This parquet is intended for: 1. **Semantic search** by vector similarity (vector). 2. **Historical RAG** with source traceability (source_pdf, source_file). 3. **Metadata filtering** (issue_date, corpus, decade). ## Notes - The text comes from OCR/historical document processing; minor noise may remain. - Due to scraper issues, all data for the year 2004 was lost. - `page_start` and `page_end` are null across all records in this version.
提供机构:
ferMorales
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作