five

Bluefir/hetus-time-use

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Bluefir/hetus-time-use
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 pretty_name: HETUS Harmonized Time Use (Eurostat TSV + ESMS metadata) task_categories: - tabular-classification - text-classification language: - en dataset_info: - config_name: files features: - name: wave dtype: string - name: dataset_id dtype: string - name: source_file dtype: string - name: metadata_file dtype: string - name: rows_wide dtype: string - name: rows_long dtype: string - name: rows_expected_long dtype: string - name: lossless_validated dtype: string - name: dimensions dtype: string - name: time_periods_raw dtype: string - name: time_periods dtype: string splits: - name: train num_bytes: 8186 num_examples: 39 download_size: 7294 dataset_size: 8186 - config_name: metadata features: - name: wave dtype: string - name: metadata_file dtype: string - name: prepared_at dtype: string - name: metadata_set_id dtype: string - name: attribute_id dtype: string - name: attribute_path dtype: string - name: value_html dtype: string - name: value_text dtype: string splits: - name: train num_bytes: 94561 num_examples: 150 download_size: 34933 dataset_size: 94561 - config_name: observations features: - name: dataset_id dtype: string - name: wave dtype: string - name: source_file dtype: string - name: metadata_file dtype: string - name: source_row_index dtype: string - name: source_key_raw dtype: string - name: dimension_order_raw dtype: string - name: dimension_order dtype: string - name: time_period_raw dtype: string - name: time_period dtype: string - name: time_period_year dtype: string - name: observation_cell_raw dtype: string - name: observation_raw dtype: string - name: observation dtype: string - name: observation_value dtype: string - name: duration_minutes dtype: string - name: status_flag dtype: string - name: is_missing dtype: string - name: acl00 dtype: string - name: acl18 dtype: string - name: age dtype: string - name: daysweek dtype: string - name: freq dtype: string - name: geo dtype: string - name: hhcomp dtype: string - name: hhstatus dtype: string - name: isced11 dtype: string - name: isced97 dtype: string - name: month dtype: string - name: sex dtype: string - name: startime dtype: string - name: tra_mode dtype: string - name: unit dtype: string - name: wstatus dtype: string splits: - name: train num_bytes: 636505250 num_examples: 1680915 download_size: 25062037 dataset_size: 636505250 configs: - config_name: files data_files: - split: train path: files/train-* - config_name: metadata data_files: - split: train path: metadata/train-* - config_name: observations data_files: - split: train path: observations/train-* tags: - sociology - time-use - europe - EU size_categories: - 1M<n<10M --- # Dataset Card: HETUS Time Use (Converted from Eurostat TSV) ## Dataset Summary This dataset contains normalized tabular observations converted from Eurostat HETUS TSV files, plus flattened SDMX ESMS metadata. The conversion is performed by `db-hf-normalization.py` and preserves raw TSV tokens to prevent information loss during normalization. ## Source Data - Original source: Eurostat HETUS database exports (TSV + SDMX XML metadata) - Local source folder: `hetus/` - Waves currently included: `2000_2010`, `2020` ## Conversion Method The converter: 1. Reads TSV files as tab-separated raw text (no NA auto-coercion). 2. Expands packed dimension keys from the first column. 3. Melts year/time columns into long format. 4. Preserves exact raw source tokens for auditability: - `source_key_raw` - `time_period_raw` - `observation_cell_raw` 5. Creates normalized fields for analysis: - `time_period`, `time_period_year` - `observation`, `observation_value`, `duration_minutes`, `status_flag` 6. Validates strict lossless mapping per file (1:1 source cell preservation). ## Splits ### `observations` Long-format observations. Core provenance fields: - `dataset_id` - `wave` - `source_file` - `metadata_file` - `source_row_index` - `source_key_raw` - `time_period_raw` - `observation_cell_raw` Normalized fields: - `time_period` - `time_period_year` - `observation_raw` - `observation` - `observation_value` - `duration_minutes` - `status_flag` - `is_missing` Dimension fields (vary by source table, unioned into one schema): - Examples: `freq`, `unit`, `sex`, `age`, `geo`, `acl00`, `acl18`, ... ### `files` Per-source-file conversion and validation report: - `rows_wide` - `rows_long` - `rows_expected_long` - `lossless_validated` - dimensions/time-period summaries ### `metadata` Flattened SDMX ESMS metadata attributes with: - attribute id/path - HTML value (`value_html`) - plain-text value (`value_text`) ## Data Loss / Integrity Statement For each TSV file, conversion enforces: - Exact row cardinality: `rows_long == rows_wide * number_of_time_period_columns` - Exact cell-level preservation of raw source tuples: `(source_row_index, source_key_raw, time_period_raw, observation_cell_raw)` If any mismatch is detected, conversion fails for that file. ## Known Caveats - The first TSV column stores packed dimensions separated by commas. The converter keeps both: - raw packed key (`source_key_raw`) - expanded normalized dimensions - Whitespace around values may be normalized in analytical columns, but the raw fields remain preserved. ## Intended Uses - Comparative time-use analysis across countries, years, demographic strata, and activity codes. - Reproducible ETL pipelines where raw-source lineage is required. ## Out-of-Scope Uses - Any individual-level inference (dataset is aggregate tabular statistics). ## Citation Please cite Eurostat HETUS and include repository/version information for this converted dataset. ## Dataset Version - Converter script: `db-hf-normalization.py` - Conversion date: {{YYYY-MM-DD}} - Repository commit: {{GIT_COMMIT_SHA}}
提供机构:
Bluefir
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作