five

AIStudioDelta/Eurovoc_2025_by_language

收藏
Hugging Face2025-11-29 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/AIStudioDelta/Eurovoc_2025_by_language
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: bul features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 6898206932 num_examples: 190367 download_size: 2371513792 dataset_size: 6898206932 - config_name: ces features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 4588315774 num_examples: 212129 download_size: 1925541050 dataset_size: 4588315774 - config_name: dan features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5383280436 num_examples: 295701 download_size: 2194966358 dataset_size: 5383280436 - config_name: deu features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 6153224483 num_examples: 308464 download_size: 2541800545 dataset_size: 6153224483 - config_name: ell features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 9823795253 num_examples: 296951 download_size: 3468274174 dataset_size: 9823795253 - config_name: eng features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 12458032118 num_examples: 377608 download_size: 5435494205 dataset_size: 12458032118 - config_name: est features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 4334038948 num_examples: 211756 download_size: 1805498864 dataset_size: 4334038948 - config_name: fin features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5608940400 num_examples: 288769 download_size: 2339141410 dataset_size: 5608940400 - config_name: fra features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 6452281556 num_examples: 311113 download_size: 2641887672 dataset_size: 6452281556 - config_name: gle features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 1256818166 num_examples: 45341 download_size: 472088828 dataset_size: 1256818166 - config_name: hrv features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 3125213385 num_examples: 134022 download_size: 1273260934 dataset_size: 3125213385 - config_name: hun features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 4976083975 num_examples: 212288 download_size: 2011001418 dataset_size: 4976083975 - config_name: isl features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5563023 num_examples: 177 download_size: 2730625 dataset_size: 5563023 - config_name: ita features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5806900890 num_examples: 302970 download_size: 2384345164 dataset_size: 5806900890 - config_name: lav features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 4538030179 num_examples: 211727 download_size: 1875739508 dataset_size: 4538030179 - config_name: lit features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 4539835551 num_examples: 211895 download_size: 1890675662 dataset_size: 4539835551 - config_name: mkd features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 10551797 num_examples: 119 download_size: 4135205 dataset_size: 10551797 - config_name: mlt features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 4611405914 num_examples: 194929 download_size: 1850688551 dataset_size: 4611405914 - config_name: nld features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5736073166 num_examples: 298149 download_size: 2314354830 dataset_size: 5736073166 - config_name: nor features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5441215 num_examples: 194 download_size: 2597322 dataset_size: 5441215 - config_name: pol features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 4842534965 num_examples: 213997 download_size: 2004013837 dataset_size: 4842534965 - config_name: por features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5795692364 num_examples: 296786 download_size: 2349457669 dataset_size: 5795692364 - config_name: ron features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 4637245880 num_examples: 191099 download_size: 1840068383 dataset_size: 4637245880 - config_name: slk features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 4602939833 num_examples: 212385 download_size: 1951251981 dataset_size: 4602939833 - config_name: slv features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 4313046380 num_examples: 212002 download_size: 1829427152 dataset_size: 4313046380 - config_name: spa features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5937017189 num_examples: 299217 download_size: 2406747494 dataset_size: 5937017189 - config_name: srp features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 21942336 num_examples: 152 download_size: 9071054 dataset_size: 21942336 - config_name: swe features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 5458706751 num_examples: 288845 download_size: 2214795939 dataset_size: 5458706751 - config_name: tur features: - name: title dtype: string - name: date dtype: timestamp[s] - name: eurovoc_concepts list: string - name: eurovoc_concepts_ids list: string - name: url dtype: string - name: lang dtype: string - name: formats list: string - name: text dtype: string - name: subset dtype: string splits: - name: train num_bytes: 23930479 num_examples: 255 download_size: 10826806 dataset_size: 23930479 configs: - config_name: bul data_files: - split: train path: bul/train-* - config_name: ces data_files: - split: train path: ces/train-* - config_name: dan data_files: - split: train path: dan/train-* - config_name: deu data_files: - split: train path: deu/train-* - config_name: ell data_files: - split: train path: ell/train-* - config_name: eng data_files: - split: train path: eng/train-* - config_name: est data_files: - split: train path: est/train-* - config_name: fin data_files: - split: train path: fin/train-* - config_name: fra data_files: - split: train path: fra/train-* - config_name: gle data_files: - split: train path: gle/train-* - config_name: hrv data_files: - split: train path: hrv/train-* - config_name: hun data_files: - split: train path: hun/train-* - config_name: isl data_files: - split: train path: isl/train-* - config_name: ita data_files: - split: train path: ita/train-* - config_name: lav data_files: - split: train path: lav/train-* - config_name: lit data_files: - split: train path: lit/train-* - config_name: mkd data_files: - split: train path: mkd/train-* - config_name: mlt data_files: - split: train path: mlt/train-* - config_name: nld data_files: - split: train path: nld/train-* - config_name: nor data_files: - split: train path: nor/train-* - config_name: pol data_files: - split: train path: pol/train-* - config_name: por data_files: - split: train path: por/train-* - config_name: ron data_files: - split: train path: ron/train-* - config_name: slk data_files: - split: train path: slk/train-* - config_name: slv data_files: - split: train path: slv/train-* - config_name: spa data_files: - split: train path: spa/train-* - config_name: srp data_files: - split: train path: srp/train-* - config_name: swe data_files: - split: train path: swe/train-* - config_name: tur data_files: - split: train path: tur/train-* license: eupl-1.2 task_categories: - text-generation language: - bg - cs - da - de - el - en - et - fi - fr - ga - hr - hu - is - it - lv - lt - mk - mt - nl - 'no' - pl - pt - ro - sk - sl - es - sr - sv - tr pretty_name: Eurovoc 2025 by language --- # 🇪🇺 🏷️ EuroVoc dataset (by language) This is the [EuropeanParliament/Eurovoc_2025](https://huggingface.co/datasets/EuropeanParliament/Eurovoc_2025) dataset, but split up by language, not by period. The original is split up into periods (`1996-03` through `2025-11`), with documents in different languages mixed together. For ease of training this dataset splits the data by language instead, with documents in different periods put together. ## License This dataset is redistributed under the original [European Union Public License 1.2](https://eupl.eu/1.2/en/). When using this version of the dataset, please give the appropriate credit to the original dataset. ## Code The dataset was created with the following script: ```python from datasets import concatenate_datasets, get_dataset_config_names, load_dataset from tqdm import tqdm def main(dataset_name, languages, output_dataset, private=True): periods = sorted(get_dataset_config_names(dataset_name)) for language in languages: print(language) subsets = [] for period in tqdm(periods): dataset = load_dataset(dataset_name, period, split='train') dataset = dataset.filter(lambda lang: lang == language, input_columns=['lang']) if len(dataset) > 0: dataset = dataset.add_column('subset', [period] * len(dataset)) subsets.append(dataset) lang_dataset = concatenate_datasets(subsets) lang_dataset.push_to_hub(output_dataset, config_name=language, private=private) if __name__ == '__main__': main(dataset_name='EuropeanParliament/Eurovoc_2025', languages=['bul', 'ces', 'dan', 'deu', 'ell', 'eng', 'est', 'fin', 'fra', 'gle', 'hrv', 'hun', 'isl', 'ita', 'lav', 'lit', 'mkd', 'mlt', 'nld', 'nor', 'pol', 'por', 'ron', 'slk', 'slv', 'spa', 'srp', 'swe', 'tur'], output_dataset='AIStudioDelta/Eurovoc_2025_by_language', private=True) ```
提供机构:
AIStudioDelta
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作