AIStudioDelta/Eurovoc_2025_by_language
收藏Hugging Face2025-11-29 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/AIStudioDelta/Eurovoc_2025_by_language
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: bul
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 6898206932
num_examples: 190367
download_size: 2371513792
dataset_size: 6898206932
- config_name: ces
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 4588315774
num_examples: 212129
download_size: 1925541050
dataset_size: 4588315774
- config_name: dan
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 5383280436
num_examples: 295701
download_size: 2194966358
dataset_size: 5383280436
- config_name: deu
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 6153224483
num_examples: 308464
download_size: 2541800545
dataset_size: 6153224483
- config_name: ell
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 9823795253
num_examples: 296951
download_size: 3468274174
dataset_size: 9823795253
- config_name: eng
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 12458032118
num_examples: 377608
download_size: 5435494205
dataset_size: 12458032118
- config_name: est
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 4334038948
num_examples: 211756
download_size: 1805498864
dataset_size: 4334038948
- config_name: fin
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 5608940400
num_examples: 288769
download_size: 2339141410
dataset_size: 5608940400
- config_name: fra
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 6452281556
num_examples: 311113
download_size: 2641887672
dataset_size: 6452281556
- config_name: gle
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 1256818166
num_examples: 45341
download_size: 472088828
dataset_size: 1256818166
- config_name: hrv
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 3125213385
num_examples: 134022
download_size: 1273260934
dataset_size: 3125213385
- config_name: hun
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 4976083975
num_examples: 212288
download_size: 2011001418
dataset_size: 4976083975
- config_name: isl
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 5563023
num_examples: 177
download_size: 2730625
dataset_size: 5563023
- config_name: ita
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 5806900890
num_examples: 302970
download_size: 2384345164
dataset_size: 5806900890
- config_name: lav
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 4538030179
num_examples: 211727
download_size: 1875739508
dataset_size: 4538030179
- config_name: lit
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 4539835551
num_examples: 211895
download_size: 1890675662
dataset_size: 4539835551
- config_name: mkd
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 10551797
num_examples: 119
download_size: 4135205
dataset_size: 10551797
- config_name: mlt
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 4611405914
num_examples: 194929
download_size: 1850688551
dataset_size: 4611405914
- config_name: nld
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 5736073166
num_examples: 298149
download_size: 2314354830
dataset_size: 5736073166
- config_name: nor
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 5441215
num_examples: 194
download_size: 2597322
dataset_size: 5441215
- config_name: pol
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 4842534965
num_examples: 213997
download_size: 2004013837
dataset_size: 4842534965
- config_name: por
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 5795692364
num_examples: 296786
download_size: 2349457669
dataset_size: 5795692364
- config_name: ron
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 4637245880
num_examples: 191099
download_size: 1840068383
dataset_size: 4637245880
- config_name: slk
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 4602939833
num_examples: 212385
download_size: 1951251981
dataset_size: 4602939833
- config_name: slv
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 4313046380
num_examples: 212002
download_size: 1829427152
dataset_size: 4313046380
- config_name: spa
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 5937017189
num_examples: 299217
download_size: 2406747494
dataset_size: 5937017189
- config_name: srp
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 21942336
num_examples: 152
download_size: 9071054
dataset_size: 21942336
- config_name: swe
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 5458706751
num_examples: 288845
download_size: 2214795939
dataset_size: 5458706751
- config_name: tur
features:
- name: title
dtype: string
- name: date
dtype: timestamp[s]
- name: eurovoc_concepts
list: string
- name: eurovoc_concepts_ids
list: string
- name: url
dtype: string
- name: lang
dtype: string
- name: formats
list: string
- name: text
dtype: string
- name: subset
dtype: string
splits:
- name: train
num_bytes: 23930479
num_examples: 255
download_size: 10826806
dataset_size: 23930479
configs:
- config_name: bul
data_files:
- split: train
path: bul/train-*
- config_name: ces
data_files:
- split: train
path: ces/train-*
- config_name: dan
data_files:
- split: train
path: dan/train-*
- config_name: deu
data_files:
- split: train
path: deu/train-*
- config_name: ell
data_files:
- split: train
path: ell/train-*
- config_name: eng
data_files:
- split: train
path: eng/train-*
- config_name: est
data_files:
- split: train
path: est/train-*
- config_name: fin
data_files:
- split: train
path: fin/train-*
- config_name: fra
data_files:
- split: train
path: fra/train-*
- config_name: gle
data_files:
- split: train
path: gle/train-*
- config_name: hrv
data_files:
- split: train
path: hrv/train-*
- config_name: hun
data_files:
- split: train
path: hun/train-*
- config_name: isl
data_files:
- split: train
path: isl/train-*
- config_name: ita
data_files:
- split: train
path: ita/train-*
- config_name: lav
data_files:
- split: train
path: lav/train-*
- config_name: lit
data_files:
- split: train
path: lit/train-*
- config_name: mkd
data_files:
- split: train
path: mkd/train-*
- config_name: mlt
data_files:
- split: train
path: mlt/train-*
- config_name: nld
data_files:
- split: train
path: nld/train-*
- config_name: nor
data_files:
- split: train
path: nor/train-*
- config_name: pol
data_files:
- split: train
path: pol/train-*
- config_name: por
data_files:
- split: train
path: por/train-*
- config_name: ron
data_files:
- split: train
path: ron/train-*
- config_name: slk
data_files:
- split: train
path: slk/train-*
- config_name: slv
data_files:
- split: train
path: slv/train-*
- config_name: spa
data_files:
- split: train
path: spa/train-*
- config_name: srp
data_files:
- split: train
path: srp/train-*
- config_name: swe
data_files:
- split: train
path: swe/train-*
- config_name: tur
data_files:
- split: train
path: tur/train-*
license: eupl-1.2
task_categories:
- text-generation
language:
- bg
- cs
- da
- de
- el
- en
- et
- fi
- fr
- ga
- hr
- hu
- is
- it
- lv
- lt
- mk
- mt
- nl
- 'no'
- pl
- pt
- ro
- sk
- sl
- es
- sr
- sv
- tr
pretty_name: Eurovoc 2025 by language
---
# 🇪🇺 🏷️ EuroVoc dataset (by language)
This is the [EuropeanParliament/Eurovoc_2025](https://huggingface.co/datasets/EuropeanParliament/Eurovoc_2025) dataset, but split up by language, not by period.
The original is split up into periods (`1996-03` through `2025-11`), with documents in different languages mixed together.
For ease of training this dataset splits the data by language instead, with documents in different periods put together.
## License
This dataset is redistributed under the original [European Union Public License 1.2](https://eupl.eu/1.2/en/). When using this version of the dataset, please give the appropriate credit to the original dataset.
## Code
The dataset was created with the following script:
```python
from datasets import concatenate_datasets, get_dataset_config_names, load_dataset
from tqdm import tqdm
def main(dataset_name, languages, output_dataset, private=True):
periods = sorted(get_dataset_config_names(dataset_name))
for language in languages:
print(language)
subsets = []
for period in tqdm(periods):
dataset = load_dataset(dataset_name, period, split='train')
dataset = dataset.filter(lambda lang: lang == language, input_columns=['lang'])
if len(dataset) > 0:
dataset = dataset.add_column('subset', [period] * len(dataset))
subsets.append(dataset)
lang_dataset = concatenate_datasets(subsets)
lang_dataset.push_to_hub(output_dataset, config_name=language, private=private)
if __name__ == '__main__':
main(dataset_name='EuropeanParliament/Eurovoc_2025',
languages=['bul', 'ces', 'dan', 'deu', 'ell', 'eng', 'est', 'fin', 'fra', 'gle', 'hrv', 'hun', 'isl', 'ita', 'lav', 'lit', 'mkd', 'mlt', 'nld', 'nor', 'pol', 'por', 'ron', 'slk', 'slv', 'spa', 'srp', 'swe', 'tur'],
output_dataset='AIStudioDelta/Eurovoc_2025_by_language',
private=True)
```
提供机构:
AIStudioDelta



