CATIE-AQ/frenchKEYWORDS
收藏Hugging Face2025-12-01 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/CATIE-AQ/frenchKEYWORDS
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- fr
license: cc-by-4.0
size_categories:
- 10K<n<100K
task_categories:
- token-classification
- text-generation
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
dataset_info:
features:
- name: text
dtype: string
- name: key_words
dtype: string
- name: dataset
dtype: string
splits:
- name: train
num_bytes: 26762234
num_examples: 12005
- name: validation
num_bytes: 3630782
num_examples: 1488
- name: test
num_bytes: 7310223
num_examples: 2981
download_size: 20586631
dataset_size: 37703239
---
# Dataset information
**Dataset concatenating Key Words extraction datasets, available in French and open-source.**
There are a total of **16,471** rows, of which 12,002 are for training, 1,488 for validation and 2,981 for testing.
# Usage
```
from datasets import load_dataset
dataset = load_dataset("CATIE-AQ/frenchKEYWORDS")
```
```
DatasetDict({
train: Dataset({
features: ['inputs', 'targets', 'dataset'],
num_rows: 12002
})
validation: Dataset({
features: ['inputs', 'targets', 'dataset'],
num_rows: 1488
})
test: Dataset({
features: ['inputs', 'targets', 'dataset'],
num_rows: 2981
})
})
```
# Dataset
## Details of rows
| Dataset Original | Splits | Note |
| ----------- | ----------- | ----------- |
| [taln-ls2n/wikinews-fr-100](https://huggingface.co/datasets/taln-ls2n/wikinews-fr-100)| 100 train | We keep only the keywords present (P) and reordered (R) from the original dataset. |
| [taln-ls2n/termith-eval](https://huggingface.co/datasets/maurya/taln-ls2n/termith-eval)| 399 train | We keep only the keywords present (P) and reordered (R) from the original dataset. |
| [taln-ls2n/taln-archives](https://huggingface.co/datasets/taln-ls2n/taln-archives)| 1,207 train | We keep only the keywords present (P) and reordered (R) from the original dataset. |
| [papyrus](https://github.com/smolPixel/French-keyphrase-generation/tree/main/data/papyrus_f)| 10,299 train / 1,488 validation / 2,981 test | We only keep the French split of papyrus |
## Removing duplicate data and leaks
The concatenation of the four datasets previously listed does not create duplicates or leaks.
## Columns
```
dataset_train = dataset['train'].to_pandas()
dataset_train.head()
text key_words dataset
0 Lancement du projet Wikidata Un nouveau projet... projet, wikidata, projet de base de donnée, so... wikinews
1 Nigeria : crash d'un avion avec 153 personnes ... nigeria, crash, avion, lagos, mcdonnel douglas... wikinews
2 Jean-Marie Le Pen traite sa fille de « petite-... jean-marie le pen, fille, petite-bourgeoise, b... wikinews
3 France : lancement des Journées de l'archéolog... france, lancement, journées de l'archéologie, ... wikinews
4 Mali : les islamistes poursuivent leurs destru... mali, islamistes, destructions, tombouctou, ma... wikinews
```
- the `text` column contains the text
- the `key_words` column contains the keywords
- the `dataset` column identifies the row's original dataset (if you wish to apply filters to it)
## Split
- `train` corresponds to the concatenation of `wikinews` + `termith` + `taln` + `papyrus`
- `validation` corresponds to `papyrus`
- `test` corresponds to `papyrus`
# Citations
### taln-ls2n/wikinews-fr-100
```
@inproceedings{bougouin-etal-2013-topicrank,
title = "{T}opic{R}ank: Graph-Based Topic Ranking for Keyphrase Extraction",
author = "Bougouin, Adrien and Boudin, Florian and Daille, B{\'e}atrice",
editor = "Mitkov, Ruslan and Park, Jong C.",
booktitle = "Proceedings of the Sixth International Joint Conference on Natural Language Processing",
month = oct,
year = "2013",
address = "Nagoya, Japan",
publisher = "Asian Federation of Natural Language Processing",
url = "https://aclanthology.org/I13-1062",
pages = "543--551"}
```
### taln-ls2n/termith-eval
```
@inproceedings{bougouin-etal-2016-termith,
title = "{T}erm{ITH}-Eval: a {F}rench Standard-Based Resource for Keyphrase Extraction Evaluation",
author = "Bougouin, Adrien and Barreaux, Sabine and Romary, Laurent and Boudin, Florian and Daille, B{\'e}atrice",
editor = "Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Goggi, Sara and Grobelnik, Marko and Maegaard, Bente and Mariani, Joseph and Mazo, Helene and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios",
booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
month = may,
year = "2016",
address = "Portoro{\v{z}}, Slovenia",
publisher = "European Language Resources Association (ELRA)",
url = "https://aclanthology.org/L16-1304",
pages = "1924--1927"}
```
### taln-ls2n/taln-archives
```
@inproceedings{boudin-2013-taln,
title = "{TALN} Archives : a digital archive of {F}rench research articles in Natural Language Processing ({TALN} Archives : une archive num{\'e}rique francophone des articles de recherche en Traitement Automatique de la Langue) [in {F}rench]",
author = "Boudin, Florian",
editor = "Morin, Emmanuel and Est{\`e}ve, Yannick",
booktitle = "Proceedings of TALN 2013 (Volume 2: Short Papers)",
month = jun,
year = "2013",
address = "Les Sables d{'}Olonne, France",
publisher = "ATALA",
url = "https://aclanthology.org/F13-2001",
pages = "507--514"}
```
### papyrus
```
@inproceedings{NEURIPS2022_f8870955,
author = {Piedboeuf, Fr\'{e}d\'{e}ric and Langlais, Philippe},
booktitle = {Advances in Neural Information Processing Systems},
editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh},
pages = {38046--38059},
publisher = {Curran Associates, Inc.},
title = {A new dataset for multilingual keyphrase generation},
url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/f88709551258331f9ab31b33c71021a4-Paper-Datasets_and_Benchmarks.pdf},
volume = {35},
year = {2022}}
```
### FrenchKEYWORDS
```
@misc{FrenchKEYWORDS_2025,
author = { {BOURDOIS, Loïck} },
organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
title = { frenchKEYWORDS (Revision fda5770) },
year = 2025,
url = { https://huggingface.co/datasets/CATIE-AQ/frenchKEYWORDS },
doi = { 10.57967/hf/7133 },
publisher = { Hugging Face }
}
```
# License
[cc-by-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)
提供机构:
CATIE-AQ



