CATIE-AQ/frenchKEYWORDS

Name: CATIE-AQ/frenchKEYWORDS
Creator: CATIE-AQ
Published: 2025-12-01 11:05:21
License: 暂无描述

Hugging Face2025-12-01 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/CATIE-AQ/frenchKEYWORDS

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - fr license: cc-by-4.0 size_categories: - 10K<n<100K task_categories: - token-classification - text-generation configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: text dtype: string - name: key_words dtype: string - name: dataset dtype: string splits: - name: train num_bytes: 26762234 num_examples: 12005 - name: validation num_bytes: 3630782 num_examples: 1488 - name: test num_bytes: 7310223 num_examples: 2981 download_size: 20586631 dataset_size: 37703239 --- # Dataset information **Dataset concatenating Key Words extraction datasets, available in French and open-source.** There are a total of **16,471** rows, of which 12,002 are for training, 1,488 for validation and 2,981 for testing. # Usage ``` from datasets import load_dataset dataset = load_dataset("CATIE-AQ/frenchKEYWORDS") ``` ``` DatasetDict({ train: Dataset({ features: ['inputs', 'targets', 'dataset'], num_rows: 12002 }) validation: Dataset({ features: ['inputs', 'targets', 'dataset'], num_rows: 1488 }) test: Dataset({ features: ['inputs', 'targets', 'dataset'], num_rows: 2981 }) }) ``` # Dataset ## Details of rows | Dataset Original | Splits | Note | | ----------- | ----------- | ----------- | | [taln-ls2n/wikinews-fr-100](https://huggingface.co/datasets/taln-ls2n/wikinews-fr-100)| 100 train | We keep only the keywords present (P) and reordered (R) from the original dataset. | | [taln-ls2n/termith-eval](https://huggingface.co/datasets/maurya/taln-ls2n/termith-eval)| 399 train | We keep only the keywords present (P) and reordered (R) from the original dataset. | | [taln-ls2n/taln-archives](https://huggingface.co/datasets/taln-ls2n/taln-archives)| 1,207 train | We keep only the keywords present (P) and reordered (R) from the original dataset. | | [papyrus](https://github.com/smolPixel/French-keyphrase-generation/tree/main/data/papyrus_f)| 10,299 train / 1,488 validation / 2,981 test | We only keep the French split of papyrus | ## Removing duplicate data and leaks The concatenation of the four datasets previously listed does not create duplicates or leaks. ## Columns ``` dataset_train = dataset['train'].to_pandas() dataset_train.head() text key_words dataset 0 Lancement du projet Wikidata Un nouveau projet... projet, wikidata, projet de base de donnée, so... wikinews 1 Nigeria : crash d'un avion avec 153 personnes ... nigeria, crash, avion, lagos, mcdonnel douglas... wikinews 2 Jean-Marie Le Pen traite sa fille de « petite-... jean-marie le pen, fille, petite-bourgeoise, b... wikinews 3 France : lancement des Journées de l'archéolog... france, lancement, journées de l'archéologie, ... wikinews 4 Mali : les islamistes poursuivent leurs destru... mali, islamistes, destructions, tombouctou, ma... wikinews ``` - the `text` column contains the text - the `key_words` column contains the keywords - the `dataset` column identifies the row's original dataset (if you wish to apply filters to it) ## Split - `train` corresponds to the concatenation of `wikinews` + `termith` + `taln` + `papyrus` - `validation` corresponds to `papyrus` - `test` corresponds to `papyrus` # Citations ### taln-ls2n/wikinews-fr-100 ``` @inproceedings{bougouin-etal-2013-topicrank, title = "{T}opic{R}ank: Graph-Based Topic Ranking for Keyphrase Extraction", author = "Bougouin, Adrien and Boudin, Florian and Daille, B{\'e}atrice", editor = "Mitkov, Ruslan and Park, Jong C.", booktitle = "Proceedings of the Sixth International Joint Conference on Natural Language Processing", month = oct, year = "2013", address = "Nagoya, Japan", publisher = "Asian Federation of Natural Language Processing", url = "https://aclanthology.org/I13-1062", pages = "543--551"} ``` ### taln-ls2n/termith-eval ``` @inproceedings{bougouin-etal-2016-termith, title = "{T}erm{ITH}-Eval: a {F}rench Standard-Based Resource for Keyphrase Extraction Evaluation", author = "Bougouin, Adrien and Barreaux, Sabine and Romary, Laurent and Boudin, Florian and Daille, B{\'e}atrice", editor = "Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Goggi, Sara and Grobelnik, Marko and Maegaard, Bente and Mariani, Joseph and Mazo, Helene and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)", month = may, year = "2016", address = "Portoro{\v{z}}, Slovenia", publisher = "European Language Resources Association (ELRA)", url = "https://aclanthology.org/L16-1304", pages = "1924--1927"} ``` ### taln-ls2n/taln-archives ``` @inproceedings{boudin-2013-taln, title = "{TALN} Archives : a digital archive of {F}rench research articles in Natural Language Processing ({TALN} Archives : une archive num{\'e}rique francophone des articles de recherche en Traitement Automatique de la Langue) [in {F}rench]", author = "Boudin, Florian", editor = "Morin, Emmanuel and Est{\`e}ve, Yannick", booktitle = "Proceedings of TALN 2013 (Volume 2: Short Papers)", month = jun, year = "2013", address = "Les Sables d{'}Olonne, France", publisher = "ATALA", url = "https://aclanthology.org/F13-2001", pages = "507--514"} ``` ### papyrus ``` @inproceedings{NEURIPS2022_f8870955, author = {Piedboeuf, Fr\'{e}d\'{e}ric and Langlais, Philippe}, booktitle = {Advances in Neural Information Processing Systems}, editor = {S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh}, pages = {38046--38059}, publisher = {Curran Associates, Inc.}, title = {A new dataset for multilingual keyphrase generation}, url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/f88709551258331f9ab31b33c71021a4-Paper-Datasets_and_Benchmarks.pdf}, volume = {35}, year = {2022}} ``` ### FrenchKEYWORDS ``` @misc{FrenchKEYWORDS_2025, author = { {BOURDOIS, Loïck} }, organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} }, title = { frenchKEYWORDS (Revision fda5770) }, year = 2025, url = { https://huggingface.co/datasets/CATIE-AQ/frenchKEYWORDS }, doi = { 10.57967/hf/7133 }, publisher = { Hugging Face } } ``` # License [cc-by-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)

提供机构：

CATIE-AQ

5,000+

优质数据集

54 个

任务类型

进入经典数据集