five

CATIE-AQ/frenchPARAPHRASE

收藏
Hugging Face2025-12-01 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/CATIE-AQ/frenchPARAPHRASE
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - fr license: cc-by-4.0 size_categories: - 100K<n<1M task_categories: - text-generation configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: sentence dtype: string - name: paraphrase dtype: string - name: dataset dtype: string splits: - name: train num_bytes: 32824568 num_examples: 251753 - name: validation num_bytes: 320653 num_examples: 1857 - name: test num_bytes: 239521 num_examples: 903 download_size: 12172298 dataset_size: 33384742 --- # Dataset information **Dataset concatenating Paraphases datasets available in French and open-source.** There are a total of **254,513** rows, of which 251,753 are for training, 1,857 for validation and 903 for testing. # Usage ``` from datasets import load_dataset dataset = load_dataset("CATIE-AQ/frenchPARAPHRASE") ``` # Dataset ## Details of rows | Dataset Original | Splits | Note | | ----------- | ----------- | ----------- | | [Helsinki-NLP/tatoeba_mt](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt)| 2,117 train / 999 validation | We only keep the French split (`fra-fra`) and not use the test set | | [tapaco](https://huggingface.co/tapaco)| 206,912 train | We only keep the French split (`fr`) | | [paws-x](https://huggingface.co/datasets/paws-x)| 21,000 train / 858 validation / 903 test | We only keep the French split (`fr`) | | [paraphrasing_french_5000](https://hf.co/ismailiismail/paraphrasing_french_5000)| 4,972 train | | | [paraphrasing_french](https://hf.co/ismailiismail/paraphrasing_french)| 2,075 train | | | [paragraphss_paraphrasing](https://hf.co/ismailiismail/paragraphss_paraphrasing)| 1,000 train | | | [multi_paraphrasing_french](https://hf.co/ismailiismail/multi_paraphrasing_french)| 14,936 train | 997 sexplets forming 14936 sentence duets | ## Removing duplicate data and leaks The sum of the values of the datasets listed here gives the following result: ``` DatasetDict({ train: Dataset({ features: ['sentence', 'paraphrase', 'dataset'], num_rows: 253012 }) validation: Dataset({ features: ['sentence', 'paraphrase', 'dataset'], num_rows: 1857 }) test: Dataset({ features: ['sentence', 'paraphrase', 'dataset'], num_rows: 903 }) }) ``` However, a data item in training split A may not be in A's test split, but may be present in B's test set, creating a leak when we create the A+B dataset. The same logic applies to duplicate data. So we need to make sure we remove them. After our clean-up, we finally have the following numbers: ``` DatasetDict({ train: Dataset({ features: ['sentence', 'paraphrase', 'dataset'], num_rows: 251753 }) validation: Dataset({ features: ['sentence', 'paraphrase', 'dataset'], num_rows: 1857 }) test: Dataset({ features: ['sentence', 'paraphrase', 'dataset'], num_rows: 903 }) }) ``` ## Columns ``` dataset_train = dataset['train'].to_pandas() dataset_train.head() sentence paraphrase dataset 0 La saison NBA 1975 - 76 était la 30e saison de... La saison 1975-1976 de la National Basketball ... paws_x 1 Lorsque des débits comparables peuvent être ma... Les résultats sont élevés lorsque des débits c... paws_x 2 C'est le siège du district de Zerendi dans la ... C'est le siège du district de Zerendi dans la ... paws_x 3 William Henry Henry Harman est né le 17 févrie... William Henry Harman est né à Waynesboro, en V... paws_x 4 Avec un nombre discret de probabilités Formule... Étant donné un ensemble discret de probabilité... paws_x ``` - the `sentence` column contains the text - the `paraphrase` column contains paraphrase - the `dataset` column identifies the row's original dataset (if you wish to apply filters to it) ## Split - `train` corresponds to the concatenation of `paws_x` + `tatoeba` + `tapaco` + `paraphrasing_french_5000` + `paraphrasing_french` + `paragraphss_paraphrasing` + `multi_paraphrasing_french` - `validation` corresponds to the concatenation of `paws_x` + `tatoeba` - `test` corresponds to `paws_x` # Citations ### Helsinki-NLP/tatoeba_mt ``` @inproceedings{tiedemann-2020-tatoeba, title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}", author = {Tiedemann, J{\"o}rg}, booktitle = "Proceedings of the Fifth Conference on Machine Translation", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.wmt-1.139", pages = "1174--1182"} ``` ### Tapaco ``` @dataset{scherrer_yves_2020_3707949, author = {Scherrer, Yves}, title = {{TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages}}, month = mar, year = 2020, publisher = {Zenodo}, version = {1.0}, doi = {10.5281/zenodo.3707949}, url = {https://doi.org/10.5281/zenodo.3707949}} ``` ### Paws_x ``` @InProceedings{pawsx2019emnlp, title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}}, author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason}, booktitle = {Proc. of EMNLP}, year = {2019}} ``` ### Paraphrasing_french_5000 ``` Dataset by Naouadir Ismail (2023) Hugging Face repository: https://huggingface.co/datasets/ismailiismail/paraphrasing_french_5000 ``` ### Paraphrasing_french ``` Dataset by Naouadir Ismail (2023) Hugging Face repository: https://huggingface.co/datasets/ismailiismail/paraphrasing_french ``` ### Paragraphss_paraphrasing ``` Dataset by Naouadir Ismail (2023) Hugging Face repository : https://huggingface.co/datasets/ismailiismail/paragraphss_paraphrasing ``` ### Multi_paraphrasing_french ``` Dataset by Naouadir Ismail (2023) Hugging Face repository : https://huggingface.co/datasets/ismailiismail/multi_paraphrasing_french ``` ### FrenchPARAPHRASE ``` @misc{FrenchPARAPHRASE_2025, author = { {BOURDOIS, Loïck} }, organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} }, year = 2025, url = { https://huggingface.co/datasets/CATIE-AQ/frenchPARAPHRASE }, doi = { 10.57967/hf/7132 }, publisher = { Hugging Face } } ``` # License [cc-by-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)
提供机构:
CATIE-AQ
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作