lyon-nlp/clustering-hal-s2s

Name: lyon-nlp/clustering-hal-s2s
Creator: lyon-nlp
Published: 2024-06-06 08:20:05
License: 暂无描述

Hugging Face2024-06-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/lyon-nlp/clustering-hal-s2s

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification language: - fr size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: test path: test.jsonl - config_name: raw data_files: - split: test path: test.jsonl - config_name: mteb_eval data_files: - split: test path: mteb_eval.jsonl --- ## Clustering HAL This dataset was created by scrapping data from the HAL platform. Over 80,000 articles have been scrapped to keep their id, title and category. It was originally used for the French version of [MTEB](https://github.com/embeddings-benchmark/mteb), but it can also be used for various clustering or classification tasks, or even evaluate the general knowledge of a model. ⚠️ This dataset contains 2 subsets. *IT IS STRONGLY ADVISED TO USE THE CLEANED UP ``mteb_eval`` SUBSET*: - ***"raw"*** subset : contains the data originally scrapped, without any cleaning. The data contains mostly titles in French, but also titles in other languages (english, italian, ...) - ***"mteb_eval"*** subset : is the subset used for the MTEB evaluation. It is a cleaned up version of the raw dataset. Notably, samples have been removed if : - their "domain" were in a minor class (less than 500 samples were available) - their "title" were less than or equal 2 words - the language was not French ### Usage To use this dataset, you can run the following code : ```py from datasets import load_dataset dataset = load_dataset("lyon-nlp/clustering-hal-s2s", name="mteb_eval", split="test") # for MTEB eval subset ``` ### Citation If you use this dataset in your work, please consider citing: ``` @misc{ciancone2024extending, title={Extending the Massive Text Embedding Benchmark to French}, author={Mathieu Ciancone and Imene Kerboua and Marion Schaeffer and Wissam Siblini}, year={2024}, eprint={2405.20468}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

lyon-nlp

原始信息汇总

Clustering HAL 数据集概述

基本信息

许可证：Apache 2.0
任务类别：文本分类
语言：法语
数据规模：10K<n<100K

配置信息

默认配置：
- 数据文件：
  - 分割：测试
  - 路径：test.jsonl
原始配置：
- 数据文件：
  - 分割：测试
  - 路径：test.jsonl
MTEB评估配置：
- 数据文件：
  - 分割：测试
  - 路径：mteb_eval.jsonl

数据集描述

创建来源：从HAL平台抓取的数据，包含超过80,000篇文章的ID、标题和类别。
原始用途：用于法语版本的MTEB评估，但也可用于各种聚类或分类任务，或评估模型的通用知识。

数据集子集

原始子集：
- 包含原始抓取的数据，未经任何清理。主要包含法语标题，但也包含其他语言（如英语、意大利语等）的标题。
MTEB评估子集：
- 用于MTEB评估的清理版本。特别地，如果样本满足以下条件之一，则已被移除：
  - 其“领域”属于小类别（可用样本少于500个）
  - 其“标题”少于或等于2个词
  - 语言不是法语

使用方法

python from datasets import load_dataset dataset = load_dataset("lyon-nlp/clustering-hal-s2s", name="mteb_eval", split="test") # 用于MTEB评估子集

引用信息

@misc{ciancone2024extending, title={Extending the Massive Text Embedding Benchmark to French}, author={Mathieu Ciancone and Imene Kerboua and Marion Schaeffer and Wissam Siblini}, year={2024}, eprint={2405.20468}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集