lyon-nlp/clustering-hal-s2s
收藏Hugging Face2024-06-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lyon-nlp/clustering-hal-s2s
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
language:
- fr
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: test
path: test.jsonl
- config_name: raw
data_files:
- split: test
path: test.jsonl
- config_name: mteb_eval
data_files:
- split: test
path: mteb_eval.jsonl
---
## Clustering HAL
This dataset was created by scrapping data from the HAL platform.
Over 80,000 articles have been scrapped to keep their id, title and category.
It was originally used for the French version of [MTEB](https://github.com/embeddings-benchmark/mteb), but it can also be used for various clustering or classification tasks, or even evaluate the general knowledge of a model.
⚠️ This dataset contains 2 subsets. *IT IS STRONGLY ADVISED TO USE THE CLEANED UP ``mteb_eval`` SUBSET*:
- ***"raw"*** subset : contains the data originally scrapped, without any cleaning. The data contains mostly titles in French, but also titles in other languages (english, italian, ...)
- ***"mteb_eval"*** subset : is the subset used for the MTEB evaluation. It is a cleaned up version of the raw dataset. Notably, samples have been removed if :
- their "domain" were in a minor class (less than 500 samples were available)
- their "title" were less than or equal 2 words
- the language was not French
### Usage
To use this dataset, you can run the following code :
```py
from datasets import load_dataset
dataset = load_dataset("lyon-nlp/clustering-hal-s2s", name="mteb_eval", split="test") # for MTEB eval subset
```
### Citation
If you use this dataset in your work, please consider citing:
```
@misc{ciancone2024extending,
title={Extending the Massive Text Embedding Benchmark to French},
author={Mathieu Ciancone and Imene Kerboua and Marion Schaeffer and Wissam Siblini},
year={2024},
eprint={2405.20468},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
lyon-nlp
原始信息汇总
Clustering HAL 数据集概述
基本信息
- 许可证:Apache 2.0
- 任务类别:文本分类
- 语言:法语
- 数据规模:10K<n<100K
配置信息
- 默认配置:
- 数据文件:
- 分割:测试
- 路径:test.jsonl
- 数据文件:
- 原始配置:
- 数据文件:
- 分割:测试
- 路径:test.jsonl
- 数据文件:
- MTEB评估配置:
- 数据文件:
- 分割:测试
- 路径:mteb_eval.jsonl
- 数据文件:
数据集描述
- 创建来源:从HAL平台抓取的数据,包含超过80,000篇文章的ID、标题和类别。
- 原始用途:用于法语版本的MTEB评估,但也可用于各种聚类或分类任务,或评估模型的通用知识。
数据集子集
- 原始子集:
- 包含原始抓取的数据,未经任何清理。主要包含法语标题,但也包含其他语言(如英语、意大利语等)的标题。
- MTEB评估子集:
- 用于MTEB评估的清理版本。特别地,如果样本满足以下条件之一,则已被移除:
- 其“领域”属于小类别(可用样本少于500个)
- 其“标题”少于或等于2个词
- 语言不是法语
- 用于MTEB评估的清理版本。特别地,如果样本满足以下条件之一,则已被移除:
使用方法
python from datasets import load_dataset dataset = load_dataset("lyon-nlp/clustering-hal-s2s", name="mteb_eval", split="test") # 用于MTEB评估子集
引用信息
@misc{ciancone2024extending, title={Extending the Massive Text Embedding Benchmark to French}, author={Mathieu Ciancone and Imene Kerboua and Marion Schaeffer and Wissam Siblini}, year={2024}, eprint={2405.20468}, archivePrefix={arXiv}, primaryClass={cs.CL} }



