five

PITTI/MicRou_chunked

收藏
Hugging Face2024-03-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/PITTI/MicRou_chunked
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - fr tags: - legal - finance pretty_name: MicRou_chunked size_categories: - 1K<n<10K --- # MicRou ## Introduction The documents that constitute the dataset were gathered for a RAG project in memory of [Michel Rouger](https://www.pitti.io/articles/michel-rouger) : the documents were part of his personal archives and include his own work as well as work produced by other authors during projects he ran. ## Datasets The [MicRou repository](https://github.com/pappitti/MicRou) includes 2 datasets in French: 1. microu This dataset includes approximately 850 documents in French (books, articles, minutes of debates) produced between 1998 and 2020. It covers justice and law, finance and economics, management, healthcare, education, sports, history and geopolitics... Overall it represents between 1.5m and 2m tokens depending on the tokenizer you use. In many cases, the documents stem from a larger source that was broken down as parts could be considered independently (e.g. different chapters of a book or different articles of a newsletter). It is nonetheless possible to recombine the entire source : within a "dossier", you can group by date and, within each group, order by index. Documents that do not come from a larger source have an index of 0 by default. 2. microu-chunked As part of the RAG projet, we used an embeddings model, [Solon](https://huggingface.co/OrdalieTech/Solon-embeddings-large-0.1), with a context window of 512 tokens so we had to split the MicRou dataset into chunks. This is the resulting dataset. The [MicRou repository](https://github.com/pappitti/MicRou) details the chunking strategy and includes scripts used for chunking. ## License (CC-BY-NC-SA-4.0) The dataset is currently under restrictive license. We plan to convert it to an open license once we have finalized the review of the right holders. Some documents may be excluded following the review, but we also plan to add others over time.
提供机构:
PITTI
原始信息汇总

MicRou 数据集概述

简介

MicRou 数据集由 Michel Rouger 的个人档案中的文档组成,这些文档包括他自己的工作和他在项目中指导的其他作者的工作。数据集用于纪念 Michel Rouger 的 RAG 项目。

数据集内容

MicRou 数据集包含两个法语数据集:

  1. microu

    • 包含约 850 份法语文档,涵盖时间范围为 1998 年至 2020 年。
    • 文档类型包括书籍、文章、辩论记录等。
    • 主题涉及司法和法律、金融和经济、管理、医疗保健、教育、体育、历史和地缘政治等。
    • 总词汇量在 150 万到 200 万之间,具体取决于使用的分词器。
    • 文档可能来自更大的源文件,如书籍的不同章节或新闻稿的不同文章,但可以重新组合。
  2. microu-chunked

    • 为了适应 RAG 项目中使用的嵌入模型 Solon 的 512 个词汇窗口,microu 数据集被分割成块。
    • 分割策略和相关脚本在 MicRou 仓库中有详细说明。

许可

数据集目前处于限制性许可下(CC-BY-NC-SA-4.0),计划在完成权利人审查后转换为开放许可。审查过程中可能会排除一些文档,并计划在未来添加其他文档。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作