five

muellerzr/RAG-accelerate

收藏
Hugging Face2024-01-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/muellerzr/RAG-accelerate
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en --- ## Preparing the dataset ### NOTICE: All code is owned by Hugging Face and uses the Apache 2.0 Licence. While I clean and strip the dataset for processing, do note that this dataset is under the same scruteny as the original Apache 2.0 License. ## Clone Repo Data souce used is the [accelerate](https://github.com/huggingface/accelerate) repository. I'm using the latest version, v0.25.0 ```bash git clone https://github.com/huggingface/accelerate cd accelerate git checkout v0.25.0 cd .. mkdir docs src mv accelerate/src/accelerate/* src mv accelerate/docs/* docs cd src rm __init__.py commands/__init__.py test_utils/__init__.py utils/__init__.py ``` ### Cleaning the dataset Using `regex` in VSCODE, use the following replacement: ```regex # Copyright(.*\n)+# limitations under the license. ``` ```regex <!--Copyright(.*\n)+--> ``` In the source: ```regex """ ``` To: ```regex """ ``` Then remove all import statements (as we only care about the content). Strip all blank spaces/whitespace: ```regex ^(?:[\t ]*(?:\r?\n|\r))+ ``` **WARNING**: It is known that this will seperate out the `_inner()` in the source code and use it as a seperate function losing the context. Trying out with this issue for now.
提供机构:
muellerzr
原始信息汇总

数据集准备

数据集所有权和许可

  • 数据集代码归 Hugging Face 所有,使用 Apache 2.0 许可证。
  • 数据集在处理过程中保持与原始 Apache 2.0 许可证相同的审查标准。

数据源

  • 数据源来自 accelerate 仓库,使用最新版本 v0.25.0。

数据集清理步骤

  1. 克隆仓库并准备目录 bash git clone https://github.com/huggingface/accelerate cd accelerate git checkout v0.25.0 cd .. mkdir docs src mv accelerate/src/accelerate/* src mv accelerate/docs/* docs cd src rm init.py commands/init.py test_utils/init.py utils/init.py

  2. 使用正则表达式清理数据

    • 移除版权声明: regex

      Copyright(.*

)+# limitations under the license.

    regex
    <!--Copyright(.*

)+-->

- 清理源代码中的空白行:
    regex
    ^(?:[	 ]*(?:

? | ))+

- 移除所有导入语句(只关注内容)。
  1. 注意事项
    • 清理过程中会分离出 _inner() 函数,导致上下文丢失,目前正在尝试解决此问题。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作