muellerzr/RAG-accelerate
收藏Hugging Face2024-01-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/muellerzr/RAG-accelerate
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
---
## Preparing the dataset
### NOTICE:
All code is owned by Hugging Face and uses the Apache 2.0 Licence. While I clean and strip the dataset for processing, do note that this dataset is under the same scruteny as the original Apache 2.0 License.
## Clone Repo
Data souce used is the [accelerate](https://github.com/huggingface/accelerate) repository. I'm using the latest version, v0.25.0
```bash
git clone https://github.com/huggingface/accelerate
cd accelerate
git checkout v0.25.0
cd ..
mkdir docs src
mv accelerate/src/accelerate/* src
mv accelerate/docs/* docs
cd src
rm __init__.py commands/__init__.py test_utils/__init__.py utils/__init__.py
```
### Cleaning the dataset
Using `regex` in VSCODE, use the following replacement:
```regex
# Copyright(.*\n)+# limitations under the license.
```
```regex
<!--Copyright(.*\n)+-->
```
In the source:
```regex
"""
```
To:
```regex
"""
```
Then remove all import statements (as we only care about the content).
Strip all blank spaces/whitespace:
```regex
^(?:[\t ]*(?:\r?\n|\r))+
```
**WARNING**: It is known that this will seperate out the `_inner()` in the source code and use it as a seperate function losing the context. Trying out with this issue for now.
提供机构:
muellerzr
原始信息汇总
数据集准备
数据集所有权和许可
- 数据集代码归 Hugging Face 所有,使用 Apache 2.0 许可证。
- 数据集在处理过程中保持与原始 Apache 2.0 许可证相同的审查标准。
数据源
- 数据源来自 accelerate 仓库,使用最新版本 v0.25.0。
数据集清理步骤
-
克隆仓库并准备目录 bash git clone https://github.com/huggingface/accelerate cd accelerate git checkout v0.25.0 cd .. mkdir docs src mv accelerate/src/accelerate/* src mv accelerate/docs/* docs cd src rm init.py commands/init.py test_utils/init.py utils/init.py
-
使用正则表达式清理数据
- 移除版权声明:
regex
Copyright(.*
- 移除版权声明:
regex
)+# limitations under the license.
regex
<!--Copyright(.*
)+-->
- 清理源代码中的空白行:
regex
^(?:[ ]*(?:
? | ))+
- 移除所有导入语句(只关注内容)。
- 注意事项
- 清理过程中会分离出
_inner()函数,导致上下文丢失,目前正在尝试解决此问题。
- 清理过程中会分离出



