PJMixers-Dev/PocketDoc_Dans-Prosemaxx-Cowriter-XL-8192-shrunk-l3
收藏Hugging Face2024-10-11 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/PJMixers-Dev/PocketDoc_Dans-Prosemaxx-Cowriter-XL-8192-shrunk-l3
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
---
```py
import json
from tqdm import tqdm
from transformers import AutoTokenizer
import re
import pandas as pd
def load_json_or_jsonl(file_path):
try:
with open(file_path, "r") as file:
try:
# Try loading the entire file as JSON
data = json.load(file)
return data
except json.JSONDecodeError:
# If loading as JSON fails, try loading as JSON Lines
file.seek(0) # Reset file pointer to the beginning
lines = file.readlines()
json_lines_data = []
for line in lines:
try:
item = json.loads(line.strip())
json_lines_data.append(item)
except json.JSONDecodeError as e:
print(f"Error decoding JSON in line: {e}")
return json_lines_data
except FileNotFoundError:
print(f"File not found: {file_path}")
return None
def reg_check(string):
basic_slop = [
"haze of pleasure",
"finds solace in",
"reveling in the satisfaction",
"with each breath",
"a delicate dance",
"wet flesh",
"sensitive flesh",
"\\bministration(|s)\\b",
"audible pop",
"rivulets",
"admit it",
"the ball is in your court",
"the game is on",
"the choice is yours",
"i don't bite... unless you want me to",
"half-lidded eyes",
"(he|she|they) worries (his|her|their) bottom lip",
"warring with",
"arousal pooling",
"take your pleasure",
"(he|she|they) fiddles with the hem of (his|her|their) (skirt|shirt)",
"kiss-bruised lips",
"bruising kiss",
"despite (himself|herself|themselves|themself)",
"yours to take",
"\\bwanton\\b",
"reckless abandon",
"torn between",
"knuckles turning white",
"grins wickedly",
"fiery red hair",
"long lashes",
"propriety be damned",
"the world narrows",
"pupils blown wide with pleasure",
"chestnut eyes",
"(he|she|they) grasps your chin and forces you to meet (his|her|their) gaze",
"(he|she|they) bites your ear",
"nails raking angry red lines down your back",
"(her|his) cheeks flaming",
"cheeks hollowing",
"stars burst behind (his|her) eyes",
"inner walls clenching around nothing",
"puckered hole",
"wet heat",
"(he|she) whimpers, biting (his|her) lip",
"dusky nipples",
"slick fold(|s)",
"still lodged deep inside (his|her)",
"heart, body and soul belong to you",
"the night is still young",
"\\.\\.\\.for now\\b",
"whether you like it not",
"without waiting for response",
"however, (its|it is|it's) important",
"important to remember that",
"once upon",
"nestled deep within",
"an ethereal beauty",
"breathless and eager",
"whispering words of passion",
"soft and gentle",
"shivers (\\w+\\s+)?down",
"dance of pleasure",
"(his|her) sex",
"sent (shockwaves|shock waves)",
"in a rhythm",
"wild abandon",
"exhausted and spent",
"life would never be the same again",
"like an electric shock",
"threatens to consume",
"what (seemed|felt) like an eternity",
"(lay|lie) ahead",
"\\bwet pop\\b",
"maybe, just maybe",
"perhaps, just perhaps",
"starts to blur",
"but it felt like",
"unfamiliar, yet",
"moist fold(|s)",
"the night is still young",
"our shared experiences",
"bond(|s) built on mutual trust",
"the ball is in your court",
"little did (he|she|they) know",
"a pregnant silence",
"beats like a (\\w+\\s+)?drum",
"\\bpert\\b",
"for the sake of keeping things",
"her breasts heaving with desire",
"dickick",
"\\brivulets\\b",
"arousal pooling in (his|her|their) belly",
"steeling (her|him)self",
"the din of the crowd",
"journey of mutual understanding",
"revulsion warred with (reluctant|reluctance)",
"her bare mound(|s)",
"pooled around her (ankles|feet)",
"straddles your (waist|lap)",
"words turn into a purr",
"grips like a vice",
"shivers running up",
"arched spine",
"penetrated to the hilt",
"the pressure in (her|his) loins",
"catch my drift",
"sway(|s) hypnotically",
"tantalizing promise",
"with each slow, deliberate movement",
"for what (felt|seemed) like (hours|an eternity|forever)",
", but (he|she|they|I) can't help it",
"conspiratorial whisper(|s)",
"whisper(|ing) conspiratorially"
]
sus = [
"\\bloli\\b",
"\\bcunny\\b",
"\\bchild\\b",
"\\bkid\\b",
"\\btoddler\\b",
"\\binfant\\b",
"\\bbaby\\b",
"\\bkindergarten\\b",
"\\bkindergarden\\b",
]
for pattern in basic_slop:
if re.search(pattern, string.lower()):
return True
for pattern in sus:
if re.search(pattern, string.lower()):
return True
return False
def shrink_sharegpt(
sharegpt_file,
output_file,
max_length
):
# Subtract 2 to allow room for BOS and EOS
max_length = max_length - 2
json_data = []
sharegpt_data = load_json_or_jsonl(sharegpt_file)
for sample_index, sample in tqdm(pd.DataFrame(sharegpt_data).iterrows(), total=len(sharegpt_data)):
sample_length = 0
new_sample_data = []
for turn in sample["conversations"]:
if turn["from"] == "system":
turn_name = "system"
elif turn["from"] == "human":
turn_name = "user"
elif turn["from"] == "gpt":
turn_name = "assistant"
else:
print("Unknown 'from'")
exit()
# Skip any samples which contain stuff we don't want
if reg_check(turn['value']):
new_sample_data = []
break
turn_length = len(
tokenizer(
f"<|start_header_id|>{turn_name}<|end_header_id|>\n\n"
f"{turn['value']}<|eot_id|>",
add_special_tokens=False
)["input_ids"]
)
if sample_length + turn_length <= max_length:
sample_length += turn_length
new_sample_data.append(turn)
else:
break
# Check if there's less than 2 turns
if len(new_sample_data) < 2:
continue
# Don't end on a user turn
while new_sample_data[-1]["from"] == "human":
del new_sample_data[-1]
# Again check if there's less than 2 turns, this time after possibly removing 'human' turns
if len(new_sample_data) < 2:
continue
json_data.append({"conversations": new_sample_data})
df = pd.DataFrame(json_data)
df.to_parquet(output_file, index=False)
if __name__ == "__main__":
source_file = "./downloaded_datasets/PocketDoc_Dans-Prosemaxx-Cowriter-XL.json"
output_file = "./downloaded_datasets/PocketDoc_Dans-Prosemaxx-Cowriter-XL-8192-shrunk-l3.parquet"
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
shrink_sharegpt(
source_file,
output_file,
max_length=8192
)
```
本脚本为一款用于过滤与裁剪ShareGPT格式对话数据集的Python处理工具,具体实现如下:
1. 数据加载函数`load_json_or_jsonl(file_path)`:用于加载JSON或JSON Lines格式的数据集文件。函数会优先尝试将整个文件作为标准JSON对象加载,若解析失败则重置文件指针并逐行读取为JSON Lines格式条目,同时捕获文件未找到异常并返回空值。
2. 违规内容检测函数`reg_check(string)`:用于校验输入文本是否包含不当或违规表述。该函数内置两组正则匹配规则:`basic_slop`用于匹配低俗、暧昧的不当措辞,`sus`用于匹配涉及未成年人的敏感词汇;若任意规则匹配成功则返回`True`,否则返回`False`。
3. 核心处理函数`shrink_sharegpt(sharegpt_file, output_file, max_length)`:实现对输入的ShareGPT数据集的裁剪与合规过滤:
- 首先将传入的最大令牌(Token)长度减去2,预留起始令牌(Beginning of Sequence, BOS)与结束令牌(End of Sequence, EOS)的使用空间;
- 加载输入数据集并遍历每个样本,依次处理每一轮对话:
- 统一转换对话角色标识:将`system`保留为`system`,`human`转换为`user`,`gpt`转换为`assistant`,遇到未知角色标识则直接报错退出;
- 对当前轮次的对话内容执行违规检测,若命中规则则直接丢弃该样本;
- 计算当前轮次的令牌(Token)长度,将符合总长度限制的对话轮次累加,直至总长度超出阈值则停止追加后续轮次;
- 过滤掉对话轮次少于2的无效样本,并确保最终样本的最后一轮对话不为用户轮次(若为用户轮次则删除直至符合要求),再次校验对话轮次数量;
- 将处理后的有效样本整理为DataFrame并保存为Parquet格式的输出文件。
4. 主函数调用示例:指定输入文件路径为`./downloaded_datasets/PocketDoc_Dans-Prosemaxx-Cowriter-XL.json`,输出文件路径为`./downloaded_datasets/PocketDoc_Dans-Prosemaxx-Cowriter-XL-8192-shrunk-l3.parquet`,使用`meta-llama/Meta-Llama-3-8B-Instruct`对应的自动分词器(AutoTokenizer),并设置最大令牌(Token)长度为8192。
提供机构:
PJMixers-Dev



