GenQA-academic-filtered-sharegpt

Name: GenQA-academic-filtered-sharegpt
Creator: maas
Published: 2025-12-05 11:47:30
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/Weyaxi/GenQA-academic-filtered-sharegpt

下载链接

链接失效反馈

官方服务：

资源简介：

# 🔄 GenQA-academic-filtered This dataset is a filtered version of the **"academic"** split of [tomg-group-umd/GenQA](https://hf.co/datasets/tomg-group-umd/GenQA). # 🔍 Filtering proccess: 1. Remove the columns **"idx"** and **"category"** from the dataset. 2. Filter the dataset to keep only the examples where the **"template"** field is equal to **"topic"**. 3. Further filter the dataset to retain only the examples where the **"prompt"** field starts with **"Write a complex question from the domain of"**. 4. Use a regular expression to extract the domain from the **"prompt"** field. The domain is the part of the text that follows **"Write a complex question from the domain of"** and ends with a period. After that replace the original **"prompt"** field with the extracted domain. 5. Rename the **"prompt"** and **"text"** columns to **"extracted_topic"** and **"conversations"**. 6. Remove the **"template"** column from the dataset. # 🖥️ Code to filter ```python from datasets import load_dataset import re dataset = load_dataset("tomg-group-umd/GenQA", split="academic", num_proc=2) dataset = dataset.remove_columns(["idx", "category"]) dataset = dataset.filter(lambda x: x["template"] == "topic", num_proc=8) txt = "Write a complex question from the domain of" dataset = dataset.filter(lambda example: example["prompt"].startswith(txt), num_proc=8) pattern = r'Write a complex question from the domain of (.*?)\.' def extract_domain(example): match = re.search(pattern, example['prompt']) if match: domain = match.group(1) example['prompt'] = domain return example dataset_topic_list = dataset.map(extract_domain, num_proc=8) dataset_topic_list = dataset_topic_list.rename_column("prompt", "extracted_topic") dataset_topic_list = dataset_topic_list.rename_column("text", "conversations") dataset_topic_list = dataset_topic_list.remove_columns(["template"]) dataset_topic_list.push_to_hub("Weyaxi/GenQA-academic-filtered") ```

# 🔄 过滤型GenQA学术数据集本数据集为[tomg-group-umd/GenQA](https://hf.co/datasets/tomg-group-umd/GenQA)数据集的**「学术」**子集的过滤版本。 # 🔍 过滤流程： 1. 移除数据集中的**「idx」**与**「category」**列。 2. 对数据集进行过滤，仅保留**「template」**字段值为**「topic」**的样本。 3. 进一步过滤数据集，仅保留**「prompt」**字段以`Write a complex question from the domain of`开头的样本。 4. 利用正则表达式从**「prompt」**字段中提取领域信息：领域为`Write a complex question from the domain of`之后、以句号结尾的文本片段。随后将**「prompt」**字段替换为提取得到的领域信息。 5. 将**「prompt」**与**「text」**列分别重命名为**「extracted_topic」**与**「conversations」**。 6. 移除数据集中的**「template」**列。 # 🖥️ 过滤代码 python from datasets import load_dataset import re dataset = load_dataset("tomg-group-umd/GenQA", split="academic", num_proc=2) dataset = dataset.remove_columns(["idx", "category"]) dataset = dataset.filter(lambda x: x["template"] == "topic", num_proc=8) txt = "Write a complex question from the domain of" dataset = dataset.filter(lambda example: example["prompt"].startswith(txt), num_proc=8) pattern = r'Write a complex question from the domain of (.*?)\.' def extract_domain(example): match = re.search(pattern, example['prompt']) if match: domain = match.group(1) example['prompt'] = domain return example dataset_topic_list = dataset.map(extract_domain, num_proc=8) dataset_topic_list = dataset_topic_list.rename_column("prompt", "extracted_topic") dataset_topic_list = dataset_topic_list.rename_column("text", "conversations") dataset_topic_list = dataset_topic_list.remove_columns(["template"]) dataset_topic_list.push_to_hub("Weyaxi/GenQA-academic-filtered")

提供机构：

maas

创建时间：

2025-08-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集