five

GenQA-academic-filtered-sharegpt

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/Weyaxi/GenQA-academic-filtered-sharegpt
下载链接
链接失效反馈
官方服务:
资源简介:
# 🔄 GenQA-academic-filtered This dataset is a filtered version of the **"academic"** split of [tomg-group-umd/GenQA](https://hf.co/datasets/tomg-group-umd/GenQA). # 🔍 Filtering proccess: 1. Remove the columns **"idx"** and **"category"** from the dataset. 2. Filter the dataset to keep only the examples where the **"template"** field is equal to **"topic"**. 3. Further filter the dataset to retain only the examples where the **"prompt"** field starts with **"Write a complex question from the domain of"**. 4. Use a regular expression to extract the domain from the **"prompt"** field. The domain is the part of the text that follows **"Write a complex question from the domain of"** and ends with a period. After that replace the original **"prompt"** field with the extracted domain. 5. Rename the **"prompt"** and **"text"** columns to **"extracted_topic"** and **"conversations"**. 6. Remove the **"template"** column from the dataset. # 🖥️ Code to filter ```python from datasets import load_dataset import re dataset = load_dataset("tomg-group-umd/GenQA", split="academic", num_proc=2) dataset = dataset.remove_columns(["idx", "category"]) dataset = dataset.filter(lambda x: x["template"] == "topic", num_proc=8) txt = "Write a complex question from the domain of" dataset = dataset.filter(lambda example: example["prompt"].startswith(txt), num_proc=8) pattern = r'Write a complex question from the domain of (.*?)\.' def extract_domain(example): match = re.search(pattern, example['prompt']) if match: domain = match.group(1) example['prompt'] = domain return example dataset_topic_list = dataset.map(extract_domain, num_proc=8) dataset_topic_list = dataset_topic_list.rename_column("prompt", "extracted_topic") dataset_topic_list = dataset_topic_list.rename_column("text", "conversations") dataset_topic_list = dataset_topic_list.remove_columns(["template"]) dataset_topic_list.push_to_hub("Weyaxi/GenQA-academic-filtered") ```

# 🔄 过滤型GenQA学术数据集 本数据集为[tomg-group-umd/GenQA](https://hf.co/datasets/tomg-group-umd/GenQA)数据集的**「学术」**子集的过滤版本。 # 🔍 过滤流程: 1. 移除数据集中的**「idx」**与**「category」**列。 2. 对数据集进行过滤,仅保留**「template」**字段值为**「topic」**的样本。 3. 进一步过滤数据集,仅保留**「prompt」**字段以`Write a complex question from the domain of`开头的样本。 4. 利用正则表达式从**「prompt」**字段中提取领域信息:领域为`Write a complex question from the domain of`之后、以句号结尾的文本片段。随后将**「prompt」**字段替换为提取得到的领域信息。 5. 将**「prompt」**与**「text」**列分别重命名为**「extracted_topic」**与**「conversations」**。 6. 移除数据集中的**「template」**列。 # 🖥️ 过滤代码 python from datasets import load_dataset import re dataset = load_dataset("tomg-group-umd/GenQA", split="academic", num_proc=2) dataset = dataset.remove_columns(["idx", "category"]) dataset = dataset.filter(lambda x: x["template"] == "topic", num_proc=8) txt = "Write a complex question from the domain of" dataset = dataset.filter(lambda example: example["prompt"].startswith(txt), num_proc=8) pattern = r'Write a complex question from the domain of (.*?)\.' def extract_domain(example): match = re.search(pattern, example['prompt']) if match: domain = match.group(1) example['prompt'] = domain return example dataset_topic_list = dataset.map(extract_domain, num_proc=8) dataset_topic_list = dataset_topic_list.rename_column("prompt", "extracted_topic") dataset_topic_list = dataset_topic_list.rename_column("text", "conversations") dataset_topic_list = dataset_topic_list.remove_columns(["template"]) dataset_topic_list.push_to_hub("Weyaxi/GenQA-academic-filtered")
提供机构:
maas
创建时间:
2025-08-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作