GenQA-academic-filtered-sharegpt
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/Weyaxi/GenQA-academic-filtered-sharegpt
下载链接
链接失效反馈官方服务:
资源简介:
# 🔄 GenQA-academic-filtered
This dataset is a filtered version of the **"academic"** split of [tomg-group-umd/GenQA](https://hf.co/datasets/tomg-group-umd/GenQA).
# 🔍 Filtering proccess:
1. Remove the columns **"idx"** and **"category"** from the dataset.
2. Filter the dataset to keep only the examples where the **"template"** field is equal to **"topic"**.
3. Further filter the dataset to retain only the examples where the **"prompt"** field starts with **"Write a complex question from the domain of"**.
4. Use a regular expression to extract the domain from the **"prompt"** field. The domain is the part of the text that follows **"Write a complex question from the domain of"** and ends with a period. After that replace the original **"prompt"** field with the extracted domain.
5. Rename the **"prompt"** and **"text"** columns to **"extracted_topic"** and **"conversations"**.
6. Remove the **"template"** column from the dataset.
# 🖥️ Code to filter
```python
from datasets import load_dataset
import re
dataset = load_dataset("tomg-group-umd/GenQA", split="academic", num_proc=2)
dataset = dataset.remove_columns(["idx", "category"])
dataset = dataset.filter(lambda x: x["template"] == "topic", num_proc=8)
txt = "Write a complex question from the domain of"
dataset = dataset.filter(lambda example: example["prompt"].startswith(txt), num_proc=8)
pattern = r'Write a complex question from the domain of (.*?)\.'
def extract_domain(example):
match = re.search(pattern, example['prompt'])
if match:
domain = match.group(1)
example['prompt'] = domain
return example
dataset_topic_list = dataset.map(extract_domain, num_proc=8)
dataset_topic_list = dataset_topic_list.rename_column("prompt", "extracted_topic")
dataset_topic_list = dataset_topic_list.rename_column("text", "conversations")
dataset_topic_list = dataset_topic_list.remove_columns(["template"])
dataset_topic_list.push_to_hub("Weyaxi/GenQA-academic-filtered")
```
# 🔄 过滤型GenQA学术数据集
本数据集为[tomg-group-umd/GenQA](https://hf.co/datasets/tomg-group-umd/GenQA)数据集的**「学术」**子集的过滤版本。
# 🔍 过滤流程:
1. 移除数据集中的**「idx」**与**「category」**列。
2. 对数据集进行过滤,仅保留**「template」**字段值为**「topic」**的样本。
3. 进一步过滤数据集,仅保留**「prompt」**字段以`Write a complex question from the domain of`开头的样本。
4. 利用正则表达式从**「prompt」**字段中提取领域信息:领域为`Write a complex question from the domain of`之后、以句号结尾的文本片段。随后将**「prompt」**字段替换为提取得到的领域信息。
5. 将**「prompt」**与**「text」**列分别重命名为**「extracted_topic」**与**「conversations」**。
6. 移除数据集中的**「template」**列。
# 🖥️ 过滤代码
python
from datasets import load_dataset
import re
dataset = load_dataset("tomg-group-umd/GenQA", split="academic", num_proc=2)
dataset = dataset.remove_columns(["idx", "category"])
dataset = dataset.filter(lambda x: x["template"] == "topic", num_proc=8)
txt = "Write a complex question from the domain of"
dataset = dataset.filter(lambda example: example["prompt"].startswith(txt), num_proc=8)
pattern = r'Write a complex question from the domain of (.*?)\.'
def extract_domain(example):
match = re.search(pattern, example['prompt'])
if match:
domain = match.group(1)
example['prompt'] = domain
return example
dataset_topic_list = dataset.map(extract_domain, num_proc=8)
dataset_topic_list = dataset_topic_list.rename_column("prompt", "extracted_topic")
dataset_topic_list = dataset_topic_list.rename_column("text", "conversations")
dataset_topic_list = dataset_topic_list.remove_columns(["template"])
dataset_topic_list.push_to_hub("Weyaxi/GenQA-academic-filtered")
提供机构:
maas
创建时间:
2025-08-29



