sidovic/LearningQ-qg

Name: sidovic/LearningQ-qg
Creator: sidovic
Published: 2023-08-31 14:23:06
License: 暂无描述

Hugging Face2023-08-31 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/sidovic/LearningQ-qg

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: unknown task_categories: - text-generation language: - en tags: - question generation pretty_name: LeaningQ-qg size_categories: - 100K<n<1M train-eval-index: - config: plain_text task: question-generation task_id: extractive_question_generation splits: train_split: train eval_split: validation test_split: test col_mapping: context: context questionsrc: question source question: question metrics: - type: squad name: SQuAD dataset_info: features: - name: context dtype: string - name: questionsrc dtype: string - name: question dtype: string config_name: plain_text splits: - name: train num_examples: 188660 - name: validation num_examples: 20630 - name: test num_examples: 18227 --- # Dataset Card for LearningQ-qg ## Dataset Description - **Repository:** [GitHub](https://github.com/AngusGLChen/LearningQ#readme) - **Paper:** [LearningQ: A Large-scale Dataset for Educational Question Generation](https://ojs.aaai.org/index.php/ICWSM/article/view/14987/14837) - **Point of Contact:** s.lamri@univ-bouira.dz ### Dataset Summary LearningQ, a challenging educational question generation dataset containing over 230K document-question pairs by [Guanliang Chen, Jie Yang, Claudia Hauff and Geert-Jan Houben]. It includes 7K instructor-designed questions assessing knowledge concepts being taught and 223K learner-generated questions seeking in-depth understanding of the taught concepts. This new version collected and corrected from over than 50000 error and more than 1500 type of error by [Sidali Lamri](https://dz.linkedin.com/in/sidali-lamri) ### Use the dataset ```python from datasets import load_dataset lq_dataset = load_dataset("sidovic/LearningQ-qg") lq_dataset["train"][1] len(lq_dataset["train"]),len(lq_dataset["validation"]),len(lq_dataset["test"]) ``` ### Supported Tasks and Leaderboards [Question generation] ### Languages [English] ## Dataset Structure ### Data Instances An example of example looks as follows. ``` { "context": "This is a test context.", "questionsrc": "test context", "question": "Is this a test?" } ``` ### Data Fields The data fields are the same among all splits. - `context`: a `string` feature. - `questionsrc`: a `string` feature. - `question`: a `string` feature. ### Data Splits | name |train |validation|test | |----------|-----:|---------:|----:| |LearningQ |188660| 20630|18227| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ``` { author = {Sidali Lamri}, title = {new LearningQ version for Question generation in transformers}, year = {2023} } @paper{ICWSM18LearningQ, author = {Guanliang Chen, Jie Yang, Claudia Hauff and Geert-Jan Houben}, title = {LearningQ: A Large-scale Dataset for Educational Question Generation}, conference = {International AAAI Conference on Web and Social Media}, year = {2018} } ``` ### Contributions [More Information Needed]

--- 许可证：未知任务类别： - 文本生成语言： - 英语标签： - 问题生成数据集展示名：LearningQ-qg 样本量范围：10万 < 样本量 < 100万训练评估配置： - 配置：纯文本任务：问题生成任务ID：抽取式问题生成（extractive question generation）数据集拆分：训练集拆分：train 验证集拆分：validation 测试集拆分：test 列映射： context → 上下文（context） questionsrc → 问题源（questionsrc） question → 问题（question）评估指标： - 类型：SQuAD 名称：SQuAD 数据集信息：特征： - 名称：上下文（context），数据类型：字符串 - 名称：问题源（questionsrc），数据类型：字符串 - 名称：问题（question），数据类型：字符串配置名称：纯文本数据集拆分： - 拆分名称：train，样本数：188660 - 拆分名称：validation，样本数：20630 - 拆分名称：test，样本数：18227 --- # LearningQ-qg 数据集卡片 ## 数据集描述 - **代码仓库**：[GitHub](https://github.com/AngusGLChen/LearningQ#readme) - **相关论文**：[LearningQ: 大规模教育问题生成数据集](https://ojs.aaai.org/index.php/ICWSM/article/view/14987/14837) - **联系人**：s.lamri@univ-bouira.dz ### 数据集概述 LearningQ是一款具有较高挑战性的教育问题生成数据集，由Guanliang Chen、Jie Yang、Claudia Hauff与Geert-Jan Houben构建，包含超过23万条文档-问题对。其中包含7000条由教师设计的问题，用于评估授课过程中的知识概念；另有22.3万条由学习者生成的问题，用于考察其对授课内容的深度理解。本修订版本由Sidali Lamri从超过5万条错误样本与1500余种错误类型中整理并修正而来。 ### 数据集使用方法 python from datasets import load_dataset lq_dataset = load_dataset("sidovic/LearningQ-qg") lq_dataset["train"][1] len(lq_dataset["train"]),len(lq_dataset["validation"]),len(lq_dataset["test"]) ### 支持任务与基准榜单 [问题生成] ### 语言 [英语] ## 数据集结构 ### 数据样例一条数据样例如以下所示： { "context": "This is a test context.", "questionsrc": "test context", "question": "Is this a test?" } ### 数据字段所有数据集拆分的数据字段均保持一致： - `context`（上下文）：字符串类型特征 - `questionsrc`（问题源）：字符串类型特征 - `question`（问题）：字符串类型特征 ### 数据集拆分 | 拆分名称 | 训练集 | 验证集 | 测试集 | |----------|-------:|-------:|-------:| | LearningQ | 188660 | 20630 | 18227 | ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源数据生产者是谁？ [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注者是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集管理者 [需补充更多信息] ### 许可信息 [需补充更多信息] ### 引用信息 bibtex { author = {Sidali Lamri}, title = {new LearningQ version for Question generation in transformers}, year = {2023} } @paper{ICWSM18LearningQ, author = {Guanliang Chen, Jie Yang, Claudia Hauff and Geert-Jan Houben}, title = {LearningQ: A Large-scale Dataset for Educational Question Generation}, conference = {International AAAI Conference on Web and Social Media}, year = {2018} } ### 贡献情况 [需补充更多信息]

提供机构：

sidovic

原始信息汇总

数据集概述

数据集描述

名称: LearningQ-qg
语言: 英语
任务类别: 文本生成
标签: 问题生成

数据集详细信息

许可证: 未知
大小: 100K<n<1M
配置名称: plain_text
训练-验证-测试分割:
- 训练集: 188660个样本
- 验证集: 20630个样本
- 测试集: 18227个样本
特征:
- context: 字符串类型
- questionsrc: 字符串类型
- question: 字符串类型

数据集使用

加载数据集: python from datasets import load_dataset lq_dataset = load_dataset("sidovic/LearningQ-qg")

数据集结构

数据实例: json { "context": "string", "questionsrc": "string", "question": "string" }
数据字段:
- context: 字符串
- questionsrc: 字符串
- question: 字符串

数据集创建

数据集概述:
- 包含230K文档-问题对
- 包括7K教师设计的问题和223K学生生成的问题
- 由Sidali Lamri收集并纠正超过50000个错误和1500种错误类型

搜集汇总

数据集介绍

背景与挑战

背景概述

LearningQ-qg是一个大规模的教育问题生成数据集，包含超过230K的文档-问题对，支持文本生成任务，适用于英语环境。数据集分为训练、验证和测试集，适用于教育领域的问题生成研究和应用。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集