five

SLPL/syntran-fa

收藏
Hugging Face2024-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SLPL/syntran-fa
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - fa license: mit multilinguality: - monolingual size_categories: - 30k<n<50k task_categories: - question-answering - text2text-generation - text-generation task_ids: [] pretty_name: SynTranFa tags: - conditional-text-generation - conversational-question-answering --- # SynTran-fa Syntactic Transformed Version of Farsi QA datasets to make fluent responses from questions and short answers. You can use this dataset by the code below: ```python import datasets data = datasets.load_dataset('SLPL/syntran-fa', split="train") ``` ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Sharif-SLPL](https://github.com/Sharif-SLPL) - **Repository:** [SynTran-fa](https://github.com/agp-internship/syntran-fa) - **Point of Contact:** [Sadra Sabouri](mailto:sabouri.sadra@gmail.com) - **Paper:** [SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation](https://www.preprints.org/manuscript/202410.1684/v1) ### Dataset Summary Generating fluent responses has always been challenging for the question-answering task, especially in low-resource languages like Farsi. In recent years there were some efforts for enhancing the size of datasets in Farsi. Syntran-fa is a question-answering dataset that accumulates the former Farsi QA dataset's short answers and proposes a complete fluent answer for each pair of (question, short_answer). This dataset contains nearly 50,000 indices of questions and answers. The dataset that has been used as our sources are in [Source Data section](#source-data). The main idea for this dataset comes from [Fluent Response Generation for Conversational Question Answering](https://aclanthology.org/2020.acl-main.19.pdf) where they used a "parser + syntactic rules" module to make different fluent answers from a pair of question and a short answer using a parser and some syntactic rules. In this project, we used [stanza](https://stanfordnlp.github.io/stanza/) as our parser to parse the question and generate a response according to it using the short (sentences without verbs - up to ~4 words) answers. One can continue this project by generating different permutations of the sentence's parts (and thus providing more than one sentence for an answer) or training a seq2seq model which does what we do with our rule-based system (by defining a new text-to-text task). ### Supported Tasks and Leaderboards This dataset can be used for the question-answering task, especially when you are going to generate fluent responses. You can train a seq2seq model with this dataset to generate fluent responses - as done by [Fluent Response Generation for Conversational Question Answering](https://aclanthology.org/2020.acl-main.19.pdf). ### Languages + Persian (fa) ## Dataset Structure Each row of the dataset will look like something like the below: ```json { 'id': 0, 'question': 'باشگاه هاکی ساوتهمپتون چه نام دارد؟', 'short_answer': 'باشگاه هاکی ساوتهمپتون', 'fluent_answer': 'باشگاه هاکی ساوتهمپتون باشگاه هاکی ساوتهمپتون نام دارد.', 'bert_loss': 1.110097069682014 } ``` + `id` : the entry id in dataset + `question` : the question + `short_answer` : the short answer corresponding to the `question` (the primary answer) + `fluent_answer` : fluent (long) answer generated from both `question` and the `short_answer` (the secondary answer) + `bert_loss` : the loss that [pars-bert](https://huggingface.co/HooshvareLab/bert-base-parsbert-uncased) gives when inputting the `fluent_answer` to it. As it increases the sentence is more likely to be influent. Note: the dataset is sorted increasingly by the `bert_loss`, so first sentences are more likely to be fluent. ### Data Splits Currently, the dataset just provided the `train` split. There would be a `test` split soon. ## Dataset Creation ### Source Data The source datasets that we used are as follows: + [PersianQA](https://github.com/sajjjadayobi/PersianQA) + [PersianQuAD](https://ieeexplore.ieee.org/document/9729745) #### Initial Data Collection and Normalization We extract all short answer (sentences without verbs - up to ~4 words) entries of all open source QA datasets in Farsi and used some rules featuring the question parse tree to make long (fluent) answers. ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information The dataset is completely a subset of open source known datasets so all information in it is already there on the internet as a open-source dataset. By the way, we do not take responsibility for any of that. ## Additional Information ### Dataset Curators The dataset is gathered together completely in the Asr Gooyesh Pardaz company's summer internship under the supervision of Soroush Gooran, Prof. Hossein Sameti, and the mentorship of Sadra Sabouri. This project was Farhan Farsi's first internship project. ### Licensing Information MIT ### Citation Information ```bibtex @article{farsi2024syntran, title={SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation}, author={Farsi, Farhan and Sabouri, Sadra and Kashfipour, Kian and Gooran, Soroush and Sameti, Hossein and Asgari, Ehsaneddin}, year={2024}, doi={10.20944/preprints202410.1684.v1}, publisher={Preprints} } ``` ### Contributions Thanks to [@farhaaaaa](https://github.com/farhaaaaa) and [@sadrasabouri](https://github.com/sadrasabouri) for adding this dataset.
提供机构:
SLPL
原始信息汇总

数据集概述

数据集名称

  • 名称:SynTran-fa
  • 别名:SynTranFa

语言和许可证

  • 语言:波斯语 (fa)
  • 许可证:MIT

多语言性

  • 类型:单语种

大小分类

  • 范围:30,000 < n < 50,000

任务类别

  • 任务
    • 问答
    • 文本到文本生成
    • 文本生成

标签

  • 标签
    • 条件文本生成
    • 对话问答

数据集描述

数据集总结

  • 目的:生成流畅的回答,特别是针对波斯语这种资源较少的语言。
  • 内容:包含约50,000个问题和答案的索引,每个条目包括问题、简短答案和根据问题及简短答案生成的流畅答案。
  • 方法:使用stanza作为解析器,根据问题解析树和一些句法规则生成流畅答案。

支持的任务和排行榜

  • 应用:可用于训练序列到序列模型以生成流畅的回答。

数据集结构

  • 数据格式:JSON
  • 字段
    • id:数据集中的条目ID
    • question:问题
    • short_answer:简短答案
    • fluent_answer:流畅答案
    • bert_loss:使用pars-bert模型评估的流畅度损失
  • 排序:数据集按bert_loss升序排序,损失越低,句子越流畅。

数据分割

  • 当前提供:仅提供train分割
  • 未来计划:将提供test分割

数据集创建

源数据

  • 来源
    • PersianQA
    • PersianQuAD

初始数据收集和规范化

  • 方法:从所有开放源代码的波斯语QA数据集中提取所有简短答案(无动词的句子,最多约4个词),并使用一些规则根据问题解析树生成流畅答案。

个人和敏感信息

  • 声明:数据集完全是从已知的开放源代码数据集中提取的,所有信息均已在互联网上公开。

数据集管理

数据集策划者

  • 组织:Asr Gooyesh Pardaz公司的夏季实习项目
  • 监督:Soroush Gooran, Prof. Hossein Sameti
  • 指导:Sadra Sabouri
  • 项目:Farhan Farsi的第一个实习项目

许可证信息

  • 许可证:MIT

贡献者

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作