five

mesolitica/synthetic-jawi-conversation

收藏
Hugging Face2024-06-12 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/mesolitica/synthetic-jawi-conversation
下载链接
链接失效反馈
官方服务:
资源简介:
Synthetic Jawi Conversation数据集是从一个马来西亚通用问答数据集转换而来,数据被转换为Jawi格式。数据集以JSONL格式存储,包含509,375行,大小为13GB。示例数据展示了对话的上下文、用户问题和助手的回答,分别以马来语和Jawi语呈现。

Synthetic Jawi Conversation数据集是从一个马来西亚通用问答数据集转换而来,数据被转换为Jawi格式。数据集以JSONL格式存储,包含509,375行,大小为13GB。示例数据展示了对话的上下文、用户问题和助手的回答,分别以马来语和Jawi语呈现。
提供机构:
mesolitica
原始信息汇总

Synthetic Jawi Conversation 数据集概述

基本信息

  • 任务类别: 文本生成
  • 语言: 马来语 (ms)

数据集来源

数据文件

  • 文件名: synthetic-jawi-conversation.jsonl
  • 行数: 509,375 行
  • 大小: 13 GB

示例数据

  • 角色: context, user, assistant
  • 内容: 包含马来语和Jawi文字的对话内容

示例

json [ {"role": "context", "content": "din Venue: DK7, Block J, 3rd Floor, Faculty of Medicine 14 21.06.2023 9.00am – 12.00pm (SDL): to turn in Reflection write up (online, synchronous) Lecturer: Prof. Dr. Fung Shin Yee / Dr. Muhamad Alfakri B. Mat Noh 4 Links to the online platforms (Zoom, GoogleMeet, Microsoft Teams and others) will be shared via the class rep a few days before the class. All students registered for this course must join the class Whatsapp group for notifications. Slides and pre-recorded videos will be uploaded on Spectrum for your reference. Online synchronous: Be present to join the online class at the designated time as stated in the timetable. Attendance will be taken up to 30 mins after class has begun. Class material will be shared on Spectrum and the students can access the information in their own time. Self-directed learning: Students use this time to prepare for presentations or assignments. Lecturers: Name Affiliation Adams, Peter J University of Auckland, New Zealand Anwar Norazit Biomedical Science Chang Li Yen Medical Microbiology Noran Naqiah Binti Mohd Hairi Social and Preventative Medicine Fung Shin Yee Molecular Medicine Haryanti Azura bt Mohammad Wali Animal Experimental Unit, FOM Leo Bey Fen Central Unit of Advanced Research Imaging, CENTUARI Mahmood Danee Social and Preventative Medicine Mohd Shafiq Azanan Paediatrics Department Muhamad Alfakri B. Mat Noh Anatomy Ng Kwan Hoong Biomedical Imaging Nur Azah Hamzaid Head, Research Training Unit, ADeC Phan Chia Wei Clinical Investigation Centre Rozaida Poh Biomedical Science Tan Ai Huey Research Management Unit, FOM Thamil Selvee Ramasamy Molecular Medicine Zanaria Saupi Udin UM Main Library For any enquiries: Prof. Dr. Fung Shin Yee - Programme Coordinator 03-7967 5745 (syfung@ummc.edu.my) Dr. Muhamad Alfakri B. Mat Noh - Programme Coordinator 03-7967 4731 (alfakri@um.edu.my) Muhd Shahrul Harris, The Dean’s Office, Faculty of Medicine 03-7967 7503 (muhdharris@um.edu.my) Updated 04.01.2023 mailto:syfung@ummc.edu.my mailto:alfakri@um.edu.my mailto:muhdharris@um.edu.my Yanti Idaya, AMK., et al. (Eds.): ICOLIS 2021, Kuala Lumpur: DLIS, FASS, 2021 24 Digital Repository for English Language Teaching and Learning Resources: A Panacea in the Time of the Pandemic Grace Hui Lin Tan and Yanti Idaya Aspura M.K. Department of Library and Information Science, Faculty of Arts and Social Sciences, Universiti Malaya, Kuala Lumpur, MALAYSIA e-mail: tangracehl@gmail.com; yanti@um.edu.my ABSTRACT The Coronavirus disease 2019 (COVID-19) pandemic poses a huge challenge to the education system in Malaysia. During this critical time, schools were closed and teachers were required to conduct lessons via the online platform. Thus, the use of digital teaching resources increased exponentially as these resources made lessons more effective and also more interesting. As these resources had to be systematically stored for easy retrieval, a digital repository prototype was developed specially to support English Language instruction at SMK Canossian Convent (SMKCC), Kluang, Johor. The repository facilitates the systematic storage and easy retrieval of the image, text and video collections of the English Language teaching and learning resources using the DSpace system. This was implemented using the ADDIE model which consists of five phases, namely analysis, design, development, implementation and evaluation. A total of 75 teaching and learning resources were uploaded into the system, accompanied by selected Dublin Core metadata fields which include Title, Author, Keyword, Issue Date, Publisher, Description, Uniform Resource Identifier (URI), Type and Language. A survey to evaluate the level of satisfaction was then carried out among users of this digital repository prototype. The results showed that 100% of the respondents found the digital repository useful, easy to use, easy to learn, and were satisfied in using it to browse, search and retrieve any particular item stored in the collection. This project has, indeed, delivered a good prototype of the digital repository service to the targeted users. Suggestions from users were noted for project improvement and future development of the digital repository. Keywords: Digital repositories; DSpace; ADDIE model; Dublin Core metadata; English Language instruction INTRODUCTION On 11 March 2020, the World Health Organisation (WHO) declared the Coronavirus disease 2019 (COVID-19) outbreak a global pandemic (Ghebreyesus 2020). Due to a sharp rise in COVID-19 cases in the country, the Malaysian Prime Minister, Tan Sri Mahiaddin Yassin officially implemented the ‘Movement Control Order’ (MCO) on 18 March 2020 (Sukumaran 2020). During this pandemic, schools in Malaysia were closed indefinitely as a way to protect students from COVID-19 infections and to prevent the disease from Yanti Idaya, AMK., et al. (Eds.): ICOLIS 2021, Kuala Lumpur: DLIS, FASS, 2021 25 spreading in the community (Soo 2020). With the closure of all schools, teachers had to use available technologies and conduct teaching via the online platform (Razak 2020). They had to be creative and innovative in overcoming the limitations of online teaching and in ensuring that e-learning was effective (Othman et al. 2020; Pokhrel and Chhetri 2021). Prior to the COVID-19 pandemic, the English Language teachers of SMK Canossian Convent (SMKCC), Kluang conducted face-to-face lessons and occasionally used digital resources available on several platforms, websites or portals such as the Johor English Digital Learning Resources (JEDLR), British Council, Teach-This.com, All Things Grammar and YouTube. These resources had to be carefully selected for lessons to be effective and interesting so as to achieve their learning objectives. During the MCO period, the use of these available digital resources has become even more popular among teachers conducting online teaching as it helps overcome the problem of having limited time to prepare individual online learning content (Huang et al. 2020). Every English Language teacher has their collection of resources stored in electronic folders in the computer or on external hard disk for future use. The English Language teachers faced various technical problems trying to locate the resources that were stored earlier. Firstly, these resources were not localised; they were scattered, and not well- organised. Secondly, no standard metadata was used when storing these resources. Information about the resources was either not stated or was too brief, that is, only the folder name or the file title was available. Thirdly, item retrieval was time-consuming as the teachers had to select the folder, and then browse through the files according to the level and title. When these titles were renamed, the time taken for the retrieval of the items was much longer. Fourthly, redundancy due to multiple storages occurred when teachers saved the resources which already existed in their computers under a different file name. Lastly, the problem of accessibility arose when teachers, while working in school, wanted to access the image, text and video collection of resources stored earlier but did not bring their computer or external hard disk to school. It was apparent that there was a need for a simple yet centralised and integrated system where good and effective teaching and learning resources could be pooled together, systematically stored in a specific location for easy retrieval, and shared among the teachers. Discussions with their English Language Department Head led to the decision to establish a digital repository as a database to store and manage the digital collection of teaching and learning resources. The resources to be included in this repository were Examination Questions, Fun Learning Activities and Writing Genres. Therefore, this research project aims to develop a digital repository prototype using DSpace for the image, text and video collections of teaching and learning resources of the English Language Department of SMKCC Kluang. The development of the digital repository prototype in this research project is to enhance item retrieval of the English Language teaching and learning resources. Yanti Idaya, AMK., et al. (Eds.): ICOLIS 2021, Kuala Lumpur: DLIS, FASS, 2021 2}, {"role": "user", "content_ms": "Bagaimanakah wabak itu membawa kepada pembangunan prototaip repositori digital untuk pengajaran Bahasa Inggeris di SMK Canossian Convent (SMKCC), Kluang, Johor?", "content_jawi": "باڬايماناكه وابق ايت ممباوا كڤد ڤمباڠونن ڤروتوتايڤ ريڤوسيتوري ديڬيتل اونتوق ڤڠاجرن بهاس ايڠڬريس د ايس عيم كي چانوسسين چونۏينت ﴿ايس عيم كيچچ﴾، قلواڠ، جوهر؟"}, {"role": "assistant", "content_ms": "Pandemik COVID-19 membawa kepada pembangunan prototaip repositori digital untuk pengajaran Bahasa Inggeris di SMK Canossian Convent (SMKCC), Kluang, Johor kerana keperluan untuk menutup sekolah selama-lamanya sebagai satu cara untuk melindungi pelajar daripada jangkitan COVID-19 dan mencegah penyakit itu daripada terus merebak dalam masyarakat. Dengan penutupan semua sekolah, guru terpaksa menjalankan pengajaran melalui platform dalam talian, yang memerlukan mereka kreatif dan inovatif dalam mengatasi batasan pengajaran dalam talian dan memastikan e-pembelajaran berkesan.

Sebelum wabak, guru Bahasa Inggeris SMKCC kadangkala menggunakan sumber digital yang terdapat di beberapa platform, laman web atau portal untuk pengajaran bersemuka. Walau bagaimanapun, dalam tempoh PKP, penggunaan sumber digital yang ada ini menjadi lebih popular dalam kalangan guru yang menjalankan pengajaran dalam talian. Setiap guru Bahasa Inggeris mempunyai koleksi sumber mereka yang disimpan dalam folder elektronik dalam komputer atau pada cakera keras luaran untuk kegunaan masa hadapan. Walau bagaimanapun, mereka menghadapi pelbagai masalah teknikal semasa cuba mencari sumber yang disimpan lebih awal, seperti sumber tidak disetempat, tidak tersusun dengan baik, tiada metadata standard digunakan semasa menyimpan sumber ini, pengambilan item memakan masa, lebihan disebabkan pelbagai storan, dan masalah kebolehaksesan.

Untuk menangani isu ini, Ketua Jabatan Bahasa Inggeris dan guru Bahasa Inggeris memutuskan untuk mewujudkan repositori digital sebagai pangkalan data untuk menyimpan dan mengurus koleksi digital sumber pengajaran dan pembelajaran. Sumber yang akan dimasukkan ke dalam repositori ini ialah Soalan Peperiksaan, Aktiviti Pembelajaran Seronok dan Genre Penulisan. Oleh itu, projek penyelidikan ini bertujuan untuk membangunkan prototaip repositori digital menggunakan DSpace untuk koleksi imej, teks dan video sumber pengajaran dan pembelajaran Jabatan Bahasa Inggeris SMKCC Kluang. Pembangunan prototaip repositori digital dalam projek penyelidikan ini adalah untuk meningkatkan perolehan semula item sumber pengajaran dan pembelajaran Bahasa Inggeris, memudahkan guru mengakses dan menggunakan sumber untuk pengajaran dalam talian.", "content_jawi": "ڤندميق چوۏيد-١٩ ممباوا كڤد ڤمباڠونن ڤروتوتايڤ ريڤوسيتوري ديڬيتل اونتوق ڤڠاجرن بهاس ايڠڬريس د ايس عيم كي چانوسسين چونۏينت ﴿ايس عيم كيچچ﴾، قلواڠ، جوهر كران كڤرلوان اونتوق منوتوڤ سكوله سلاما-لاماڽ سباڬاي ساتو چارا اونتوق مليندوڠي ڤلاجر درڤد جڠكيتن چوۏيد-١٩ دان منچڬه ڤڽاكيت ايت درڤد تروس مربق دالم مشاركت. دڠن ڤنوتوڤن سموا سكوله، ڬورو ترڤقسا منجالنكن ڤڠاجرن ملالوءي ڤلاتفورم دالم تالين، يڠ ممرلوكن مريك كرياتيف دان اينوۏاتيف دالم مڠاتسي باتسن ڤڠاجرن دالم تالين دان ممستيكن اي-ڤمبلاجرن بركسن.

سبلوم وابق، ڬورو بهاس ايڠڬريس ايس عيم كيچچ كادڠكالا مڠڬوناكن سومبر ديڬيتل يڠ ترداڤت د ببراڤ ڤلاتفورم، لامن ويب اتاو ڤورتل اونتوق ڤڠاجرن برسموكا. والاو باڬايماناڤون، دالم تيمڤوه ڤقڤ، ڤڠڬوناان سومبر ديڬيتل يڠ اد اين منجادي لبيه ڤوڤولر دالم كالڠن ڬورو يڠ منجالنكن ڤڠاجرن دالم تالين. ستياڤ ڬورو بهاس ايڠڬريس ممڤوڽاءي كوليكسي سومبر مريك يڠ دسيمڤن دالم فولدر ايليكترونيك دالم كومڤوتر اتاو ڤد چاكرا كراس لواران اونتوق كڬونأن ماس هادڤن. والاو باڬايماناڤون، مريك مڠهادڤي ڤلباڬاي مسئله

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集源自 mesolitica/mixtral-malaysian-general-qa 语料库,通过系统性的转换流程将其中的马来语对话内容转写为爪夷文(Jawi)形式。构建过程依托于公开的 Notebook 工具链,在 GitHub 仓库中完成数据清洗与格式统一,最终生成包含509,375条记录、总规模达13GB的 JSONL 文件。每条对话保留了原始的多轮交互结构,涵盖用户提问与助手的双语回答(马来语和爪夷文),确保了语义的完整性与跨脚本的一致性。
使用方法
该数据集专为文本生成任务设计,可直接用于训练或微调基于 Transformer 架构的语言模型。使用时,用户可通过 Hugging Face Datasets 库加载 JSONL 文件,将对话序列作为输入,利用 role 字段区分上下文、用户与助手内容。推荐将爪夷文部分作为目标输出,马来语部分作为输入提示,以构建跨脚本翻译或对话生成模型。同时,数据集兼容标准的数据划分与批处理流程,适合在 GPU 环境下进行高效训练与评估。
背景与挑战
背景概述
该数据集由马来西亚研究机构 mesolitica 于近期创建,源自其原有的 mixtral-malaysian-general-qa 数据集,核心研究问题在于将现代马来语对话转换为爪夷文(Jawi)文本,以填补马来语自然语言处理中低资源文字——爪夷文——的语料空白。爪夷文作为马来语的传统书写系统,在宗教、历史及文化领域具有重要地位,但数字化资源稀缺,限制了相关文本生成与语言模型的发展。该数据集包含超过50万条对话记录,规模达13GB,为马来语多文字处理、跨文字迁移学习及低资源语言建模提供了关键基准,对推动东南亚语言技术与文化遗产数字化具有显著影响力。
当前挑战
该数据集面临的核心挑战在于:1) 爪夷文作为低资源文字,其拼写规范、字符变体及与拉丁马来文的映射关系复杂,现有模型在文本生成中易出现拼写错误或语义偏离,需解决跨文字迁移学习中的对齐与泛化难题;2) 构建过程中,原始通用问答数据需经自动或半自动转换,转换精度受限于爪夷文转写工具的质量,且长文本(如示例中的学术内容)的上下文连贯性难以保证,导致部分对话在文字转换后存在信息丢失或逻辑断裂。此外,数据规模虽大,但领域覆盖偏向学术与教育场景,多样性与平衡性有待提升。
常用场景
经典使用场景
该数据集的核心用途在于为马来语爪夷文(Jawi)的对话生成任务提供大规模、高质量的合成语料。它源自通用问答数据集,经转换后保留了丰富的学术、教育及日常交流语境,特别适用于训练和评估基于文本生成的语言模型在爪夷文这一特殊文字系统上的表现。研究者可借此探索低资源文字场景下的语义保持与语法连贯性,推动马来语自然语言处理在历史文字与现代技术融合方面的发展。
解决学术问题
该数据集有效缓解了爪夷文数字资源稀缺的困境,为学术界提供了一个标准化的基准,用于研究文字转写对语言模型性能的影响。它解决了传统爪夷文语料库规模小、领域窄的问题,使得模型在跨文字(罗马字母与爪夷文)的语义对齐、信息检索和对话理解等任务中得以系统评估。其意义在于促进了多文字环境下自然语言处理的公平性研究,并为东南亚语言的历史文献数字化提供了数据支撑。
实际应用
在实际应用中,该数据集可用于开发支持爪夷文的智能客服、教育辅导及文化内容生成系统。例如,在马来西亚的宗教教育或历史文献解读场景中,模型能够基于用户输入的爪夷文问题,生成符合现代语境的回复,从而辅助教学与信息传播。此外,它还能嵌入到翻译工具或多语言聊天机器人中,实现罗马字母与爪夷文之间的无缝交互,提升特殊文字用户的数字体验。
数据集最近研究
最新研究方向
该数据集聚焦于将马来语通用问答数据转换为爪夷文(Jawi)文本,旨在推动低资源文字在对话生成领域的研究。随着东南亚地区对多语言、多文字自然语言处理需求的增长,尤其是马来语社区对历史文字与现代数字系统融合的重视,该数据集为开发支持爪夷文的聊天机器人、教育工具及文化遗产数字化提供了关键资源。其大规模合成对话(约50万条)不仅填补了爪夷文对话数据的空白,还为研究跨文字迁移学习、低资源语言模型微调及语义对齐技术开辟了新路径,对保护马来语书面文化遗产和促进区域语言技术包容性具有深远意义。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作