five

MaLLaM-2.5-Small-Manglish-QA

收藏
Hugging Face2024-11-15 更新2024-12-12 收录
下载链接:
https://huggingface.co/datasets/mesolitica/MaLLaM-2.5-Small-Manglish-QA
下载链接
链接失效反馈
官方服务:
资源简介:
MaLLaM 2.5 Synthetic Singlish QA数据集用于生成合成的新加坡式英语(Singlish)问答对。该数据集包含多种语言的示例,包括英语、马来语和马来西亚英语(Manglish)。内容涵盖政治、科学、数学和代码等多个主题。提供的'Politics'部分示例展示了数据集的格式,包括不同语言的问答对以及合成的新加坡式英语版本。该数据集旨在用于生成合成的问答数据,特别是针对新加坡式英语这一独特方言。

MaLLaM 2.5 Synthetic Singlish QA Dataset is designed for generating synthetic question-answer (QA) pairs in Singaporean Colloquial English (Singlish). This dataset includes examples in multiple languages, namely English, Malay, and Manglish (Malaysian Colloquial English). Its content covers a wide range of topics including politics, science, mathematics, and coding. The sample entries under the 'Politics' section demonstrate the dataset's format, featuring QA pairs in various languages as well as the synthetic Singlish variants. This dataset is intended for generating synthetic QA data, with a specific focus on this unique Singaporean colloquial dialect.
提供机构:
Mesolitica
创建时间:
2024-11-15
原始信息汇总

MaLLaM 2.5 Synthetic Singlish QA

语言

  • 马来语 (ms)
  • 英语 (en)

数据集描述

该数据集包含合成的新加坡式英语(Singlish)问答数据。

数据文件

政治

  • 文件名: mallam-small-2.5-manglish-politics.jsonl
  • 行数: 66025

示例数据

json [ { "role": "user", "content": "Why did Malaysias former anti-corruption agency, the Malaysian Anti-Corruption Commission (MACC), face allegations of bias and political interference during Najib Razaks tenure?", "content_ms": "Mengapa bekas agensi antirasuah Malaysia, Suruhanjaya Pencegahan Rasuah Malaysia (SPRM), menghadapi dakwaan berat sebelah dan campur tangan politik semasa Najib Razak memegang jawatan?", "content_manglish": "Wah, why ah? Dat old MACC kena complain left right center during Najib time. People say dem biased lah, kena political interference one. Sibeh jialat sia! MACC supposed to catch all the corrupt fellas, but during Najib time, everyone say dem cannot do proper job. Like dat how? MACC supposed to be big shot anti-corruption agency, but end up kena accuse of being political puppet. Alamak, so many problem one!" }, { "role": "assistant", "content": "During Najib Razaks tenure as Prime Minister of Malaysia (2009-2018), the Malaysian Anti-Corruption Commission (MACC) faced allegations of bias and political interference due to several high-profile cases and controversies. While I cannot access real-time data, I can provide information based on publicly available reports and records. Here are some reasons why MACC faced such allegations during that period:

  1. Delayed investigation into 1MDB scandal: The 1Malaysia Development Berhad (1MDB) scandal, which involved alleged money laundering and embezzlement of billions of dollars, became a focal point of the allegations against MACC. Critics argued that MACC had not taken decisive action against those implicated in the scandal, including Najib Razak himself, despite evidence presented by whistleblowers and international media.
  2. Sodomy II case: The MACC was accused of acting on behalf of the ruling government in the highly politicized Sodomy II case against opposition leader Anwar Ibrahim. Anwar was convicted in 2015 on charges of sodomizing a former aide, and critics claimed that the MACC had failed to conduct a thorough investigation into the allegations of political interference and witness tampering in the case.
  3. Appointment process: The appointment process of the MACC chief was criticized during Najibs tenure, with claims that the process was politically influenced. Critics argued that the government had handpicked MACC chiefs who were more likely to be loyal to the ruling party, rather than acting independently to uphold the law.
  4. Whistleblower protection: MACC was also accused of not doing enough to protect whistleblowers, including those who came forward with evidence related to the 1MDB scandal. Critics claimed that whistleblowers faced harassment, intimidation, and even physical threats as a result of their actions, and that MACC had not provided adequate support or protection for these individuals.
  5. Lack of transparency: The MACCs operations and decision-making processes were criticized for being opaque. Critics argued that this lack of transparency left the agency vulnerable to political interference and made it difficult for the public to trust its impartiality.

These allegations of bias and political interference undermined public confidence in MACCs ability to carry out its mandate to fight corruption effectively and independently. After the 2018 general election, which saw the ousting of Najib Razaks government, the new administration initiated a series of reforms to strengthen MACCs independence and enhance its investigative capabilities.", "content_ms": "Semasa Najib Razak memegang jawatan Perdana Menteri Malaysia (2009-2018), Suruhanjaya Pencegahan Rasuah Malaysia (SPRM) menghadapi dakwaan berat sebelah dan campur tangan politik disebabkan beberapa kes berprofil tinggi dan kontroversi. Walaupun saya tidak dapat mengakses data masa nyata, saya boleh memberikan maklumat berdasarkan laporan dan rekod yang tersedia secara terbuka. Berikut adalah beberapa sebab mengapa SPRM menghadapi dakwaan sedemikian dalam tempoh itu:

  1. Siasatan tertunda terhadap skandal 1MDB: Skandal 1Malaysia Development Berhad (1MDB), yang melibatkan dakwaan pengubahan wang haram dan penyelewengan berbilion dolar, menjadi titik tumpuan dakwaan terhadap SPRM. Pengkritik berhujah bahawa SPRM tidak mengambil tindakan tegas terhadap mereka yang terlibat dalam skandal itu, termasuk Najib Razak sendiri, walaupun bukti dikemukakan oleh pemberi maklumat dan media antarabangsa.
  2. Kes Sodomy II: SPRM dituduh bertindak bagi pihak kerajaan yang memerintah dalam kes Sodomy II yang sangat dipolitikkan terhadap pemimpin pembangkang Anwar Ibrahim. Anwar disabitkan pada 2015 atas tuduhan meliwat bekas pembantu, dan pengkritik mendakwa bahawa SPRM telah gagal menjalankan siasatan menyeluruh terhadap dakwaan campur tangan politik dan gangguan saksi dalam kes itu.
  3. Proses pelantikan: Proses pelantikan ketua SPRM dikritik semasa tempoh Najib, dengan dakwaan bahawa proses itu dipengaruhi oleh politik. Pengkritik berhujah bahawa kerajaan telah memilih ketua SPRM yang lebih berkemungkinan setia kepada parti pemerintah, daripada bertindak secara bebas untuk menegakkan undang-undang.
  4. Perlindungan pemberi maklumat: SPRM juga dituduh tidak melakukan cukup untuk melindungi pemberi maklumat, termasuk mereka yang tampil dengan bukti berkaitan skandal 1MDB. Pengkritik mendakwa pemberi maklumat menghadapi gangguan, intimidasi, dan juga ancaman fizikal akibat tindakan mereka, dan SPRM tidak memberikan sokongan atau perlindungan yang mencukupi untuk individu ini.
  5. Kekurangan ketelusan: Operasi dan proses membuat keputusan SPRM dikritik kerana bersikap legap. Pengkritik berhujah bahawa kekurangan ketelusan ini menyebabkan agensi itu terdedah kepada campur tangan politik dan menyukarkan orang ramai untuk mempercayai ketidaksepihakannya.

Dakwaan berat sebelah dan campur tangan politik ini menjejaskan keyakinan orang ramai terhadap keupayaan SPRM untuk melaksanakan mandatnya untuk memerangi rasuah dengan berkesan dan bebas. Selepas pilihan raya umum 2018, yang menyaksikan penyingkiran kerajaan Najib Razak, pentadbiran baharu memulakan beberapa siri pembaharuan untuk mengukuhkan kemerdekaan SPRM dan meningkatkan keupayaan penyiasatannya.", "content_manglish": "Wah, during dat Najib Razak time as Malaysia PM (2009-2018), MACC kena big time lah. People say they biased and kena political interference, especially for big big cases. I cannot check real-time data, but I tell you what I know based on what people say. Heres why MACC kena so much problem during dat time:

  1. 1MDB scandal: Wah lau eh, dis one big scandal lah. Billions of dollars gone missing, people say money laundering and embezzlement. MACC supposed to investigate, but people say they slow slow only, especially when Najib kena involved. Whistleblowers and foreign media all say something, but MACC like blur like sotong.

  2. Sodomy II case: MACC kena accused of helping government in dis case against Anwar Ibrahim. Anwar kena convicted in 2015 for sodomy, but people say MACC not investigate properly. They say MACC kena pressure from government, so cannot do proper job.

  3. How they choose boss: People complain how MACC choose their big boss. Say government pick people who like them, not people who really want to fight corruption. Like dat how to be fair?

  4. No protect whistleblowers: MACC also kena say they dont protect people who tell them about corruption. People who give info about 1MDB scandal kena threatened and bullied. MACC supposed to help, but they like siao only.

  5. Everything so secretive: MACC very secretive one. People say they dont tell everything, so how to trust them? Like dat how to fight corruption properly?

All these things make people think MACC not fair and not independent. After 2018 election, new government try to make MACC better. They want MACC to be more strong and can fight corruption properly." } ]

科学

  • 文件名: 未提供
  • 行数: 未提供

数学

  • 文件名: 未提供
  • 行数: 未提供

代码

  • 文件名: 未提供
  • 行数: 未提供
搜集汇总
数据集介绍
main_image_url
构建方式
MaLLaM-2.5-Small-Manglish-QA数据集通过合成生成技术构建,涵盖了马来西亚政治、科学和数学等多个领域的问答对。数据生成过程借助了Mesolitica平台,确保了数据的多样性和广泛性。每个问答对均以三种语言形式呈现:标准英语、马来语以及Manglish(马来西亚英语),这种多语言格式为研究跨语言理解和生成提供了丰富的素材。
特点
该数据集的特点在于其多语言性和领域广泛性。每个问答对均以标准英语、马来语和Manglish三种语言形式呈现,为研究跨语言模型提供了独特的数据支持。此外,数据集涵盖了政治、科学和数学等多个领域,确保了内容的多样性和复杂性。Manglish作为一种非正式的语言变体,为研究语言变体和方言提供了宝贵的资源。
使用方法
MaLLaM-2.5-Small-Manglish-QA数据集适用于多种自然语言处理任务,如跨语言问答、机器翻译和语言生成。研究人员可以通过分析不同语言形式的问答对,探索跨语言模型的性能。此外,该数据集还可用于训练和评估多语言对话系统,特别是在处理非正式语言变体(如Manglish)时的表现。数据集的JSONL格式便于加载和处理,适合大规模机器学习实验。
背景与挑战
背景概述
MaLLaM-2.5-Small-Manglish-QA数据集由Mesolitica机构创建,旨在通过生成合成Manglish问答对,推动马来西亚多语言环境下的自然语言处理研究。Manglish是一种融合了马来语、英语及方言的混合语言,广泛用于马来西亚的日常交流。该数据集的构建时间可追溯至2023年,其核心研究问题在于如何有效处理和理解Manglish这种非标准语言形式,以支持问答系统的开发与优化。该数据集涵盖了政治、科学和数学等多个领域,为研究多语言混合场景下的语言模型提供了重要资源。
当前挑战
MaLLaM-2.5-Small-Manglish-QA数据集在构建过程中面临多重挑战。首先,Manglish作为一种非标准语言,其语法和词汇结构具有高度灵活性,难以通过传统语言模型进行准确建模。其次,数据集的生成依赖于合成技术,可能导致问答对的语言表达不够自然或存在偏差。此外,多语言混合场景下的语义对齐问题也增加了数据处理的复杂性。在应用层面,如何利用该数据集提升问答系统在Manglish环境下的性能,仍需进一步探索与优化。
常用场景
经典使用场景
MaLLaM-2.5-Small-Manglish-QA数据集在自然语言处理领域中被广泛应用于多语言问答系统的开发与优化。该数据集通过生成合成的Manglish(马来西亚英语)问答对,为研究者提供了丰富的语言资源,特别是在处理马来西亚政治、科学和数学等领域的多语言问题时,展现了其独特的价值。
衍生相关工作
基于MaLLaM-2.5-Small-Manglish-QA数据集,研究者开发了一系列多语言问答模型和语言理解工具。例如,一些工作专注于优化Manglish与标准马来语之间的翻译模型,另一些则探索了在多语言环境中进行信息检索和知识图谱构建的技术。这些衍生工作不仅推动了多语言处理技术的发展,还为低资源语言的研究提供了新的思路和方法。
数据集最近研究
最新研究方向
近年来,MaLLaM-2.5-Small-Manglish-QA数据集在自然语言处理领域的研究方向主要集中在多语言问答系统的开发与优化上。该数据集结合了马来语、英语以及Manglish(一种马来西亚英语方言),为研究者提供了丰富的语言资源,特别是在处理混合语言和方言的问答任务中展现了独特的优势。前沿研究关注于如何利用该数据集提升跨语言模型的性能,尤其是在低资源语言环境下的应用。此外,随着马来西亚政治和科学领域的热点事件不断涌现,该数据集也被广泛应用于相关领域的知识问答系统开发,进一步推动了多语言信息检索和知识图谱构建技术的发展。这些研究不仅提升了问答系统的准确性和鲁棒性,还为多语言自然语言处理技术的普及和应用提供了重要支持。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作