five

guymorlan/levanti

收藏
Hugging Face2024-07-14 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/guymorlan/levanti
下载链接
链接失效反馈
官方服务:
资源简介:
Levanti数据集包含50万句黎凡特阿拉伯语(包括巴勒斯坦、叙利亚、黎巴嫩和埃及方言)的句子,这些句子被翻译成英语和希伯来语,并增加了音标、希伯来语转写和英语转写。数据集由42K真实句子和466K高质量合成句子组成,合成句子基于多样化的词典条目和适当的例子生成,以增加语料库的语义和词汇多样性。音标、希伯来语转写和英语转写被添加到113K巴勒斯坦阿拉伯语句子中。数据集的列包括方言、阿拉伯语句子、希伯来语翻译、英语翻译、是否为合成句子、带音标的阿拉伯语句子、希伯来语转写和英语转写。

Levanti is a dataset of 500K sentences in Levantine colloquial Arabic (Palestinian, Syrian, Lebanese + Egyptian), translated to English and Hebrew and augmented with diacritics, Hebrew transliteration and English transliteration. Levanti is composed of a core of 42K real sentences collected and manually translated and validated, and 466K high quality synthetic sentences carefully generated with Claude Sonnet 3.5 based on diverse dictionary entries and appropriate example to increase the semantic and lexical diversity in the corpus. Diacritics, Hebrew transliteration and English transliteration are added to 113K sentences in Palestinian Arabic. Diacritics were generated with Claude Sonnet 3.5 via appropriate prompting and examples, and fixed with an extensive set of manually crafted heuristics. Claude was likewise used as a reasoning engine for applying several of the heuristics. Transliteration is the output of the transliteration model (see below). All English translation were generated with gpt-4o based on the Arabic and Hebrew translation.
提供机构:
guymorlan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作