five

d3b4g/maldivian-latin-script-corpus

收藏
Hugging Face2026-03-29 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/d3b4g/maldivian-latin-script-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - dv - en tags: - dhivehi - maldives - romanized - latin-script - code-switching - social-media - informal - comments - low-resource size_categories: - 100K<n<1M license: cc-by-4.0 task_categories: - text-classification pretty_name: Maldivian Latin-Script Corpus configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: comment_id dtype: int64 - name: post_id dtype: int64 - name: date dtype: string - name: author_name dtype: string - name: text dtype: string - name: parent_id dtype: int64 - name: char_count dtype: int64 - name: thaana_ratio dtype: float64 - name: language dtype: string - name: source dtype: string - name: token_approx dtype: int64 splits: - name: train num_bytes: 68012312 num_examples: 317199 download_size: 27580917 dataset_size: 68012312 --- # Maldivian Latin-Script Corpus A growing collection of Latin-script text from Maldivian online communities, containing Romanized Dhivehi, English, and code-mixed writing. Data is collected from multiple sources and tagged by origin. ## Why this dataset is unique Maldivians commonly write Dhivehi phonetically using Latin script rather than switching to the Thaana keyboard. This produces text like: > `"varah reethi vaahaka eh"` → ވަރަށް ރީތި ވާހަކައެއް *(very nice story)* > `"maa salhi, next part avahah up kohdhyba"` → *(so good, please upload next part quickly)* This **Romanized Dhivehi** is visually indistinguishable from English to any script-based detector, yet it is a distinct and widely-used informal writing system among younger Maldivians. ## Sources | Source | Type | Records | Added | |---|---|---|---| | [esfiya.com](https://esfiya.com) | Fiction story comments | ~314,000 | 2026-03 | *More sources will be added over time — news site comments, YouTube comments, social media. Each record carries a `source` field for filtering by origin.* ## Dataset fields | Field | Description | |---|---| | `comment_id` | Unique comment ID from the source platform | | `post_id` | ID of the article or story this comment belongs to | | `date` | Comment timestamp (UTC) | | `author_name` | Username of commenter (`null` if anonymous) | | `text` | Comment text in Latin script | | `parent_id` | ID of parent comment if reply, else `0` | | `char_count` | Character count of text | | `thaana_ratio` | Ratio of Thaana characters — `0.0` for all current records | | `language` | Script label — see language note below | | `token_approx` | Approximate token count | | `source` | Origin platform (e.g. `esfiya.com`, `vaguthu.mv`) | ## Statistics (v1.0 — March 2026) | Metric | Value | |---|---| | Total records | ~314,000 | | Sources | 1 (esfiya.com) | | Date range | October 2012 — March 2026 | | Thaana ratio | 0.0 (all Latin script) | | Threaded replies | ~38% of records | | Avg comment length | ~60 characters | ## Language note Records in this dataset are written in Latin script and fall into three categories: - **Romanized Dhivehi** — Dhivehi language written phonetically in Latin letters. Very common among younger Maldivians. Examples: `"varah reethi"`, `"maa salhi"`, `"next part plx"`, `"haadha lahey update vaaleh"` - **English** — actual English words. Examples: `"really nice story"`, `"keep it up"`, `"waiting for next part"` - **Mixed** — sentences combining both. Example: `"varah reethi story, really enjoyed it"` **No automatic classifier can reliably separate Romanized Dhivehi from English** without a model trained specifically on this writing system. The `language` field is set to `latin` for all records. This dataset is itself the primary resource for building such a classifier. ## Use cases - **Romanized Dhivehi detection** — train a classifier to identify Romanized Dhivehi vs English in Latin-script text - **Code-switching research** — study how Maldivians switch between Romanized Dhivehi and English mid-sentence - **Sentiment analysis** — reader reactions with naturally implied sentiment (fiction comments carry strong emotional signals) - **Informal language modeling** — the most colloquial, everyday Maldivian writing available in any dataset - **Social NLP** — threaded reply structure enables conversation and dialogue modeling ## Data collection and cleaning **v1.0 — esfiya.com (March 2026)** - Filtered to Latin-script only (`thaana_ratio = 0.0`) - Removed noise comments (fewer than 3 unique characters) - Removed exact duplicates (same text + same post) - Anonymous comments retained with `author_name = null` ## Changelog | Version | Date | Description | |---|---|---| | 1.0 | 2026-03 | Initial release — esfiya.com comments | ## Citation ```bibtex @dataset{maldivian_latin_script_corpus_2026, title = {Maldivian Latin-Script Corpus}, author = {d3b4g}, year = {2026}, url = {https://huggingface.co/datasets/d3b4g/maldivian-latin-script-corpus}, note = {A growing collection of Latin-script text from Maldivian online communities} } ``` ## License CC-BY-4.0. Content is user-submitted from the respective source platforms.

--- language: - 迪维希语(Dhivehi) - 英语(English) tags: - 迪维希语(Dhivehi) - 马尔代夫 - 罗马化 - 拉丁字母 - 语码转换(code-switching) - 社交媒体 - 非正式文本 - 评论 - 低资源语种 size_categories: - 100K<n<1M license: CC BY 4.0 task_categories: - 文本分类 pretty_name: 马尔代夫拉丁字母语料库 configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: comment_id dtype: int64 - name: post_id dtype: int64 - name: date dtype: 字符串 - name: author_name dtype: 字符串 - name: text dtype: 字符串 - name: parent_id dtype: int64 - name: char_count dtype: int64 - name: thaana_ratio dtype: float64 - name: language dtype: 字符串 - name: source dtype: 字符串 - name: token_approx dtype: int64 splits: - name: train num_bytes: 68012312 num_examples: 317199 download_size: 27580917 dataset_size: 68012312 --- # 马尔代夫拉丁字母语料库 这是一个持续增长的马尔代夫在线社区拉丁字母文本集合,涵盖罗马化迪维希语、英语以及语码混合文本。数据源自多个平台,并按来源进行标注。 ## 该数据集的独特性 马尔代夫民众通常使用拉丁字母对迪维希语进行音转写,而非切换至塔那那字母(Thaana)键盘。由此生成的文本示例如下: > "varah reethi vaahaka eh" → ވަރަށް ރީތި ވާހަކައެއް *(非常精彩的故事)* > "maa salhi, next part avahah up kohdhyba" → *(十分出色,请尽快上传下一部分)* 这种**罗马化迪维希语(Romanized Dhivehi)**对于基于字母表的检测器而言,与英语视觉上毫无区别,但却是马尔代夫年轻人广泛使用的非正式书写体系。 ## 数据来源 | 来源平台 | 内容类型 | 样本规模 | 收录时间 | |---|---|---|---| | [esfiya.com](https://esfiya.com) | 小说评论 | 约314,000条 | 2026年3月 | *后续将陆续新增更多数据源,包括新闻网站评论、YouTube评论及社交媒体内容。每条数据均带有`source`字段,支持按来源进行筛选。* ## 数据集字段说明 | 字段名 | 字段描述 | |---|---| | `comment_id` | 来源平台分配的唯一评论标识符 | | `post_id` | 该评论所属文章或故事的ID | | `date` | 评论的UTC时间戳 | | `author_name` | 评论者用户名,匿名时为`null` | | `text` | 拉丁字母形式的评论文本 | | `parent_id` | 父评论ID,若为直接回复则设为`0` | | `char_count` | 评论文本的字符总数 | | `thaana_ratio` | 塔那那字母(Thaana)字符占比——当前所有记录均为`0.0` | | `language` | 脚本标签,详见下文语言说明 | | `token_approx` | 近似Token(Token)数量 | | `source` | 数据来源平台(例如`esfiya.com`、`vaguthu.mv`) | ## 统计信息(v1.0 — 2026年3月) | 指标 | 数值 | |---|---| | 总样本量 | 约314,000条 | | 来源平台数量 | 1个(esfiya.com) | | 时间范围 | 2012年10月 — 2026年3月 | | 塔那那字母占比 | 0.0(全部为拉丁字母) | | 带嵌套回复的样本占比 | 约38% | | 平均评论长度 | 约60个字符 | ## 语言说明 本数据集的所有记录均采用拉丁字母书写,可分为三类: 1. **罗马化迪维希语(Romanized Dhivehi)**:以拉丁字母音转写的迪维希语文本,在马尔代夫年轻群体中极为普及。示例:`"varah reethi"`、`"maa salhi"`、`"next part plx"`、`"haadha lahey update vaaleh"` 2. **英语**:纯英语文本。示例:`"really nice story"`、`"keep it up"`、`"waiting for next part"` 3. **混合文本**:同时包含上述两种语言的语句。示例:`"varah reethi story, really enjoyed it"` **若无针对该书写体系的专用模型,任何自动分类器均无法可靠区分罗马化迪维希语与英语**。本数据集所有记录的`language`字段均设为`latin`,而本数据集本身正是构建此类分类器的核心资源。 ## 应用场景 1. **罗马化迪维希语识别**:训练分类器以区分拉丁字母文本中的罗马化迪维希语与英语 2. **语码转换研究**:分析马尔代夫民众在语句中如何在罗马化迪维希语与英语之间进行切换 3. **情感分析**:基于带有自然情感倾向的读者反馈开展分析(小说评论通常带有强烈的情感信号) 4. **非正式语言建模**:获取现有数据集中最口语化、最贴近日常的马尔代夫书面语料 5. **社交自然语言处理**:依托嵌套回复结构实现对话与交互建模 ## 数据收集与清洗流程(v1.0 — esfiya.com,2026年3月) - 仅保留拉丁字母文本(`thaana_ratio = 0.0`) - 移除字符数少于3个的无效评论 - 移除完全重复的评论(文本与所属文章均相同) - 匿名评论保留,其`author_name`字段设为`null` ## 更新日志 | 版本 | 发布日期 | 说明 | |---|---|---| | 1.0 | 2026年3月 | 首次发布:收录esfiya.com的小说评论数据 | ## 引用格式 bibtex @dataset{maldivian_latin_script_corpus_2026, title = {Maldivian Latin-Script Corpus}, author = {d3b4g}, year = {2026}, url = {https://huggingface.co/datasets/d3b4g/maldivian-latin-script-corpus}, note = {一个不断增长的马尔代夫在线社区拉丁字母文本集合} } ## 许可证 CC-BY-4.0。所有内容均来自各来源平台的用户提交内容。
提供机构:
d3b4g
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作