smcproject/ml-wiki-sentences
收藏Hugging Face2026-03-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/smcproject/ml-wiki-sentences
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
language:
- ml
size_categories:
- 1M<n<10M
---
# ml-wiki-sentences
[](https://huggingface.co/datasets/smcproject/ml-wiki-sentences)
Malayalam Wikipedia sentences extracted from article text, segmented into individual sentences.
## Dataset Description
This dataset contains sentence-segmented text from Malayalam (ml) Wikipedia articles. Each row represents a single sentence extracted from Wikipedia articles, with metadata linking it back to the source article.
### Data Source
- **Source**: Malayalam Wikipedia (ml.wikipedia.org)
- **Dump Date**: March 2025
- **Original Dump**: Wikimedia Enterprise HTML dumps (20250320)
- **Articles Processed**: 88,832 articles
- **Total Sentences**: 2,250,219 sentences
### Language
- **Language**: Malayalam (ml)
- **Script**: Malayalam (ml)
## Dataset Schema
| Column | Type | Description |
|--------|------|-------------|
| `id` | int64 | Wikipedia article ID |
| `url` | string | Full URL to the Wikipedia article |
| `name` | string | Article title |
| `sentence` | string | Individual sentence text |
| `sentence_index` | int32 | Position of this sentence within the article (0-based) |
## Dataset Creation
1. Downloaded enterprise HTML dumps from Wikimedia
2. Extracted article HTML and converted to Parquet format
3. Used tree-sitter-html (Rust) to extract plain text from HTML
4. Used sentencex (Rust) for sentence segmentation
Source code: https://github.com/santhoshtr/wikisentences
## Licensing
**Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)**
This dataset is derived from Wikipedia content, which is available under CC BY-SA 3.0. The source text is licensed under the same terms.
See: https://creativecommons.org/licenses/by-sa/3.0/
## Citation
If you use this dataset in your research or projects, please cite:
```
ml-wiki-sentences Dataset (2025). Malayalam Wikipedia sentences.
Available at: https://huggingface.co/datasets/smcproject/ml-wiki-sentences
```
## Contact
For questions or issues, please open an issue on the repository.
---
*Dataset automatically generated from Malayalam Wikipedia dumps.*
license: 知识共享署名-相同方式共享3.0(CC BY-SA 3.0)
language:
- 马拉雅拉姆语(ml)
size_categories:
- 100万 < 语句总量 < 1000万
# ml-wiki-sentences
[](https://huggingface.co/datasets/smcproject/ml-wiki-sentences)
从马拉雅拉姆语(Malayalam)维基百科文章文本中提取并拆分为独立语句的数据集。
## 数据集描述
本数据集包含从马拉雅拉姆语(Malayalam,ml)维基百科文章中拆分得到的分句文本。数据集中每一行对应一条从维基百科文章中提取的独立语句,并附带关联至源文章的元数据。
### 数据来源
- **数据来源平台**:马拉雅拉姆语维基百科(ml.wikipedia.org)
- **数据导出日期**:2025年3月
- **原始导出文件**:维基媒体企业版(Wikimedia Enterprise)HTML导出包(20250320)
- **处理文章总数**:88,832篇
- **总语句数**:2,250,219条
### 语言信息
- **语言**:马拉雅拉姆语(ml)
- **书写系统**:马拉雅拉姆文(ml)
## 数据集架构
| 列名 | 数据类型 | 说明 |
|--------|------|-------------|
| `id` | int64 | 维基百科文章ID |
| `url` | string | 对应维基百科文章的完整URL |
| `name` | string | 文章标题 |
| `sentence` | string | 单条语句文本 |
| `sentence_index` | int32 | 该语句在文章中的位置(从0开始索引) |
## 数据集构建流程
1. 从维基媒体平台下载企业版HTML导出包
2. 提取文章HTML内容并转换为Parquet格式
3. 基于Rust语言的tree-sitter-html库从HTML中提取纯文本
4. 基于Rust语言的sentencex库完成语句拆分
源代码仓库:https://github.com/santhoshtr/wikisentences
## 许可证说明
**知识共享署名-相同方式共享3.0未适配版本(CC BY-SA 3.0)**
本数据集衍生自维基百科内容,维基百科内容采用CC BY-SA 3.0协议发布,本数据集的源文本遵循相同许可协议。
详情参见:https://creativecommons.org/licenses/by-sa/3.0/
## 引用方式
若您在研究或项目中使用本数据集,请按照以下格式引用:
ml-wiki-sentences 数据集 (2025). 马拉雅拉姆语维基百科语句数据集.
获取地址:https://huggingface.co/datasets/smcproject/ml-wiki-sentences
## 联系方式
如有疑问或问题,请在源代码仓库提交Issue。
---
*本数据集由马拉雅拉姆语维基百科导出包自动生成。*
提供机构:
smcproject



