smcproject/ml-wiki-sentences

Name: smcproject/ml-wiki-sentences
Creator: smcproject
Published: 2026-03-13 10:37:53
License: 暂无描述

Hugging Face2026-03-13 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/smcproject/ml-wiki-sentences

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-3.0 language: - ml size_categories: - 1M<n<10M --- # ml-wiki-sentences [![Dataset on Hugging Face](https://img.shields.io/badge/Dataset-Hugging%20Face-blue)](https://huggingface.co/datasets/smcproject/ml-wiki-sentences) Malayalam Wikipedia sentences extracted from article text, segmented into individual sentences. ## Dataset Description This dataset contains sentence-segmented text from Malayalam (ml) Wikipedia articles. Each row represents a single sentence extracted from Wikipedia articles, with metadata linking it back to the source article. ### Data Source - **Source**: Malayalam Wikipedia (ml.wikipedia.org) - **Dump Date**: March 2025 - **Original Dump**: Wikimedia Enterprise HTML dumps (20250320) - **Articles Processed**: 88,832 articles - **Total Sentences**: 2,250,219 sentences ### Language - **Language**: Malayalam (ml) - **Script**: Malayalam (ml) ## Dataset Schema | Column | Type | Description | |--------|------|-------------| | `id` | int64 | Wikipedia article ID | | `url` | string | Full URL to the Wikipedia article | | `name` | string | Article title | | `sentence` | string | Individual sentence text | | `sentence_index` | int32 | Position of this sentence within the article (0-based) | ## Dataset Creation 1. Downloaded enterprise HTML dumps from Wikimedia 2. Extracted article HTML and converted to Parquet format 3. Used tree-sitter-html (Rust) to extract plain text from HTML 4. Used sentencex (Rust) for sentence segmentation Source code: https://github.com/santhoshtr/wikisentences ## Licensing **Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)** This dataset is derived from Wikipedia content, which is available under CC BY-SA 3.0. The source text is licensed under the same terms. See: https://creativecommons.org/licenses/by-sa/3.0/ ## Citation If you use this dataset in your research or projects, please cite: ``` ml-wiki-sentences Dataset (2025). Malayalam Wikipedia sentences. Available at: https://huggingface.co/datasets/smcproject/ml-wiki-sentences ``` ## Contact For questions or issues, please open an issue on the repository. --- *Dataset automatically generated from Malayalam Wikipedia dumps.*

license: 知识共享署名-相同方式共享3.0（CC BY-SA 3.0） language: - 马拉雅拉姆语（ml） size_categories: - 100万 < 语句总量 < 1000万 # ml-wiki-sentences [![数据集：Hugging Face](https://img.shields.io/badge/Dataset-Hugging%20Face-blue)](https://huggingface.co/datasets/smcproject/ml-wiki-sentences) 从马拉雅拉姆语（Malayalam）维基百科文章文本中提取并拆分为独立语句的数据集。 ## 数据集描述本数据集包含从马拉雅拉姆语（Malayalam，ml）维基百科文章中拆分得到的分句文本。数据集中每一行对应一条从维基百科文章中提取的独立语句，并附带关联至源文章的元数据。 ### 数据来源 - **数据来源平台**：马拉雅拉姆语维基百科（ml.wikipedia.org） - **数据导出日期**：2025年3月 - **原始导出文件**：维基媒体企业版（Wikimedia Enterprise）HTML导出包（20250320） - **处理文章总数**：88,832篇 - **总语句数**：2,250,219条 ### 语言信息 - **语言**：马拉雅拉姆语（ml） - **书写系统**：马拉雅拉姆文（ml） ## 数据集架构 | 列名 | 数据类型 | 说明 | |--------|------|-------------| | `id` | int64 | 维基百科文章ID | | `url` | string | 对应维基百科文章的完整URL | | `name` | string | 文章标题 | | `sentence` | string | 单条语句文本 | | `sentence_index` | int32 | 该语句在文章中的位置（从0开始索引） | ## 数据集构建流程 1. 从维基媒体平台下载企业版HTML导出包 2. 提取文章HTML内容并转换为Parquet格式 3. 基于Rust语言的tree-sitter-html库从HTML中提取纯文本 4. 基于Rust语言的sentencex库完成语句拆分源代码仓库：https://github.com/santhoshtr/wikisentences ## 许可证说明 **知识共享署名-相同方式共享3.0未适配版本（CC BY-SA 3.0）** 本数据集衍生自维基百科内容，维基百科内容采用CC BY-SA 3.0协议发布，本数据集的源文本遵循相同许可协议。详情参见：https://creativecommons.org/licenses/by-sa/3.0/ ## 引用方式若您在研究或项目中使用本数据集，请按照以下格式引用： ml-wiki-sentences 数据集 (2025). 马拉雅拉姆语维基百科语句数据集. 获取地址：https://huggingface.co/datasets/smcproject/ml-wiki-sentences ## 联系方式如有疑问或问题，请在源代码仓库提交Issue。 --- *本数据集由马拉雅拉姆语维基百科导出包自动生成。*

提供机构：

smcproject

5,000+

优质数据集

54 个

任务类型

进入经典数据集