YanFren/Hebrew_wikipedia
收藏Hugging Face2024-11-25 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/YanFren/Hebrew_wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是希伯来语维基百科的JSON集合,通过爬取希伯来语维基百科页面并使用广度优先搜索策略收集链接,然后对数据进行清洗和整理。数据集文件为JSONL格式,包含页面ID、页面名称、URL、摘要和段落等信息。
This dataset is a collection of JSON sets containing most of the Hebrew Wikipedia. The data is collected by crawling Hebrew Wikipedia and applying a breadth-first search strategy to collect all redirect links. Each pages links are collected as tuples, and further scraping and cleaning of page content, including paragraphs and page summaries, are performed. The dataset file is in JSONL format, containing a unique identifier, page name, page URL, uncleaned page summary (which may include LaTeX or other languages), and cleaned paragraphs (containing only Hebrew and English characters).
提供机构:
YanFren



