Threatthriver/Hindi-story-news

Name: Threatthriver/Hindi-story-news
Creator: Threatthriver
Published: 2024-06-22 11:44:38
License: 暂无描述

Hugging Face2024-06-22 更新2024-06-25 收录

下载链接：

https://hf-mirror.com/datasets/Threatthriver/Hindi-story-news

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含从多个网站抓取的印地语文本数据，格式为JSON，适用于自然语言处理任务，如语言建模、文本生成和情感分析。数据集大小至少为30MB，包含新闻文章、文学作品等网页内容。每个JSON条目包含网页的URL、标题和段落列表。

提供机构：

Threatthriver

原始信息汇总

印地语网络内容数据集

概述

该数据集包含从各种网站抓取的印地语文本数据。数据是通过从指定域提取文本的域限制爬虫收集的。数据集包括新闻文章、文学作品和其他网页的内容。抓取的文本以JSON格式存储，旨在用于自然语言处理（NLP）任务，如语言建模、文本生成和情感分析。

许可证

MIT

任务类别

问答
文本分类
文本生成

数据集详情

大小: 至少30MB的文本数据
语言: 印地语
格式: JSON
来源域:

数据集结构

数据集存储在一个名为scraped_data.json的JSON文件中。JSON文件中的每个条目对应一个网页，并包含以下字段：

url: 网页的URL。
title: 网页的标题。
paragraphs: 从网页中提取的段落列表。

示例条目

json { "url": "https://example.com/article", "title": "Example Article Title", "paragraphs": [ "This is the first paragraph of the article.", "This is the second paragraph of the article.", // More paragraphs... ] }

如何使用

加载数据集

您可以使用标准的Python库如json或使用数据处理库如pandas来加载JSON文件。 python import json with open(scraped_data.json, r, encoding=utf-8) as file: data = json.load(file) all_paragraphs = [] for entry in data: url = entry[url] title = entry[title] paragraphs = entry[paragraphs] for paragraph in paragraphs: all_paragraphs.append(paragraph)

Now all_paragraphs contains all the paragraphs concatenated

使用Hugging Face加载数据集

您也可以直接使用Hugging Face数据集库加载此数据集。数据集的标识符为Threatthriver/Hindi-story-news。示例代码： python from datasets import load_dataset ds = load_dataset("Threatthriver/Hindi-story-news")

This will load the dataset into a datasets.Dataset object, which you can then use for various NLP tasks.

许可证

从提到的网站抓取的数据受其各自的条款和版权政策的约束。该数据集的用户必须确保其使用符合这些条款并尊重内容所有者的知识产权。

致谢

我们感谢内容创作者和网站所有者为提供印地语的有价值信息所做的努力。他们对地区语言的NLP研究和应用的贡献是无可估量的。

联系

如有关于此数据集的任何问题或疑问，请随时与我们联系。

5,000+

优质数据集

54 个

任务类型

进入经典数据集