five

recursal/Devopedia

收藏
Hugging Face2024-06-10 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/recursal/Devopedia
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - crowdsourced license: - cc-by-sa-3.0 task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling source_datasets: - original language: - en configs: - config_name: default data_files: - split: articles path: data/dev_files.jsonl - split: index path: data/dev_index.json pretty_name: Devopedia --- # Dataset Card for Devopedia ![](HusbandoDevopedia.png "Kaito is a passionate developer and one of the core contributors to Devopedia. He has a background in software engineering and a deep commitment to making technology accessible to everyone. Inspired by the rapid pace of technological change and the challenges it poses to developers, Kaito dedicates his time to creating clear, concise, and unbiased articles on Devopedia. Kaito's approachable demeanor and modern style make him a relatable and inspiring figure for developers at all stages of their journey.") *~~Waifu~~ Husbando to catch your attention.* ### Dataset Description *Devopedia* is a **~1.15 M** Tokens (llama-2-7b-chat-tokenizer) / **~999.32K** Tokens (RWKV Tokenizer) scrape of [Devopedia](https://devopedia.org/). It serves as a training resource for large language models and other NLP tasks. This card details the dataset's origin, content, and limitations. - **Curated by:** KaraKaraWitch - **Funded by:** Recursal.ai (I work there lol) - **Shared by:** KaraKaraWitch - **Language(s) (NLP):** English - **License:** cc-by-sa-4.0 Devopedia was created under time constraints for the release of [EagleX v1](https://huggingface.co/recursal/EagleX_1-7T_HF), and may contain biases in selection. ### Supported Tasks and Leaderboards Primarily used for language modeling. ### Languages While the dataset is focused on English. Keep in mind there are other languages as well. ### Processing and Filtering We scraped the Devopedia for a list of articles. Writing them to a compiled file `dev_index.json`. Before scraping individual article for its page contents. The article contents are then selected by sections. Each section is converted to Markdown. Including the appropriate title. No filtering was done over the dataset. ### Data Instances Refer to the following sample: ```json { "text": "# Hypothesis Testing and Types of Errors\n\n## Summary\n\n\nSuppose we want to study income of a population. We study a sample from the population and draw conclusions. The sample should represent the population for our study to be a reliable one.\n\n**Null hypothesis** \\((H\\_0)\\) is that sample represents population. Hypothesis testing provides us with framework to conclude if we have sufficient evidence to either accept or reject null hypothesis. \n\nPopulation characteristics are either assumed or drawn from third-party sources or judgements by subject matter experts. Population data and sample data are characterised by moments of its distribution (mean, variance, skewness and kurtosis). We test null hypothesis for equality of moments where population characteristic is available and conclude if sample represents population.\n\nFor example, given only mean income of population, <TRUNCATED...>" } ``` ### Data Keys Each json line is a dictionary with a `text` str. ## Recursal's Vision > To make AI accessible to everyone, regardless of language, or economical status This is the collective goal of the `RWKV Open Source foundation` and `Recursal AI`, the commercial entity who backs it. We believe that AI should not be controlled by a select few individual organization. And that it should be made accessible regardless if you are rich or poor, or a native speaker of english. ### About RWKV RWKV is an Open Source, non profit group, under the linux foundation. Focused on developing the RWKV AI architecture, in accordence to our vision. The RWKV architecture scales efficiently and economically. As an RNN & Transformer hybrid, it is able to provide the performance similar to leading transformer models, while having the compute and energy efficiency of an RNN based architecture. You can find out more about the project, and latest models, at the following - [https://blog.rwkv.com](https://blog.rwkv.com) - [https://wiki.rwkv.com](https://wiki.rwkv.com) ### About Recursal AI Recursal AI, is the commercial entity built to provide support for RWKV model development and users, while providing commercial services via its public cloud, or private-cloud / on-premise offerings. As part of our vision. Our commitment, is to ensure open source development and access to the best foundational AI models and datasets. The following dataset/models provided here, is part of that commitment. You can find out more about recursal AI here - [https://recursal.ai](https://recursal.ai) - [https://blog.recursal.ai](https://blog.recursal.ai) ### Dataset Curators KaraKaraWitch. (I typically hang out in PygmalionAI discord, sometimes EleutherAI. If something is wrong, `@karakarawitch` on discord.) I'd be happy if you could spread the word and recommend this dataset. ### Licensing Information Devopedia lists their content as under CC-BY-SA. Recursal Waifus [Husbandos] (The banner image) are licensed under CC-BY-SA. They do not represent the related websites in any official capacity unless otherwise or announced by the website. You may use them as a banner image. However, you must always link back to the dataset. ### Citation Information ``` @misc{Devopedia, title = {Devopedia}, author = {KaraKaraWitch, recursal.ai}, year = {2024}, howpublished = {\url{https://huggingface.co/datasets/recursal/Devopedia}}, } ```
提供机构:
recursal
原始信息汇总

数据集概述

基本信息

  • 数据集名称: Devopedia
  • 数据量: 约1.15M Tokens (llama-2-7b-chat-tokenizer) / 约999.32K Tokens (RWKV Tokenizer)
  • 语言: 英语
  • 许可证: cc-by-sa-4.0
  • 创建者: KaraKaraWitch
  • 资金支持: Recursal.ai

数据集描述

Devopedia是一个从Devopedia抓取的数据集,主要用于训练大型语言模型和其他自然语言处理任务。数据集包含文章的文本内容,每个部分转换为Markdown格式,未进行过滤。

任务支持

主要用于语言建模任务。

数据实例

数据集中的每个实例是一个包含text字段的JSON对象,示例如下: json { "text": "# Hypothesis Testing and Types of Errors

Summary

Suppose we want to study income of a population. We study a sample from the population and draw conclusions. The sample should represent the population for our study to be a reliable one.

Null hypothesis ((H_0)) is that sample represents population. Hypothesis testing provides us with framework to conclude if we have sufficient evidence to either accept or reject null hypothesis.

Population characteristics are either assumed or drawn from third-party sources or judgements by subject matter experts. Population data and sample data are characterised by moments of its distribution (mean, variance, skewness and kurtosis). We test null hypothesis for equality of moments where population characteristic is available and conclude if sample represents population.

For example, given only mean income of population, <TRUNCATED...>" }

引用信息

@misc{Devopedia, title = {Devopedia}, author = {KaraKaraWitch, recursal.ai}, year = {2024}, howpublished = {url{https://huggingface.co/datasets/recursal/Devopedia}}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作