kwassl-ai/noah-text-10BT

Name: kwassl-ai/noah-text-10BT
Creator: kwassl-ai
Published: 2026-03-03 09:30:30
License: 暂无描述

Hugging Face2026-03-03 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/kwassl-ai/noah-text-10BT

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by task_categories: - text-generation language: - en pretty_name: NoAH Text 10BT --- # Dataset Card for NoAH Text First 10B Tokens of [NoAH Text](https://huggingface.co/datasets/kwassl-ai/noah-text) ## Dataset Details ### Dataset Description  NoAH Text is a curated dataset of English spoken and written text, designed for training language models without attribution requirements. It consists of public domain works and content released under permissive licenses: **CC0**, **MIT-0**, and **The Unlicense**. By aggregating and filtering the Common Corpus, NoAH provides a clean, legally unambiguous resource for research and development in natural language processing. This dataset is a sample of 10BT tokens from the full NoAH Text dataset. - **Language(s) (NLP):** English - **License:** [Open Data Commons Attribution License](https://opendatacommons.org/licenses/by/) ## Uses ## Dataset Structure Dataset Fields: - `identifier`: unique text identifier. In many cases, this is also the link to the original resources. - `collection`: name of one of the XX sub-collections curated for Common corpus. - `open type`: one of the six leading collection groupings: - `license`: sharing rights for the content either uncopyrighted (public domain, US federal public domain, CC0 on Wikidata) or various free licenses (Creative Commons, MIT, French Licence ouverte, etc.) - `date`: date of creation of the resource where known. Due to the significance of public domain and other cultural heritage content, more than half of Common Corpus predates the 21st century. - `title`: title of the resource when known or alternatively the filename. - `creator`: institution publishing/collecting/curating the resource. - `language`: automatically identified language. - `word_count`: number of space delimited words. - `token_count`: number of tokens as calculated by Pleias official tokenizer. - `text`: full text, without formatting. ## Dataset Creation ### Curation Rationale With NoAH, we try to start addressing the growing need for high-quality, legally unambiguous text corpora in the NLP community. Many existing datasets are encumbered by complex licensing terms, creating barriers for researchers and developers who require clear, attribution-free resources. By focusing exclusively on public domain works and permissive licenses (CC0, MIT-0, and The Unlicense), NoAH eliminates legal uncertainty and simplifies compliance, enabling unrestricted use in training language models. ### Source Data Contains information from [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus) by PleIAs. #### Personal and Sensitive Information Beyond filtering by license and language type, we have not processed the data any further. For this reason, any personally identifiable information (PII) present in the original dataset may still be present in this dataset. ## Bias, Risks, and Limitations Beyond filtering by license and language type, we have not processed the data any further. For this reason, any bias present in the original dataset may still be present in this dataset.

提供机构：

kwassl-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集