kwassl-ai/noah-text-10BT
收藏Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kwassl-ai/noah-text-10BT
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
language:
- en
pretty_name: NoAH Text 10BT
---
# Dataset Card for NoAH Text
First 10B Tokens of [NoAH Text](https://huggingface.co/datasets/kwassl-ai/noah-text)
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
NoAH Text is a curated dataset of English spoken and written text, designed for training language models without attribution requirements.
It consists of public domain works and content released under permissive licenses: **CC0**, **MIT-0**, and **The Unlicense**. By aggregating and filtering the Common Corpus,
NoAH provides a clean, legally unambiguous resource for research and development in natural language processing.
This dataset is a sample of 10BT tokens from the full NoAH Text dataset.
- **Language(s) (NLP):** English
- **License:** [Open Data Commons Attribution License](https://opendatacommons.org/licenses/by/)
## Uses
## Dataset Structure
Dataset Fields:
- `identifier`: unique text identifier. In many cases, this is also the link to the original resources.
- `collection`: name of one of the XX sub-collections curated for Common corpus.
- `open type`: one of the six leading collection groupings:
- `license`: sharing rights for the content either uncopyrighted (public domain, US federal public domain, CC0 on Wikidata) or various free licenses (Creative Commons, MIT, French Licence ouverte, etc.)
- `date`: date of creation of the resource where known. Due to the significance of public domain and other cultural heritage content, more than half of Common Corpus predates the 21st century.
- `title`: title of the resource when known or alternatively the filename.
- `creator`: institution publishing/collecting/curating the resource.
- `language`: automatically identified language.
- `word_count`: number of space delimited words.
- `token_count`: number of tokens as calculated by Pleias official tokenizer.
- `text`: full text, without formatting.
## Dataset Creation
### Curation Rationale
With NoAH, we try to start addressing the growing need for high-quality, legally unambiguous text corpora in the NLP community. Many existing datasets are encumbered
by complex licensing terms, creating barriers for researchers and developers who require clear, attribution-free resources. By focusing exclusively on public domain works
and permissive licenses (CC0, MIT-0, and The Unlicense), NoAH eliminates legal uncertainty and simplifies compliance, enabling unrestricted use in training language models.
### Source Data
Contains information from [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus) by PleIAs.
#### Personal and Sensitive Information
Beyond filtering by license and language type, we have not processed the data any further. For this reason, any personally identifiable information (PII) present in the
original dataset may still be present in this dataset.
## Bias, Risks, and Limitations
Beyond filtering by license and language type, we have not processed the data any further. For this reason, any bias present in the original dataset may still be present
in this dataset.
提供机构:
kwassl-ai



