five

kwassl-ai/noah-text

收藏
Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kwassl-ai/noah-text
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation language: - en pretty_name: NoAH Text --- # Dataset Card for NoAH Text 700B Tokens of English text extracted from the [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus) containing only data released under permissive, no attribution required licenses. ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> NoAH Text is a curated dataset of English spoken and written text, designed for training language models without attribution requirements. It consists of public domain works and content released under permissive licenses: **CC0**, **MIT-0**, and **The Unlicense**. By aggregating and filtering the Common Corpus, NoAH provides a clean, legally unambiguous resource for research and development in natural language processing. - **Language(s) (NLP):** English - **License:** [Open Data Commons Attribution License](https://opendatacommons.org/licenses/by/) ### Dataset Composition Our preliminary analysis suggests that this dataset contains about 700B Tokens of text. The data is comprised of data from the following domains: | Collection | Domain | Sources | Tokens | |:--------------:|:------------------------:|:----------------------------------------------------------------------------------------------------------:|--------| | OpenGovernment | legal and administrative | Finance Commons (e.g. SEC, WTO) and Legal Commons (e.g. Europarl, Caselaw Access Project, Chinese CaseLaw) | 287B | | OpenCulture | cultural heritage | public domain books and newspapers, Wikisource | 480B | | OpenScience | academic | OpenAlex | 5.86B | | OpenWeb | web text | YouTube Commons, MOSEL, Stack Exchange, CCCC | 1.65B | Note that this preliminary data is projected from a small sample of ~2.5% of the full dataset. We will provide a full analysis at a later date. ## Uses ## Dataset Structure Dataset Fields: - `identifier`: unique text identifier. In many cases, this is also the link to the original resources. - `collection`: name of one of the XX sub-collections curated for Common corpus. - `open type`: one of the six leading collection groupings: - `license`: sharing rights for the content either uncopyrighted (public domain, US federal public domain, CC0 on Wikidata) or various free licenses (Creative Commons, MIT, French Licence ouverte, etc.) - `date`: date of creation of the resource where known. Due to the significance of public domain and other cultural heritage content, more than half of Common Corpus predates the 21st century. - `title`: title of the resource when known or alternatively the filename. - `creator`: institution publishing/collecting/curating the resource. - `language`: automatically identified language. - `word_count`: number of space delimited words. - `token_count`: number of tokens as calculated by Pleias official tokenizer. - `text`: full text, without formatting. ## Dataset Creation ### Curation Rationale With NoAH, we try to start addressing the growing need for high-quality, legally unambiguous text corpora in the NLP community. Many existing datasets are encumbered by complex licensing terms, creating barriers for researchers and developers who require clear, attribution-free resources. By focusing exclusively on public domain works and permissive licenses (CC0, MIT-0, and The Unlicense), NoAH eliminates legal uncertainty and simplifies compliance, enabling unrestricted use in training language models. ### Source Data Contains information from [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus) by PleIAs. #### Personal and Sensitive Information Beyond filtering by license and language type, we have not processed the data any further. For this reason, any personally identifiable information (PII) present in the original dataset may still be present in this dataset. ## Bias, Risks, and Limitations Beyond filtering by license and language type, we have not processed the data any further. For this reason, any bias present in the original dataset may still be present in this dataset.
提供机构:
kwassl-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作