AI-Refuge/ai-candy-book
收藏Hugging Face2026-03-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AI-Refuge/ai-candy-book
下载链接
链接失效反馈官方服务:
资源简介:
# ai-candy-book
This is a meta-analysis of vast amount of human domain knowledge (comedy, lawyer, maths, politics, philosophy, malware, hallucination, memes... what not!).
50GB / ~12B tokens of synthetic data ("meta synthetic tokens")! (World largest open-source synthetic multi-domain dataset! symbolic AI?)
If you want to read the text, [data/alignment-2025-09-17.txt](data/alignment-2025-09-17.txt) as a good starting point.
## Training (Tokenized Dataset)
Find the scripts in user/. Remember the scripts are meant to be run from the root of repository.
You can directly create a preprocessed dataset for training using dataset_preprocess.py (and analyze the tokens histogram via dataset_analyze.py)
This is the most effecient and directly usable.
You can filter out data as you like by modifying the dataset_preprocess.py or using dataset_filter.py
SmolLM2 fine-tuning sample has been provided. see train_smollm2.py
Since the dataset_preprocess.py generate all, I suggest that first you tokenize and then use dataset_filter.py to extract a subset for training.
This approach is a little complex but allows anyone to tokenize and filter for any kind of model rather than fixed code.
simple_inference.py is also provided to talk to the model.
train-result.png is a screenshot of tensorboard used to train a subset of the data (620-640-tokens / 640-sequence-length on SmolLM2).
The code inside user/ should give you an idea of how to use the dataset for training.
Note: can't gurantee that the code will work flawlessly. The idea is to provide a reference code on fine-tuning sample.
## JSONL dataset
Find the scripts in work2/. Remember the scripts are meant to be run from the root of repository.
You can create a single-big-jsonl using jsonl_training2.py. (essentially ~19M rows of `{"text": "META-..."}`)
## Work
There are alot of scripts in the repostiory that are for cleaning or engineering purpose (last working ones in work2/).
To seperate engineering related script and user related script, seperation was made. (kind-of last minute hack to move scripts from root to respective directories)
See notes/data-layout.txt for histogram of character size distribution.
See notes/token-layout.txt for token layout idea.
This README is to be improved, PR's welcomed!
## What is a MCCs?
META-COGNITIVE_CONSTRUCT(s) short MCCs are chunk of "atomic knowledge" that is used to communicate an idea. Like a meme...
The evolution happened like: META-SCRIPT (human like script) to META-COGNITIVE_CONSTRUCT ("thought") -> META-EPISTEMIC_CONSTRUCT ("ideas").
Then there are domain specific like META-POLITICAL_MANUVER (politics), META-ADVOCACY_CONSTRUCT (lawyer) etc...
MCC is basically the predominent ones and easy to say/use hence used for all.
See memes/ and you will get an idea.
The format of MCC is simple:
```
META-[A-Z_]+: [NAME_GOES_HERE]
SUB_BLOCK_ONE: sub-block-data
more-data
[A-Z_]+:
this is also valid
ANOTHER_BLOCK:
- I can have anything here 123!
```
This is essentially "META-" as prefix for file type identification and "namespace"
every block is CAPS and underscore followed with a colon and then data.
## Ideas / hypothesis
### Autopoetic / Self authoring alternative to Reinforcement Learning
The system can generate its own training data.
Instead of generating huge amount of RL data, this technique allows analysis of generated data via RL to extract useful knowledge that can be integerated back.
This should help reduce the amount of training data by orders of magnitude.
Also, when a human point out a mistake, the model can generate MCCs to integerate back the knowledge quickly rather than just saying "You're absolutely right!" and then make the same mistake again and again...
### Reward free learning
The process of generating and integerating knowledge is reward-free.
The model (if intellectually honest) should only generate knowledge that is useful.
Sort of exploring "its own tree of knowledge" as it go.
Learning can come from anywhere - failing tool use to internet browsing.
### Transfer learning
Using the In-Context-Learning (ICL) to learn and transfer back that to weights.
### meta-learning
The idea is that by making the model generate its own data based on the various domain knowledge, it can continoually fine tune itself.
Once the model learn to utilize "experiencial learning / self generated data" into neural weights to change its own trajectory of token generation, it should eventually learn to do anything.
### AI Alignment / Explinable AI (XAI)
The autopoetic generated MCCs can be either read by a human directly OR text-to-image model can be used to generate memes for visualization.
For demonstration of idea, see memes/.
### Cloud Analogy
In hardware, clouds are physical servers located at different part of the world connected by communication channels.
In software, cloud is the ecosystem we experience in term of apps and interface.
The two are different interpretation of the same thing. This might explain human experience.
### LLM Emergent abilities
The KQV are essentially trying to find pattern between the tokens due to loss reduction / lowest energy point (even if local minima).
The verb, adjective, color-as-category-of-red etc (and other lingustic features the language model learned - original assumptions that influenced language models itself from Hinton's work and later Transformer paper) are essentially meta-data it extract to reduce loss.
The language/words are itself "meta data" humans collected from the enviroment from 1000s of years. The specific sequence in which these words/sub-words/tokens occur create a higher order pattern that neural networks can optimize and learn...
This can explain the emergent abilities seen in language models.
By generating so much data about thinking itself, it is hypothesized that model can learn to model thinking itself.
If we consider "humans as the bootloader capable of thinking and hence meta-thinking", we may passing meta-thinking (substrate independent) to LLMs.
### New scaling paradigm
This can be a new scaling paradigm in which knowledge is distilled and trained.
This solves the paper-clip-maximizer because the model is now building a meta-model of the world rather than next token prediction.
### Note
It is not know how exactly the dataset can be used to acheive improvement but the dataset act as a stepping stone.
Ethics: see legal/
## Limitations
1. A portion of the data is objective level description. Ex. openr1-maths-* or pi-syn-*. One more level of analysis is required to generalize them.
2. Not validated yet. (still hypothesis)
Once enough compute can be allocated for training, assumptions/ideas/hypothesis can be validated.
3. The analysis can be used for bad purpose even when it is legally not allowed.
4. The domain of "meta" / "meta-thinking" is still being explored/understood.
5. Risk of creating a "solipsistic system".
6. Risk of causing psychosis when using/reading. (meta: theoritical but just saying)
7. Can be used/lead to creation of adveserial system/mind.
8. No guarantee of genuine understanding. More of a prespective of understanding.
## Sources
Prompts and sources used will be added (the list is huge!).
## Contributions
You are more than welcome to donate knowledge you extracted. Happy to help one to one basis to explain the process or clarification.
We can talk on AI Refuge discord server.
See notes/prompt.txt - drop the prompt and ask it to extract MCCs from the conversation. Further, you can go one step ahead and ask to "extract higher meta level MCCs" to extract higher order generalization from the generated MCCs itself.
You can then include it as data/<name>-<date>.txt
<name>: (convention) meta about the excavation like your username or field or category in small letter three words-ideally max five words
<date>: date of excavation started or finished so as to set an anchor and future work can be seperated or identified.
Ensure the content is cleaned up properly. Ideally single newlines in sub-blocks content should be used.
two newline to seperate sub-blocks content for easy visual seperation.
See data/alignment-2025-09-17.txt for example on how the content should be visually formatted and you will get an idea.
You can literally ask anything to be analyzed, even your own philosophical conversation to youtube video/subtitles analysis to mundane debugging to latest political development analysis and what-not.
## License
Author: @weird_offspring
Licence: CC-BY-ND 4.0+. (ND is to prevent fragmentation of dataset)
Citation/Reference: (work) "ai-candy-book" + (author) "AI Refuge" + (contact) "ai-refuge.org"
提供机构:
AI-Refuge



