From cd7bfb66dc11c74817172a4607e4509a4487f343 Mon Sep 17 00:00:00 2001 From: Richard Kuo Date: Tue, 19 Dec 2023 21:16:46 +0800 Subject: [PATCH] Add files via upload --- _posts/2023-12-11-LLM.md | 422 +++++++++++++ _posts/2023-12-12-Reinforcement-Learning.md | 664 ++++++++++++++++++++ 2 files changed, 1086 insertions(+) create mode 100644 _posts/2023-12-11-LLM.md create mode 100644 _posts/2023-12-12-Reinforcement-Learning.md diff --git a/_posts/2023-12-11-LLM.md b/_posts/2023-12-11-LLM.md new file mode 100644 index 00000000..33b7e949 --- /dev/null +++ b/_posts/2023-12-11-LLM.md @@ -0,0 +1,422 @@ +--- +layout: post +title: Large Language Models +author: [Richard Kuo] +category: [Lecture] +tags: [jekyll, ai] +--- + +Introduction to Language Models, LLMs, Algorithms for building LLMs, etc. + +--- +## History of LLM +[A Survey of Large Language Models](https://www.semanticscholar.org/paper/A-Survey-of-Large-Language-Models-Zhao-Zhou/c61d54644e9aedcfc756e5d6fe4cc8b78c87755d)
+Since the introduction of Transformer model in 2017, large language models (LLMs) have evolved significantly.
+ChatGPT saw 1.6B visits in May 2023. Meta also released three versions of LLaMA-2 (7B, 13B, 70B) free for commercial use in July.
+ +--- +### 從解題能力來看四代語言模型的演進 +An evolution process of the four generations of language models (LM) from the perspective of task solving capacity.
+![](https://d3i71xaburhd42.cloudfront.net/c61d54644e9aedcfc756e5d6fe4cc8b78c87755d/2-Figure2-1.png) + +--- +### 大型語言模型統計表 +![](https://d3i71xaburhd42.cloudfront.net/c61d54644e9aedcfc756e5d6fe4cc8b78c87755d/8-Table1-1.png) + +--- +### 近年大型語言模型(>10B)的時間軸 +![](https://d3i71xaburhd42.cloudfront.net/c61d54644e9aedcfc756e5d6fe4cc8b78c87755d/9-Figure3-1.png) + +--- +### 大型語言模型之產業分類 +![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8bfd65-5272-4cf1-8b86-954bab975bab_2400x1350.png) + +--- +### 大型語言模型之技術分類 +![](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*vZK250i8PIWid6BiaZ1QCA.png) + +--- +### 計算記憶體的成長與Transformer大小的關係 +[AI and Memory Wall](https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8)
+![](https://miro.medium.com/v2/resize:fit:4800/format:webp/0*U-7GJqBZ2tY1W5Iu) + +--- +### LLMops 針對生成式 AI 用例調整了 MLops 技術堆疊 +![](https://www.insightpartners.com/wp-content/uploads/2023/10/llmops-market-map-1.png) + +--- +## Transformer +**Paper:** [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
+**Code:** [huggingface/transformers](https://github.com/huggingface/transformers)
+![](https://miro.medium.com/max/407/1*3pxDWM3c1R_WSW7hVKoaRA.png) + + + + + +
+ +--- +### New Understanding about Transformer +**Blog:**
+* [Researchers Gain New Understanding From Simple AI](https://www.quantamagazine.org/researchers-glimpse-how-ai-gets-so-good-at-language-processing-20220414/) +* [Transformer稱霸的原因找到了?OpenAI前核心員工揭開注意力頭協同工作機理](https://bangqu.com/A76oX7.html) + +**Papers:**
+* [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html) +* [In-context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) + +--- +### BERT +**Paper:** [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
+**Blog:** [進擊的BERT:NLP 界的巨人之力與遷移學習](https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html)
+ +--- +### GPT (Generative Pre-Training Transformer) +**Paper:** [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
+**Paper:** [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
+**Code:** [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
+ + +--- +### GPT-2 +**Paper:** [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)
+**Code:** [openai/gpt-2](https://github.com/openai/gpt-2)
+**GPT2 Demo:** [Transformer Demo](https://app.inferkit.com/demo), [GPT-2 small](https://minimaxir.com/apps/gpt2-small/)
+**Blog:** [直觀理解GPT2語言模型並生成金庸武俠小說](https://leemeng.tw/gpt2-language-model-generate-chinese-jing-yong-novels.html)
+ +--- +### T5: Text-To-Text Transfer Transformer (by Google) +**Paper:** [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)
+**Code:** [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer)
+![](https://1.bp.blogspot.com/-89OY3FjN0N0/XlQl4PEYGsI/AAAAAAAAFW4/knj8HFuo48cUFlwCHuU5feQ7yxfsewcAwCLcBGAsYHQ/s640/image2.png) + +--- +### GPT-3 +**Code:** [openai/gpt-3](https://github.com/openai/gpt-3)
+**[GPT-3 Demo](https://gpt3demo.com/)**
+![](https://dzlab.github.io/assets/2020/07/20200725-gpt3-model-architecture.png) + +--- +### [CKIP Lab 繁體中文詞庫小組](https://ckip.iis.sinica.edu.tw/) +CKIP (CHINESE KNOWLEDGE AND INFORMATION PROCESSING): 繁體中文的 transformers 模型(包含 ALBERT、BERT、GPT2)及自然語言處理工具。
+[CKIP Lab 下載軟體與資源](https://ckip.iis.sinica.edu.tw/resource)
+* [CKIP Transformers](https://github.com/ckiplab/ckip-transformers) +* [CKIP Tagger](https://github.com/ckiplab/ckiptagger)
+ +--- +## Question Answering +### [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) - The Stanford Question Answering Dataset
+**Paper:** [Know What You Don't Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822)
+

+ +--- +### Instruct GPT +**Paper:** [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
+**Blog:** [Aligning Language Models to Follow Instructions](https://openai.com/blog/instruction-following/)
+ +--- +### ChatGPT +[ChatGPT: Optimizing Language Models for Dialogue](https://openai.com/blog/chatgpt/)
+ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in early 2022.
+ +![](https://cdn.openai.com/chatgpt/draft-20221129c/ChatGPT_Diagram.svg) + + + +--- +### [LLaMA](https://huggingface.co/docs/transformers/main/model_doc/llama) +*It is a collection of foundation language models ranging from 7B to 65B parameters.*
+**Paper:** [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
+![](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*nt-ydHhSVsaLXq_HZRaLQA.png) + +--- +### [OpenLLaMA](https://github.com/openlm-research/open_llama) +**model:** [https://huggingface.co/openlm-research/open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2)
+**Kaggle:** [https://www.kaggle.com/code/rkuo2000/llm-openllama](https://www.kaggle.com/code/rkuo2000/llm-openllama)
+ +--- +**Blog:** [Building a Million-Parameter LLM from Scratch Using Python](https://levelup.gitconnected.com/building-a-million-parameter-llm-from-scratch-using-python-f612398f06c2)
+**Kaggle:** [LLM LLaMA from scratch](https://www.kaggle.com/rkuo2000/llm-llama-from-scratch/)
+ +--- +### Pythia +**Paper:** [Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling](https://arxiv.org/abs/2304.01373)
+[Datasheet for the Pile](https://arxiv.org/abs/2201.07311)
+**Code:** [Pythia: Interpreting Transformers Across Time and Scale](https://github.com/EleutherAI/pythia)
+ +--- +### Falcon-40B +**Paper:** [The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only](https://arxiv.org/abs/2306.01116)
+**Code:** [https://huggingface.co/tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b)
+ +--- +### LLaMA-2 +**Paper:** [Llama 2: Open Foundation and Fine-Tuned Chat Models](https://arxiv.org/abs/2307.09288)
+**Code:** [https://github.com/facebookresearch/llama](https://github.com/facebookresearch/llama)
+**models:** [https://huggingface.co/meta-llama](https://huggingface.co/meta-llama)
+ +--- +### GPT4 +**Paper:** [GPT-4 Technical Report](https://arxiv.org/abs/2303.08774)
+![](https://image-cdn.learnin.tw/bnextmedia/image/album/2023-03/img-1679884936-23656.png?w=1200&output=webp) + +--- +### MiniGPT-4 +**Paper:** [MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models](https://arxiv.org/abs/2304.10592)
+**Paper:** [MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning](https://arxiv.org/abs/2310.09478)
+**Code:** [https://github.com/Vision-CAIR/MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)
+ +![](https://github.com/Vision-CAIR/MiniGPT-4/raw/main/figs/minigpt2_demo.png) +![](https://github.com/Vision-CAIR/MiniGPT-4/raw/main/figs/online_demo.png) + +--- +### LLM Lingua +**Paper: [LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models](https://arxiv.org/abs/2310.05736)
+**Code: [https://github.com/microsoft/LLMLingua](https://github.com/microsoft/LLMLingua)
+**Kaggle:** [https://www.kaggle.com/code/rkuo2000/llm-lingua](https://www.kaggle.com/code/rkuo2000/llm-lingua)
+![](https://github.com/microsoft/LLMLingua/raw/main/images/LLMLingua.png) + +--- +### Mistral Transformer +**Paper:** [Mistral 7B](https://arxiv.org/abs/2310.06825)
+**Code:** [https://github.com/mistralai/mistral-src](https://github.com/mistralai/mistral-src)
+**Kaggle:** [https://www.kaggle.com/code/rkuo2000/llm-mistral-7b-instruct](https://www.kaggle.com/code/rkuo2000/llm-mistral-7b-instruct)
+ +--- +### Zephyr +**Paper:** [Zephyr: Direct Distillation of LM Alignment](https://arxiv.org/abs/2310.16944)
+**Code:** [https://huggingface.co/HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)
+**Kaggle:** [https://www.kaggle.com/code/rkuo2000/llm-zephyr-7b](https://www.kaggle.com/code/rkuo2000/llm-zephyr-7b)
+![](https://i3.res.bangqu.com/farm/liang/news/2023/10/28/9e3a1a498f94b147fd57608b4beaefe0.jpg) + +--- +### SOLAR-10.7B ~ Depth Upscaling +**Code:** [https://huggingface.co/upstage/SOLAR-10.7B-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-v1.0)
+Depth-Upscaled SOLAR-10.7B has remarkable performance. It outperforms models with up to 30B parameters, even surpassing the recent Mixtral 8X7B model.
+Leveraging state-of-the-art instruction fine-tuning methods, including supervised fine-tuning (SFT) and direct preference optimization (DPO), +researchers utilized a diverse set of datasets for training. This fine-tuned model, SOLAR-10.7B-Instruct-v1.0, achieves a remarkable Model H6 score of 74.20, +boasting its effectiveness in single-turn dialogue scenarios.
+ +--- +### Phi-2 (Transformer with 2.7B parameters) +**Blog:** [Phi-2: The surprising power of small language models](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/)
+**Code:** [https://huggingface.co/microsoft/phi-2](https://huggingface.co/microsoft/phi-2)
+**Kaggle:** [https://www.kaggle.com/code/rkuo2000/llm-phi-2](https://www.kaggle.com/code/rkuo2000/llm-phi-2)
+ +--- +### FlagEmbedding +**Paper:** [Retrieve Anything To Augment Large Language Models](https://arxiv.org/abs/2310.07554)
+**Code:** [https://github.com/FlagOpen/FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)
+**Kaggle:** [https://www.kaggle.com/code/rkuo2000/llm-flagembedding](https://www.kaggle.com/code/rkuo2000/llm-flagembedding)
+![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4e4265-7dab-4c5d-b14f-5dfd1b270e75_746x735.png) + +--- +### LM-Cocktail +**Paper:** [LM-Cocktail: Resilient Tuning of Language Models via Model Merging](https://arxiv.org/abs/2311.13534)
+**Code:** [https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail)
+ +--- +### LongLoRA +**Code:** [https://github.com/dvlab-research/LongLoRA](https://github.com/dvlab-research/LongLoRA)
+[2023.11.19] We release a new version of LongAlpaca models, LongAlpaca-7B-16k, LongAlpaca-7B-16k, and LongAlpaca-7B-16k.
+![](https://github.com/dvlab-research/LongLoRA/raw/main/imgs/LongAlpaca.png) + +--- +### Magicoder +**Paper:** [Magicoder: Source Code Is All You Need](https://arxiv.org/abs/2312.02120)
+**Kaggle:** [https://www.kaggle.com/code/rkuo2000/llm-magicoder](https://www.kaggle.com/code/rkuo2000/llm-magicoder)
+![](https://github.com/ise-uiuc/magicoder/raw/main/assets/overview.svg) + +--- +### [ALTER-LLM](https://tnoinkwms.github.io/ALTER-LLM/) +**Paper:** [From Text to Motion: Grounding GPT-4 in a Humanoid Robot "Alter3"](https://arxiv.org/abs/2312.06571)
+ +![](https://tnoinkwms.github.io/ALTER-LLM/architecture_2.png) +![](https://tnoinkwms.github.io/ALTER-LLM/feedback.png) + +--- +### EAGLE-LLM +**Blog:** [EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation](https://sites.google.com/view/eagle-llm)
+**Code:** [https://github.com/SafeAILab/EAGLE](https://github.com/SafeAILab/EAGLE)
+**Kaggle:** [https://www.kaggle.com/code/rkuo2000/eagle-llm](https://www.kaggle.com/code/rkuo2000/eagle-llm)
+ +--- +### Purple Llama CyberSecEval +**Paper:** [Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models](https://arxiv.org/abs/2312.04724)
+**Code:** [CybersecurityBenchmarks](https://github.com/facebookresearch/PurpleLlama/tree/main/CybersecurityBenchmarks)
+[meta-llama/LlamaGuard-7b](https://huggingface.co/meta-llama/LlamaGuard-7b)
+ + + + + +
Our Test Set (Prompt)OpenAI ModToxicChatOur Test Set (Response)
Llama-Guard0.9450.8470.6260.953
OpenAI API 0.7640.8560.5880.769
Perspective API0.7280.7870.5320.699
+ +--- +## Building LLM +[Patterns for Building LLM-based Systems & Products](https://eugeneyan.com/writing/llm-patterns/) +![](https://eugeneyan.com/assets/llm-patterns-og.png) + +### [Retrieval Augmented Generation (RAG)](https://arxiv.org/abs/2005.11401) to Add Knowledge +![](https://eugeneyan.com/assets/rag.jpg) + +--- +#### [Fusion-in-Decoder (FiD)](https://arxiv.org/abs/2007.01282) +![](https://eugeneyan.com/assets/fid.jpg) + +--- +#### [Retrieval-Enhanced Transformer (RETRO)](https://arxiv.org/abs/2112.04426) +![](https://eugeneyan.com/assets/retro.jpg) + +--- +#### [Internet-augmented LMs](https://arxiv.org/abs/2203.05115) +![](https://eugeneyan.com/assets/internet-llm.jpg) + +--- +#### [Overview of RAG for CodeT5+](https://arxiv.org/abs/2305.07922) +![](https://eugeneyan.com/assets/codet5.jpg) + +--- +#### [Hypothetical document embeddings (HyDE)](https://arxiv.org/abs/2212.10496) +![](https://eugeneyan.com/assets/hyde.jpg) + +--- +### Fine-tuning : To get better at specific tasks + +#### [ULMFit](https://arxiv.org/abs/1801.06146) +![](https://eugeneyan.com/assets/ulmfit.jpg) + +--- +#### [Bidirectional Encoder Representations from Transformers (BERT; encoder only)](https://arxiv.org/abs/1810.04805) +![](https://eugeneyan.com/assets/bert.jpg) + +--- +#### [Generative Pre-trained Transformers (GPT; decoder only)](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) +![](https://eugeneyan.com/assets/gpt.jpg) + +--- +#### [Text-to-text Transfer Transformer (T5; encoder-decoder)](https://arxiv.org/abs/1910.10683) +![](https://eugeneyan.com/assets/t5.jpg) + +--- +#### [InstructGPT](https://arxiv.org/abs/2203.02155) +![](https://eugeneyan.com/assets/instructgpt.jpg) + +--- +#### [Soft prompt tuning](https://arxiv.org/abs/2104.08691) +**Paper:** [Soft-prompt Tuning for Large Language Models to Evaluate Bias](https://arxiv.org/abs/2306.04735)
+**Blog:** [Guiding Frozen Language Models with Learned Soft Prompts](https://blog.research.google/2022/02/guiding-frozen-language-models-with.html)
+![](https://blogger.googleusercontent.com/img/a/AVvXsEgWPnqNhC2ZtEjkumYCtNi18nHLQY9U5dmV13cJzQzscVhcHYhLdpTdTv-1ZI3IaOVfWE9x7y4g75jtyImEaI7dsonfD43S24flWsevDgEdbA0oR5w6fJsnFecnKGysSguLKJKEQ5svS-aQn_ClNZm6jURazpAxFNWTQoTm708a4hFq8f2HzMVpz3wZ_g=w640-h360) +![](https://blogger.googleusercontent.com/img/a/AVvXsEgNi-pteVLIEZ6H5HdV8RadrzCkegKA3zJCM2ObwTHKKYhgF7b-c7qsN85P1j4nXcqHcIDTj2dU5KfslYU4PuIFXaDpF6o_e5jMfFWljd6Kpc0E1n-UG6LtMA5B_BIAKjWTUibhwCnQ2zWap9BiZgA-VB0bxQG-S1jMcUHZ01kl0uLIKIoqKYH8QtUiYA=s693) + +--- +#### [prefix tuning](https://arxiv.org/abs/2101.00190) +![](https://eugeneyan.com/assets/prefix.jpg) + +--- +#### [adapter](https://arxiv.org/abs/1902.00751) +![](https://eugeneyan.com/assets/adapter.jpg) + +--- +#### [Low-Rank Adaptation (LoRA)](https://arxiv.org/abs/2106.09685) +![](https://eugeneyan.com/assets/lora.jpg) + +--- +#### [QLoRA](https://arxiv.org/abs/2305.14314) +![](https://eugeneyan.com/assets/qlora.jpg) + +--- +### Caching: To reduce latency and cost + +#### [GPTCache](https://github.com/zilliztech/GPTCache) +![](https://eugeneyan.com/assets/gptcache.jpg) + +--- +### LLM Kaggle-examples: +[https://www.kaggle.com/code/rkuo2000/llm-chromadb-langchain](https://www.kaggle.com/code/rkuo2000/llm-chromadb-langchain)
+[https://www.kaggle.com/code/rkuo2000/llm-finetuning](https://www.kaggle.com/code/rkuo2000/llm-finetuning/)
+[https://www.kaggle.com/code/rkuo2000/llama2-7b-hf-finetune](https://www.kaggle.com/code/rkuo2000/llama2-7b-hf-finetune)
+[https://www.kaggle.com/code/rkuo2000/llama2-qlora](https://www.kaggle.com/code/rkuo2000/llama2-qlora)
+ +--- +### [Open-LLMs](https://github.com/eugeneyan/open-llms) +Open LLMs
+Open LLM for Coder
+ +--- +## LLM Coders + +### AlphaCode +**Paper:** [Competition-Level Code Generation with AlphaCode](https://arxiv.org/pdf/2203.07814.pdf)
+![](https://victordibia.com/static/alphacode-2292e53c73500c1103f2f1fccec3f33d.png) + +--- +### AlphaCode 2 +**Report:** [AlphaCode 2 Technical Report](https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf)
+![](https://cdn.bulldogjob.com/system/photos/files/000/013/124/original/AlphaCode_2_overview.png) + +--- +### StarCoder +**Paper:** [StarCoder: may the source be with you!](https://arxiv.org/abs/2305.06161)
+The StarCoder models are 15.5B parameter models trained on **80+** programming languages from The Stack (v1.2), with opt-out requests excluded. The model uses Multi Query Attention, a context window of 8192 tokens, and was trained using the Fill-in-the-Middle objective on 1 trillion tokens.
+ +--- +### StarChat-Alpha +**Blog:** [Creating a Coding Assistant with StarCoder](https://huggingface.co/blog/starchat-alpha)
+ +--- +### DeciCoder +**Blog:** [Introducing DeciCoder: The New Gold Standard in Efficient and Accurate Code Generation](https://deci.ai/blog/decicoder-efficient-and-accurate-code-generation-llm/)
+ +--- +### CodeGen2.5 +**Blog:** [CodeGen2.5: Small, but mighty](https://blog.salesforceairesearch.com/codegen25/)
+**Paper:** [CodeGen2: Lessons for Training LLMs on Programming and Natural Languages](https://arxiv.org/abs/2305.02309)
+**Code:** [https://github.com/salesforce/CodeGen/tree/main/codegen25](https://github.com/salesforce/CodeGen/tree/main/codegen25)
+ +--- +### Code Llama +**Paper:** [Code Llama: Open Foundation Models for Code](https://arxiv.org/abs/2308.12950)
+![](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*0wXBmrJYzHnTvIupJL_TeQ.png) +**Kaggle:** [https://www.kaggle.com/rkuo2000/llm-code-llama](https://www.kaggle.com/rkuo2000/llm-code-llama)
+ +--- +## Thoughts + +### Tree of Thoughts +**Paper:** [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601)
+**Code:** [https://github.com/princeton-nlp/tree-of-thought-llm](https://github.com/princeton-nlp/tree-of-thought-llm)
+![](https://github.com/princeton-nlp/tree-of-thought-llm/blob/master/pics/teaser.png?raw=true) + +--- +### XoT +**Paper:** [Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation](https://arxiv.org/abs/2311.04254)
+![](https://miro.medium.com/v2/resize:fit:720/format:webp/0*r_a44DuxG3D8DGZO.png) + +--- +### FunSearch +[DeepMind發展用LLM解困難數學問題的方法](https://www.ithome.com.tw/news/160354)
+![](https://s4.itho.me/sites/default/files/styles/picture_size_large/public/field/image/2108_-_funsearch_making_new_discoveries_in_mathematical_sciences_using_lar_-_deepmind.google.jpg?itok=mAy4ydAE) + +--- +### BrainGPT +**Paper:** [DeWave: Discrete EEG Waves Encoding for Brain Dynamics to Text Translation](https://arxiv.org/abs/2309.14030)
+**Blog:** [New Mind-Reading "BrainGPT" Turns Thoughts Into Text On Screen](https://www.iflscience.com/new-mind-reading-braingpt-turns-thoughts-into-text-on-screen-72054)
+![](https://i3.res.bangqu.com/farm/liang/news/2023/12/18/339b9a2158e1fd28e1e39ee4b1557df2.jpg) +![](https://i3.res.bangqu.com/farm/liang/news/2023/12/18/79ca704627e4cadc1e23afc1b2f029cb.jpg) + + +--- +### [Run Llama 2 Locally in 7 Lines! (Apple Silicon Mac)](https://blog.lastmileai.dev/run-llama-2-locally-in-7-lines-apple-silicon-mac-c3f46143f327) +![](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*81Zzsz8opkq8eBUbpRHlng.png) +On an `M2 Max MacBook Pro`, I was able to get 35–40 tokens per second using the LLAMA_METAL build flag.
+ +### [LLaMA-2-7B Benchmark](https://github.com/liltom-eth/llama2-webui/blob/main/docs/performance.md) + + +
+
+ +*This site was last updated {{ site.time | date: "%B %d, %Y" }}.* + diff --git a/_posts/2023-12-12-Reinforcement-Learning.md b/_posts/2023-12-12-Reinforcement-Learning.md new file mode 100644 index 00000000..4024af7f --- /dev/null +++ b/_posts/2023-12-12-Reinforcement-Learning.md @@ -0,0 +1,664 @@ +--- +layout: post +title: Reinforcement Learning +author: [Richard Kuo] +category: [Lecture] +tags: [jekyll, ai] +--- + +This introduction includes Policy Gradient, Taxonomy of RL Algorithms, OpenAI Gym, PyBullet, +AI in Games, Multi-Agent RL, Imitation Learning , Meta Learning, RL-Stock, Social Tranmission. + +--- +## Introduction of Reinforcement Learning +![](https://i.stack.imgur.com/eoeSq.png) +

+ +--- +### What is Reinforcement Learning ? +[概述增強式學習 (Reinforcement Learning, RL) (一) ](https://www.youtube.com/watch?v=XWukX-ayIrs)
+ + + + + +
+ +--- +### Policy Gradient +**Blog:** [DRL Lecture 1: Policy Gradient (Review)](https://hackmd.io/@shaoeChen/Bywb8YLKS/https%3A%2F%2Fhackmd.io%2F%40shaoeChen%2FHkH2hSKuS)
+ + +--- +### Actor-Critic + + + +--- +### Reward Shaping + + + +--- +## Algorithms + +### Taxonomy of RL Algorithms +**Blog:** [Kinds of RL Alogrithms](https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html)
+ +* **Value-based methods** : Deep Q Learning + - Where we learn a value function that will map each state action pair to a value. +* **Policy-based methods** : Reinforce with Policy Gradients + - where we directly optimize the policy without using a value function + - This is useful when the action space is continuous (連續) or stochastic (隨機) + - use total rewards of the episode +* **Hybrid methods** : Actor-Critic + - a Critic that measures how good the action taken is (value-based) + - an Actor that controls how our agent behaves (policy-based) +* **Model-based methods** : Partially-Observable Markov Decision Process (POMDP) + - State-transition models + - Observation-transition models + +--- +### List of RL Algorithms +1. **Q-Learning** + - [An Analysis of Temporal-Difference Learning with Function Approximation](http://web.mit.edu/jnt/www/Papers/J063-97-bvr-td.pdf) + - [Algorithms for Reinforcement Learning](https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf) + - [A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation](https://arxiv.org/abs/1806.02450) +2. **A2C** (Actor-Critic Algorithms): [Actor-Critic Algorithms](https://proceedings.neurips.cc/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf) +3. **DQN** (Deep Q-Networks): [1312.5602](https://arxiv.org/abs/1312.5602) +4. **TRPO** (Trust Region Policy Optimizaton): [1502.05477](https://arxiv.org/abs/1502.05477) +5. **DDPG** (Deep Deterministic Policy Gradient): [1509.02971](https://arxiv.org/abs/1509.02971) +6. **DDQN** (Deep Reinforcement Learning with Double Q-learning): [1509.06461](https://arxiv.org/abs/1509.06461) +7. **DD-Qnet** (Double Dueling Q Net): [1511.06581](https://arxiv.org/abs/1511.06581) +8. **A3C** (Asynchronous Advantage Actor-Critic): [1602.01783](https://arxiv.org/abs/1602.01783) +9. **ICM** (Intrinsic Curiosity Module): [1705.05363](https://arxiv.org/abs/1705.05363) +10. **I2A** (Imagination-Augmented Agents): [1707.06203](https://arxiv.org/abs/1707.06203) +11. **PPO** (Proximal Policy Optimization): [1707.06347](https://arxiv.org/abs/1707.06347) +12. **C51** (Categorical 51-Atom DQN): [1707.06887](https://arxiv.org/abs/1707.06887) +13. **HER** (Hindsight Experience Replay): [1707.01495](https://arxiv.org/abs/1707.01495) +14. **MBMF** (Model-Based RL with Model-Free Fine-Tuning): [1708.02596](https://arxiv.org/abs/1708.02596) +15. **Rainbow** (Combining Improvements in Deep Reinforcement Learning): [1710.02298](https://arxiv.org/abs/1710.02298) +16. **QR-DQN** (Quantile Regression DQN): [1710.10044](https://arxiv.org/abs/1710.10044) +17. **AlphaZero** : [1712.01815](https://arxiv.org/abs/1712.01815) +18. **SAC** (Soft Actor-Critic): [1801.01290](https://arxiv.org/abs/1801.01290) +19. **TD3** (Twin Delayed DDPG): [1802.09477](https://arxiv.org/abs/1802.09477) +20. **MBVE** (Model-Based Value Expansion): [1803.00101](https://arxiv.org/abs/1803.00101) +21. **World Models**: [1803.10122](https://arxiv.org/abs/1803.10122) +22. **IQN** (Implicit Quantile Networks for Distributional Reinforcement Learning): [1806.06923](https://arxiv.org/abs/1806.06923) +23. **SHER** (Soft Hindsight Experience Replay): [2002.02089](https://arxiv.org/abs/2002.02089) +24. **LAC** (Actor-Critic with Stability Guarantee): [2004.14288](https://arxiv.org/abs/2004.14288) +25. **AGAC** (Adversarially Guided Actor-Critic): [2102.04376](https://arxiv.org/abs/2102.04376) +26. **TATD3** (Twin actor twin delayed deep deterministic policy gradient learning for batch process control): [2102.13012](https://arxiv.org/abs/2102.13012) +27. **SACHER** (Soft Actor-Critic with Hindsight Experience Replay Approach): [2106.01016](https://arxiv.org/abs/2106.01016) +28. **MHER** (Model-based Hindsight Experience Replay): [2107.00306](https://arxiv.org/abs/2107.00306) + +--- +## Open Environments + +### [Best Benchmarks for Reinforcement Learning: The Ultimate List](https://neptune.ai/blog/best-benchmarks-for-reinforcement-learning) +* **AI Habitat** – Virtual embodiment; Photorealistic & efficient 3D simulator; +* **Behaviour Suite** – Test core RL capabilities; Fundamental research; Evaluate generalization; +* **DeepMind Control Suite** – Continuous control; Physics-based simulation; Creating environments; +* **DeepMind Lab** – 3D navigation; Puzzle-solving; +* **DeepMind Memory Task Suite** – Require memory; Evaluate generalization; +* **DeepMind Psychlab** – Require memory; Evaluate generalization; +* **Google Research Football** – Multi-task; Single-/Multi-agent; Creating environments; +* **Meta-World** – Meta-RL; Multi-task; +* **MineRL** – Imitation learning; Offline RL; 3D navigation; Puzzle-solving; +* **Multiagent emergence environments** – Multi-agent; Creating environments; Emergence behavior; +* **OpenAI Gym** – Continuous control; Physics-based simulation; Classic video games; RAM state as observations; +* **OpenAI Gym Retro** – Classic video games; RAM state as observations; +* **OpenSpiel** – Classic board games; Search and planning; Single-/Multi-agent; +* **Procgen Benchmark** – Evaluate generalization; Procedurally-generated; +* **PyBullet Gymperium** – Continuous control; Physics-based simulation; MuJoCo unpaid alternative; +* **Real-World Reinforcement Learning** – Continuous control; Physics-based simulation; Adversarial examples; +* **RLCard** – Classic card games; Search and planning; Single-/Multi-agent; +* **RL Unplugged** – Offline RL; Imitation learning; Datasets for the common benchmarks; +* **Screeps** – Compete with others; Sandbox; MMO for programmers; +* **Serpent.AI – Game Agent Framework** – Turn ANY video game into the RL env; +* **StarCraft II Learning Environment** – Rich action and observation spaces; Multi-agent; Multi-task; +* **The Unity Machine Learning Agents Toolkit (ML-Agents)** – Create environments; Curriculum learning; Single-/Multi-agent; Imitation learning; +* **WordCraft** -Test core capabilities; Commonsense knowledge; + +--- +### [OpenAI Gym](https://github.com/openai/gym) +[Reinforcement Learning 健身房](https://rkuo2000.github.io/AI-course/lecture/2023/12/14/Reinforcement-Learning.html)
+ +--- +### [Stable Baselines 3](https://github.com/DLR-RM/stable-baselines3) +RL Algorithms in PyTorch : **A2C, DDPG, DQN, HER, PPO, SAC, TD3**.
+**QR-DQN, TQC, Maskable PPO** are in [SB3 Contrib](https://github.com/Stable-Baselines-Team/stable-baselines3-contrib)
+**[SB3 examples](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html)**
+`pip install stable-baselines3`
+For Ubuntu: `pip install gym[atari]`
+For Win10 : `pip install --no-index -f ttps://github.com/Kojoley/atari-py/releases atari-py`
+Downloading and installing visual studio 2015-2019 x86 and x64 from [here](https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads)
+ +--- +### Q Learning +**Blog:** [A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python](https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/)
+![](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/04/Screenshot-2019-04-16-at-5.46.01-PM-670x440.png) + +--- +**Blog:** [An introduction to Deep Q-Learning: let’s play Doom](https://www.freecodecamp.org/news/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8/)
+ +![](https://cdn-media-1.freecodecamp.org/images/1*js8r4Aq2ZZoiLK0mMp_ocg.png) +![](https://cdn-media-1.freecodecamp.org/images/1*LglEewHrVsuEGpBun8_KTg.png) + +--- +### DQN +**Paper:** [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)
+![](https://www.researchgate.net/publication/338248378/figure/fig3/AS:842005408141312@1577761141285/This-is-DQN-framework-for-DRL-DNN-outputs-the-Q-values-corresponding-to-all-actions.jpg) + +**[PyTorch Tutorial](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)**
+**Gym Cartpole**: [dqn.py](https://github.com/rkuo2000/RL-Gym/blob/main/cartpole/dqn.py)
+![](https://pytorch.org/tutorials/_images/cartpole.gif) + +--- +### DQN RoboCar +**Blog:** [Deep Reinforcement Learning on ESP32](https://www.hackster.io/aslamahrahiman/deep-reinforcement-learning-on-esp32-843928)
+**Code:** [Policy-Gradient-Network-Arduino](https://github.com/aslamahrahman/Policy-Gradient-Network-Arduino)
+ + +--- +### DQN for MPPT control +**Paper:** [A Deep Reinforcement Learning-Based MPPT Control for PV Systems under Partial Shading Condition](https://www.researchgate.net/publication/341720872_A_Deep_Reinforcement_Learning-Based_MPPT_Control_for_PV_Systems_under_Partial_Shading_Condition)
+ +![](https://www.researchgate.net/publication/341720872/figure/fig1/AS:896345892196354@1590716922926/A-diagram-of-the-deep-Q-network-DQN-algorithm.ppm) + +--- +### DDQN +**Paper:** [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461)
+**Tutorial:** [Train a Mario-Playing RL Agent](https://pytorch.org/tutorials/intermediate/mario_rl_tutorial.html)
+**Code:** [MadMario](https://github.com/YuansongFeng/MadMario)
+![](https://pytorch.org/tutorials/_images/mario.gif) + +--- +### Duel DQN +**Paper:** [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581)
+![](https://theaisummer.com/static/b0f4c8c3f3a5158b5899aa52575eaea0/95a07/DDQN.jpg) + +### Double Duel Q Net +**Code:** [mattbui/dd_qnet](https://github.com/mattbui/dd_qnet)
+ +![](https://github.com/mattbui/dd_qnet/blob/master/screenshots/running.gif?raw=true) + +--- +### A2C +**Paper:** [Actor-Critic Algorithms](https://proceedings.neurips.cc/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf)
+![](https://miro.medium.com/max/1400/0*g0jtX8lIdplzJ8oo.png) + - The **“Critic”** estimates the **value** function. This could be the action-value (the Q value) or state-value (the V value). + - The **“Actor”** updates the **policy** distribution in the direction suggested by the Critic (such as with policy gradients). + - A2C: Instead of having the critic to learn the Q values, we make him learn the Advantage values. + +--- +### A3C +**Paper:** [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783)
+**Blog:** [The idea behind Actor-Critics and how A2C and A3C improve them](https://towardsdatascience.com/the-idea-behind-actor-critics-and-how-a2c-and-a3c-improve-them-6dd7dfd0acb8)
+**Blog:** [李宏毅_ATDL_Lecture_23](https://hackmd.io/@shaoeChen/SkRbRFBvH#)
+ +![](https://miro.medium.com/max/770/0*OWRT4bcbfcansOwA.jpg) + +--- +### DDPG +**Paper:** [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971)
+**Blog:** [Deep Deterministic Policy Gradients Explained](https://towardsdatascience.com/deep-deterministic-policy-gradients-explained-2d94655a9b7b)
+**Blog:** [人工智慧-Deep Deterministic Policy Gradient (DDPG)](https://www.wpgdadatong.com/tw/blog/detail?BID=B2541)
+DDPG是在A2C中加入**經驗回放記憶體**,在訓練的過程中會持續的收集經驗,並且會設定一個buffer size,這個值代表要收集多少筆經驗,每當經驗庫滿了之後,每多一個經驗則最先收集到的經驗就會被丟棄,因此可以讓經驗庫一值保持滿的狀態並且避免無限制的收集資料造成電腦記憶體塞滿。
+學習的時候則是從這個經驗庫中隨機抽取成群(batch)經驗來訓練DDPG網路,周而復始的不斷進行學習最終網路就能達到收斂狀態,請參考下圖DDPG演算架構圖。
+![](https://edit.wpgdadawant.com/uploads/news_file/blog/2020/2976/tinymce/2020-12-27_18h15_54.png) +**Code:** [End to end motion planner using Deep Deterministic Policy Gradient (DDPG) in gazebo](https://github.com/m5823779/motion-planner-reinforcement-learning)
+

+ +--- +### [Intrinsic Curiosity Module (ICM)](https://pathak22.github.io/noreward-rl/) +**Paper:** [Curiosity-driven Exploration by Self-supervised Prediction](https://arxiv.org/abs/1705.05363)
+**Code:** [pathak22/noreward-rl](https://github.com/pathak22/noreward-rl)
+ + +--- +### PPO +**Paper:** [Proximal Policy Optimization](https://arxiv.org/abs/1707.06347)
+**On-policy vs Off-policy**
+On-Policy 方式是指用於學習的agent與觀察環境的agent是同一個,所以引數θ始終保持一致。**(邊做邊學)**
+Off-Policy方式是指用於學習的agent與用於觀察環境的agent不是同一個,他們的引數θ可能不一樣。**(在旁邊透過看別人做來學習)**
+比如下圍棋,On-Policy方式是agent親歷親為,而Off-Policy是一個agent看其他的agent下棋,然後去學習人家的東西。
+ +--- +### TRPO +**Paper:** [Trust Region Policy Optimization](https://arxiv.org/abs/1502.05477)
+**Blog:** [Trust Region Policy Optimization講解](https://www.twblogs.net/a/5d5ead97bd9eee541c32568c)
+TRPO 算法 (Trust Region Policy Optimization)和PPO 算法 (Proximal Policy Optimization)都屬於MM(Minorize-Maximizatio)算法。
+ +--- +### HER +**Paper:** [Hindsight Experience Replay](https://arxiv.org/abs/1707.01495)
+**Code:** [OpenAI HER](https://github.com/openai/baselines/tree/master/baselines/her)
+ + +--- +### MBMF +**Paper:** [Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning](https://arxiv.org/abs/1708.02596)
+ + +--- +### SAC +**Paper:** [Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor](https://arxiv.org/abs/1801.01290)
+![](https://miro.medium.com/max/974/0*NgZ_bq_nUOq73jK_.png) + +--- +### TD3 +**Paper:** [Addressing Function Approximation Error in Actor-Critic Methods](https://arxiv.org/abs/1802.09477)
+**Code:** [sfujim/TD3](https://github.com/sfujim/TD3)
+TD3 with RAMDP
+![](https://www.researchgate.net/publication/338605159/figure/fig2/AS:847596578947080@1579094180953/Structure-of-TD3-Twin-Delayed-Deep-Deterministic-Policy-Gradient-with-RAMDP.jpg) + +--- +### POMDP (Partially-Observable Markov Decision Process) +**Paper:** [Planning and acting in partially observable stochastic domains](https://people.csail.mit.edu/lpk/papers/aij98-pomdp.pdf)
+![](https://ars.els-cdn.com/content/image/1-s2.0-S2352154618301670-gr1_lrg.jpg) + +--- +### SHER +**Paper:** [Soft Hindsight Experience Replay](https://arxiv.org/abs/2002.02089) +![](https://d3i71xaburhd42.cloudfront.net/6253a0f146a36663e908509e14648f8e2a5ab581/5-Figure3-1.png) + +--- +### Exercises: [RL-gym](https://github.com/rkuo2000/RL-gym) +Downloading and installing visual studio 2015-2019 x86 and x64 from [here](https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads)
+ +``` +sudo apt-get install ffmpeg freeglut3-dev xvfb +pip install tensorflow +pip install pyglet==1.5.27 +pip install stable_baselines3[extra] +pip install gym[all] +pip install autorom[accept-rom-license] +git clone https://github.com/rkuo2000/RL-gym +cd RL-gym +cd cartpole +``` + +--- +#### ~/RL-gym/cartpole +`python3 random_action.py`
+`python3 q_learning.py`
+`python3 dqn.py`
+![](https://github.com/rkuo2000/RL-gym/blob/main/assets/CartPole.gif?raw=true) + +--- +#### ~/RL-gym/sb3/ +alogrithm = A2C, output = xxx.zip
+`python3 train.py LunarLander-v2 640000`
+`python3 enjoy.py LunarLander-v2`
+`python3 enjoy_gif.py LunarLander-v2`
+![](https://github.com/rkuo2000/RL-gym/blob/main/assets/LunarLander.gif?raw=true) + +--- +### Atari +env_name listed in Env_Name.txt
+you can train on [Kaggle](https://www.kaggle.com/code/rkuo2000/rl-sb3-atari), then download .zip to play on PC
+ +`python3 train_atari.py Pong-v0 1000000`
+`python3 enjoy_atari.py Pong-v0`
+`python3 enjoy_atari_gif.py Pong-v0`
+ + + + + + + + + + +
+ +--- +### [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) +**PyBulletEnv**
+`python enjoy.py --algo a2c --env AntBulletEnv-v0 --folder rl-trained-agents/ -n 5000`
+![](https://github.com/rkuo2000/AI-course/blob/main/images/RL_SB3_Zoo_AntBulletEnv.gif?raw=true) +`python enjoy.py --algo a2c --env HalfCheetahBulletEnv-v0 --folder rl-trained-agents/ -n 5000`
+![](https://github.com/rkuo2000/AI-course/blob/main/images/RL_SB3_Zoo_HalfCheetahBulletEnv.gif?raw=true) +`python enjoy.py --algo a2c --env HopperBulletEnv-v0 --folder rl-trained-agents/ -n 5000`
+![](https://github.com/rkuo2000/AI-course/blob/main/images/RL_SB3_Zoo_HopperBulletEnv.gif?raw=true) +`python enjoy.py --algo a2c --env Walker2DBulletEnv-v0 --folder rl-trained-agents/ -n 5000`
+![](https://github.com/rkuo2000/AI-course/blob/main/images/RL_SB3_Zoo_Walker2DBulletEnv.gif?raw=true) + +--- +### [Pybullet](https://pybullet.org) - Bullet Real-Time Physics Simulation + + + + + + + + +--- +### [PyBullet-Gym](https://github.com/benelot/pybullet-gym) +**Code:** [rkuo2000/pybullet-gym](https://github.com/rkuo2000/pybullet-gym)
+* installation +``` +pip install gym +pip install pybullet +pip install stable-baselines3 +git clone https://github.com/rkuo2000/pybullet-gym +export PYTHONPATH=$PATH:/home/yourname/pybullet-gym +``` + +#### gym +**Env names:** *Ant, Atlas, HalfCheetah, Hopper, Humanoid, HumanoidFlagrun, HumanoidFlagrunHarder, InvertedPendulum, InvertedDoublePendulum, InvertedPendulumSwingup, Reacher, Walker2D*
+ +**Blog:**
+[Creating OpenAI Gym Environments with PyBullet (Part 1)](https://gerardmaggiolino.medium.com/creating-openai-gym-environments-with-pybullet-part-1-13895a622b24)
+[Creating OpenAI Gym Environments with PyBullet (Part 2)](https://gerardmaggiolino.medium.com/creating-openai-gym-environments-with-pybullet-part-2-a1441b9a4d8e)
+![](https://media0.giphy.com/media/VI3OuvQShK3gzENiVz/giphy.gif?cid=790b761131bda06b74fcd9bb06c6a43939cf446edf403a68&rid=giphy.gif&ct=g) + +--- +### [OpenAI Gym Environments for Donkey Car](https://github.com/tawnkramer/gym-donkeycar) +* [Documentation](https://gym-donkeycar.readthedocs.io/en/latest/) +* Download [simulator binaries](https://github.com/tawnkramer/gym-donkeycar/releases) +* [Donkey Simulator User Guide](https://docs.donkeycar.com/guide/simulator/) +![](https://docs.donkeycar.com/assets/sim_screen_shot.png) + +--- +### [Google Dopamine](https://github.com/google/dopamine) +Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.
+*Dopamine supports the following agents, implemented with [jax](https://github.com/google/jax): DQN, C51, Rainbow, IQN, SAC.*
+ +--- +### [ViZDoom](https://github.com/mwydmuch/ViZDoom) +![](https://camo.githubusercontent.com/a7d9d95fc80903bcb476c2bbdeac3fa7623953c05401db79101c2468b0d90ad9/687474703a2f2f7777772e63732e7075742e706f7a6e616e2e706c2f6d6b656d706b612f6d6973632f76697a646f6f6d5f676966732f76697a646f6f6d5f636f727269646f725f7365676d656e746174696f6e2e676966) +`sudo apt install cmake libboost-all-dev libsdl2-dev libfreetype6-dev libgl1-mesa-dev libglu1-mesa-dev libpng-dev libjpeg-dev libbz2-dev libfluidsynth-dev libgme-ev libopenal-dev zlib1g-dev timidity tar nasm`
+`pip install vizdoom`
+ +--- +## AI in Games +**Paper:** [AI in Games: Techniques, Challenges and Opportunities](https://arxiv.org/abs/2111.07631)
+![](https://github.com/rkuo2000/AI-course/blob/main/images/AI_in_Games_survey.png?raw=true) + +--- +### AlphaGo +2016 年 3 月,AlphaGo 這一台 AI 思維的機器挑戰世界圍棋冠軍李世石(Lee Sedol)。比賽結果以 4 比 1 的分數,AlphaGo 壓倒性的擊倒人類世界最會下圍棋的男人。
+**Paper:** [Mastering the game of Go with deep neural networks and tree search](https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf)
+**Paper:** [Mastering the game of Go without human knowledge](https://www.nature.com/articles/nature24270.epdf?author_access_token=VJXbVjaSHxFoctQQ4p2k4tRgN0jAjWel9jnR3ZoTv0PVW4gB86EEpGqTRDtpIz-2rmo8-KG06gqVobU5NSCFeHILHcVFUeMsbvwS-lxjqQGg98faovwjxeTUgZAUMnRQ)
+**Blog:** [Day 27 / DL x RL / 令世界驚艷的 AlphaGo](https://ithelp.ithome.com.tw/articles/10252358)
+ +AlphaGo model 主要包含三個元件:
+* **Policy network**:根據盤面預測下一個落點的機率。 +* **Value network**:根據盤面預測最終獲勝的機率,類似預測盤面對兩方的優劣。 +* **Monte Carlo tree search (MCTS)**:類似在腦中計算後面幾步棋,根據幾步之後的結果估計現在各個落點的優劣。 + +![](https://i.imgur.com/xdc52cv.png) + +* **Policy Networks**: 給定 input state,會 output 每個 action 的機率。
+AlphaGo 中包含三種 policy network:
+* [Supervised learning (SL) policy network](https://chart.googleapis.com/chart?cht=tx&chl=p_%7B%5Csigma%7D) +* [Reinforcement learning (RL) policy network](https://chart.googleapis.com/chart?cht=tx&chl=p_%7B%5Crho%7D) +* [Rollout policy network](https://chart.googleapis.com/chart?cht=tx&chl=p_%7B%5Cpi%7D) + +* **Value Network**: 預測勝率,Input 是 state,output 是勝率值。
+這個 network 也可以用 supervised learning 訓練,data 是歷史對局中的 state-outcome pair,loss 是 mean squared error (MSE)。 + +* **Monte Carlo Tree Search (MCTS)**: 結合這些 network 做 planning,決定遊戲進行時的下一步。
+![](https://i.imgur.com/aXdpcz6.png) +1. Selection:從 root 開始,藉由 policy network 預測下一步落點的機率,來選擇要繼續往下面哪一步計算。選擇中還要考量每個 state-action pair 出現過的次數,盡量避免重複走同一條路,以平衡 exploration 和 exploitation。重複這個步驟直到樹的深度達到 max depth L。 +2. Expansion:到達 max depth 後的 leaf node sL,我們想要估計這個 node 的勝算。首先從 sL 往下 expand 一層。 +3. Evaluation:每個 sL 的 child node 會開始 rollout,也就是跟著 rollout policy network 預測的 action 開始往下走一陣子,取得 outcome z。最後 child node 的勝算會是 value network 對這個 node 預測的勝率和 z 的結合。 +4. Backup:sL 會根據每個 child node 的勝率更新自己的勝率,並往回 backup,讓從 root 到 sL 的每個 node 都更新勝率。 + +--- +### AlphaZero +2017 年 10 月,AlphaGo Zero 以 100 比 0 打敗 AlphaGo。
+**Blog:** [AlphaGo beat the world’s best Go player. He helped engineer the program that whipped AlphaGo.](https://www.technologyreview.com/innovator/julian-schrittwieser/)
+**Paper:** [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815)
+![](https://s.newtalk.tw/album/news/180/5c0f5a4489883.png) +AlphaGo 用兩個類神經網路,分別估計策略函數和價值函數。AlphaZero 用一個多輸出的類神經網路
+AlphaZero 的策略函數訓練方式是直接減少類神經網路與MCTS搜尋出來的πₜ之間的差距,這就是在做regression,而 AlpahGo 原本用的方式是RL演算法做 Policy gradient。(πₜ:當時MCTS後的動作機率值)
+**Blog:** [優拓 Paper Note ep.13: AlphaGo Zero](https://blog.yoctol.com/%E5%84%AA%E6%8B%93-paper-note-ep-13-alphago-zero-efa8d4dc538c)
+**Blog:** [Monte Carlo Tree Search (MCTS) in AlphaGo Zero](https://jonathan-hui.medium.com/monte-carlo-tree-search-mcts-in-alphago-zero-8a403588276a)
+**Blog:** [The 3 Tricks That Made AlphaGo Zero Work](https://hackernoon.com/the-3-tricks-that-made-alphago-zero-work-f3d47b6686ef)
+1. MTCS with intelligent lookahead search +2. Two-headed Neural Network Architecture +3. Using residual neural network architecture  + + + + + + +
+ +![](https://github.com/rkuo2000/AI-course/blob/main/images/AlphaGo_version_comparison.png?raw=true) + +--- +### AlphaZero with a Learned Model +**Paper:** [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://arxiv.org/abs/1911.08265)
+RL can be divided into Model-Based RL (MBRL) and Model-Free RL (MFRL). Model-based RL uses an environment model for planning, whereas model-free RL learns the optimal policy directly from interactions. Model-based RL has achieved superhuman level of performance in Chess, Go, and Shogi, where the model is given and the game requires sophisticated lookahead. However, model-free RL performs better in environments with high-dimensional observations where the model must be learned. +![](https://www.endtoend.ai/assets/blog/rl-weekly/36/muzero.png) + +--- +### Minigo +**Code:** [tensorflow minigo](https://github.com/tensorflow/minigo)
+ +--- +### ELF OpenGo +**Code:** [https://github.com/pytorch/ELF](https://github.com/pytorch/ELF)
+**Blog:** [A new ELF OpenGo bot and analysis of historical Go games](https://ai.facebook.com/blog/open-sourcing-new-elf-opengo-bot-and-go-research/)
+ +--- +### Chess Zero +**Code:** [Zeta36/chess-alpha-zero](https://github.com/Zeta36/chess-alpha-zero)
+ + +--- +### AlphaStar +**Blog:** [AlphaStar: Mastering the real-time strategy game StarCraft II](https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii)
+**Blog:** [AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning](https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning)
+**Code:** [PySC2 - StarCraft II Learning Environment](https://github.com/deepmind/pysc2)
+![](https://lh3.googleusercontent.com/ckm-3GlBQJ4zbNzfiW97yPqj5PVC0qIbRg42FL35EbDkhWoCNxyNZMMJN-f6VZmLMRbyBk2PArLQ-jDxlHbsE3_YaDUmcxUvMf8M=w1440-rw-v1) + +--- +### [OpenAI Five](https://openai.com/blog/openai-five/) at Dota2 + + +--- +### [DeepMind FTW](https://deepmind.com/blog/article/capture-the-flag-science) +![](https://lh3.googleusercontent.com/CFlAYmP49qitU-SOP_PaKtV1kOrlpNnvo4oEDFhyxelrVwKyAbkdXwUuDFRTmiRSQle4955mmOAB4jrrIrWzIXDt8hOajZGtJzNaDRw=w1440-rw-v1) +![](https://lh3.googleusercontent.com/RterJzRGidwT9R_Dqeu5LY5MZPjjYRc-MQdQyca7gACnA7w0bjCu_hIcoXLC4xV5zebvdZnN7ocZkemGnF4K7_p5SMZCLRWbNq1IDQ=w1440-rw-v1) + +--- +### Texas Hold'em Poker +**Code:** [fedden/poker_ai](https://github.com/fedden/poker_ai)
+**Code:** [Pluribus Poker AI](https://github.com/kanagle2312/pluribus-poker-AI) + [poker table](https://codepen.io/Rovak/pen/ExYeQar)
+**Blog:** [Artificial Intelligence Masters The Game of Poker – What Does That Mean For Humans?](https://www.forbes.com/sites/bernardmarr/2019/09/13/artificial-intelligence-masters-the-game-of-poker--what-does-that-mean-for-humans/?sh=dcaa18f5f9ea)
+ + +--- +### Suphx +**Paper:** [2003.13590](https://arxiv.org/abs/2003.13590)
+**Blog:** [微软超级麻将AI Suphx论文发布,研发团队深度揭秘技术细节](https://www.msra.cn/zh-cn/news/features/mahjong-ai-suphx-paper)
+![](https://d3i71xaburhd42.cloudfront.net/b30c663690c3a096c7d92f307ba7d17bdfd48553/6-Figure2-1.png) + +--- +### DouZero +**Paper:** [2106.06135](https://arxiv.org/abs/2106.06135)
+**Code:** [kwai/DouZero](https://github.com/kwai/DouZero)
+**Demo:** [douzero.org/](https://douzero.org/)
+![](https://camo.githubusercontent.com/45f00ff00a26f0df47ebbab3a993ccbf83e4715d7a0f1132665c8c045ebd52c2/68747470733a2f2f646f757a65726f2e6f72672f7075626c69632f64656d6f2e676966) + +--- +### JueWu +**Paper:** [Supervised Learning Achieves Human-Level Performance in MOBA Games: A Case Study of Honor of Kings](https://arxiv.org/abs/2011.12582)
+**Blog:** [Tencent AI ‘Juewu’ Beats Top MOBA Gamers](https://medium.com/syncedreview/tencent-ai-juewu-beats-top-moba-gamers-acdb44133d24)
+![](https://miro.medium.com/max/2000/1*bDp5a8gKiHynxiK-TQPJdQ.png) +![](https://github.com/rkuo2000/AI-course/blob/main/images/JueWu_structure.png?raw=true) + +--- +### StarCraft Commander +**[启元世界](http://www.inspirai.com/research/scc?language=en)**
+**Paper:** [SCC: an efficient deep reinforcement learning agent mastering the game of StarCraft II](https://arxiv.org/abs/2012.13169)
+ +--- +### Hanabi ToM +**Paper:** [Theory of Mind for Deep Reinforcement Learning in Hanabi](https://arxiv.org/abs/2101.09328)
+**Code:** [mwalton/ToM-hanabi-neurips19](https://github.com/mwalton/ToM-hanabi-neurips19)
+Hanabi (from Japanese 花火, fireworks) is a cooperative card game created by French game designer Antoine Bauza and published in 2010. + + +--- +## MARL (Multi-Agent Reinforcement Learning) + +### Neural MMO +**Paper:** [The Neural MMO Platform for Massively Multiagent Research](https://arxiv.org/abs/2110.07594)
+**Blog:** [User Guide](https://neuralmmo.github.io/build/html/rst/userguide.html)
+![](https://neuralmmo.github.io/build/html/_images/splash.png) + + +--- +### Multi-Agent Locomotion +**Paper:** [Emergent Coordination Through Competition](https://arxiv.org/abs/1902.07151)
+**Code:** [Locomotion task library](https://github.com/deepmind/dm_control/tree/master/dm_control/locomotion)
+**Code:** [DeepMind MuJoCo Multi-Agent Soccer Environment](https://github.com/deepmind/dm_control/tree/master/dm_control/locomotion/soccer)
+![](https://github.com/deepmind/dm_control/blob/master/dm_control/locomotion/soccer/soccer.png?raw=true) + +--- +### [Unity ML-agents Toolkit](https://unity.com/products/machine-learning-agents) +**Code:** [Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents)
+![](https://unity.com/sites/default/files/styles/16_9_s_scale_width/public/2020-05/Complex-AI-environments_0.jpg) + +**Blog:** [A hands-on introduction to deep reinforcement learning using Unity ML-Agents](https://medium.com/coder-one/a-hands-on-introduction-to-deep-reinforcement-learning-using-unity-ml-agents-e339dcb5b954)
+![](https://miro.medium.com/max/540/0*ojKXHwzo_a-rjwpz.gif) + +--- +### DDPG Actor-Critic Reinforcement Learning Reacher Environment +**Code:** [https://github.com/Remtasya/DDPG-Actor-Critic-Reinforcement-Learning-Reacher-Environment](https://github.com/Remtasya/DDPG-Actor-Critic-Reinforcement-Learning-Reacher-Environment)
+![](https://github.com/Remtasya/DDPG-Actor-Critic-Reinforcement-Learning-Reacher-Environment/raw/master/project_images/reacher%20environment.gif) + +--- +### Multi-Agent Mobile Manipulation +**Paper:** [Spatial Intention Maps for Multi-Agent Mobile Manipulation](https://arxiv.org/abs/2103.12710)
+**Code:** [jimmyyhwu/spatial-intention-maps](https://github.com/jimmyyhwu/spatial-intention-maps)
+![](https://user-images.githubusercontent.com/6546428/111895195-42af8700-89ce-11eb-876c-5f98f6b31c96.gif) + +--- +### DeepMind Cultural Transmission +**Paper** [Learning few-shot imitation as cultural transmission](https://www.nature.com/articles/s41467-023-42875-2)
+**Blog:** [DeepMind智慧體訓練引入GoalCycle3D](https://cdn.technews.tw/2023/12/14/learning-few-shot-imitation-as-cultural-transmission)
+以模仿開始,然後深度強化學習繼續最佳化甚至找到超越前者的實驗,顯示AI智慧體能觀察別的智慧體學習並模仿。
+這從零樣本開始,即時取得利用資訊的能力,非常接近人類積累和提煉知識的方式。
+![](https://img.technews.tw/wp-content/uploads/2023/12/13113144/41467_2023_42875_Fig1_HTML.jpg) + +--- +## Imitation Learning +**Blog:** [A brief overview of Imitation Learning](https://smartlabai.medium.com/a-brief-overview-of-imitation-learning-8a8a75c44a9c)
+ + +--- +### Self-Imitation Learning +directly use past good experiences to train current policy.
+**Paper:** [Self-Imitation Learming](https://arxiv.org/abs/1806.05635)
+**Code:** [junhyukoh/self-imitation-learning](https://github.com/junhyukoh/self-imitation-learning)
+**Blog:** [[Paper Notes 2] Self-Imitation Learning](https://medium.com/intelligentunit/paper-notes-2-self-imitation-learning-b3a0fbdee351)
+![](https://miro.medium.com/max/2000/1*tvoSPpq7zSNscaVIxQX5hg@2x.png) + +--- +### Self-Imitation Learning by Planning +**Paper:** [Self-Imitation Learning by Planning](https://arxiv.org/abs/2103.13834)
+ + +--- +### Surgical Robotics +**Paper:** [Open-Sourced Reinforcement Learning Environments for Surgical Robotics](https://arxiv.org/abs/1903.02090)
+**Code:** [RL Environments for the da Vinci Surgical System](https://github.com/ucsdarclab/dVRL)
+ +--- +## Meta Learning (Learning to Learn) +**Blog:** [Meta-Learning: Learning to Learn Fast](https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html)
+ +### Meta-Learning Survey +**Paper:** [Meta-Learning in Neural Networks: A Survey](https://arxiv.org/abs/2004.05439)
+![](https://d3i71xaburhd42.cloudfront.net/020bb2ba5f3923858cd6882ba5c5a44ea8041ab6/6-Figure1-1.png) + +--- +### MAML (Model-Agnostic Meta-Learning) +**Paper:** [Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks](https://arxiv.org/abs/1703.03400)
+**Code:** [cbfinn/maml_rl](https://github.com/cbfinn/maml_rl)
+ +--- +### Reptile +**Paper:** [On First-Order Meta-Learning Algorithms](https://arxiv.org/abs/1803.02999)
+**Code:** [openai/supervised-reptile](https://github.com/openai/supervised-reptile)
+ +--- +### MAML++ +**Paper:** [How to train your MAML](https://arxiv.org/abs/1810.09502)
+**Code:** [AntreasAntoniou/HowToTrainYourMAMLPytorch](https://github.com/AntreasAntoniou/HowToTrainYourMAMLPytorch)
+**Blog:** [元學習——從MAML到MAML++](https://www.twblogs.net/a/60e689df1cf175147a0e2084)
+ +--- +**Paper:** [First-order Meta-Learned Initialization for Faster Adaptation in Deep Reinforcement Learning](https://www.andrew.cmu.edu/user/abhijatb/assets/Deep_RL_project.pdf)
+![](https://github.com/rkuo2000/AI-course/blob/main/images/Meta_Learning_algorithms.png?raw=true) + +--- +### FAMLE (Fast Adaption by Meta-Learning Embeddings) +**Paper:** [Fast Online Adaptation in Robotics through Meta-Learning Embeddings of Simulated Priors](https://arxiv.org/abs/2003.04663)
+ +![](https://media.arxiv-vanity.com/render-output/5158097/x1.png) +![](https://media.arxiv-vanity.com/render-output/5158097/x3.png) + +--- +### Bootstrapped Meta-Learning +**Paper:** [Bootstrapped Meta-Learning](https://arxiv.org/abs/2109.04504)
+**Blog:** [DeepMind’s Bootstrapped Meta-Learning Enables Meta Learners to Teach Themselves](https://syncedreview.com/2021/09/20/deepmind-podracer-tpu-based-rl-frameworks-deliver-exceptional-performance-at-low-cost-107/)
+ +![](https://i0.wp.com/syncedreview.com/wp-content/uploads/2021/09/image-77.png?w=549&ssl=1) + +--- +## Unsupervised Learning + +### Understanding the World Through Action +**Blog:** [Understanding the World Through Action: RL as a Foundation for Scalable Self-Supervised Learning](https://medium.com/@sergey.levine/understanding-the-world-through-action-rl-as-a-foundation-for-scalable-self-supervised-learning-636e4e243001)
+**Paper:** [Understanding the World Through Action](https://arxiv.org/abs/2110.12543)
+![](https://miro.medium.com/max/1400/1*79ztJveD6kanHz9H8VY2Lg.gif) +**Actionable Models**
+a self-supervised real-world robotic manipulation system trained with offline RL, performing various goal-reaching tasks. Actionable Models can also serve as general pretraining that accelerates acquisition of downstream tasks specified via conventional rewards. +![](https://miro.medium.com/max/1280/1*R7-IP07Inc7K6v4i_dQ-RQ.gif) + +--- +### RL-Stock +**Kaggle:** [https://www.kaggle.com/rkuo2000/stock-lstm](https://www.kaggle.com/rkuo2000/stock-lstm)
+**Kaggle:** [https://kaggle.com/rkuo2000/stock-dqn](https://kaggle.com/rkuo2000/stock-dqn)
+ +--- +### Stock Trading +**Blog:** [Predicting Stock Prices using Reinforcement Learning (with Python Code!)](https://www.analyticsvidhya.com/blog/2020/10/reinforcement-learning-stock-price-prediction/)
+![](https://editor.analyticsvidhya.com/uploads/770801_26xDRHI-alvDAfcPPJJGjQ.png) + +**Code:** [DQN-DDPG_Stock_Trading](https://github.com/AI4Finance-Foundation/DQN-DDPG_Stock_Trading)
+**Code:** [FinRL](https://github.com/AI4Finance-Foundation/FinRL)
+**Blog:** [Automated stock trading using Deep Reinforcement Learning with Fundamental Indicators](https://medium.com/@mariko.sawada1/automated-stock-trading-with-deep-reinforcement-learning-and-financial-data-a63286ccbe2b)
+ +--- +### FinRL +**Papers:**
+[2010.14194](https://arxiv.org/abs/2010.14194): Learning Financial Asset-Specific Trading Rules via Deep Reinforcement Learning
+[2011.09607](https://arxiv.org/abs/2011.09607): FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance
+[2101.03867](https://arxiv.org/abs/2101.03867): A Reinforcement Learning Based Encoder-Decoder Framework for Learning Stock Trading Rules
+[2106.00123](https://arxiv.org/abs/2106.00123): Deep Reinforcement Learning in Quantitative Algorithmic Trading: A Review
+[2111.05188](https://arxiv.org/abs/2111.05188): FinRL-Podracer: High Performance and Scalable Deep Reinforcement Learning for Quantitative Finance
+[2112.06753](https://arxiv.org/abs/2112.06753): FinRL-Meta: A Universe of Near-Real Market Environments for Data-Driven Deep Reinforcement Learning in Quantitative Finance
+ +**Blog:** [FinRL­-Meta: A Universe of Near Real-Market En­vironments for Data­-Driven Financial Reinforcement Learning](https://medium.datadriveninvestor.com/finrl-meta-a-universe-of-near-real-market-en-vironments-for-data-driven-financial-reinforcement-e1894e1ebfbd)
+![](https://miro.medium.com/max/2000/1*rOW0RH56A-chy3HKaxcjNw.png) +**Code:** [DQN-DDPG_Stock_Trading](https://github.com/AI4Finance-Foundation/DQN-DDPG_Stock_Trading)
+**Code:** [FinRL](https://github.com/AI4Finance-Foundation/FinRL)
+ +
+
+ +*This site was last updated {{ site.time | date: "%B %d, %Y" }}.* + +