ML implementation

Machine learning

TensorFlow
Hugging Face: blog 📡, learn
Kaggle
BigML
SentenceTransformers
Lightning AI’s Deep Learning Fundamentals
Pytorch
Machine Learning for Software Engineering: a curated list of papers, PhD theses, datasets, and tools

Optimisation

Méthode de Nelder-Mead

Andrej Karpathy: Homepage, Blog, YouTube 📡⇈
Yannic Kilcher: YouTube 📡
Sebastian Raschka: Homepage, Ahead of AI 📡↑
AI Coffee Break with Letitia (Letitia Parcalabescu): YouTube 📡, Substack

Articles and videos

SDS 771: Gradient Boosting: XGBoost, LightGBM and CatBoost, with Kirill Eremenko↑ (⧉) by Kirill Eremenko and Jon Krohn (April 2^nd, 2024) ► A good basic presentation of some algorithm based on decision trees: XGBoost, LightGBM, and CatBoost.
SDS 831: PyTorch Lightning Lit-Serve and Lightning Studios, with Dr. Luca Antiga (⧉) by Luca Antiga and Jon Krohn (October 29^th, 2024) ► The products of Lightning AI (PyTorch Lightning, Lightning Studios, LitServe, and Lightning Thunder) and some thoughts about small language models.
Neural networks
- Polyworld: Using Evolution to Design Artificial Intelligence by Virgil Griffith (November 8^th, 2007) ► Using artificial life to optimise neural networks.
- Evolving Neural Networks to Play 2048 by John Downey (May 12^th, 2014) ► The title says it all.
- Is Dr. Calvin in the Room? by Robert Cecil Martin "Uncle Bob" (March 16^th, 2017) ► Some ideas and some dubious simple back-of-the-envelope calculations about neural networks.
- How to generate text: using different decoding methods for language generation with Transformers by Patrick von Platen (March 1^st, 2020) ► A presentation of the different methods to control the text generated by a model: greedy search, beam search, sampling, top-K sampling, and top-p (nucleus) sampling.
- The Neural Network, A Visual Introduction by "vcubingx" (August 23^rd, 2020) ► The title says it all.
- Gradients are Not All You Need (Machine Learning Research Paper Explained) by Yannic Kilcher (November 16^th, 2021) ► A paper ("Gradients are Not All You Need") showing that gradient backpropagation does not work properly for some chaotic systems.
- Machine Learning 1: Tour d'horizon et le cas MuZero (feat Dalle2, PaLM) - Passe-science #47 by Thomas Cabaret (June 4^th, 2022) ► The latest results of the best AIs and how Muzero is trained.
- ↪Machine Learning 2: Architecture et Alphastar (Transformer, attention) - Passe-science #48 by Thomas Cabaret (June 11^th, 2022) ► A description of the architecture of Alphastart and of transformers.
- TensorFlow in 100 Seconds by Jeff Delaney (August 3^rd, 2022) ► A very short example of using TensorFlow.
- The spelled-out intro to neural networks and backpropagation: building micrograd↑ by Andrej Karpathy (August 17^th, 2022) ► A good basic introduction to backpropagation with the code details.
- Understanding Encoder And Decoder LLMs by Sebastian Raschka (June 17^th, 2023) ► The title says it all.
- Create a Large Language Model from Scratch with Python – Tutorial↓ by Elliot (August 25^th, 2023) ► This lengthy tutorial is not worth watching. Many parts lack preparation, some explanations are confusing, Elliot spends a significant amount of time explaining simple concepts while skipping complex ones… We get some understanding on how to implement an LLM, but this would easily be accomplished in a one-hour video.
- LLM Training: RLHF and Its Alternatives by Sebastian Raschka (September 10^th, 2023) ► As said in the title, a clear description of RHLF and its alternatives.
- What is LoRA? Low-Rank Adaptation for finetuning LLMs EXPLAINED by Letitia Parcalabescu (September 18^th, 2023) ► A presentation of LoRA.
- Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation) — Things I Learned From Hundreds of Experiments by Sebastian Raschka (November 19^th, 2023) ► Some experiments with LoRA.
- What is Q-Learning (back to basics) by Yannic Kilcher (November 25^th, 2023) ► The title says it all.
- Mixture of Experts Explained by Omar Sanseviero, Lewis Tunstall, Philipp Schmid, Sourab Mangrulkar, Younes Belkada, and Pedro Cuenca (December 11^th, 2023) ► A technical history of the MoE models.
- Apple is doing the UNTHINKABLE!!! by Abdul Majed Raja (January 6^th, 2024) ► Some information about Apple’s MLX Framework.
- LLaMA Pro: Progressive LLaMA with Block Expansion (Paper Explained) by Yannic Kilcher (January 7^th, 2024) ► Commenting a paper ("LLaMA Pro: Progressive LLaMA with Block Expansion") which contains dubious claims about how the researchers improved LLaMA by duplicating some blocks.
- A Guide to Deeplearning4j (January 8^th, 2024) ► A short presentation of Deeplearning4j.
- Sampling for Text Generation↑ by Chip Huyen (January 16^th, 2024) ► A clear overview on the methods used to sample or constrain the output of a generative AI.
- AlphaGeometry: Solving olympiad geometry without human demonstrations (Paper Explained) by Yannic Kilcher (January 22^nd, 2024) ► A summary of the "Solving olympiad geometry without human demonstrations" paper describing AlphaGeometry.
- LLMs itself CAN create BETTER LLMs by Abdul Majed Raja (January 23^rd, 2024) ► A quick presentation of "Self-Rewarding Language Models": a dubious claim that a model can be improved by training and rewarding itself a new iteration of itself.
- Model Merging, Mixtures of Experts, and Towards Smaller LLMs by Sebastian Raschka (February 3^rd, 2024) ► Weight Averaged Reward Models, Tuning Language Models by Proxy, Mixtral of Experts, and TinyLlama.
- Sparse LLMs at inference: 6x faster transformers! | DEJAVU paper explained by Letitia Parcalabescu (February 3^rd, 2024) ► A presentation of "Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time", some self-attention heads and some MLP neurons are selected by running a simpler neuron network.
- Lumiere: A Space-Time Diffusion Model for Video Generation (Paper Explained) by Yannic Kilcher (February 4^th, 2024) ► An opinionated presentation of "Lumiere: A Space-Time Diffusion Model for Video Generation".
- "MORE AGENTS" Is All You Need by Abdul Majed Raja (February 12^th, 2024) ► "More Agents Is All You Need" analyses the gain of generating answers from several LLMs and using a voting mechanism to define the final answer.
- How Quickly Do Large Language Models Learn Unexpected Skills? — A new study suggests that so-called emergent abilities actually develop gradually and predictably, depending on how you measure them. by Stephen Ornes (February 13^th, 2024) ► The debate about emergent capabilities appearing abruptly or continuously is still going on…
- Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch↑ by Sebastian Raschka (February 18^th, 2024) ► A clear description of LoRA and DoRA.
- How Selective Forgetting Can Help AI Learn Better — Erasing key information during training results in machine learning models that can learn new languages faster and more easily.↓ by Amos Zeeberg (February 28^th, 2024) ► Some little and too basic information about forgetting language models which are easier to train to new languages.
- A LoRA Successor, Small Finetuned LLMs Vs Generalist LLMs, and Transparent LLM Research by Sebastian Raschka (March 3^rd, 2024) ► Can small fined-tuned models perform better on some tasks than large models, DoRA, OLMo is a real open-source model, Gemma…
- Tips for LLM Pretraining and Evaluating Reward Models — Discussing AI Research Papers in March 2024 by Sebastian Raschka (March 31^st, 2024) ► An analysis of continuous pretraining and a benchmark for evaluating reward models.
- How do mixture-of-experts layers affect transformer models? — This new LLM technique has started improving the results of models without additional training. by Cameron R. Wolfe PhD (April 4^th, 2024) ► A short description of the Mixture of Experts architecture, I guess that if you know enough to understand this, you already know about MoE.
- Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping (Searchformer) by Yannic Kilcher (April 6^th, 2024) ► "Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping": using a transformer model to mimic an A^* search algorithm, results are better when the model also has to reproduce the search traces.
- Why Recurrent Neural Networks are cursed | LM2 by "vcubingx" (April 8^th, 2024) ► A presentation of Recurrent Neural Networks.
- Flow Matching for Generative Modeling (Paper Explained) by Yannic Kilcher (April 8^th, 2024) ► "Flow Matching for Generative Modeling": a mathematical description of Flow Matching, a mechanism to train Continuous Normalizing Flows.
- How did the Attention Mechanism start an AI frenzy? | LM3 by "vcubingx" (April 15^th, 2024) ► How the attention mechanism was implemented for RNN.
- Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention by Yannic Kilcher (April 25^th, 2024) ► A paper ("Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention") describing how to support infinite context length, but Yannic Kilcher has some doubts about its real effectiveness.
- How AI 'Understands' Images (CLIP) - Computerphile by Mike Pound (April 25^th, 2024) ► A basic presentation of CLIP.
- TransformerFAM: Feedback attention is working memory by Yannic Kilcher (April 28^th, 2024) ► Yet another paper ("TransformerFAM: Feedback attention is working memory") proposing infinite context by using additional tokens as a short term memory.
- ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained) by Yannic Kilcher (May 1^st, 2024) ► A paper ("ORPO: Monolithic Preference Optimization without Reference Model") combining supervised fine-tuning and preference alignment.
- Shapley Values Explained | Interpretability for AI models, even LLMs! by Letitia Parcalabescu (May 6^th, 2024) ► A presentation of Shapley values, a method to explain how much each input impacts the model’s output, and an example with Llama 2 and the SHAP library.
- Has Generative AI Already Peaked? - Computerphile by Mike Pound (May 9^th, 2024) ► A basic presentation of a paper ("No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance") claiming that multimodal models require exponentially more data to achieve linear improvements.
- GaLore EXPLAINED: Memory-Efficient LLM Training by Gradient Low-Rank Projection by Letitia Parcalabescu (May 27^th, 2024) ► Yet another pre-training/fine-tuning algorithm: "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection".
- LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments — Discussing the Latest Model Releases and AI Research in May 2024 by Sebastian Raschka (June 2^nd, 2024) ► Three papers: not masking instructions when calculating the loss for instruction finetuning performs better than masking, but only if the answer is sort and the number of training examples is small; LoRA learns less and forgets less than full finetuning; MoRa is yet another finetuning algorithm.
- Machine Learning and Logistic Regression↓ by Diarra Bell (July 19^th, 2024) ► A bad description of logistic regression, the linear part is not explained.
- A New Type of Neural Network Is More Interpretable — Kolmogorov-Arnold Networks could point physicists to new hypotheses↓ by Matthew Hutson (August 5^th, 2024) ► There is little valuable information about KAN networks in this article.
- Reinforcement Learning from Human Feedback (RLHF) Explained by Martin Keen (August 7^th, 2024) ► The title says it all: a short presentation of RLHF.
- New LLM Pre-training and Post-training Paradigms — A Look at How Modern LLMs Are Trained by Sebastian Raschka (August 17^th, 2024) ► An overview and comparison of the pre and post-trainings of Qwen 2, Apple Foundation Model, Gemma 2, and Llama 3.1.
- Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution – Paper Explained by Letitia Parcalabescu (August 20^th, 2024) ► "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution" describes a diffusion model algorithm for text that gives correct results, before such models used to generate garbage text.
- What is Mixture of Experts? by Martin Keen (August 28^th, 2024) ► A clear introduction to the MoE architecture and its advantages and challenges.
- Transformer LLMs are Turing Complete after all !? by Franz Nowak and Letitia Parcalabescu (September 5^th, 2024) ► LLM with chain-of-thought are equivalent to probabilistic Turing machines.
- Building A GPT-Style LLM Classifier From Scratch — Finetuning a GPT Model for Spam Classification by Sebastian Raschka (September 21^st, 2024) ► This article details how to fine-tune a LLM to use it as a classifier: replacing the output layer by a classifier one, then post-training it while freezing all layers except the last transformer block and the output layer.
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Paper) by Yannic Kilcher (October 5^th, 2024) ► A review of "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters", a paper comparing the cost efficiency of best-of-n vs. beam search vs. lookahead search, and the cost efficiency ot training vs. inference.
- Graph Language Models EXPLAINED in 5 Minutes! [Author explanation 🔴 at ACL 2024] by Moritz Plenz and Letitia Parcalabescu (October 6^th, 2024) ► Moritz Plenz proposes a method to build a Graph Language Model from a LLM.
- Text Classification: AI Techniques and Real-World Applications by Carl Broker (October 15^th, 2024) ► A good but limited 101 presentation for managers: the basics of text classification.
- Understanding Multimodal LLMs — An introduction to the main techniques and latest models↑ by Sebastian Raschka (November 3^rd, 2024) ► A good presentation of the architectures of models able to take both text and images as input. The two main options are Unified Embedding Decoder and Cross-modality Attention. Sebastian Raschka describes ten recent such models.
- Large Language Models explained briefly by Grant Sanderson (November 20^th, 2024) ► A short, basic, and clear presentation of how LLMs work.
- TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained) by Yannic Kilcher (November 23^rd, 2024) ► A, rather negative, review of "TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters".
- Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty by Meera Hahn (December 5^th, 2024) ► A prototype of an image generation UI where the user is proposed to clarify some aspects of what s/he wants or s/he can edit an interpretable belief graph of the model.
- REPA Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You ... by Letitia Parcalabescu (December 8^th, 2024) ► Improving image generation diffusion models by combining them with vision transformer models.
- Safety Alignment Should be Made More Than Just a Few Tokens Deep (Paper Explained) by Yannic Kilcher (December 10^th, 2024) ► Some comment on "Safety Alignment Should be Made More Than Just a Few Tokens Deep", a paper demonstrating that safety fine-tuning mostly impacts the first generated tokens, so you can get the model to generate unsafe content by controlling the first tokens, using a DAN prompt, fine-tuning it…
- Are LLMs capable of non-verbal reasoning? — Processing in the "latent space" could help AI with tricky logical questions. by Kyle Orland (December 12^th, 2024) ► The subtitle says it all.
- The Dark Matter of AI [Mechanistic Interpretability]↑ by Stephen Welch (December 23^rd, 2024) ► A presentation of Mechanistic Interpretability and its use of Sparse Autoencoders.
- Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained) by Yannic Kilcher (December 24^th, 2024) ► This description of "Byte Latent Transformer: Patches Scale Better Than Tokens" is difficult to understand.
- Noteworthy AI Research Papers of 2024 (Part One) — Six influential AI papers from January to June↑ by Sebastian Raschka (December 31^st, 2024) ► Sebastian Raschka has chosen one interesting paper for each month of 2024. He gives some information extracted from each one.
- ↪Noteworthy AI Research Papers of 2024 (Part Two) — Six influential AI papers from July to December↑ by Sebastian Raschka (January 15^th, 2025) ► The second half of the year.
- Training large language models to reason in a continuous latent space – COCONUT Paper explained by Letitia Parcalabescu (January 26^th, 2025) ► An overview of the Chain of Thoughts mechanism presented in "Training Large Language Models to Reason in a Continuous Latent Space".
- [GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models by Yannic Kilcher (January 26^th, 2025) ► A description of TRPO, PPO, and GRPO. The most interesting is the conclusion that RL is only able to reshape the base model probabilities, not to add new capabilities to it.
- Diffusion Models for AI Image Generation by Isaac Ke (January 30^th, 2025) ► A clear, basic, and classical presentation of diffusion models.
- ChatGPT is made from 100 million of these [The Perceptron] by Stephen Welch (February 1^st, 2025) ► The history of gradient back-propagation.
- What is Semi-Supervised Learning? by Martin Keen (February 3^rd, 2025) ► The usual IBM’s presentation for manager, this time this is little information about semi-supervised learning.
- Speculative Decoding and Efficient LLM Inference with Chris Lott by Chris Lott and Sam Charrington (February 3^rd, 2025) ► The hardware constraints when running a LLM and some optimisations: KV caching, quantisation, pruning, speculative decoding…
- Understanding Reasoning LLMs — Methods and Strategies for Building and Refining Reasoning Models by Sebastian Raschka (February 5^th, 2025) ► A good overview of the current methods to build reasoning models: inference-time scaling, RL, SFT, and distillation.
- Elisa Fromont - Les modèles de diffusion by Elisa Fromont (February 6^th, 2025) ► A mathematical description of the diffusion models.
- New AI text diffusion models break speed barriers by pulling words from noise — New diffusion models borrow technique from AI image synthesis for 10x speed boost. by Benj Edwards (February 27^th, 2025) ► Inception Labs has released Mercury Coder, a LLaDA (Large Language Diffusion with mAsking).
- How DeepSeek Rewrote the Transformer [MLA] by Stephen Welch (March 5^th, 2025) ► A presentation of DeepSeek’s Multi-Head Latent Attention.
- The State of LLM Reasoning Model Inference — Part 1: Inference-Time Compute Scaling Methods by Sebastian Raschka (March 8^th, 2025) ► Sebastian Raschka summarises several articles among the current flurry of papers related to inference-time scaling.
- On the Biology of a Large Language Model (Part 1) by Yannic Kilcher (April 5^th, 2025) ► A presentation of two articles of Anthropic on interpretability: "Circuit Tracing: Revealing Computational Graphs in Language Models" and "On the Biology of a Large Language Model — We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.".
- ↪Exploring the “Biology” of LLMs with Circuit Tracing with Emmanuel Ameisen↑ by Emmanuel Ameisen and Sam Charrington (April 14^th, 2025) ► An interview of one of the authors of the previous papers.
- ↪On the Biology of a Large Language Model (Part 2) by Yannic Kilcher (May 3^rd, 2025) ► Yannic Kilcher is sarcastic on Anthopic’s snobbiness: they behave as if they are the only ones able to deal with AI complexity and impact. Nevertheless, they are doing good research.
- Introducing HELMET: Holistically Evaluating Long-context Language Models by Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Moshe Wasserblat, and Danqi Chen (April 16^th, 2025) ► A presentation of the benchmark and the results of running it on 59 models.
- Direct Preference Optimization: A Technical Deep Dive by Ivan Provilkov, Zain Hasan, and Max Ryabinin (April 17^th, 2025) ► A basic description of DPO.
- 4-Bit Training for Billion-Parameter LLMs? Yes, Really. by Letitia Parcalabescu (April 18^th, 2025) ► How to train a FP8 or FP4 model.
- The State of Reinforcement Learning for LLM Reasoning — Understanding GRPO and New Insights from Reasoning Model Papers by Sebastian Raschka (April 19^th, 2025) ► A description of RLHF, PPO, GRPO, and RLVR.
- Les 4 étapes pour entrainer un LLM by David Louapre (April 25^th, 2025) ► A basic description of the building of a LLM.
- ↪Les 4 étapes pour entrainer un LLM by David Louapre (April 25^th, 2025) ► Some additional information.
- The Strange Physics That Gave Birth to AI — Modern thinking machines owe their existence to insights from the physics of complex materials. by Elise Cutts (April 30^th, 2025) ► How John Hopfield had the idea to use models of spin glasses to build neural networks wih memory.
- Language Concept Models: The Next Leap in Generative AI by Aaron Baughman (May 6^th, 2025) ► After describing the usual encoder-decoder architecture, Aaron Baughman speaks about SONAR and Large Concept Models, but this is high level introduction to the matter.
- The Misconception that Almost Stopped AI by Stephen Welch (May 9^th, 2025) ► This video tries to explain why gradient descent works, while we could believe it would be stuck in a local minimum, because there are a lot of dimensions.
- ↪The F=ma of Artificial Intelligence by Stephen Welch (June 11^th, 2025) ► Yet another description of gradient backpropagation.
- Token-Efficient Long Video Understanding for Multimodal LLMs | Paper explained by Letitia Parcalabescu (May 18^th, 2025) ► A presentation of STORM ("Token-Efficient Long Video Understanding for Multimodal LLMs"), a vision model using Mamba to compress video information.
- Tokenisers
  - Let's build the GPT Tokenizer⇈ by Andrej Karpathy (February 20^th, 2024) ► A very good description of tokenisation.
  - So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer — We released a new open source byte-pair tokenizer that is faster and more flexible than popular alternatives. by Hendrik van Antwerpen and Alexander Neubeck (December 12^th, 2024) ► The explanation of the algorithm is not so clear.
- Generative Adversarial Networks
  - Generative Adversarial Networks (GANs) - Computerphile by Robert Miles (October 25^th, 2017) ► Adversarial Networks and using them to generate images.
  - Zebras, Horses & CycleGAN - Computerphile by Mike Pound (August 1^st, 2019) ► A description of CycleGAN, two GANs working in opposite directions.
- Transformers
  - AI Language Models & Transformers - Computerphile by Robert Miles (June 26^th, 2019) ► The usage and implementation of language models, and the new attention-based ones: transformers.
  - Transformers, explained: Understand the model behind GPT, BERT, and T5 by Dale Markowitz (August 19^th, 2021) ► A basic presentation of transformers.
  - Scaling Transformer to 1M tokens and beyond with RMT (Paper Explained) by Yannic Kilcher (April 27^th, 2023) ► An explanation of the "Scaling Transformer to 1M tokens and beyond with RMT" paper: a RNN where the base building block is a transformer.
  - Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs by Sebastian Raschka (January 14^th, 2024) ► A clear and detailed description of self-attention implementation.
  - Transformers explained | The architecture behind LLMs by Letitia Parcalabescu (January 21^st, 2024) ► Yet another explanation of the transformer architecture. This one is correct.
  - SDS 759: Full Encoder-Decoder Transformers Fully Explained, with Kirill Eremenko (⧉) by Kirill Eremenko and Jon Krohn (February 20^th, 2024) ► Yet another presentation of transformers, this one is so-so.
  - À quoi ressemble ChatGPT ? 🌶️↓ by Lê Nguyên Hoang (October 29^th, 2024) ► This explanation of transformers is much too fast to be understandable.
- MAMBA
  - Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained) by Yannic Kilcher (December 24^th, 2023) ► The paper presenting the Mamba architecture: "Mamba: Linear-Time Sequence Modeling with Selective State Spaces".
  - This CRAZY Paper on Mamba has got some REAL Juice!!! by Abdul Majed Raja (February 7^th, 2024) ► A shallow presentation of the "Repeat After Me: Transformers are Better than State Space Models at Copying" paper.
  - MAMBA and State Space Models explained | SSM explained by Letitia Parcalabescu (February 17^th, 2024) ► Another presentation of MAMBA.
  - The FIRST Production-grade Mamba-based LLM!!! by Abdul Majed Raja (March 31^st, 2024) ► A presentation of ai21labs/Jamba-v0.1, a hybrid Mamba/Transformer architecture.
  - Attention!!! JAMBA Instruct - Mamba LLM's new Baby!!! by Abdul Majed Raja (May 3^rd, 2024) ► Jamba-Instruct, a chatbot based on Jamba.
- xLSTM
  - xLSTM: Extended Long Short-Term Memory by Yannic Kilcher (June 2^nd, 2024) ► A presentation of "xLSTM: Extended Long Short-Term Memory", a study of LSTM variants scaled to billions of parameters.
  - xLSTM Explained in Detail!!! by Maximilian Beck and Abdul Majed Raja (July 1^st, 2024) ► Maximilian Beck, an author of the previous paper, is presenting xLSTM.
- Ternary models
  - Floating Points are no more, Changes everything for LLMs!!!↓ by Abdul Majed Raja (February 28^th, 2024) ► "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" describes an LLM using only -1, 0, and 1 as weights.
  - Scalable MatMul-free Language Modeling (Paper Explained) by Yannic Kilcher (July 8^th, 2024) ► Yannic Kilcher presents and comments "Scalable MatMul-free Language Modeling", a paper proposing to replace matrix multiplication with ternary weights.
  - Understanding 1.58-bit Large Language Models by Arun Nanda (September 7^th, 2024) ► A good presentation of the state of the art of ternary models.
  - Microsoft’s “1‑bit” AI model runs on a CPU only, while matching larger systems — Future AI might not need supercomputers thanks to models like BitNet b1.58 2B4T. by Kyle Orland (April 18^th, 2025) ► Some basic information about Microsoft’s BitNet b1.58 2B4T.
- PyTorch
  - Débuter avec PyTorch by Salma Elghourbal (March 18^th, 2021) ► A short introduction to Pytorch with a small example.
  - Building a Single Layer Neural Network in PyTorch by Muhammad Asad Iqbal Khan (April 8^th, 2023) ► A complete and very simple example.
  - SDS 819: PyTorch: From Zero to Hero, with Luka Anicin (⧉) by Luka Anicin and Jon Krohn (September 17^th, 2024) ► Luka Anicin provides a basic presentation of Pytorch and describes how he got into machine learning.
  - Get SH*T Done with PyTorch
    - Getting Started with PyTorch by Venelin Valkov (February 6^th, 2020) ► A short introduction to PyTorch.
    - Build Your First Neural Network with PyTorch by Venelin Valkov (February 21^st, 2020) ► A simple neural network.
- Transformers
  - The Transformers Library: standardizing model definitions by "Lysandre", Arthur Zucker, Pedro Cuenca, and Julien Chaumond (May 15^th, 2025) ► Hugging Face wants its library to be usable across the whole AI landscape.
- 3blue1brown’s Deep learning
  - But what is a neural network? | Deep learning chapter 1 by Grant Sanderson (October 5^th, 2017) ► A basic introduction to the structure of neutral networks and how such a structure could work.
  - Gradient descent, how neural networks learn | DL2 by Grant Sanderson (October 16^th, 2017) ► Training a neural network consists in minimising a cost function and how to use gradient descent to perform this minimisation.
  - Backpropagation, intuitively | DL3 by Grant Sanderson (November 3^rd, 2017) ► Getting a feeling of how backpropagation works.
  - Backpropagation calculus | DL4 by Grant Sanderson (November 3^rd, 2017) ► The calculus expressions of backpropagation.
  - Transformers (how LLMs work) explained visually | DL5 by Grant Sanderson (April 1^st, 2024) ► A very high-level description of an LLM architecture and the description of embedding and softmax.
  - Attention in transformers, step-by-step | DL6 by Grant Sanderson (April 7^th, 2024) ► A description of the transformer architecture.
  - Visualizing transformers and attention | Talk for TNG Big Tech Day '24 by Grant Sanderson (July 5^th, 2024) ► A 45-minutes summary of the previous videos.
  - How might LLMs store facts | DL7↑ by Grant Sanderson (August 31^st, 2024) ► A description of how the multilayer perceptrons complete embedded vectors with related data during inference.
- Grokking
  - Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained) by Yannic Kilcher (October 6^th, 2021) ► A paper ("Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets") on the grokking phenomenon where generalisation seems to happen abruptly and long after fitting the training data.
  - "Grokking" : les modèles d'IA sont-ils capables de piger ? — Ce phénomène étonnant, découvert récemment, pourrait changer notre compréhension de l'apprentissage et de la cognition dans les réseaux de neurones... by David Louapre (September 11^th, 2023) ► Another presentation of grokking.
  - How Do Machines ‘Grok’ Data? — By apparently overtraining them, researchers have seen neural networks discover novel solutions to problems. by Anil Ananthaswamy (April 12^th, 2024) ► Some researchers understood some cases of grokking, but the phenomenon has only been studied from small neural network doing modular arithmetic.