- TensorFlow
- Hugging Face: blog📡, learn
- Kaggle
- BigML
- SentenceTransformers
- Lightning AI’s Deep Learning Fundamentals
- Pytorch
- Machine Learning for Software Engineering: a curated list of papers, PhD theses, datasets, and tools
Optimisation
: Homepage, Blog, YouTube📡⇈
: YouTube📡
: Homepage, Ahead of AI📡↑
AI Coffee Break with Letitia ( ): YouTube📡, Substack
Articles and videos
- SDS 771: Gradient Boosting: XGBoost, LightGBM and CatBoost, with Kirill Eremenko↑ (⧉) by and (April 2nd, 2024) ► A good basic presentation of some algorithm based on decision trees: XGBoost, LightGBM, and CatBoost.
- SDS 831: PyTorch Lightning Lit-Serve and Lightning Studios, with Dr. Luca Antiga (⧉) by and (October 29th, 2024) ► The products of Lightning AI (PyTorch Lightning, Lightning Studios, LitServe, and Lightning Thunder) and some thoughts about small language models.
-
Neural networks
- Polyworld: Using Evolution to Design Artificial Intelligence by (November 8th, 2007) ► Using artificial life to optimise neural networks.
- Evolving Neural Networks to Play 2048 by (May 12th, 2014) ► The title says it all.
- Is Dr. Calvin in the Room? by (March 16th, 2017) ► Some ideas and some dubious simple back-of-the-envelope calculations about neural networks.
- How to generate text: using different decoding methods for language generation with Transformers by (March 1st, 2020) ► A presentation of the different methods to control the text generated by a model: greedy search, beam search, sampling, top-K sampling, and top-p (nucleus) sampling.
- The Neural Network, A Visual Introduction by (August 23rd, 2020) ► The title says it all.
- Gradients are Not All You Need (Machine Learning Research Paper Explained) by (November 16th, 2021) ► A paper ("Gradients are Not All You Need") showing that gradient backpropagation does not work properly for some chaotic systems.
- Machine Learning 1: Tour d'horizon et le cas MuZero (feat Dalle2, PaLM) - Passe-science #47 by (June 4th, 2022) ► The latest results of the best AIs and how Muzero is trained.
- ↪Machine Learning 2: Architecture et Alphastar (Transformer, attention) - Passe-science #48 by (June 11th, 2022) ► A description of the architecture of Alphastart and of transformers.
- TensorFlow in 100 Seconds by (August 3rd, 2022) ► A very short example of using TensorFlow.
- The spelled-out intro to neural networks and backpropagation: building micrograd↑ by (August 17th, 2022) ► A good basic introduction to backpropagation with the code details.
- Understanding Encoder And Decoder LLMs by (June 17th, 2023) ► The title says it all.
- Create a Large Language Model from Scratch with Python – Tutorial↓ by (August 25th, 2023) ► This lengthy tutorial is not worth watching. Many parts lack preparation, some explanations are confusing, spends a significant amount of time explaining simple concepts while skipping complex ones… We get some understanding on how to implement an LLM, but this would easily be accomplished in a one-hour video.
- LLM Training: RLHF and Its Alternatives by (September 10th, 2023) ► As said in the title, a clear description of RHLF and its alternatives.
- What is LoRA? Low-Rank Adaptation for finetuning LLMs EXPLAINED by (September 18th, 2023) ► A presentation of LoRA.
- Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation) — Things I Learned From Hundreds of Experiments by (November 19th, 2023) ► Some experiments with LoRA.
- What is Q-Learning (back to basics) by (November 25th, 2023) ► The title says it all.
- Mixture of Experts Explained by , , , , , and (December 11th, 2023) ► A technical history of the MoE models.
- Apple is doing the UNTHINKABLE!!! by (January 6th, 2024) ► Some information about Apple’s MLX Framework.
- LLaMA Pro: Progressive LLaMA with Block Expansion (Paper Explained) by (January 7th, 2024) ► Commenting a paper ("LLaMA Pro: Progressive LLaMA with Block Expansion") which contains dubious claims about how the researchers improved LLaMA by duplicating some blocks.
- A Guide to Deeplearning4j (January 8th, 2024) ► A short presentation of Deeplearning4j.
- Sampling for Text Generation↑ by (January 16th, 2024) ► A clear overview on the methods used to sample or constrain the output of a generative AI.
- AlphaGeometry: Solving olympiad geometry without human demonstrations (Paper Explained) by (January 22nd, 2024) ► A summary of the "Solving olympiad geometry without human demonstrations" paper describing AlphaGeometry.
- LLMs itself CAN create BETTER LLMs by (January 23rd, 2024) ► A quick presentation of "Self-Rewarding Language Models": a dubious claim that a model can be improved by training and rewarding itself a new iteration of itself.
- Model Merging, Mixtures of Experts, and Towards Smaller LLMs by (February 3rd, 2024) ► Weight Averaged Reward Models, Tuning Language Models by Proxy, Mixtral of Experts, and TinyLlama.
- Sparse LLMs at inference: 6x faster transformers! | DEJAVU paper explained by (February 3rd, 2024) ► A presentation of "Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time", some self-attention heads and some MLP neurons are selected by running a simpler neuron network.
- Lumiere: A Space-Time Diffusion Model for Video Generation (Paper Explained) by (February 4th, 2024) ► An opinionated presentation of "Lumiere: A Space-Time Diffusion Model for Video Generation".
- "MORE AGENTS" Is All You Need by (February 12th, 2024) ► "More Agents Is All You Need" analyses the gain of generating answers from several LLMs and using a voting mechanism to define the final answer.
- How Quickly Do Large Language Models Learn Unexpected Skills? — A new study suggests that so-called emergent abilities actually develop gradually and predictably, depending on how you measure them. by (February 13th, 2024) ► The debate about emergent capabilities appearing abruptly or continuously is still going on…
- Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch↑ by (February 18th, 2024) ► A clear description of LoRA and DoRA.
- How Selective Forgetting Can Help AI Learn Better — Erasing key information during training results in machine learning models that can learn new languages faster and more easily.↓ by (February 28th, 2024) ► Some little and too basic information about forgetting language models which are easier to train to new languages.
- A LoRA Successor, Small Finetuned LLMs Vs Generalist LLMs, and Transparent LLM Research by (March 3rd, 2024) ► Can small fined-tuned models perform better on some tasks than large models, DoRA, OLMo is a real open-source model, Gemma…
- Tips for LLM Pretraining and Evaluating Reward Models — Discussing AI Research Papers in March 2024 by (March 31st, 2024) ► An analysis of continuous pretraining and a benchmark for evaluating reward models.
- How do mixture-of-experts layers affect transformer models? — This new LLM technique has started improving the results of models without additional training. by (April 4th, 2024) ► A short description of the Mixture of Experts architecture, I guess that if you know enough to understand this, you already know about MoE.
- Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping (Searchformer) by (April 6th, 2024) ► "Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping": using a transformer model to mimic an A* search algorithm, results are better when the model also has to reproduce the search traces.
- Why Recurrent Neural Networks are cursed | LM2 by (April 8th, 2024) ► A presentation of Recurrent Neural Networks.
- Flow Matching for Generative Modeling (Paper Explained) by (April 8th, 2024) ► "Flow Matching for Generative Modeling": a mathematical description of Flow Matching, a mechanism to train Continuous Normalizing Flows.
- How did the Attention Mechanism start an AI frenzy? | LM3 by (April 15th, 2024) ► How the attention mechanism was implemented for RNN.
- Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention by (April 25th, 2024) ► A paper ("Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention") describing how to support infinite context length, but has some doubts about its real effectiveness.
- How AI 'Understands' Images (CLIP) - Computerphile by (April 25th, 2024) ► A basic presentation of CLIP.
- TransformerFAM: Feedback attention is working memory by (April 28th, 2024) ► Yet another paper ("TransformerFAM: Feedback attention is working memory") proposing infinite context by using additional tokens as a short term memory.
- ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained) by (May 1st, 2024) ► A paper ("ORPO: Monolithic Preference Optimization without Reference Model") combining supervised fine-tuning and preference alignment.
- Shapley Values Explained | Interpretability for AI models, even LLMs! by (May 6th, 2024) ► A presentation of Shapley values, a method to explain how much each input impacts the model’s output, and an example with Llama 2 and the SHAP library.
- Has Generative AI Already Peaked? - Computerphile by (May 9th, 2024) ► A basic presentation of a paper ("No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance") claiming that multimodal models require exponentially more data to achieve linear improvements.
- GaLore EXPLAINED: Memory-Efficient LLM Training by Gradient Low-Rank Projection by (May 27th, 2024) ► Yet another pre-training/fine-tuning algorithm: "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection".
- LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments — Discussing the Latest Model Releases and AI Research in May 2024 by (June 2nd, 2024) ► Three papers: not masking instructions when calculating the loss for instruction finetuning performs better than masking, but only if the answer is sort and the number of training examples is small; LoRA learns less and forgets less than full finetuning; MoRa is yet another finetuning algorithm.
- Machine Learning and Logistic Regression↓ by (July 19th, 2024) ► A bad description of logistic regression, the linear part is not explained.
- A New Type of Neural Network Is More Interpretable — Kolmogorov-Arnold Networks could point physicists to new hypotheses↓ by (August 5th, 2024) ► There is little valuable information about KAN networks in this article.
- Reinforcement Learning from Human Feedback (RLHF) Explained by (August 7th, 2024) ► The title says it all: a short presentation of RLHF.
- New LLM Pre-training and Post-training Paradigms — A Look at How Modern LLMs Are Trained by (August 17th, 2024) ► An overview and comparison of the pre and post-trainings of Qwen 2, Apple Foundation Model, Gemma 2, and Llama 3.1.
- Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution – Paper Explained by (August 20th, 2024) ► "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution" describes a diffusion model algorithm for text that gives correct results, before such models used to generate garbage text.
- What is Mixture of Experts? by (August 28th, 2024) ► A clear introduction to the MoE architecture and its advantages and challenges.
- Transformer LLMs are Turing Complete after all !? by and (September 5th, 2024) ► LLM with chain-of-thought are equivalent to probabilistic Turing machines.
- Building A GPT-Style LLM Classifier From Scratch — Finetuning a GPT Model for Spam Classification by (September 21st, 2024) ► This article details how to fine-tune a LLM to use it as a classifier: replacing the output layer by a classifier one, then post-training it while freezing all layers except the last transformer block and the output layer.
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Paper) by (October 5th, 2024) ► A review of "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters", a paper comparing the cost efficiency of best-of-n vs. beam search vs. lookahead search, and the cost efficiency ot training vs. inference.
- Graph Language Models EXPLAINED in 5 Minutes! [Author explanation 🔴 at ACL 2024] by and (October 6th, 2024) ► proposes a method to build a Graph Language Model from a LLM.
- Text Classification: AI Techniques and Real-World Applications by (October 15th, 2024) ► A good but limited 101 presentation for managers: the basics of text classification.
- Understanding Multimodal LLMs — An introduction to the main techniques and latest models↑ by (November 3rd, 2024) ► A good presentation of the architectures of models able to take both text and images as input. The two main options are Unified Embedding Decoder and Cross-modality Attention. describes ten recent such models.
- Large Language Models explained briefly by (November 20th, 2024) ► A short, basic, and clear presentation of how LLMs work.
- TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained) by (November 23rd, 2024) ► A, rather negative, review of "TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters".
- Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty by (December 5th, 2024) ► A prototype of an image generation UI where the user is proposed to clarify some aspects of what s/he wants or s/he can edit an interpretable belief graph of the model.
- REPA Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You ... by (December 8th, 2024) ► Improving image generation diffusion models by combining them with vision transformer models.
- Safety Alignment Should be Made More Than Just a Few Tokens Deep (Paper Explained) by (December 10th, 2024) ► Some comment on "Safety Alignment Should be Made More Than Just a Few Tokens Deep", a paper demonstrating that safety fine-tuning mostly impacts the first generated tokens, so you can get the model to generate unsafe content by controlling the first tokens, using a DAN prompt, fine-tuning it…
- Are LLMs capable of non-verbal reasoning? — Processing in the "latent space" could help AI with tricky logical questions. by (December 12th, 2024) ► The subtitle says it all.
- The Dark Matter of AI [Mechanistic Interpretability]↑ by (December 23rd, 2024) ► A presentation of Mechanistic Interpretability and its use of Sparse Autoencoders.
- Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained) by (December 24th, 2024) ► This description of "Byte Latent Transformer: Patches Scale Better Than Tokens" is difficult to understand.
- Noteworthy AI Research Papers of 2024 (Part One) — Six influential AI papers from January to June↑ by (December 31st, 2024) ► has chosen one interesting paper for each month of 2024. He gives some information extracted from each one.
- ↪Noteworthy AI Research Papers of 2024 (Part Two) — Six influential AI papers from July to December↑ by (January 15th, 2025) ► The second half of the year.
- Training large language models to reason in a continuous latent space – COCONUT Paper explained by (January 26th, 2025) ► An overview of the Chain of Thoughts mechanism presented in "Training Large Language Models to Reason in a Continuous Latent Space".
- [GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models by (January 26th, 2025) ► A description of TRPO, PPO, and GRPO. The most interesting is the conclusion that RL is only able to reshape the base model probabilities, not to add new capabilities to it.
- Diffusion Models for AI Image Generation by (January 30th, 2025) ► A clear, basic, and classical presentation of diffusion models.
- ChatGPT is made from 100 million of these [The Perceptron] by (February 1st, 2025) ► The history of gradient back-propagation.
- What is Semi-Supervised Learning? by (February 3rd, 2025) ► The usual IBM’s presentation for manager, this time this is little information about semi-supervised learning.
- Speculative Decoding and Efficient LLM Inference with Chris Lott by and (February 3rd, 2025) ► The hardware constraints when running a LLM and some optimisations: KV caching, quantisation, pruning, speculative decoding…
- Understanding Reasoning LLMs — Methods and Strategies for Building and Refining Reasoning Models by (February 5th, 2025) ► A good overview of the current methods to build reasoning models: inference-time scaling, RL, SFT, and distillation.
- Elisa Fromont - Les modèles de diffusion by (February 6th, 2025) ► A mathematical description of the diffusion models.
- New AI text diffusion models break speed barriers by pulling words from noise — New diffusion models borrow technique from AI image synthesis for 10x speed boost. by (February 27th, 2025) ► Inception Labs has released Mercury Coder, a LLaDA (Large Language Diffusion with mAsking).
- How DeepSeek Rewrote the Transformer [MLA] by (March 5th, 2025) ► A presentation of DeepSeek’s Multi-Head Latent Attention.
- The State of LLM Reasoning Model Inference — Part 1: Inference-Time Compute Scaling Methods by (March 8th, 2025) ► summarises several articles among the current flurry of papers related to inference-time scaling.
- On the Biology of a Large Language Model (Part 1) by (April 5th, 2025) ► A presentation of two articles of Anthropic on interpretability: "Circuit Tracing: Revealing Computational Graphs in Language Models" and "On the Biology of a Large Language Model — We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.".
- ↪Exploring the “Biology” of LLMs with Circuit Tracing with Emmanuel Ameisen↑ by and (April 14th, 2025) ► An interview of one of the authors of the previous papers.
- Introducing HELMET: Holistically Evaluating Long-context Language Models by , , , , , , and (April 16th, 2025) ► A presentation of the benchmark and the results of running it on 59 models.
- Direct Preference Optimization: A Technical Deep Dive by , , and (April 17th, 2025) ► A basic description of DPO.
- 4-Bit Training for Billion-Parameter LLMs? Yes, Really. by (April 18th, 2025) ► How to train a FP8 or FP4 model.
- The State of Reinforcement Learning for LLM Reasoning — Understanding GRPO and New Insights from Reasoning Model Papers by (April 19th, 2025) ► A description of RLHF, PPO, GRPO, and RLVR.
- Les 4 étapes pour entrainer un LLM by (April 25th, 2025) ► A basic description of the building of a LLM.
- ↪Les 4 étapes pour entrainer un LLM by (April 25th, 2025) ► Some additional information.
- The Strange Physics That Gave Birth to AI — Modern thinking machines owe their existence to insights from the physics of complex materials. by (April 30th, 2025) ► How had the idea to use models of spin glasses to build neural networks wih memory.
-
Tokenisers
- Let's build the GPT Tokenizer⇈ by (February 20th, 2024) ► A very good description of tokenisation.
- So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer — We released a new open source byte-pair tokenizer that is faster and more flexible than popular alternatives. by and (December 12th, 2024) ► The explanation of the algorithm is not so clear.
-
Generative Adversarial Networks
- Generative Adversarial Networks (GANs) - Computerphile by (October 25th, 2017) ► Adversarial Networks and using them to generate images.
- Zebras, Horses & CycleGAN - Computerphile by (August 1st, 2019) ► A description of CycleGAN, two GANs working in opposite directions.
-
Transformers
- AI Language Models & Transformers - Computerphile by (June 26th, 2019) ► The usage and implementation of language models, and the new attention-based ones: transformers.
- Transformers, explained: Understand the model behind GPT, BERT, and T5 by (August 19th, 2021) ► A basic presentation of transformers.
- Scaling Transformer to 1M tokens and beyond with RMT (Paper Explained) by (April 27th, 2023) ► An explanation of the "Scaling Transformer to 1M tokens and beyond with RMT" paper: a RNN where the base building block is a transformer.
- Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs by (January 14th, 2024) ► A clear and detailed description of self-attention implementation.
- Transformers explained | The architecture behind LLMs by (January 21st, 2024) ► Yet another explanation of the transformer architecture. This one is correct.
- SDS 759: Full Encoder-Decoder Transformers Fully Explained, with Kirill Eremenko (⧉) by and (February 20th, 2024) ► Yet another presentation of transformers, this one is so-so.
- À quoi ressemble ChatGPT ? 🌶️↓ by (October 29th, 2024) ► This explanation of transformers is much too fast to be understandable.
-
MAMBA
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained) by (December 24th, 2023) ► The paper presenting the Mamba architecture: "Mamba: Linear-Time Sequence Modeling with Selective State Spaces".
- This CRAZY Paper on Mamba has got some REAL Juice!!! by (February 7th, 2024) ► A shallow presentation of the "Repeat After Me: Transformers are Better than State Space Models at Copying" paper.
- MAMBA and State Space Models explained | SSM explained by (February 17th, 2024) ► Another presentation of MAMBA.
- The FIRST Production-grade Mamba-based LLM!!! by (March 31st, 2024) ► A presentation of ai21labs/Jamba-v0.1, a hybrid Mamba/Transformer architecture.
- Attention!!! JAMBA Instruct - Mamba LLM's new Baby!!! by (May 3rd, 2024) ► Jamba-Instruct, a chatbot based on Jamba.
-
xLSTM
- xLSTM: Extended Long Short-Term Memory by (June 2nd, 2024) ► A presentation of "xLSTM: Extended Long Short-Term Memory", a study of LSTM variants scaled to billions of parameters.
- xLSTM Explained in Detail!!! by and (July 1st, 2024) ► , an author of the previous paper, is presenting xLSTM.
-
Ternary models
- Floating Points are no more, Changes everything for LLMs!!!↓ by (February 28th, 2024) ► "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" describes an LLM using only -1, 0, and 1 as weights.
- Scalable MatMul-free Language Modeling (Paper Explained) by (July 8th, 2024) ► presents and comments "Scalable MatMul-free Language Modeling", a paper proposing to replace matrix multiplication with ternary weights.
- Understanding 1.58-bit Large Language Models by (September 7th, 2024) ► A good presentation of the state of the art of ternary models.
- Microsoft’s “1‑bit” AI model runs on a CPU only, while matching larger systems — Future AI might not need supercomputers thanks to models like BitNet b1.58 2B4T. by (April 18th, 2025) ► Some basic information about Microsoft’s BitNet b1.58 2B4T.
-
PyTorch
- Débuter avec PyTorch by (March 18th, 2021) ► A short introduction to Pytorch with a small example.
- Building a Single Layer Neural Network in PyTorch by (April 8th, 2023) ► A complete and very simple example.
-
Get SH*T Done with PyTorch
- Getting Started with PyTorch by (February 6th, 2020) ► A short introduction to PyTorch.
- Build Your First Neural Network with PyTorch by (February 21st, 2020) ► A simple neural network.
-
3blue1brown’s Deep learning
- But what is a neural network? | Deep learning chapter 1 by (October 5th, 2017) ► A basic introduction to the structure of neutral networks and how such a structure could work.
- Gradient descent, how neural networks learn | DL2 by (October 16th, 2017) ► Training a neural network consists in minimising a cost function and how to use gradient descent to perform this minimisation.
- Backpropagation, intuitively | DL3 by (November 3rd, 2017) ► Getting a feeling of how backpropagation works.
- Backpropagation calculus | DL4 by (November 3rd, 2017) ► The calculus expressions of backpropagation.
- Transformers (how LLMs work) explained visually | DL5 by (April 1st, 2024) ► A very high-level description of an LLM architecture and the description of embedding and softmax.
- Attention in transformers, step-by-step | DL6 by (April 7th, 2024) ► A description of the transformer architecture.
- Visualizing transformers and attention | Talk for TNG Big Tech Day '24 by (July 5th, 2024) ► A 45-minutes summary of the previous videos.
- How might LLMs store facts | DL7↑ by (August 31st, 2024) ► A description of how the multilayer perceptrons complete embedded vectors with related data during inference.
-
Grokking
- Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained) by (October 6th, 2021) ► A paper ("Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets") on the grokking phenomenon where generalisation seems to happen abruptly and long after fitting the training data.
- "Grokking" : les modèles d'IA sont-ils capables de piger ? — Ce phénomène étonnant, découvert récemment, pourrait changer notre compréhension de l'apprentissage et de la cognition dans les réseaux de neurones... by (September 11th, 2023) ► Another presentation of grokking.
- How Do Machines ‘Grok’ Data? — By apparently overtraining them, researchers have seen neural networks discover novel solutions to problems. by (April 12th, 2024) ► Some researchers understood some cases of grokking, but the phenomenon has only been studied from small neural network doing modular arithmetic.