# Understanding AI (keep updating)

*
*

I’ve been wanting to work through the underpinnings of chatgpt since before it became popular, so I’m taking this opportunity to keep it up to date over time. This article will cover a range of very basic knowledge, starting from supervised and unsupervised learning, to shallow neural networks, to loss functions, gradient descent, how to measure networks, regularization, convolution, Residual networks, transformers… Reinforcement learning, and so on, so it will keep updating.

**Three main subfields in Machine Learning**

Reinforcement Learning, Supervised Learning, and Unsupervised Learning.

**Universal Approximation Theorem**

A shallow neural network can approximate any continuous function, if allowed enough hidden units.

**The relationship between ReLU and linear regions for shallow networks**

number of ReLUs + 1 = regions

**Why not just use (very wide) shallow networks**

A deep network with the same number of hidden units as a wide shallow network can represent more subregions.

Deep networks with the same number of hidden units as a wide shallow network seem to learn faster.

Deep networks with the same number of hidden units as a wide shallow network seem to need fewer data points to achieve the same accuracy.

**The logarithm function**

The maximum value is still achieved for the same x (or input) value as for the original function

**The softmax function**

The softmax function generates what looks like a probability distribution for arbitrary input.

**Gradient descent**

It might get stuck in a local minimum, but that does not seem to matter in high-dimensional settings

**Nesterov momentum**

Nesterov momentum is a variant of the simple momentum idea where the momentum step is performed before the gradient computationã€‚

**Backpropagation algorithm forward pass**

To compute the derivatives of the weights, we need to calculate and store the activations at the hidden layers. This is known as the forward pass since it involves running the network equations sequentially. Do the math and you’ll know what the loss is at the end.

**Backpropagation backward pass**

The backward pass first computes derivatives at the end of the network and then works backward to exploit the inherent redundancy of these computations.

**exploding gradient and vanishing gradient**

If we initialize with a larger value, the magnitude of the gradients increases rapidly as we pass back through the network. If we initialize with a value smaller, then the magnitude decreases. These are known as the exploding gradient and vanishing gradient problems, respectively.

**Noise,Bias,Variance**

It is data-oriented, and by comparing the training and test data, we can often conclude that the model’s performance is poor due to these three reasons.

Noise is an inherent limitation that cannot be improved.

Bias refers to a model that is not flexible enough to perfectly fit the function. For example, attempting to express a Sin function with only three ReLU units would have limitations that could be addressed by increasing the complexity of the network.

Variance refers to the fact that the model does not sufficiently converge to the underlying data, and the results of the model vary each time due to the stochastic learning algorithm. This can be addressed by increasing the amount of training data.

**Bias-Variance trade-off**

It is not always the case that increasing the complexity of the model will always reduce bias. An overly complex model with insufficient data can easily overfit.

There exists an intermediate state where the model’s capacity and the amount of data are balanced, and at this point, the bias is minimal, and the variance does not increase significantly.

**Double descent**

In datasets with noise, there is often a phenomenon known as double descent, where the test error decreases twice as we increase the model complexity. However, the reason for this phenomenon is not yet clear.

**Three kinds of data**

Training data (which is used to learn the model parameters), validation data (which is used to choose the hyperparameters), and test data (which is used to estimate the final performance).

**Regularization**

Regularization refers to a series of methods used to reduce the performance gap between the training and test sets. Strictly speaking, it involves adding terms to the loss function and then selecting the parameters.

**References**