Understanding AI (keep updating)

I’ve been wanting to work through the underpinnings of chatgpt since before it became popular, so I’m taking this opportunity to keep it up to date over time. This article will cover a range of very basic knowledge, starting from supervised and unsupervised learning, to shallow neural networks, to loss functions, gradient descent, how to measure networks, regularization, convolution, Residual networks, transformers… Reinforcement learning, and so on, so it will keep updating.

Three main subfields in Machine Learning

Reinforcement Learning, Supervised Learning, and Unsupervised Learning.

Universal Approximation Theorem

A shallow neural network can approximate any continuous function, if allowed enough hidden units.

The relationship between ReLU and linear regions for shallow networks

number of ReLUs + 1 = regions

Why not just use (very wide) shallow networks

A deep network with the same number of hidden units as a wide shallow network can represent more subregions.

Deep networks with the same number of hidden units as a wide shallow network seem to learn faster.

Deep networks with the same number of hidden units as a wide shallow network seem to need fewer data points to achieve the same accuracy.

The logarithm function

The maximum value is still achieved for the same x (or input) value as for the original function

The softmax function

The softmax function generates what looks like a probability distribution for arbitrary input.

Gradient descent

It might get stuck in a local minimum, but that does not seem to matter in high-dimensional settings

Nesterov momentum

Nesterov momentum is a variant of the simple momentum idea where the momentum step is performed before the gradient computation。

Backpropagation algorithm forward pass

To compute the derivatives of the weights, we need to calculate and store the activations at the hidden layers. This is known as the forward pass since it involves running the network equations sequentially. Do the math and you’ll know what the loss is at the end.

Backpropagation backward pass

The backward pass first computes derivatives at the end of the network and then works backward to exploit the inherent redundancy of these computations.

exploding gradient and vanishing gradient

If we initialize with a larger value, the magnitude of the gradients increases rapidly as we pass back through the network. If we initialize with a value smaller, then the magnitude decreases. These are known as the exploding gradient and vanishing gradient problems, respectively.


It is data-oriented, and by comparing the training and test data, we can often conclude that the model’s performance is poor due to these three reasons.

Noise is an inherent limitation that cannot be improved.

Bias refers to a model that is not flexible enough to perfectly fit the function. For example, attempting to express a Sin function with only three ReLU units would have limitations that could be addressed by increasing the complexity of the network.

Variance refers to the fact that the model does not sufficiently converge to the underlying data, and the results of the model vary each time due to the stochastic learning algorithm. This can be addressed by increasing the amount of training data.

Bias-Variance trade-off

It is not always the case that increasing the complexity of the model will always reduce bias. An overly complex model with insufficient data can easily overfit.

There exists an intermediate state where the model’s capacity and the amount of data are balanced, and at this point, the bias is minimal, and the variance does not increase significantly.

Double descent

In datasets with noise, there is often a phenomenon known as double descent, where the test error decreases twice as we increase the model complexity. However, the reason for this phenomenon is not yet clear.

Three kinds of data

Training data (which is used to learn the model parameters), validation data (which is used to choose the hyperparameters), and test data (which is used to estimate the final performance).


Regularization refers to a series of methods used to reduce the performance gap between the training and test sets. Strictly speaking, it involves adding terms to the loss function and then selecting the parameters.


If all weights are initialized to 0.0, then most of the network will be redundant. If the initial weights are too big, then training will break, resulting in NaN values. If the initial weights are too small, then training will take forever.


The validation error stays more or less constant, while the validation loss might increase again. It is over-fitting.

Implicit regularization

In both gradient descent and stochastic gradient descent algorithms, the model parameters are updated and adjusted during the update process by hyperparameters such as the learning rate. This adjustment process is itself a form of regularisation, which prevents the model from oscillating when overfitting, thus making it smoother. In addition, as the stochastic gradient descent algorithm uses random sampling, the parameters are updated differently each time, which can also be considered as a kind of regularisation, thus making the model more generalisable.

Early stopping and L2

Both Early stopping and L2 regularisation attempt to avoid overfitting; Early stopping avoids overfitting by monitoring training performance and ending training early when conditions are met; L2 avoids overfitting by adding the sum of squares of the model weights to the loss function as a regularisation term.


A way to get one result after taking a set of models together and calculating them according to different methods. For example, calculating the mean.


This method randomly sets a certain number of hidden cell weights to zero, which invalidates some of the cells, which allows the process to avoid overfitting by not relying too heavily on a single cell.


It can be added in three places: the input data, the weights, and the labels. Each mini batch does a different randomization, which results in not leaning towards any one point, thus avoiding overfitting.

Transfer learning

When the training data is insufficient, the trained model is borrowed and fine-tuned to give better results. The two tasks should be somewhat related, removing the output layer from the trained model, adding a new output layer or multiple layers, and once again training only the newly added layer, or just fine-tuning the whole model. The concept of transfer learning comes from the fact that most of the parameters have already been initialised by the first model and it is easier to get a good result.

Multi-task learning

The model is trained to output for multiple purposes at the same time. For example, predicting both segment and depth. because each individual task brings in more relevant data, there is a performance improvement for each task.

Self-supervised learning

There are two families of approaches to self-supervised learning: generative and contrastive. Generative example: a bunch of articles, randomly remove some words and train the model to recognise what is missing. That’s what ChatGPT does, and it has a claimed 170T words. Contrastive example: one picture, get a second one by deformation, train the model to recognise which one is the original.

Data augmentation

A way of adding data by artificial means when the data set is insufficient. For example, if the same image is rotated, flipped, blurred, etc., the labels are still the same.

Why 224*224

It often happens that images are classified, and when using CNN, the reason for getting the images to 224*224 is the approach of cropping the smaller images in 256*256. Because 8*32=256, 7*32 is a good choice if you want to make a uniform size crop. This means that both the length and width can be let out by one 32, the remaining is actually 224. Then, in the area where the length and width are both 32, say the top left corner, select the coordinates of the top left corner to fall in this area, and you can produce about 1000 (32*32) new 244*244 images.

Invariance and equivariance

f[t[x]] = f[x] This is invariance, the transformed image still gives the same result as the original image. f[t[x]] = t[f[x]] This is equivariance, whether the image is converted and then computed or computed and then converted will give the same result. The purpose of convolution is to achieve these two properties. The convolution operation can be used to extract features in an image and is translation invariant. The pooling operation in a convolutional neural network improves the equivalence of features, allowing the network to classify different images.


There are three parameters in the convolution process: size, stride, and dilation. The kernel size determines how many elements to operate at once. Stride determines how many elements to skip each time compared to the previous operation. Dilation determines how many elements to skip in a single operation. If there are not enough elements on the boundary, zero padding is used to make up the zeroes. There is also the option of dropping the boundary, but this reduces the range.


Convolution with multiple kernels in parallel gives a set of results, and such results are called channels. The further back in the convolutional network the layer is, the larger the receptive field of its individual elements (which can represent a larger range of content.)


#big data #cutting-edge #Algorithms