I’ve been wanting to work through the underpinnings of chatgpt since before it became popular, so I’m taking this opportunity to keep it up to date over time. This article will cover a range of very basic knowledge, starting from supervised and unsupervised learning, to shallow neural networks, to loss functions, gradient descent, how to measure networks, regularization, convolution, Residual networks, transformers… Reinforcement learning, and so on, so it will keep updating.
Three main subfields in Machine Learning
Reinforcement Learning, Supervised Learning, and Unsupervised Learning.
Universal Approximation Theorem
A shallow neural network can approximate any continuous function, if allowed enough hidden units.
The relationship between ReLU and linear regions for shallow networks
number of ReLUs + 1 = regions
Why not just use (very wide) shallow networks
A deep network with the same number of hidden units as a wide shallow network can represent more subregions.
Deep networks with the same number of hidden units as a wide shallow network seem to learn faster.
Deep networks with the same number of hidden units as a wide shallow network seem to need fewer data points to achieve the same accuracy.
The logarithm function
The maximum value is still achieved for the same x (or input) value as for the original function
The softmax function
The softmax function generates what looks like a probability distribution for arbitrary input.
It might get stuck in a local minimum, but that does not seem to matter in high-dimensional settings
Nesterov momentum is a variant of the simple momentum idea where the momentum step is performed before the gradient computation。
Backpropagation algorithm forward pass
To compute the derivatives of the weights, we need to calculate and store the activations at the hidden layers. This is known as the forward pass since it involves running the network equations sequentially. Do the math and you’ll know what the loss is at the end.
Backpropagation backward pass
The backward pass first computes derivatives at the end of the network and then works backward to exploit the inherent redundancy of these computations.
exploding gradient and vanishing gradient
If we initialize with a larger value, the magnitude of the gradients increases rapidly as we pass back through the network. If we initialize with a value smaller, then the magnitude decreases. These are known as the exploding gradient and vanishing gradient problems, respectively.
It is data-oriented, and by comparing the training and test data, we can often conclude that the model’s performance is poor due to these three reasons.
Noise is an inherent limitation that cannot be improved.
Bias refers to a model that is not flexible enough to perfectly fit the function. For example, attempting to express a Sin function with only three ReLU units would have limitations that could be addressed by increasing the complexity of the network.
Variance refers to the fact that the model does not sufficiently converge to the underlying data, and the results of the model vary each time due to the stochastic learning algorithm. This can be addressed by increasing the amount of training data.
It is not always the case that increasing the complexity of the model will always reduce bias. An overly complex model with insufficient data can easily overfit.
There exists an intermediate state where the model’s capacity and the amount of data are balanced, and at this point, the bias is minimal, and the variance does not increase significantly.
In datasets with noise, there is often a phenomenon known as double descent, where the test error decreases twice as we increase the model complexity. However, the reason for this phenomenon is not yet clear.
Three kinds of data
Training data (which is used to learn the model parameters), validation data (which is used to choose the hyperparameters), and test data (which is used to estimate the final performance).
Regularization refers to a series of methods used to reduce the performance gap between the training and test sets. Strictly speaking, it involves adding terms to the loss function and then selecting the parameters.
If all weights are initialized to 0.0, then most of the network will be redundant. If the initial weights are too big, then training will break, resulting in NaN values. If the initial weights are too small, then training will take forever.
The validation error stays more or less constant, while the validation loss might increase again. It is over-fitting.
In both gradient descent and stochastic gradient descent algorithms, the model parameters are updated and adjusted during the update process by hyperparameters such as the learning rate. This adjustment process is itself a form of regularisation, which prevents the model from oscillating when overfitting, thus making it smoother. In addition, as the stochastic gradient descent algorithm uses random sampling, the parameters are updated differently each time, which can also be considered as a kind of regularisation, thus making the model more generalisable.
Early stopping and L2
Both Early stopping and L2 regularisation attempt to avoid overfitting; Early stopping avoids overfitting by monitoring training performance and ending training early when conditions are met; L2 avoids overfitting by adding the sum of squares of the model weights to the loss function as a regularisation term.
A way to get one result after taking a set of models together and calculating them according to different methods. For example, calculating the mean.
This method randomly sets a certain number of hidden cell weights to zero, which invalidates some of the cells, which allows the process to avoid overfitting by not relying too heavily on a single cell. It is a regularization technique. It is for each mini-batch. It seems to work like an implicit ensemble of many neural networks.
It can be added in three places: the input data, the weights, and the labels. Each mini batch does a different randomization, which results in not leaning towards any one point, thus avoiding overfitting.
When the training data is insufficient, the trained model is borrowed and fine-tuned to give better results. The two tasks should be somewhat related, removing the output layer from the trained model, adding a new output layer or multiple layers, and once again training only the newly added layer, or just fine-tuning the whole model. The concept of transfer learning comes from the fact that most of the parameters have already been initialised by the first model and it is easier to get a good result.
The model is trained to output for multiple purposes at the same time. For example, predicting both segment and depth. because each individual task brings in more relevant data, there is a performance improvement for each task.
It is useful if we have a lot of unlabelled data. There are two families of approaches to self-supervised learning: generative and contrastive. Generative example: a bunch of articles, randomly remove some words and train the model to recognise what is missing. That’s what ChatGPT does, and it has a claimed 170T words. Contrastive example: one picture, get a second one by deformation, train the model to recognise which one is the original.
A way of adding data by artificial means when the data set is insufficient. For example, if the same image is rotated, flipped, blurred, etc., the labels are still the same.
It often happens that images are classified, and when using CNN, the reason for getting the images to 224*224 is the approach of cropping the smaller images in 256*256. Because 8*32=256, 7*32 is a good choice if you want to make a uniform size crop. This means that both the length and width can be let out by one 32, the remaining is actually 224. Then, in the area where the length and width are both 32, say the top left corner, select the coordinates of the top left corner to fall in this area, and you can produce about 1000 (32*32) new 244*244 images.
Invariance and equivariance
f[t[x]] = f[x] This is invariance, the transformed image still gives the same result as the original image. f[t[x]] = t[f[x]] This is equivariance, whether the image is converted and then computed or computed and then converted will give the same result. The purpose of convolution is to achieve these two properties. The convolution operation can be used to extract features in an image and is translation invariant. The pooling operation in a convolutional neural network improves the equivalence of features, allowing the network to classify different images.
There are three parameters in the convolution process: size, stride, and dilation. The kernel size determines how many elements to operate at once. Stride determines how many elements to skip each time compared to the previous operation. Dilation determines how many elements to skip in a single operation. If there are not enough elements on the boundary, zero padding is used to make up the zeroes. There is also the option of dropping the boundary, but this reduces the range.
Convolution with multiple kernels in parallel gives a set of results, and such results are called channels. The further back in the convolutional network the layer is, the larger the receptive field of its individual elements (which can represent a larger range of content.)
In the 2D case, the kernel becomes k*k. If the zero-padding is not done, it is a VALID convolution and some edges are lost. If a 2D RGB image is targeted, a convolution kernel of k*k*3 is required, as there are 3 layers of RGB channels.
Downsampling and pooling
Downsampling is used to change the data of a 2D image from large to small. There are three ways to do this: the first is to take a subset, directly taking the values at a fixed position in each region to form a new array. The second is to find the maximum value in each region, also called max pooling. The third is to find the average value in each region, also known as average pooling.
The opposite operation to downsampling. It is possible to expand directly by copying elements in the relevant area. There is also the more common approach of bilinear interpolation of the missing elements in it.
Transposed convolution With convolution, it is just as possible to upsample and downsample to get the effect of enlarging and shrinking the picture. Also called deconvolution. Just give stride a bigger value. It doesn’t matter whether the convolution form is better or the previous approach is better, it’s just a personal preference.
The 1*1 convolution kernel looks useless and can actually be used to change the number of channels without pooling for multi-channel data. Combined with the bias and activation functions, it is equivalent to running the same fully connected network on the channels at each location.
AlexNet in 2012 was the first model to perform well on the ImageNet dataset via convolutional neural networks. There are 8 hidden layers, the first 5 are convolutional and the last 3 are fully connected. 16.4% top-5 error, 38.1% top-1 error. Another VGG, which takes AlexNext and increases the complexity of the convolution but leaves the actual underlying structure unchanged, achieves 6.8% top-5 error, 23.7% top-1. In comparison, AlexNet has 60m parameters, while VGG has 144m parameters.
Classical model for object detection. YOLO’s paper was published in 2016. The first half of its network is similar to VGG, with a final convolutional layer of 7x7x1024. In each of the 7x7 positions, a box drawing operation is performed, with each box having a different size and a confidence value. After the network is running, the heuristic algorithm is used to select the boxes, remove the boxes with low confidence values, and finally the final result will be framed.
The semantic segmentation result is obtained by expanding the previous box selection to each image. The most recent network for semantic segmentation is based on VGG in the first half and then docked with a mirror equivalent of VGG in the back. Also called encoder-decoder structure networks.
From AlexNet’s 8 layers to VGG’s 18 layers, the performance gets better and better, but adding more doesn’t work. The reason for this is not yet clear, but there is a reason why the gradient processing to the most initial layer may lose information out of control due to the sequential processing. So in this type of network, multiple shortcuts are provided between the layers in order to establish the possibility of passing information directly. The residual network does not have the problem of vanishing gradients, but there is a possibility of exploding gradients. The gradient explosion can be controlled by batch normalisation as mentioned below.
The maximum pooling mentioned earlier, when size=(2,2) and strides=(2,2), halves its width and height and keeps the number of channels the same.
It can compute a weighted sum of the input channels.
It assigns a label to every single pixel of an image. It needs a post-processing phase that merges pixels with the same label into regions. It generally uses some form of encoder-decoder network.
Contrary to the improvement achieved by increasing size from AlexNet to VGG, using much larger versions of VGG does not seem to produce even better accuracy on ImageNet.
Grad-CAM can be used to generate heatmaps that help explain the actual segmentation of a classification task for each layer of the network. It is a great tool for interpreting deep neural networks in image-related projects. A similar tool is imageLIME.
The residual block is the basic building block of a ResNet, used to solve the problem of gradient disappearance and gradient explosion in deep neural networks. The residual block contains two convolutional layers and a shortcut connection. The jump connection adds the input feature map directly to the output of the convolutional layer, forming the residual connection. The design of the residual block allows for a deeper network while avoiding the problem of gradient disappearance. In general, the design of the residual block follows the basic principle of adding a certain degree of transformation to the input before adding the original input.
Order of operations in the additive function of a residual network
The additive function refers to the operation of the residual network in the jump connection. In general if the linear deformation is put before ReLU, this way the addition is only equivalent to a positive quantity, as ReLU will change the negative quantity to 0. Using the opposite order, it is possible to add both positive and negative quantities. However, we must add a linear transformation at the beginning of the network in case the input is all negative. In fact, in a differential residual block, many layers are usually included.
Variance of a residual network
In a residual network, the output of each block is added back to the input, so the variance increases twofold (exponentially) at each layer. One way to do this is to rescale the signal by 1/√2 between each residual block. A second method is to use batch normalisation (BN) as the first step in the residual block and to initialise the associated offset δ to zero and the scaling γ to 1.
By normalising the input data for each layer in a neural network, the data distribution in the middle layers of the network is made more stable, thus speeding up the training process. BN is commonly applied to deep convolutional neural networks and can significantly improve the accuracy and convergence speed of the model. The main idea is to calculate the mean and standard deviation on each small batch of data, and then normalise the input data to have a mean of 0 and a variance of 1.
The biggest benefit is the large learning rate that can be used.
Representative of residual networks
ResNet, DenseNet, U-Net.
Compared to sequential convolutional networks like VGG, two innovations of residual networks like Resnet are:
- adding connections between residual blocks, so that the input of a block is connected to the output.
- batch normalisation
Residual networks allow us to train even larger networks without worrying about vanishing or exploding gradients. It can be thought of as a large network made up of a group of small networks strung together by additional connections. A residual network can lead to simpler and smoother loss surfaces than a simple convolutional network.
Transformers are mainly generated by natural language processing and are characterised by the fact that the input sequences are variable in length and, unlike pictures, there is no way to easily resize them.
The transformer makes use of the dot-product self-attention mechanism in order to solve: 1. the problem of sharing parameters to achieve different lengths of text. 2. the fact that the same words may have different meanings depending on the context. N inputs are taken, each with size D. Each output can be considered as a different route of these N inputs, with a weight of 1 afterwards.
In the self-attention mechanism, each position in the sequence is regarded as a vector of queries, keys and values.
Used to represent position information in a sequence.
Two approaches are used: Absolute position embeddings and Relative position embeddings. The first approach uses a position vector, which can be learned in advance, while the second considers relative position computation, which is usually more resource intensive.
The original size D is turned into multiple heads (D/N). Each head can be computed separately or in parallel. Another linear transformation is used to put them back together.
The advantage of multi-headed self-attention is that it improves the expressiveness and generalisation of the model, and is particularly effective when dealing with long sequences. Also, multi-headed self-attentiveness can improve the performance of a model by making it more focused on information about different features. It is widely used in models such as Transformer.
Self-attention is only a small part of a transformer layer. It actually includes a multi-headed self-attention unit, immediately followed by a fully connected layer, both of which have to be residual networks (meaning one more path from input to output), and both of which have a LayerNorm regularization operation at the end.
Common pipeline for natural language processing
Firstly, a tokeniser breaks the text into smaller constituent units (tokens) of the vocabulary. These units are then mapped into an embedding with all the information to obtain vector values.
A Token is not actually a word. For example, a proper noun may not be a normal word. Punctuation still has to be taken into account. Different versions of the same word in the content may have different Tokens.
There are some common approaches such as substring tokenisers by word frequency.
Each Token will be mapped to a word embedding in a fixed vocabulary. The same Token tends to map to the same word embeddings. The position of each Token is effectively a one-hot vector with the length of the vocabulary (i.e. a vector with all positions 0 except for the correct position 1).
The transformer model is formed by passing a matrix of vectors of tokens through a series of transformer layers. There are three types of transformer model:
- encoder model 2. decoder model 3. encoder-decoder model More on this later.
BERT encoder model
BERT is an encoder model that uses a vocabulary of 30,000 tokens. Each token has 1024 dimensions. Has 24 transformers layers. The overall number of parameters is 340M.
The BERT-like encoder model makes full use of transfer learning, with self-supervised learning from a large corpus of text in a pre-training phase, the goal of which is to learn general information in a statistical sense of the language. In the fine-tuning phase, the final network is used to solve a specific task with some small amount of supervised training data.
The BERT model was mainly predicting missing words in the Internet corpus during the pre-training phase. This way the syntactic rules are learned. For example, adjectives are often followed by nouns. Nevertheless, this similar understanding is also very limited.
The fine-tuning phase of BERT requires additional layers on top of the transformer network to turn out vectors to the desired result. For example: text classification - sentiment analysis, word classification - person and place recognition, text span prediction - finding questions and answers.
GPT3 decoder model
The goal of the decoder model is to produce the next token in a sequence. GPT3 is strictly an autoregressive language model. Using the self-attentive mechanism of the transformation layer, when addressing a single token, its subsequent contents that have not yet appeared are ignored by softmax. This is called masked self-attention. What is special about the decoder model is that in the transformation layer, because of the masked self-attention mechanism, they only focus on the current and previous tokens.
GPT3 has parameters up to 175 B. It produces predictions that are plausible, but not complete correctness.
One surprising thing about a model of this size is that it can do many things without fine-tuning. It is thought that GPT3 is already a few-data learner and that it can learn things with a small number of examples, but it is not actually sure where it learned them from and the performance is inconsistent.
Machine translation - encoder-decoder model
The transformer layer in the original decoder needs to be modified to no longer use the masked self-attentive mechanism, requiring simultaneous attention to both component contexts during machine translation, also known as the encoder-decoder attention or cross-attentive mechanism.
Long text transformer
Whether encoding or decoding, there is an interaction between each token and all other tokens, and so there is a quadratic growth in computation as the length of the text increases. There are ways to reduce token interactions using convolutional structures.
Transformer of images
Transformers was originally designed for text, and after good results were achieved some people started to use it on images. ImageGPT is a decoder model that goes through part of the image to predict the rest. The complexity of its converter network dictates that its 6.8B parameters can only handle 64x64 images.
Visual Transformation (ViT)
ViT solves the image resolution problem. It is an encoder model. However, its performance does not exceed the best CNNs. To compete with CNNs, much more training data will be required.
The main difference between the encoder and decoder models
The encoder is a plain/full self-attention mechanism, while the decoder model uses masked self-attention.
Graph neural network matrices A, X and E
A: adjacency matrix. NxN, any two nodes connected just corresponding positions in the matrix are set to 1. X: matrix for node embeddings, with N nodes, each with D attributes, or DxN. E: matrix for edge embeddings, with E edges and D attributes per node, or DxE.
Three tasks for graph neural networks
- Graph level task: for an entire graph, predicting the classification it belongs to, or the values it produces (classification and regression). For example, determining whether a chemical structure is harmful or not. To obtain the loss function, this can be defined as an operation after first averaging the output embeddings by combining them.
- Node-level tasks: predicting the classification or value for each node in the graph. For example, predicting whether a point in a 3d point cloud is part of an aircraft. At this point the loss function only looks at a single node and no longer averages.
- Edge prediction task: predict whether there is an edge between two nodes. For example, the friends recommendation base on social network. One way to get the loss function is to dot product the two nodes.
Inductive and transductive models
Inductive models of graphical neural networks: a type of semi-supervised learning, e.g. in social networks, inferring possible acquaintances (discovering new nodes and edges) based on friends that already exist.
Graph neural network transductive model: a graph-based node classification task. For example, in a social network, there may only be a small number of people with labels, which can then be used to predict the labels of more unknown users through a label-conduction algorithm.
Splitting large graph approaches
Nearest Neighbour Sampling: Random neighbour nodes selected by taking different fixed distances in multiple batches, which is a bit like dropout and adds regularization.
Graph distribution: makes it possible to have multiple datasets by turning the big graph into split subsets.
With the two approaches described above, a single large graph can be made to have datasets of different intents, and also effectively turn a transductive problem into an inductive one. In the stage of making predictions, the results can be obtained by simply selecting the neighbours of k hops for node estimation, which is relatively much less memory intensive.
Operations that can be performed by graph convolutional networks
GCNs networks can perform most of the operations of CNN networks.
Combination of the current node and all neighbouring nodes: diagonal aggregation.
Mean aggregation: one of them is called Kipf normalization.
Maximum pooling aggregation.
All the previous workflows were supervised learning, now move on to unsupervised learning.
Discriminative models map from data x to latent variable z. Generative models map from latent variable z to data x.
Generate adversarial model. Probabilistic generative model.
A good generative model has to be:
High quality sampling
well behaved latent space
interpretable latent space
adequate likelihood calculation
Quantitative unsupervised learning performance
- test likelihood 2. inception score (IS) score 3. fréchet acceptance distance 4. stream shape accuracy/recall
Generative Adversarial Networks
GANs are unsupervised models that aim to generate new samples and make them indistinguishable from training samples. It is only used to generate new samples. It is not easy to evaluate how good the generated samples are.
Models generally have two parts, a generator and a discriminator.
If the discriminator is able to discriminate, it is returned to the generator as a signal to adjust its parameters and regenerate.
This is a Nash equilibrium problem, where the generator is looking for the maximising loss and the discriminator is looking for the minimising loss. By mathematically multiplying the generator’s loss function by -1, it becomes a search for the minimising loss and also removes the dependence on its own weights.
DCGAN Deep Convolutional Generative Adversarial Network
It is not easy to train and needs to satisfy:
the use of strided convolution for both upsampling and downsampling;
using BatchNorm for both the generator and the distinguisher, except for the very beginning and the end;
use leaky ReLU for the activation function;
use Adam with lower momentum coefficients.
mode dropping and mode crashing
mode dropping: mode dropping is when the generated sample is not comprehensive enough (e.g. generating a face without a beard).
mode collapse: mode collapse is when the input z is completely ignored and all samples have one or a few results.
Wasserstein distance, also known as Earth Mover’s distance (EMD), is a measure of the difference between two probability distributions. It is based on the idea of the Minimum Cost Transport Problem (MCTP) and is used to compare the similarity between two distributions.
The Wasserstein formula allows for more stable GAN training. In fact, the quality of the output images can be increased by: progressive growing, minibatch discrimination, truncation.
Conditional adversarial network cGAN, which adds a conditional vector c to both the generator and the discriminator.
Auxiliary classifier GAN, ACGAN. c is added to the generator and not to the discriminator, but c is determined at the output.
infoGAN, add c to the generator input, no addition to the distinguisher, and predict c on output.
e.g. greyscale to colour, noise reduction, sharpening of images, sketch to photorealistic.
Pix2Pix uses before-and-after image pairs for training, CycleGAN uses unpaired images, and StyleGAN uses multiple style vectors for fine-grained control.
Possibilities of Generative Networks
GANs, compared to other generative models such as Normalizing Flows, VAEs, and Diffusion models, are the only models that cannot provide direct probabilities.
Evaluating the Quality of Generative Networks
Both the authenticity and diversity of generated images are equally important.
Distance Metrics for Evaluating Generative Network Results
Wasserstein distance is more suitable for GANs compared to Kullback-Leibler distance.
Conditional Generative Adversarial Networks (cGAN)
Labels from the data are used in both the generator and discriminator.
Deep Networks Learn Faster
Randomizing a portion of the labels in the data does not hinder a large enough network from learning the correct labels.
Inception score, Frechet inception distance, and Manifold precision/recall
These metrics are used to evaluate the quality of generative networks. They rely on a pre-trained classifier, often based on ImageNet, and measure the similarity between generated images and training samples.
Shape of the Loss Function
A flat minimum of the loss function appears to be beneficial for testing accuracy.
Memory Consumption of Networks
Pruning, knowledge distillation, and weight quantization can reduce the memory consumption of networks.
Simply put, in the case of 1D, the goal of Normalizing Flow is to map the latent variable z to x through a function f, so that the distribution of x matches the distribution of real data. The function f is referred to as the Normalizing Flow. The distribution of z is denoted as Pr(z), and the distribution of x is denoted as Pr(x). The relationship between z and x is given by x = f(z, φ). It can be both forward and inverse. Forward direction corresponds to the generation process, while the inverse direction corresponds to the normalization process, where the inverse mapping produces a normalized distribution of z. When extending from 1D to nD, both z and x become vectors, and f becomes a multivariate function, essentially performing a sequence of continuous mappings.
Flows refer to invertible network layers. Normalizing Flow is the combination of invertible network layers. Depending on the specific combination, different types of flow models can be obtained, such as linear flows, non-linear flows (affine coupling flows), autoregressive flows, inverse autoregressive flows, coupling flows, residual flows, and multiscale flows.
GLOW stands for Generative flows. The quality of synthesized images is slightly inferior to GANs, but it is not clear whether this is due to the non-invertible layers or the lack of sufficient investment in time. It performs well in interpolating between two real images, such as generating intermediate faces between two real human faces.
Variational Autoencoder (VAE)
VAEs and Normalizing Flows are both probabilistic generative models. The goal of GANs is to generate images that are indistinguishable from the training set. VAEs and Normalizing Flows aim to learn the distribution of the training set and generate new samples. VAE is a latent variable model. It was proposed in 2014 but had a complete explanation only in 2019.
Now it is possible to generate high-quality images using VAE, but it requires debugging and specialized architectural design for each layer. This leads to the diffusion model.
VAEs can also be used for fine-grained modifications of images, such as opening the mouth or shifting the gaze of a face.
The diffusion model consists of an encoder and a decoder. The encoder is forward, and the decoder is inverse. During training the decoder, a diffusion kernel is used to efficiently compute intermediate latent variables.
The diffusion model maps data instances through a series of latent variables, repeatedly mixing the current representation with random noise.
For image generation, each denoising step is implemented using U-Net, so the sampling speed is slower compared to other generative models. To improve the generation speed, the diffusion model can be reformulated as deterministic, and sampling can be performed with fewer steps, which yields good results. The diffusion model is currently the best model for text-to-image generation.
It is a framework for continuous decision-making, where an agent learns to take actions in an environment to maximize the received reward. It is commonly used in AI games.
A well-known approach is the Markov Decision Process. A Markov Decision Process consists of a five-tuple, including a state space, an action space, a transition function, a reward function, and a discount factor.
Policy gradient methods optimize the policy directly instead of assigning values to actions. They generate stochastic policies, which is important when the environment is partially observable. Updates are noisy, and many improvements have been introduced to reduce their variance.
When we cannot interact with the environment and need to learn from historical data, offline reinforcement learning is used. Decision transformers leverage the latest advances in deep learning to build a model
#big data #cutting-edge #Algorithms