Book summary: Learning From Data

Most of deep learning research nowadays is based around proposing newer models with larger number of parameters and novel methods with weak theoretical basis on why such a model could out perform its other competitive counterparts models. A large part of this is because of the inherent complexity in understanding the huge number of interactions among parameters in a deep learning model. So, studying theoretical basis of deep learning models have mostly been sidelined in favor of treating them as a black box and experimentally comparing performance among them in a leader board.

To begin mitigating this, ‘Learning from data – A short course‘ by Yaser S. Abu-Mostafa et al. is an important book in the sense that it tries to answer and provide a mathematical basis of learning problem in general – starting from the fundamental question of why it is possible to learn anything at all. The book is a great read for both amateurs and specialists of deep learning alike as the mathematical concepts required to digest the material is elementary. Yet for first time readers, the newness of materials covered might make the flow of the argument being made quite confusing. Hence, this post lists major bullet points on the important chapters (i.e. not all chapters are covered) for my own future reference and for anyone else who might benefit from a summary of this book with added notes of my own.

Chapter 1: The Learning Problem

  1. Difference between Learning and Design approaches:
    • Learning approach uses data to search the hypothesis set \mathcal{H} for a good hypothesis h that can correctly classify training data. The resultant hypothesis is then used to make predictions on out-of-sample data generated by the same unknown target function that generated the training data. Example: The problem of accepting or denying credit for a bank customer. As there are many factors (usually called features in a learning problems) affecting a customer’s ability to pay back debt, we have to analyze all previous data we have of him/her and similar customers to make a yes/no decision on granting further credit.
    • Design approach uses specifications (along with probability and statistics) to analytically define unknown target function without any need to see data. Example: The problem of classification of coins. As coins are produced by machines (also verified manually before distribution) with very little error on their diameter and form, a formal design specification with allowance for random error might be sufficient to classify them.
  2. The book only deals with learning approaches and hence brief discussion on the types of learning methods:
    • Supervised learning: Training data consists of input and its corresponding output pair. The objective of a supervised learner is to be able to match input and output pair by computing errors in each training iteration and rearranging its internal configuration accordingly. Types of supervised learning are:
      1. Active learning: Here output for an input data can be queried from the training set. This allows us for strategic choice of inputs.
      2. Online learning: Here examples of training data arrives in a stream. So, the learner has no strategic choice of input and has to learn examples as they come.
    • Reinforcement learning: Training data consists of input, its corresponding output and a grade for the output. The data does not define how good other outputs would have been for other inputs. The grade for the output functions as a rewarding function that the learning is trying to maximize as it defines what the best course of action is.
    • Unsupervised learning: Training data consists of only input and so an unsupervised learner is tasked with finding patterns between input data points to categorize them into unlabelled clusters. The evaluation phase consists of correctly classifying out of sample data into one of these clusters.
  3. We can create learners to memorize the training set by using very complex hypothesis with lot of parameters but can we create a learner from in-sample data to make predictions on out-sample data (generated by non-random target function)?
    • Deterministically – NO
    • Probabilistically – YES
  4. Introduction to Bin Model: To establish statement (3) above, we first introduce an analogical problem called Bin Model where we model a bin with unknown (could even be infinite) number of red and green marbles. The problem is to bound the \mu = probability of red marbles in the bin (so (1-\mu) = probability of green marbles) with \nu = fraction of red samples from N = total marbles (with replacement) randomly sampled from the bin.
  5. Hoeffding’s Inequality, \mathbb{P}[|\nu - \mu| > \epsilon] \leq 2e^{-2\epsilon^{2}N} \text{ for any } \epsilon > 0, where \mathbb{P} is probability and \epsilon is the threshold of error that \nu correctly tracks \mu. This inequality can help us bound the fraction of red marbles that are drawn
  6. Notes for Hoeffding’s Inequality equation:
    • The inequality only holds if the samples are randomly drawn.
    • \mu is a constant while \nu is a random variable dependent on the value of \mu.
    • The right side of the equation only has \epsilon and N. This implies that the bound is only dependent on how small the error threshold we choose is and the number of samples drawn but not on the number of samples in the bin. In fact, the number of samples in the bin could be infinite.
    • Larger the number of samples drawn from the bin, N, better the probability that \mu will track \nu within error threshold.
  7. Extension of bin model to one hypothesis: The bin framework can be reconstituted to establish statement (3) about learning for one hypothesis (for now) by redefining the red marbles as data points where hypothesis function does not match target function, i.e. h(x) \neq f(x), and vice versa for green marbles. For these redefinition to hold, we have to make sure that x \in X training inputs are randomly sampled according to a unknown probability distribution P over input space \mathcal{X}.
  8. Two rules from probabilities, given events \mathcal{B}_1, \mathcal{B}_2, …, \mathcal{B}_M:
    • If \mathcal{B}_1 implies \mathcal{B}_2, then \mathbb{P}[\mathcal{B}_1] \leq \mathbb{P}[\mathcal{B}_2]
    • Union Bound: \mathbb{P}[\mathcal{B}_1\text{ or }\mathcal{B}_2\text{ or ... or }\mathcal{B}_M] \leq \mathbb{P}[\mathcal{B}_1] + \mathbb{P}[\mathcal{B}_2] + ... + \mathbb{P}[\mathcal{B}_M]
  9. Extension of bin model to multiple hypothesis: Using rules from statement (7), we reconstitute statement (6) for multiple hypothesis as: \mathbb{P}[|E_{in}(g)-E_{out}(g)| > \epsilon] \leq 2Me^{-2\epsilon^2N} where E_{in} is in-sample error, E_{out} is out-of-sample error, g is final hypothesis, M is the total number of hypothesis in the hypothesis set.
  10. Notes about statement (8):
    • M is infinite (for now) and hence the bounding is currently too weak to be of any use. This will be dealt with in the next chapter.
    • Depending on complexity different hypothesis, the bounds can be of different sizes but for simplification they are assumed to be of one size and hence equation (8) has M instead of summation.
    • The equation seems to imply that a simple target function can be learned as easily a complex function with a simple hypothesis. This is because the statement only deals with trying to keep out-of-sample error close to in-sample error while not dealing with the requirement of making in-sample error small (see statement (10)). The complexity of a hypothesis deals with the latter.
  11. Requirements for actual learning:
    • Out-of-sample error is close to in-sample error
    • In-sample error is small enough
  12. Noisy data: Real world data is noisy. For this, we try to model \mathbb{P}(x,y)=\mathbb{P}(x)\mathbb{P}(y|x) instead, where noise is accommodated in \mathbb{P}(y|x) and input samples are picked with an unknown distribution \mathbb{P}(x).

Chapter 2: Training versus Testing

  1. Generalization bound: We can rewrite equation in Chapter 1, statement 9 to create a bound between E_{out} and E_{in}, as follows: E_{out}(g) \leq E_{in}(g)+\sqrt{\frac{1}{2N}\ln\frac{2M}{\delta}}\text{ where } \delta=2Me^{-2N\epsilon^2}
  2. Reducing infinite hypothesis choices to polynomial: The term M in Chapter 1, point 9 is infinite which makes the bound useless. Again, the union bound in Chapter 1, statement 8, consists of events \mathcal{B}_M that heavily overlap. This again makes this bound unnecessarily wide. To remedy this, we , first, recognize that given N points for a model to learn to classify, there can only be a fixed number of dichotomies that a specific model can produce. We will call this bounding the growth function of a model. This bound will replace the infinite M term with a polynomial. We will consider binary target function for simplicity but will later extend to real-valued functions.
  3. Growth function: This function gives the maximum number of dichotomies that can be generated by a hypothesis set (created by a learning algorithm) on N given points to classify. Mathematically, growth function m_{\mathcal{H}}=\max\limits_{\pmb{x}_1\text{,...,}\pmb{x}_N \in \mathcal{X}}|\mathcal{H}(\pmb{x}_1\text{,...,}\pmb{x}_N)|
  4. Break point: If no data set of size k can be shattered (i.e. produce all possible dichotomies) by \mathcal{H}, then k is said to be a break point for \mathcal{H}. If a model can shatter any N then m_{\mathcal{H}}=2^N.
  5. Need to bound growth function: If m_{\mathcal{H}}\text{ is }2^N for a hypothesis (i.e. it is too complex), the generalization error term \sqrt{\frac{1}{2N}\ln\frac{2M}{\delta}} (from statement (1)) will never go to zero regardless of the number of training examples. Then, the model will memorize the training set instead of learning from it. On the other hand, bounding m_{\mathcal{H}} by any polynomial, will bring generalization error to zero when the number of samples N\to\infty.
  6. Bounding growth function: Using recursive techniques, we find that if m_{\mathcal{H}}(k)<2^k for some value k, then m_{\mathcal{H}}\leq\sum_{i=0}^{k-1}\binom{N}{i}. The RHS of the equation is a polynomial in N of k-1 degree.
  7. Vapnik-Chervonenkis Dimension: VC dimension d_{VC} is the largest N for which m_{\mathcal{H}}=2^N. If m_{\mathcal{H}}(N)=2^N for all N, d_{VC}(\mathcal{H})=\infty. We redefine the equation in statement (6) with VC dimension as: m_{\mathcal{H}}\leq\sum_{i=0}^{d_{VC}}\binom{N}{i}. VC dimension can be thought of as measuring ‘effective’ parameters of a model to express a diverse set of hypothesis.
  8. VC Generalization bound: Finally we replace M in Generalization bound defined in statement (1) with VC Generalization Bound as: E_{out}(g)\leq E_{in}(g)+\sqrt{\frac{8}{N}\ln\frac{4m_{\mathcal{H}}(2N)}{\delta}}\text{ for any tolerance }>0\text{ and with probability }\geq 1-\delta.
  9. Notes on VC Generalization bound:
    • The slack in the bound can be attributed to: a) the bound being the same for different values of E_{out}, b) Maximum number of dichotomies being considered, and c) m_{\mathcal{H}}(N) is bounded by simple polynomial
    • Despite the bound being loose, it: a) establishes the feasibility of learning despite having infinite hypothesis set, and b) is equally loose for different learning models and hence allows generalized performance comparisons between them.
  10. Sample complexity: Sample complexity is the number of training examples needed to achieve a certain generalization performance. We can estimate the number of training examples needed for generalization error to remain equal or less than \epsilon is given by: N \geq \frac{8}{\epsilon^2}\ln(\frac{4((2N)^{d_{VC}}+1)}{\delta}). We can use iterative method to get N for an error threshold \epsilon.
  11. Penalty for Model complexity: Higher model complexity (i.e. with higher VC dimension) is more like to fit training set but loose generalization in out-of-sample test set. An intermediate VC dimension (found through training and testing) will be the best level of model complexity with minimum out-of-sample error and low (but not minimum) in-sample error.
  12. Test Set: A part of training data set kept aside for testing the final hypothesis after training is called test set. E_{test} is considered synonymous to E_{out} for practical purposes as technically E_{out} cannot be calculated.
    • Because it has only the final hypothesis to calculate error against, its bound is much more strict than the bound with hypothesis set during training.
    • Larger test sets give more accurate estimate of E_{out}. Though too a big of a test set, leads to too small of a training set which in turn degrades the ability of the model to learn anything.
    • When test set is not used in any form during training or to select the final hypothesis, a test set does not have a bias and only variance.
  13. Extension from binary to real-valued functions: Bias-variance decomposition of out-of-sample error can be expressed as: \mathbb{E}_{\mathcal{D}}[E_{out}(g^{(\mathcal{D})})] = \mathbb{E}_x[\text{bias}(x)]+\mathbb{E}_x[\text{var}(x)].
  14. Notes on bias-variance:
    • An ideal model has both low bias and variance.
    • Generally, trying to reduce bias increases variance and vice versa.
    • A good hypothesis, has an ‘acceptable’ compromise of bias and variance values.
    • In VC analysis, out-of-sample error is the sum of in-sample error and generalization error while in bias-variance analysis it is the sum of bias and variance.

Chapter 4: Overfitting

  1. Effects of:
    • Increase in number of training data – Overfitting decreases
    • Increase in noise – Overfitting increases
    • Increase in target complexity – Overfitting increases
  2. Types of noise:
    • Stochastic: This is the random noise in the data. This noise changes every time data is generated from the target function.
    • Deterministic: This noise comes from the complexity of the target function (not considering added random noise) as a model tries to fit approximate the target function with its complexity using the available degrees of freedom. This noise does not change is the same data is generated but different models can have different deterministic noises depending on their complexity (and parameters).
  3. In bias-variance analysis in Chapter 2, statement 13:
    • Stochastic noise will be an addition parameter in the equation as: \mathbb{E}_{\mathcal{D}}[E_{out}(g^{(\mathcal{D})})] = \sigma^2 + \mathbb{E}_x[\text{bias}(x)]+\mathbb{E}_x[\text{var}(x)]\text{ where }\sigma^2\text{ is variance of stochastic noise}.
    • Deterministic noise is captured by bias in the equation while variance is effected by both types of noises.
  4. Regularization: This reduces overfitting by limiting the degrees of freedom of model to selection of simpler hypothesis. The decision to whether to use regularization is heuristics-based. Regularization, generally, mildly increases bias while strongly reducing variance.
  5. Validation Set: A test set is an unbiased estimate of E_{out} and never used in training or in making choices in the learning process while a validation set, while not used in training directly, is used in certain choices in the learning process. Training with validation set K data points, consists of first training with (N-K) data points, validation with K data points to estimate the model with best E_{out} and then finally retraining with all N data points to produce the final hypothesis.
  6. Cross Validation: Setting aside too large of K validation data points reduces the number of training samples while setting aside too small leads to E_{val} be a weak estimate of E_{out}. V-fold cross validation is one type of cross validation method, where training data is divided into V disjoint sets each of size N/V. Hence, training always happens on data not used for validation and the validation error is average of validation error in each V. This allows cross validation to be more effective than just naive validation method.

Chapter 5: Three Learning Principles

  1. Occam’s Razor: Simpler models generally produce lower out-of-sample errors. Hence, it is preferable to select a simpler hypothesis than a complex one.
  2. Sampling Bias: If models are trained with a dataset which is sampled in a biased way, the resultant model will also be biased. One way to avoid this would be to train with multiple datasets.
  3. Data Snooping: Looking at testing data to choose a learning model compromises the ability to properly learn as it affects the ability to assess outcome. A portion of testing data from total training set data must be set aside and not be used to in anything besides testing the final hypothesis.

Micro:bit controlled LED matrix for CPU/GPU usage display and Audio output visualization

Project code at: https://github.com/sanje2v/RGBLEDMatrixDriver

Since digital displays to show CPU and GPU temps on PC AIO water coolers have become popular, PC enthusiast have made use of tiny screens (connected as a separate monitors to their GPUs) inside their PC cases to display similar statistics along with additional information like fan speeds, voltages etc. These displays despite being relatively cheap have high density pixel count and are great to display text and images.

But being an engineer, plug-and-play solutions kinda bore me. Why not complicate things by introducing a micro controller that communicates with the PC over serial while also handling writing to LED matrix? Hence, this project.

Brief architecture

The architecture of the system is simple enough to understand. The host PC is connected to the micro:bit LED controller using serial over USB and the controller to LED matrix using P0 pin out on the micro:bit board. The micro-controller is powered by the host via USB while the LED matrix is powered via 5V rail on the same power supply’s molex connector. Despite the power supply providing current to all the components in the system, it is important to tie the ground of the LED matrix with the ground of the micro:bit controller to counter problems caused by ground loop.

On the host side, a program continuously monitors for frame synchronization command from the controller. When such a command has been received, the host transmits a compressed frame as soon as possible via the same serial connection. The controller sits in between the host and LED matrix. Its role is to send synchronization command to the host, receive frame data from the host, decompress it in its memory and drive the LED matrix.

Using SPI driven RGB LED matrix

In the initial phase of this project, I was adamant on using a SPI driven RGB LED matrix. The matrix module called EP0075 seemed like a good idea. But this module had two problems. First, the matrix chip in this model does not have a latch to hold state. This means the matrix driving controller has to continuously update the last state of each LED as fast as possible to make sure there’s no flicker. Such demanding update requirement is challenging to get right with handling serial communication without losing data. Secondly, each matrix has 4 pin outs to wire up. If we use 4 matrices, there are now a total of 16 pin outs to wire to the controller.

SPI controlled EP0075 LED matrices

Using WS2812b based RGB LED matrix

WS2812b chip solves both these problems but introduces a new one. Because WS2812b chip does not tolerate loose timing for incoming RGB color data, the controller has to temporarily disable interrupt while sending data. This means any data that arrives from host computer (i.e. PC) via serial (implemented over USB in our case) during that time will be dropped. We solve this problem by notifying the host that the controller is ready to receive new frame data by sending a ‘SYNC\r\n’ message. When the host receives this message, it is required to send compressed frame data as soon as possible. The host has a minimum of 100 msecs (in the worst case scenario) to send the compressed frame data. When each byte is received, the controller decompresses it to generate frame data. After a certain interval has elasped, the controller begins drawing the next frame by communicating with the LED matrix chip.

Frame compression

When transmitting frame over serial from host to controller, we compress RGB frame data using a lossy compression followed by an adapted version of Runtime length encoding (RLE). Using frame compression instead of sending uncompressed data has two advantages. First, there is less data to be sent so the host is unlikely to exceed the minimum timing of 100 msecs. Secondly, we can use invalid compression bit sequence to interleave commands with frame data. In this project, I make use of invalid bit sequence ‘xxxxx000’ to cause the controller to reset its software state while also setting the frame display interval to ‘enum’ values defined for ‘xxxxx’ integer.

Kinds of bit sequencesByte sequence (8-bit)
Channel colordddddrrr
Reset commandxxxxx000
Here d is color bit information, r is repeats bit information and x is frame display interval bit information

The frame compression algorithm works as follows. Given 3 8-bit RGB color data per pixel, we first, perform a lossy compression to each channel color data to 5-bit color data by performing an integer division by 8 (or left shift by 3). On the receiving end, the controller would decompress this byte by multiplying it by 8 and adding random dithering value. After this lossy compression, we apply an adapted RLE-based compression. For compressed byte of each channel, we maintain a count of the number of times this byte appears spatially forward for the same channel. We count no more than 7 (i.e. 2^3 – 1) times. This count is stored in the 3-bits that we have gained for each channel byte after the lossy compression. We follow the same procedure for each channel byte. In the next step, we skip over N repetitions that we have found for each channel byte and continue the same process. This algorithm is implemented by ‘Compressor.py’.

Controller program

The controller program for host is capable of running in two modes. When used without any arguments, it will run in GUI mode with dialog box as follows.

GUI mode of the controller program

The program can also run in daemon mode. This mode can be used to run the controller in the background with Task Scheduler at user logins. To run in daemon mode pass arguments as follows: –as-daemon ‘<COM port>’ ‘<Full name of function defined by name() for each function module’. For example: python main.py –as-daemon ‘COM5’ ‘CPU and GPU usage monitor’. Similarly, to cleanly terminate a background daemon process pass –kill-daemon command line parameter.

As at the time of writing of this post, two functions were implemented for the controller program as follows:

CPU and GPU usage monitor

NOTE: This function plugin assumes you have a NVIDIA GPU and uses the nvml python library to query GPU usage information. The library assumes ‘nvml.dll’ is available at ‘Program Files\NVIDIA Corporation\NVSMI’ but this DLL may be installed in ‘System32’ directory by NVIDIA’s driver installer. So, instead of copying this DLL, creating a symbolic link is advisable using ‘mklink’ command.

Music visualization using FFT on stereo audio output of the host

Final notes

The daemon program can be scheduled to run at a user’s login event using Windows Task Scheduler. The program is capable of handling system shutdown/user log off event by cleanly releasing its resources and setting the LED matrix panel to OFF state.

Bayes’ Theorem by intuition

There are numerous articles all over the internet explaining Bayes’ theorem. It is thought to be the basis for logical reasoning and is important for us all to understand. Yet most of us find mathematical symbols confronting and examples used in the explanations hard to follow. So, this post is my attempt to explain the theorem intuitively and plainly as possible.

Mathematically, it is defined as:

Bayes' Theorem

where P(X) is the probability of event X occurring while P(X|Y) implies probability of event X happening given that event Y has happened

To put things into perspective, let’s discuss two simple examples:

Example in usage with percentage data

In a fictional suburb called Livingstone, the census says that an average of 70% of the population suffers from a drinking problem. Now imagine you and I are driving around this region. I point to a random person walking on the street and ask you what is the likelihood that this person has a drinking problem. It is very likely that you’ll say 70% or that the probability is 0.7 (get from percentage to probability by dividing with 100).

Well if the person was a man from say 18 to 30 yrs of age, it would most likely be right and you’d be much more confident on your answer. But what if I told you it was a 5 yr old child instead? Saying that a 5 yr old child has the likelihood of having a drinking problem is 70% is obviously absurd. Now we see the problem. Yes, I deliberately gave you incomplete information so that you would trick yourself by making false assumptions when you read the words ‘average’ and ‘person’ in the description above .

So, what should you have done to be ‘closer’ to correct estimation? For one, not assume that one statistic – here it was percentage – is enough to tell the full truth. This is why there is a whole branch of mathematics called Statistics. Secondly, when someone quotes average measure of a dataset it is usually cumulative of different distribution and so we must correct it for our target distribution – which in the latter case was children. This second requirement is exactly what Bayes’ theorem is.

Example in usage with using evidence to support hypothesis

Bayes’ theorem is usually used for reasoning how well a theorem is supported by evidence. So, let’s now explore the implication of this theorem for supporting a hypothesis using evidence.

You have a cat and a dog as pets. When you return home from work and see that the milk box is toppled and leaking (observation). You know that the cats like milk more (prior information before looking at the evidence). But then you see some milk on the dog’s face (evidence).

Before we continue, let’s us evaluate this. Notice that if we don’t consider the evidence, the cat seems like the likely suspect but when we do, the dog seems like the suspect. The similarity with the previous example can also be evident. In it, before we knew that the ‘person’ was a child, we were confident that the answer was 0.7 but after we did know that we had doubts that it was not the right answer. This is Bayes’ theorem in summation.

This whole scenario can be shown in probability diagram as follows:

bayestheorem3

From the diagram above, we can see that we want to take the evidence agreeing with hypothesis part (represented in the diagram with P(Evidence is true) P(Hypothesis is true) intersection) from the probability space of evidence being true to probability space of hypothesis being true given that evidence is true.

How? First we should know that the intersection part has contribution from both P(Evidence is true | Hypothesis is true) and P(Hypothesis is true) probability spaces. We need is contribution only from P(Evidence is true | Hypothesis is true). For that, we divide the intersection probability with P(Hypothesis is true) as:

bayestheorem5

Now we will want to rescale this from P(Evidence is true) to P(Hypothesis is true), in the next section.

Rewriting the equation

We can rewrite the original mathematical Bayes’ theorem equation as:

bayestheorem3

Or

Bayes' Theorem

If you are not familiar, this shows the associative property of multiplication.

In the rewrite above, separating the terms with ‘*’ (multiply operation) is important to make a point as we will see. Multiplication operation has an associative property (which means ‘(A*B)/C’ is equal to ‘A*(B/C)’). This property ‘obscures’ the implied meaning when the formula is written in the original form. This is unfortunate because mathematical statements are supposed to show truth statement simply.

First, a minor detour to hit the point home. Remember how we calculate percentage? If you get 7 out of 10 in a test your percentage is calculated as:

bayestheorem4Now, notice the similarity with the rewritten equation above. This similarity is not coincidental. When calculating percentage we are moving our scale from ‘out of 10’ to ‘out of hundred’.

Similarly, in the first example involving percentage, we are moving from average drinking problem from all age groups to average of no. of drinkers given selection from a specific age group. And in the second example we are moving from the prior belief that cats prefer milk to evidence to how evidence supports its likelihood to be correct and moved away from it as evidence against it was strong.

For the first example, if thinking in terms of probability is difficult, the following might make it easier to understand:

bayestheorem6

Summation

We start with initial belief called prior. As we collect evidence, its likelihood of being true allows us to change our belief either towards or away from prior.

 

UWP Custom control: Expandable Row ListView

I was working on a Universal Windows Platform app where I needed a ListView whose rows can be expanded to reveal further data underneath it. I think these controls are called DataRowTables in Web UI programming. So, I decided to write one.

ExpandableRowListView usage animation

To use it, you can either bind its ‘ItemsSource’ to an ‘IEnumerable<ExpandableRowListViewItem>’ or add static items in XAML or add static items in code using ‘ExpandableRowListView.Items.Add(…)’ method. These usages are shown in the project containing the source code.

Source code