in π Notes

# Machine Learning Engineering

Table of Contents

10 min read

These are notes for the 2AMM15 course at TU/e. This content is based on the course material, and thus, may be copyrighted.

## Introduction

### What is ML?

- Learn to perform a task, based on experience (examples) $X$, minimizing error $\mathcal{E}$. E.g. recognize a flower in an image as accurately as possible
- Learn a model $f$ with some parameters $\theta$ that produces the right output $y$
- Part of a much larger system that provides the data $X$ in the right format

$$f_{\theta}(X) = y$$ $$\underset{\theta}{\operatorname{argmin}} \mathcal{E}(f_{\theta}(X))$$

### Inductive bias

- We put assumptions into the model:
**inductive bias**$b$.- What should the model look like?
- Mimick human brain: Neural Networks
- Logical combination of inputs: Decision Trees, Linear Models
- Remember similar examples: Nearest Neighbors, SVMs
- Probability distribution: Bayesian models

- Hyper parameters: user defined settings. E.g. depth of tree, network architecture.
- Assumptions about the data distribution. E.g. $X \sim N(\mu,\sigma)$.

- What should the model look like?
- We can transfer knowledge from previous tasks: $f_1, f_2, f_3, … \Longrightarrow f_{new}$.
- Choose the right model, hyper parameters
- Reuse previously learned values for model parameters $\theta$

- In short: $\underset{\theta,b}{\operatorname{argmin}} \mathcal{E}(f_{\theta, b}(X))$

### Machine Learning vs Statistics

- Both:
- Predictions of natural phenomena

- Statistics
- Help humans understand the world
- Assume data is generated according to an understandable model

- Machine learning
- Automate a task entirely
- Assume data generation process is unknown
- Engineering-oriented, less mathematical theory

### Types of Machine Learning

- Many times, we combine more than one type

**Supervised Machine Learning**

- Learn a model $f$ from labeled data $(X, y)$, i.e., from ground truth
- Supervised: we know the correct/desired outcome (label)
**Classification**- Predict a class label (category), discrete and unordered
- Can be
**binary**(e.g. spam/not spam) or**multi-class**(e.g. letter recognition) - Many classifiers can return a
**confidence**per class

- Can be
- The predictions of the model yield a
**decision boundary**separating the classes

- Predict a class label (category), discrete and unordered
**Regression**- Predict a continuous, numeric value. E.g. temperature.
- Target variable is numeric
- Some algorithms can return a
**confidence interval** - Find relationship between predictors and the target

**Semi-Supervised Learning**

- Learn a model rom few labeled and many unlabeled examples
- Unlabeled examples add information about which new examples are likely to occur

**Unsupervised Machine Learning**

- Explore the structure of the data $X$ to extract meaningful information
- Given inputs $X$, find which ones are special, similar, anomalous, etc
- Unlabeled data, or data with unknown structure
- There exist many subtypes
**Clustering**- Organize information into meaningful subgroups (clusters)
- Objects in cluster share certain degree of similarity and dissimilarity to other clusters
- E.g. distinguish different types of customers

**Dimensionality reduction**- Data can be very high-dimensional and difficult to understand, learn from, store, etc
- Can compress the data into fewer dimensions, while retaining most of the information
- Contrary to feature selection, the new features lose their original meaning
- The new representation can bne a lot easier to model and visualize

**Reinforcement Learning**

- Develop an agent that improves its performance based on interactions with the environment. E.g. Chess, Go, etc
- Search a large space of actions and states
**Reward function**defines how well a series of actions works- Learn a series of actions (policy) that maximizes reward through exploration

### Learning = Representation + Evaluation + Optimization

ML algorithms consist of 3 components:

**Representation**: a model $f_{\theta}$ must be represented in a formal language that the computer can handle- Defines the “concepts” it can learn, the
**hypothesis space** - E.g. a decision tree, neural network, set of annotated data points

- Defines the “concepts” it can learn, the
**Evaluation**: an internal way to choose one hypothesis over the other- Objective function, scoring function, loss function $\mathcal{L}(f_{\theta})$
- E.g. difference between correct output and predictions

**Optimization**: an efficient way to search the hypothesis space- Start from simple hypothesis, extend (relax) if it doesn’t fit the data
- Start with initial set of model parameters, gradually refine them
- Many methods, differing in speed of learning, number of optima, etc

**Example (neural networks)**

- Representation: 2 model parameter ($\theta_0,\theta_1$)
- Evaluation: a loss function $\mathcal{L}(f_{\theta})$ computes how good the predictions are
- Estimated on a set of training data with the correct predictions
- We can’t see the full surface, only evaluate specific sets of parameters

- Optimization: find the optimal set of parameters
- Usually a type of search in the hypothesis space
- E.g. gradient descent: $\theta_i^{new} = \theta_i + \frac{\mathcal{L}(f_{\theta})}{\partial \theta_i} $

### Overfitting and underfitting

- Easy to create a model that is 100% accurate on training data, but very bad on new data
**Overfitting**: building a model that is too complex for the amount of data we have- Model peculiarities in training data (noise, biases, etc)
- Solve by making model simpler (regularization), or getting more data
- Most algorithms have hyper parameters that allow regularization

**Underfitting**: building a model that is too simple given the complexity of the data- Use a more complex model

- There are techniques to detect overfitting (e.g. bias-variance analysis)
- Build ensembles of many models to overcome both underfitting and overfitting

### Model Selection

- Next to the internal loss function, we need an external evaluation function
- Feedback signal: are we actually learning the right thing? Are we under/overfitting?
- Carefully choose to fit the application
- Needed to select between models (and hyper parameters settings)

- Data needs to be split into training and test sets
- Optimize model parameters on the training set
- Evaluate on independent test set

- Avoid data leakage:
- Never optimize hyper parameters settings on test data
- Never choose preprocessing techniques based on the test data

- To optimize hyper parameters and preprocessing as well, set aside part of training set as a validation set: keep it hidden during all training
- Do not evaluate final models on training data, except for:
- Tracking whether the optimizer converges (learning curves)
- Diagnosing under/overfitting:
- Underfitting:
**low training and test score** - Overfitting:
**high training score, low test score**

- Underfitting:

- On small datasets, use multiple train-test splits to avoid sampling bias. E.g. use cross validation

### Better Models

- Algorithms need to correctly transform the inputs to the right outputs
- A lot depends on how we present the data to the algorithm
- Transform data to better representation (aka encoding or embedding)
- Can be done E2E (e.g. deep learning) or by first preprocessing the data (e.g. feature selection/generation)

### Feature Engineering

- Most machine learning techniques require humans to build a good representation of the data
- Especially when data is naturally structured. E.g. table with meaningful columns
- Feature engineering is often still necessary to get the best results
- Feature selection, dimensionality reduction, scaling, etc
- Applied machine learning is basically feature engineering

- Nothing beats domain knowledge (when available) to get a good representation

**Learning Data Transformations E2E**

- It’s hard to extract good features for unstructured data. E.g. images, text
- Deep learning: learn your own representation (embedding) of the data
- Through multiple layers of representation (e.g. layers of neurons)
- Each layer transforms the data a bit, based on what reduces the error

**Curse of Dimensionality**

- Adding lots of features and letting the model figure it out does not work
- Our assumptions (inductive biases) often fail in high dimensions:
- Randomly shaped points in an n-dimensional space. E.g. a unit hypercube
- Almost all points become outliers at the edge of the space
- Distances between any two points will become almost identical

**Consequences**- For every dimension (feature) you add, you need exponentially more data to avoid sparseness
- Affects any algorithm that is based on distances. E.g. kNN, SVM, kernel-based methods, tree-based methods, etc
- Blessing of non-uniformity: on many applications, the data lives in a very small subspace
- We can drastically improve performance by selecting features or using lower-dimensional data representations

### More data can beat a cleverer algorithm

- More data reduces the change of overfitting
- Less sparse data reduces the curse of dimensionality
- Non-parametric models: number of model parameters grows with amount of data
- Tree-based techniques, kNN, SVM, etc
- they can learn any model given sufficient data but can get stuck in local minimums

- Parametric (fixed size) models: fixed number of parameters
- Linear models, neural networks, etc
- Can be given a huge number of parameters to benefit from more data
- Deep learning models can have millions of weights, learn almost any function

- The bottleneck is moving from data to compute/scalability

### Building ML systems

**Preprocessing**: raw data is rarely ideal for learning- Feature scaling: bring values in same range
- Encoding: make categorical features numeric
- Discretization: make numerical features categorical
- Label imbalance correction. E.g. downsampling
- Feature selection: remove uninteresting/correlated features
- Dimensionality reduction can also make data easier to learn
- Using pre-learned embeddings. E.g. word to vector, image to vector

**Learning and evaluating**data in training, validation and test sets**Prediction**- Final optimized model can be used for prediction
- Expected performance is performance measured on independent test set

- Together they form a workflow or
**pipeline** - There exist ML methods to automatically build and tune these pipelines
- Need to optimize pipelines continuously
**Concept drift**: the phenomenon you are modelling can change over time**Feedback**: models' predictions may change future data

## k-Nearest Neighbor

- Make a prediction: find the $k$ closest data points in the training dataset
- Classification: predict the most frequent class of the $k$ neighbors
- Regression: predict the average of the values of the $k$ neighbors
- Both can be weighted by the distance to each neighbor

- Main hyper parameters:
- Number of neighbors $k$, regularizer
- Choice of distance function: Euclidean, etc
- Weighting scheme: uniform, distance, etc

- Model:
- Representation: store training examples. E.g. in kd-tree
- Typical loss functions:
- Classification: accuracy (zero-one loss)
- Regression: root mean squared error

- Optimization: none (no model parameters to tune)

### Classification

- $k=1$: look at the nearest neighbor only. Likely to overfit
- $k>1$: do a vote and return the majority or confidence value for each class
- Using
- Few neighbors => high model complexity, complicated decision boundary
- Many neighbors => low model complexity, smoother decision boundary

- Nearest Shrunken Centroid
- Alternative to kNN neighbor
- Nearest centroid: represents each class by the centroid of its members
- Parametric model, while kNN is non-parametric

- Regularization is possible with the shrink threshold parameter
- Shrinks (scales) each feature value by within class variance of that feature
- Soft thresholding: if feature value falls below threshold, it is set to 0
- Effectively removes noisy features

- Suffers when data is not “convex”

**Scalability**

- With n = nr examples and p = nr features
- Nearest shrunken threshold
- Fit: $O(n * p)$
- Memory: $O(nrclasses * p)$
- Predict: $O(nrclasses * p)$

- Nearest neighbors (naive)
- Fit: $0$
- Memory: $O(n * p)$
- Predict: $O(n * p)$

- Nearest neighbors (with KD trees)
- Fit: $O(p * n log n)$
- Memory: $O(n * p)$
- Predict: $O(k * log n)$

### Regression

- $k=1$: return the target value of the nearest neighbor. Likely to overfit
- $k>1$: returns the mean of the target value of the $k$ nearest neighbors
- Small $k$ leads to overly complex (overfitting) model
- Larger $k$ tens to yield a smoother fit

### Pros and Cons

- Easy to understand and works well in many settings
- Training is very fast, but predicting is slow for large datasets
- Bad at high-dimensional and sparse data (Curse of Dimensionality)
- Nearest centroid is a useful parametric alternative, but only if data is near linearly separable

## Linear Models

- Predictions using a linear function of the input $X$: $f_{\mathbf{w}}(\mathbf{x}) = \sum_{i=1}^{p} w_i \cdot x_i + w_{0}$
- Learn $w$ from $X$, given a loss function $\mathcal{L}$: $\underset{\mathbf{w}}{\operatorname{argmin}} \mathcal{L}(f_\mathbf{w}(X))$
- Many algorithms with different loss functions: Least squares, ridge, lasso, logistic regression, linear SVMs, etc
- Can be very powerful (and fast), especially for large datasets with many features
- Can be generalized to learn non-linear patterns: Generalized Linear Models
- Features can be augmented with polynomials of the original features
- Features can be transformed according to a distribution (Poisson, Tweedie, Gamma, etc)
- Some linear models (e.g. SVMs) can be kernelized to learn non-linear functions

### Regression

- Prediction formula for input features $X$:
- $w_1$…$w_n$ usually called
**weights**or**coefficients**, $w_0$ the**bias**or**intercept** - Assumes that errors are $N(0,\sigma)$

- $w_1$…$w_n$ usually called

$$\hat{y} = \mathbf{w}\mathbf{x} + w_0 = \sum_{i=1}^{p} w_i \cdot x_i + w_0 = w_1 \cdot x_1 + w_2 \cdot x_2 + … + w_p \cdot x_p + w_0 $$

**Linear Regression (Ordinary Least Squares)**

- Loss function is the sum of squared errors (SSE) between predictions $\hat{y}_i$ and true regression targets $y_i$ on the training set

$$\mathcal{L}_{SSE} = \sum_{n=1}^{N} (y_n-\hat{y}_n)^2 = \sum_{n=1}^{N} (y_n-(\mathbf{w}\mathbf{x_n} + w_0))^2$$

**TODO**

## Kernelization

## Model Selection

## Ensemble Learning

## Decision Trees

- Representation: assume we can represent the concept we wanna learn with a decision tree
- Repeatedly split the data based on one feature at a time.
- Oblique trees can split on combinations of features.

- Evaluation/loss: one tree is better than another tree according to some heuristic
- Classification: instances in a leaf are all of the same class (pure leafs)
- Regression: instances in a leaf have values close to each other

- Optimization: recursive, heuristic greedy search (Hunt’s algorithm)
- Make first split based heuristic
- In each branch, repeat splitting in the same way

**Heuristics**

- We start from a dataset of $n$ points

**TODO**

## Data Preprocessing

## Bayesian Learning

## Neural Networks

## Convolutional Neural Networks

## Neural Networks for Text

Or if you don't know what a response is, you can always write a webmention comment (you don't need to know what that is).