# Today I Learned

Some of the things I've learned every day since Oct 10, 2016

## 221: A Weird Dynamic Programming Algorithm

Problem:

Given an array $S$ of integers (or reals, equivalently), with $|A| = n$, return an array $T$ with $|T| = n$ such that $T_i$ is the smallest value $j - i$ such that $j > i, S_j > S_i$ if such a $j$ exists, and otherwise is 0.

Solution:

Set $T_n = 0$. For $i = n - 1, \dots, 1$, initialize $j = i + 1$. While $S_j \leq S_i$ and $T_j \neq 0$ [using short-circuiting on this boolean], update $j = j + T_j$. After this loop terminates, set $T_i = 0$ if $T_j = 0$ and $T_i = j - i$ otherwise.

Reason it works:

If we’ve found a $j > i$ such that $S_j > S_i$, we’re done and $j - i$ will be the distance we want. Otherwise, we know $S_j \leq S_i$ and thus the next element higher than $S_i$ will at the soonest be at $j + T_j$, so we can safely skip $j$ ahead by that amount. If we find a $0$ at $T_j$ and $S_j \leq S_i$, we know we’re never going to find what we’re looking for and so we can stop looking.

Runtime (the weird part):

At first glance it’s not clear why this algorithm gives runtime better than $O(n^2)$. After all, it seems in its worst case to approximate a basic nested $for i = 1, \dots, n, for j = i, \dots, n$. However, if you denote $d$ to be the number of indices $k$ such that $S_k \geq S_{k + 1}$ and $u$ to be the number such that $S_k < S_{k + 1}$, and consider that $n = d + u$, it becomes clear that the only indices for which the $j$ update will do a non-constant amount of work is on a subset of $u$. Based on this, you can actually argue that the number of $j$ updates over the course of the algorithm is $O(u) = O(n)$.

This algorithm is a neat example of a non-vanilla dynamic programming solution where the solution building step seems to be doing a non-constant amount of work, but in fact over the course of the algorithm the amortized cost of each step is $O(1)$.

## 220: Hamming Distance, Zero Norm, and Sparsity

The Hamming distance between two vectors $x, y$ of equal length is the number of indices at which their values differ. It can alternatively be expressed as the zero-norm of their difference: $||\mathbf{x} - \mathbf{y}||_0$. (The zero norm is the number of nonzero elements in a vector.)

One important thing to note is that the zero norm is not actually a norm in the formal sense. Namely, it isn’t homogenous, meaning it doesn’t satisfy $||a\mathbf{v}||_0 = |a| ||\mathbf{v}||_0$ for a scalar $a$.

In some applications it’s desirable to have a sparse approximation to a vector, like in the weight vector $\mathbf{w}$ for a least-squares solution, for reasons like saving on computation time. For these applications we are trying to satisfy some sort of constraint akin to $||\mathbf{w}||_0 \leq k$ for some $k$.

Since the zero norm isn’t convex, optimizing a solution using a constraint like this isn’t computationally feasible. However, a range of techniques seek to minimize the zero norm of the solution by relaxing this constraint. One example of this is the LASSO method, which simply relaxes the above constraint to be $||\mathbf{w}||_1 \leq k$, which encourages but does not guarantee sparsity in $\mathbf{w}$.

## 219: Gini Coefficient

The Gini Coefficient (or Gini Index) is a statistical measure of income/wealth inequality. A fair intuition for it is that for a given population, a Gini index $G$ is a number in the range [0,1] and the closer the number is to 1, the less evenly income/wealth is distributed.

Assume for a population with $n$ people you have amounts of wealth $x_i \in \mathbb{R}^+$ for each person $i$. Then the Gini index for that population is

$G = (\sum _{i = 1} ^n \sum _{j = 1} ^n |x_i - x_j|) / (2n\sum _{i = 1} ^n x_i)$.

Note that disregarding the denominator which is just a normalizing factor, this is simply the sum of the wealth difference across all pairs $(x_i, x_j)$ of people.

As with any attempt to describe a real distribution in a single real number, this measure has its problems. One expected result of the lack of injectivity in this reduction is that drastically different distributions have very similar or even identical coefficients.

## 218: Random Forests

In machine learning, random forests are a tool used to reduce the variance of classification using decision trees.

One of the most common problems with decision tree classification is that of overfitting $\approx$ high variance. There are a number of ways to mitigate this in a single decision tree, such as pruning the decision tree so long as validation error doesn’t decrease and using smart stopping criteria (such as limiting the depth of the tree).

Rather than work with a single tree, random forests mitigate variance by creating several randomized decision trees from the available data, and then at classification time have these trees vote on what class the data point belongs to.

Question: How do we ‘randomize’ decision trees over a single data set of size $n$, such that their expected accuracy is still that of a deterministic one?

Answer: The most popular approach is called bagging (bootstrap aggregating): for each tree we take a random subset of $k < n$ data points and build the decision tree deterministically as usual. This way, the expectation of any individual tree’s classification of a point $x$ will be that of a tree trained on the whole data set. Moreover, when we classify a point $x$ by having the random trees vote on it, the classification is more robust (lower variance). In fact, the variance scales down inversely-linearly with the number $m$ of trees in our forest:

$(\sigma^2)' = \frac{\sigma^2}{m}$.

## 217: Convex Functions

A function $f \mathbb{R}^n \rightarrow \mathbb{R}$ (or in general from a convex vector space to $\mathbb{R}$ is said to be convex if

$f(tx_1 + (1-t)x_2) \leq tf(x_1) + (1-t)f(x_2)$

for $t \in [0, 1]$ and $x_1, x_2 \in \mathbb{R}^d$. $f$ is called strongly convex if the above inequality is strict.

Taking $d = 1$ makes this notion easier to visualize. The intuition is that a convex function has the property that if you interpolate a line segment between the points $(x_1, f(x_1)), (x_2, f(x_2))$ then the line segment will never dip below the ‘surface’ of the function itself.

Convex functions can have a number of nice resulting properties, depending on context. For instance, any local minimum over $w$ for the loss function $L(w) = ||Xw - y||_2^2$ for ordinary least squares is also a global minimum because the function is convex. Further, ridge regression’s loss function $L'(w) = ||Xw - y||_2^2 + ||w||_2^2$ is strictly convex and thus has a unique local minimum which is global.

## 216: Universal Approximation Theorem (Neural Nets)

The Universal Approximation Theorem (due to Kurt Hornik and others) essentially states that you can approximate an arbitrary continuous function on a compact subset of $\mathbb{R}^n$ with a neural net to arbitrary precision. (This is similar to the way single-variable polynomials are universal approximators of continuous functions on $\mathbb{R}$.) In fact, you can do this with a net with just a single hidden layer. One possible statement of the theorem is below.

Theorem: Suppose $\sigma : \mathbb{R} \rightarrow \mathbb{R}$ is non-constant, bounded, non-decreasing, and continuous, and that $S \subset \mathbb{R}^d$ is closed and bounded (compact). Then for any continuous function $f : S \rightarrow \mathbb{R}$ and $\epsilon > 0$, there exists a 2-layer neural network of the form

$h(\mathbf{x}) = \sum _{j=1} ^k c_j \sigma(\mathbf{a}_j^T \mathbf{x} + b_j)$

for some real vectors $\mathbf{a}_j$ and scalars $b_j, c_j$, such that

$|h(\mathbf{x}) - f(\mathbf{x})| < \epsilon$

for all $\mathbf{x} \in S$.

Note that a nonlinear function such as $\sigma$ is not even necessary at the output layer for this to work.

More important to note is that while this theorem guarantees there exists a network that approximates any function, it says nothing about how we train this network or obtain its parameters. Hence why we don’t just use the above network to solve all our problems.

An unrolled linked list is a data structure which is a sort of hybrid of an array and a linked list. Typically, a linked list node containing integer values would have attributes something like

int value

Node next.

The unrolled linked list simply replaces the single value of each node with an array of a fixed size, making its attributes look something like

Array values

int numberValues

Node next.

Below is a small unrolled linked list containing ordered integers.

Note that all node arrays have the same size, but not all (in fact none) of them are full. The way one decides to handle this depends on the application, but typically when one array fills up it can be split up into 2 new nodes. Point being that you can re-distribute elements between nodes however is convenient for the problem you’re solving.

One advantage of unrolled linked lists over regular linked lists is that they reduce the amount of memory overhead per element. (In a linked list, the overhead associated with the Node object typically far outweighs the amount of memory actually required to store the value itself.) Another is that in the case of ordered values, they make linear search faster than with a traditional linked list, up to a constant factor. The reason should hopefully be obvious.

## 214: Orthogonal Matrices

In linear algebra, an orthogonal matrix is a square matrix $A$ such that

$AT^T = A^T A = I$.

Equivalently, the rows and columns of $A$ are, respectively, orthonormal vectors: vectors which are all unit length and orthogonal to one another.

Also equivalently, $A^{-1} = A^T$.

A fun fact about orthogonal matrices is that their determinants are always $\pm 1$.

## 213: B-trees

A B-tree is a type of balanced search tree which is well-suited and commonly used for storing large amounts of data, such as in a database system. It is described using a positive constant $m \in \mathbb{Z}$, which defines the maximum number of children a single node can have. (Take BSTs as a specific example with $m = 2$.) Using $m$ the tree is defined by the following properties:

• Every leaf node is at the same depth in the tree
• Where a node has $b$ children, it has $b - 1$ values. The values are used as ‘bookmarks’ sandwiched in between the children in a manner generalized from BSTs: the values are sorted, and a child in between 2 values contains all values between them in the tree.
• Every node except possibly the root is at least half-full, having between $\frac{m}{2}$ and $m$ values.
• Each leaf node obviously has no children, but can still hold multiple values up to some arbitrary maximum (usually defined by disk block size or some other sensible constraint, depending on the setting).

As an example, below is an example of a B-tree with $m = 5$ and where leaves have enough room for 3 values each:

Upon inspection one can see that the sorting invariant similar to that used in BSTs is satisfied, e.g. all leaf values in between 15 and 20 are found by going to the child to the left of ’20’ in the root node, then to the child to the right of ’15’.

For obvious reasons, B-trees result in log-time operations. Lookup is straightforward, although insertions and deletions are more involved and difficult to implement. However, B-trees offer advantages over other search trees such as being optimized for reading/writing large blocks of data without too much unnecessary disk access.