# Today I Learned

Some of the things I've learned every day since Oct 10, 2016

## 217: Convex Functions

A function $f \mathbb{R}^n \rightarrow \mathbb{R}$ (or in general from a convex vector space to $\mathbb{R}$ is said to be convex if

$f(tx_1 + (1-t)x_2) \leq tf(x_1) + (1-t)f(x_2)$

for $t \in [0, 1]$ and $x_1, x_2 \in \mathbb{R}^d$. $f$ is called strongly convex if the above inequality is strict.

Taking $d = 1$ makes this notion easier to visualize. The intuition is that a convex function has the property that if you interpolate a line segment between the points $(x_1, f(x_1)), (x_2, f(x_2))$ then the line segment will never dip below the ‘surface’ of the function itself.

Convex functions can have a number of nice resulting properties, depending on context. For instance, any local minimum over $w$ for the loss function $L(w) = ||Xw - y||_2^2$ for ordinary least squares is also a global minimum because the function is convex. Further, ridge regression’s loss function $L'(w) = ||Xw - y||_2^2 + ||w||_2^2$ is strictly convex and thus has a unique local minimum which is global.

## 216: Universal Approximation Theorem (Neural Nets)

The Universal Approximation Theorem (due to Kurt Hornik and others) essentially states that you can approximate an arbitrary continuous function on a compact subset of $\mathbb{R}^n$ with a neural net to arbitrary precision. (This is similar to the way single-variable polynomials are universal approximators of continuous functions on $\mathbb{R}$.) In fact, you can do this with a net with just a single hidden layer. One possible statement of the theorem is below.

Theorem: Suppose $\sigma : \mathbb{R} \rightarrow \mathbb{R}$ is non-constant, bounded, non-decreasing, and continuous, and that $S \subset \mathbb{R}^d$ is closed and bounded (compact). Then for any continuous function $f : S \rightarrow \mathbb{R}$ and $\epsilon > 0$, there exists a 2-layer neural network of the form

$h(\mathbf{x}) = \sum _{j=1} ^k c_j \sigma(\mathbf{a}_j^T \mathbf{x} + b_j)$

for some real vectors $\mathbf{a}_j$ and scalars $b_j, c_j$, such that

$|h(\mathbf{x}) - f(\mathbf{x})| < \epsilon$

for all $\mathbf{x} \in S$.

Note that a nonlinear function such as $\sigma$ is not even necessary at the output layer for this to work.

More important to note is that while this theorem guarantees there exists a network that approximates any function, it says nothing about how we train this network or obtain its parameters. Hence why we don’t just use the above network to solve all our problems.

An unrolled linked list is a data structure which is a sort of hybrid of an array and a linked list. Typically, a linked list node containing integer values would have attributes something like

int value

Node next.

The unrolled linked list simply replaces the single value of each node with an array of a fixed size, making its attributes look something like

Array values

int numberValues

Node next.

Below is a small unrolled linked list containing ordered integers.

Note that all node arrays have the same size, but not all (in fact none) of them are full. The way one decides to handle this depends on the application, but typically when one array fills up it can be split up into 2 new nodes. Point being that you can re-distribute elements between nodes however is convenient for the problem you’re solving.

One advantage of unrolled linked lists over regular linked lists is that they reduce the amount of memory overhead per element. (In a linked list, the overhead associated with the Node object typically far outweighs the amount of memory actually required to store the value itself.) Another is that in the case of ordered values, they make linear search faster than with a traditional linked list, up to a constant factor. The reason should hopefully be obvious.