Today I Learned

Some of the things I've learned every day since Oct 10, 2016

Category Archives: machine learning

216: Universal Approximation Theorem (Neural Nets)

The Universal Approximation Theorem (due to Kurt Hornik and others) essentially states that you can approximate an arbitrary continuous function on a compact subset of \mathbb{R}^n with a neural net to arbitrary precision. (This is similar to the way single-variable polynomials are universal approximators of continuous functions on \mathbb{R}.) In fact, you can do this with a net with just a single hidden layer. One possible statement of the theorem is below.

Theorem: Suppose \sigma : \mathbb{R} \rightarrow \mathbb{R} is non-constant, bounded, non-decreasing, and continuous, and that S \subset \mathbb{R}^d is closed and bounded (compact). Then for any continuous function f : S \rightarrow \mathbb{R} and \epsilon > 0, there exists a 2-layer neural network of the form

h(\mathbf{x}) = \sum _{j=1} ^k c_j \sigma(\mathbf{a}_j^T \mathbf{x} + b_j)

for some real vectors \mathbf{a}_j and scalars b_j, c_j, such that

|h(\mathbf{x}) - f(\mathbf{x})| < \epsilon

for all \mathbf{x} \in S.

Note that a nonlinear function such as \sigma is not even necessary at the output layer for this to work.

More important to note is that while this theorem guarantees there exists a network that approximates any function, it says nothing about how we train this network or obtain its parameters. Hence why we don’t just use the above network to solve all our problems.


195: Distributional Key-Finding

One approach to the task of automatically determining the key of a sample of music (assumed to be within the conventions of ‘Western’ chromatic tuning and major/minor tonality) is called distributional key-finding. Essentially what this involves is taking an audio sample, converting it into a discretized sample of the notes occurring in each time interval through some signal-processing method such as the fast Fourier transform (e.g. [E4, G4, F#4, D4, E4] in a monophonic model), further reducing this to a sequence of pitch classes (e.g. [E, G, F#, D, E], using the previous example), and then normalizing this to a distribution across chromatic pitch classes.

This distribution is essentially a vector which says what proportion of time each of the 12 chromatic pitch classes (A, A#, B, C, etc.) was being used. This distribution is then used to determine which key the sample likely belongs to.

The reason this works is that music in a given key will tend to use some notes more than others. For instance, a melody in E minor will make the most use of its tonic and dominant notes E and B, followed by its mediant and supertonic notes G and F#, and so on. This is illustrated in the figure below, taken from this paper.


This is in fact a very crude reduction of how key identification in the musically-trained ear actually takes place. However, despite being an oversimplification it often works well for automatically detecting keys since the distributions produced by keys are relatively unique. This lends it well to techniques such as using the distribution as a feature vector in a neural net.

88: Dummy Variables

In statistics, a dummy variable is an independent variable which only takes on the values 0 or 1. This is used as an indicator variable, categorizing the object in question and adjusting its model accordingly: if the value is 1, the weight of the dummy variable will be factored in, but not if the value is 0.

76: Probabilistic Classifiers

As used in machine learning, the term probabilistic classifier refers to a function f: X \rightarrow \pi _Y, where X is the set of objects to be classified, Y is the set of classes, and \pi _Y is the set of probability distributions over Y. That is, a probabilistic classifier takes an object to be classified and gives the probability of that object belonging to each of the possible classes.

This contrasts with a non-probabilistic classifier, which instead is simply a function g: X \rightarrow Y that assigns a single class given an object. Often, the choice is simply that category with the highest probability.

74: Linearly Separable Values (Euclidean)

2 sets X_1, X_2 of points in n-dimensional Euclidean space E^n are linearly separable if and only if there exists a non-zero vector \mathbf{w} \in E^n and a number k such that

\mathbf{w} \cdot \mathbf{x} < k

holds for every \mathbf{x} \in X_1, and does not hold for every \mathbf{x'} \in X_2. Intuitively, this means that two sets of points in an n-dimensional Euclidean space are linearly separable if there is an (n-1)-dimensional plane that when inserted into the same space separates the two sets.

(This concept could probably be extended to spaces which share certain properties with E^n, such as having a partial order, closure, etc., but E^n gives the simplest example.)