Let us see how we can process the textual information to create a vector representation, also known as word embeddings or word vectors, which can be used as an input to a neural network.
One-Hot Vector
This is the most simplest one where for each word we create a vector of length equal to the size of the vocabulary, $R^{\left\|V\right\|}$. We fill the vector with $1$ at the index of the word, rest all $0s$.
$$W^{apple} = \begin{bmatrix} 1 \\ \vdots \\ \vdots \\ \vdots \\ 0 \\ \end{bmatrix} W^{banana} = \begin{bmatrix} 0 \\ 1 \\ \vdots \\ \vdots \\ 0 \\ \end{bmatrix} W^{king} = \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ \end{bmatrix} W^{queen} = \begin{bmatrix} 0 \\ \vdots \\ \vdots \\ 1 \\ 0 \\ \end{bmatrix}$$All these vectors are independent to each other. Hence this representation doesn't encodes any relationship between words:
$$(W^{apple})^TW^{banana}=(W^{king})^TW^{queen}=0$$Also, each vector would be very sparse. Hence this approach requires large space to encode all our words in the vector form.
You shall know a word by the company it keeps (Firth, J. R. 1957:11)
Word-Document Matrix
In this approach, we create a matrix where a column represents a document and a row represents the frequency of a word in the document. This matrix scales with the number of documents ($D$). The matrix size would be $R^{\left\|D*V\right\|}$ where $V$ is the size of the vocabulary.
Word-Word Matrix
In this case, we build a co-occurence matrix where both columns and rows represent words from the vocabulary. The benefit of building this matrix is that the co-occurence value of the words which are highly likely to come together in a sentence will always be high as compared to the words which rarely come together. Hence we should be fine once we have a descent sized dataset or say documents. Also, the size of the matrix dependent now on the size of the vocabulary, $R^{\left\|V*V\right\|}$.
The beauty of the last two approaches is that we can apply Singular-Value-Decomposition (SVD) on the matrix and further reduce the dimentionality. Let us see an example on the Word-Word matrix.
Consider our data to have the following 3 sentence:
- I enjoy driving.
- I like banana.
- I like reading.
The co-occurence matrix will look like:
$$X = \begin{array}{c|lcr} words & \text{I} & \text{enjoy} & \text{driving} & \text{like} & \text{banana} & \text{reading} &\text{.}\\ \hline \text{I} & 0 & 1 & 0 & 2 & 0 & 0 & 0 \\ \text{enjoy} & 1 & 0 & 1 & 0 & 0 & 0 & 0 \\ \text{driving} & 0 & 1 & 0 & 0 & 0 & 0 & 1 \\ \text{like} & 2 & 0 & 0 & 0 & 1 & 1 & 0 \\ \text{banana} & 0 & 0 & 0 & 1 & 0 & 0 & 1 \\ \text{reading} & 0 & 0 & 0 & 1 & 0 & 0 & 1 \\ \text{.} & 0 & 0 & 1 & 0 & 1 & 1 & 0 \\ \end{array} $$words = ["I" "enjoy" "driving" "like" "banana" "reading" "."];
X = [0 1 0 2 0 0 0;
1 0 1 0 0 0 0;
0 1 0 0 0 0 1;
2 0 0 0 1 1 0;
0 0 0 1 0 0 1;
0 0 0 1 0 0 1
0 0 1 0 1 1 0];
In Julia, applying SVD on our matrix $X$ will give us $U$, $S$ and $V$ where:
U,S,V = svd(X);
U
S
V
"A useful rule of thumb is to retain enough singular values to make up 90% of the energy in Σ. That is, the sum of the squares of the retained singular values should be at least 90% of the sum of the squares of all the singular values." - Jeffrey D. Ullman
S matrix is the $\sum$, hence the total energy here is:
totEnergy = sum(S.^2)
energy = zeros(length(S));
energy[1] = S[1]^2/totEnergy;
for i=2:length(S)
energy[i] = energy[i-1]+(S[i]^2/totEnergy);
end
energy
using PyPlot
plot(1:length(energy), energy)
xlabel("Dimensions")
ylabel("% Energy Retained")
grid("on")
Looking at the plot we can determine that keeping 4 dimensions are good enough for us rather than all 6. We can also print/plot the words based on the first two columns of $U$ corresponding to the two biggest singular values.
Y = X[:,1:4]
U,S,V = svd(Y);
U
for w=1:length(words)
plt.text(U[w,1], U[w,2], words[w]);
end
plt.xlim((minimum(U[:,1])-1, maximum(U[:,1])+1));
plt.ylim((minimum(U[:,2])-1, maximum(U[:,2])+1));
In the coming posts, I'll write about more interesting ways of generating word vectors.