NLP-guidance

Glossary

Bag of words

An approach to embedding in which the order of words in a document is not considered, just the presence or absence (or sometimes quantity) of terms.

Examples

Clustering

A catch-all term for a group of algorithms that aim to collect documents into clusters. The idea is that the documents within each cluster have something in common, and in particular that they have more in common with each other than with documents from outside the cluster.

Clustering algorithms typically require some measure of distance (or, to some extent equivalently, similarity) between documents in a vector space. There are a lot of different algorithms that can be used, all with pros and cons depending on the situation.

Corpus

The set of text documents that you are analysing.

Examples

Cosine similarity

A way of measuring similarity between documents after they have been embedded as vectors. The gist is that the similarity between any two documents a and b is judged by the angle θ. between their vectors a and b. To be specific, we use the cosine of this angle:

The rationale for this is that the vector space into which we embed our documents is defined such that the dimensions in it approximately relate to the concepts within the documents. The vector for a document points in the directions of the concepts that document contains. Therefore two documents with similar conceptual content will have vectors that point in similar directions: the angle between their vectors will be relatively small, so the cosine of this angle will be larger than that between documents with no conceptual similarity.

Note that cosine similarity is a similarity measure rather than a metric.

Document

A text object, the collection of which make up your corpus. If you are doing work on Search or Topics, the documents will be the objects which you will be finding similarities between in order to group them topically. The length and definition of a document will depend on the question you are answering.

Examples

Embedding

The process whereby documents or words are coded up as a vector in some (typically very high-dimensional) vector space.

Examples

Inverse Document Frequency (IDF) weighting

Assigning a weight to words in the vocabulary to represent how much we should take note of their appearance or non-appearance in a document, based on what proportion of documents in the corpus those words appear in. There are various IDF schemes; we have always used the standard

weighting for a word t, where N is the total number of documents in the corpus, and n~t~ is the number of documents that contain t.

Normalising

Transforming a vector so that it has unit length, by dividing the initial vector by its (Euclidean) length. If you are using cosine similarity to measure similarities between document vectors, normalising the vectors is often a good idea because, for vectors a and b

where \theta is the angle between a and b. If we denote the normalised versions of a and b as a’ and b’ respectively, we have , so

Dot products are typically much quicker to compute than cosines, and normalisation is quick, so this saves time.

Stemming

The practice of reducing words to their roots. This reduces the number of words in a vocabulary, and focusses embedding on the concept that the word is trying to encode, rather than the grammatical context of the word. We have generally done it using the Porter algorithm, which has implementations in a number of programming languages including R.

Examples

Stopwords

Words routinely removed from documents at an early stage of the analysis.

Examples

Term-document matrix (TDM)

A matrix, the columns of which are the vectors representing our documents, and the rows the words in our vocabulary. Used in Latent Semantic Analysis. Defines a subspace of our initial vector space, the rank of which is the smaller out of the number of documents and the size of the vocabulary.

Sometimes algorithms require a Document-Term Matrix (DTM) instead of a TDM; this is just the transpose of the TDM.

Beware that the terminology for these objects can be confused; for example, in R the package lsa contains the key function lsa() which will do the singular value decomposition that you want (see LSA page for details). This function claims that its input must be

…a document-term matrix … containing documents in colums, terms in rows

(emphasis mine). However, the TermDocumentMatrix() function from the tm package, along with Wikipedia, agrees with our definition above.

The semantics aren’t important, but care needs to be taken because when you do a singular value decomposition as part of LSA, you need to know which of the three matrices created corresponds to terms, and which to documents.

Vocabulary

The set of all words used in the corpus, after stopwords have been removed and stemming has been done (where appropriate).

Examples

Weighting scheme

The scheme by which we go from a vector of counts of each word in the vocabulary for a given document to an embedding. Typically made up of three elements: term frequency, inverse document frequency, and normalisation.

Example

Back to contents

Written by Sam Tazzyman, DaSH, MoJ