An approach to embedding in which the order of words in a document is not considered, just the presence or absence (or sometimes quantity) of terms.
A catch-all term for a group of algorithms that aim to collect documents into clusters. The idea is that the documents within each cluster have something in common, and in particular that they have more in common with each other than with documents from outside the cluster.
Clustering algorithms typically require some measure of distance (or, to some extent equivalently, similarity) between documents in a vector space. There are a lot of different algorithms that can be used, all with pros and cons depending on the situation.
The set of text documents that you are analysing.
The set of written parliamentary questions that the Ministry of Justice has answered since a give date.
The set of sentences from all prison inspection reports since a given date.
A set of emails sent to a particular person.
A way of measuring similarity between documents after they have been embedded as vectors. The gist is that the similarity between any two documents a and b is judged by the angle θ. between their vectors a and b. To be specific, we use the cosine of this angle:
The rationale for this is that the vector space into which we embed our documents is defined such that the dimensions in it approximately relate to the concepts within the documents. The vector for a document points in the directions of the concepts that document contains. Therefore two documents with similar conceptual content will have vectors that point in similar directions: the angle between their vectors will be relatively small, so the cosine of this angle will be larger than that between documents with no conceptual similarity.
Note that cosine similarity is a similarity measure rather than a metric.
A text object, the collection of which make up your corpus. If you are doing work on Search or Topics, the documents will be the objects which you will be finding similarities between in order to group them topically. The length and definition of a document will depend on the question you are answering.
The process whereby documents or words are coded up as a vector in some (typically very high-dimensional) vector space.
Assigning a weight to words in the vocabulary to represent how much we should take note of their appearance or non-appearance in a document, based on what proportion of documents in the corpus those words appear in. There are various IDF schemes; we have always used the standard
weighting for a word t, where N is the total number of documents in the corpus, and n~t~ is the number of documents that contain t.
Transforming a vector so that it has unit length, by dividing the initial vector by its (Euclidean) length. If you are using cosine similarity to measure similarities between document vectors, normalising the vectors is often a good idea because, for vectors a and b
where \theta is the angle between a and b. If we denote the normalised versions of a and b as a’ and b’ respectively, we have , so
Dot products are typically much quicker to compute than cosines, and normalisation is quick, so this saves time.
The practice of reducing words to their roots. This reduces the number of words in a vocabulary, and focusses embedding on the concept that the word is trying to encode, rather than the grammatical context of the word. We have generally done it using the Porter algorithm, which has implementations in a number of programming languages including R.
Words routinely removed from documents at an early stage of the analysis.
A matrix, the columns of which are the vectors representing our documents, and the rows the words in our vocabulary. Used in Latent Semantic Analysis. Defines a subspace of our initial vector space, the rank of which is the smaller out of the number of documents and the size of the vocabulary.
Sometimes algorithms require a Document-Term Matrix (DTM) instead of a TDM; this is just the transpose of the TDM.
Beware that the terminology for these objects can be confused; for example, in R the package lsa
contains the key function lsa()
which will do the singular value decomposition that you want (see LSA page for details). This function claims that its input must be
…a document-term matrix … containing documents in colums, terms in rows…
(emphasis mine). However, the TermDocumentMatrix()
function from the tm
package, along with Wikipedia, agrees with our definition above.
The semantics aren’t important, but care needs to be taken because when you do a singular value decomposition as part of LSA, you need to know which of the three matrices created corresponds to terms, and which to documents.
The set of all words used in the corpus, after stopwords have been removed and stemming has been done (where appropriate).
a, am, and, anywhere, are, be, boat, box, car, could, dark, do, eat, eggs, fox, goat, good, green, ham, here, house, I, if, in, let, like, may, me, mouse, not, on, or, rain, Sam, say, see, so, thank, that, the, them, there, they, train, tree, try, will, with, would, you
boat, box, car, dark, eat, eggs, fox, goat, good, green, ham, house, I, mouse, rain, Sam, train, tree.
boat, box, car, dark, eat, egg, fox, goat, good, green, ham, hous, mous, rain, sam, train, tree.
The scheme by which we go from a vector of counts of each word in the vocabulary for a given document to an embedding. Typically made up of three elements: term frequency, inverse document frequency, and normalisation.
Written by Sam Tazzyman, DaSH, MoJ