In most corpusses it is not the case that each document is about something totally different. Rather, there are usually a few underlying topics which cover the content of the corpus. We would like to be able to
This question can actually be somewhat more general than just determining the topical content of the text. For example, if we take emails to have only one of two topics, ‘spam’ or ‘ham’, then creating a spam filter is an example of a problem in this area.
If you are lucky, you will have a large corpus of texts, the topical content of which is already tagged. Our spam filter example above might be of this nature. Similarly, some departments have a taxonomy for parliamentary questions that details what team they are to be answered by depending on their topical content.
The reason that having topical tags already assigned to your corpus is fortunate is that it enables you to use supervised machine learning techniques such as naive Bayes, to classify future documents in a way comparable to those in the existing data set. It also means that questions about the number, size, and definition of topics are already answered.
Unfortunately, we have not experienced this happy state of affairs (which might be why we think it’s a happy state of affairs - the grass is always greener). Instead, we have so far been restricted to the situation where our documents are unclassified, and we have to determine the topics all by ourselves. This leaves us in the world of unsupervised machine learning, and particularly trying either Latent Dirichlet Allocation (LDA) or using clustering algorithms.
There is a full page on LDA, and our problems trying to implement it, here. Suffice it to say that we haven’t got it to work satisfactorily.
The work we have done on Search relies entirely on the ability to embed our documents in a vector space. If we can do this, we can use our distance metric (or similarity measure) and do some clustering techniques to group together documents that are ‘close’ together in the vector space, and therefore hopefully about similar subjects.
There are many different methods for clustering, and I’m not going to go through them here. Suffice it to say that you should try several and compare them for speed of computation and how sensible the results appear to be.
Further complicating the issue is that most techniques require you to set some parameter(s), whether that be explicitly number of clusters, or some related measure such as density of points for something to count as a cluster or whatever. This brings up a lot of questions: how do we know that a clustering is the “right” one, or even a “good” one?
For most clustering methodologies there are statistical measures to determine the validity of your parameters choices. For example, for k-means you can look at the silhouette of a clustering as a measure of quality. However, we have found that these measures are lacking: they give a technical idea for the number of clusters, but can often result in lots of tight clusters (often including singleton clusters) when frequently the user wants wider, looser collections (and in particular probably never wants a singleton cluster). There is no substitute for you (or the user) looking at the clusters and deciding whether or not they are useful.
Using an unsupervised machine learning technique for obtaining topics means that there is inevitably an element of black box to proceedings. This air of mystery can have the unwanted side effect that for some topics it can be
The first point is arguably the harder to deal with - if you’re not sure why some documents have been grouped, it’s hard to know how to make changes to your feature selection or embedding in order to ungroup them. It can also make it hard to convince the users that the topical clusters you have produced have any semantic meaning. In the worst case, you can see a definition of a topic (in terms of a list of words or of documents) but you can’t articulate what it is about. If this happens too much, your topical discovery is essentially useless to the user. I have found this to be a constant problem when trying LDA.
The second point can sometimes be fixed by changing feature selection or embedding schemes. For example, with Parliamentary Questions, there was a cluster forming around questions containing the relatively rare word “steps”, because there are some questions asking the Secretary of State “what steps [he/she] will take” to solve some issue or other. We want to focus on the issue, rather than this piece of parliamentary fluff language. Finding this cluster allowed us to add the word “steps” to our stopword list which led to those questions being correctly categorised with others about the same topics.
In all cases, time spent looking at your topics/clusters is usually well spent, as it gives you a feel for what your complex bit of algorithmic machinery is actually doing.
Written by Sam Tazzyman, DaSH, MoJ