A Text Search Method Using Similarities by K-Nearest Neighbours [Machine Learning]

A significant part of being an academician or a researcher is keeping up with the literature. A comprehensive literature review is essential especially if you are taking up a new scientific problem or a research project and the method to do that is an online search. The search tools out there for journal articles (or books) have never been sufficiently satisfactory for me. Typically, a simple search is performed by using one or more keywords in select fields such as title, author, date etc. Many good online databases such as the Web of Science also offers the option of an advanced search, where you can narrow down your search to filter out the unnecessary stuff. I always found the simple search to be too broad, that is, it usually returns too many articles and skimming all those articles to find the useful ones becomes a very tedious and time-consuming task. On the other hand, the advanced search usually narrows the results so much so that I’d miss some very useful articles. This is a natural drawback of conducting a search using one-to-one matching for keywords (or phrases) even though it is possible to use a combination of keywords and/or search fields. Therefore, I wanted to test an unsupervised learning algorithm, namely the k-nearest neighbours method, to alleviate these issues by a more sophisticated matching algorithm and hopefully improve the quality of the search results. This way, a full text rather than single keywords can be used as search input (e.g., one can enter a paragraph in a book to search for similar texts to subjects mentioned in that paragraph). This method can of course be applied to any text, for example, you can enter a recipe of a dish you love to find similar dishes, not just by matching similar ingredients, but also the amounts of the ingredients, the way they are cooked etc. in a combined fashion.

I implemented and tested the algorithm on the abstracts of the articles on two fields, atmospheric science and modelling, published by the American Meteorological Society (AMS). An abstract is a distilled version of the article giving brief information about the research problem, methods and the key findings, therefore it serves as a good search field to find articles similar to what is sought after. A total of 44886 scientific abstracts spanning a period between 1967-2018 are gathered from the AMS Journals website by a web scraping algorithm. I then applied the algorithm on texts from a different field, philosophy, scraped from the Notre Dame Reviews (4813 book reviews). The Python codes I developed for this work can be found under this GitHub repository.

The Method

As I implied above, the algorithm I used to quantify similarity is an unsupervised machine learning method called the k-nearest neighbours. The initial step in the method is to construct vectors that contain the words appearing in each of the abstracts together with their number of occurrences. Then the TF-IDF (Term Frequency – Inverse Document Frequency) matrix is formed using these vectors. The TF-IDF statistic is a measure of how important a word is in the document by weighing the word proportional to its frequency of occurrence. It assigns less weights to commonly used words such as ‘the’, ‘a’, ‘have’ etc. The similarities between texts are then quantified as distances applying the k-nearest neighbours algorithm on the TF-IDF matrix, and these distances can be used to identify the most similar texts to a text used for search. Such an approach results in a more probabilistic matching between entire texts instead of limited the one-to-one matching of keywords. This method is implemented using the scikit-learn package in Python (here is a tutorial you can use).

Including n-grams in the Search

N-grams are combinations of the adjacent n number of words in a given text. For example "I am the" is a 3-gram or "my own" is a 2-gram (see my previous post onn-grams). Inclusion of n-grams to the search algorithm (i.e., finding similarities using the TF-IDF matrices of different n-grams) improves the accuracy of the results since this method will diversify probabilistic matching. For example, one may want to access texts that emphasize the concept of ‘potential vorticity’ (a 2-gram) rather than single words ‘potential’ and ‘vorticity’, which have different meanings and will possibly return some irrelevant articles.

I tested the algorithm on a subject I started working on as a postdoctoral research fellow: shallow convection and how it is simulated in climate models. I don’t have extensive knowledge and experience on shallow convection in the atmosphere, so a search that would return the articles that contain introductory material is what I want. For this purpose, I use the below text on shallow convection as the target text:

“Shallow convection is what the model uses to deal with cumulus-topped mixed layers. It assumes the cumuli are not precipitating, meaning integrated column moisture and heat content are unchanged, only the vertical distribution is rearranged. In order to account for the latent heating in the cumuli and the evaporation from air mixing with the cumuli, the model formulation has to be more complex than simply dry mixing. Remember, the cumulus are sub-grid scale and the air on average is not saturated in the cumulus layer.

In nature, the cumulus transport moisture out of the boundary layer and push the inversion upward, warming the bottom of the cumulus layer and cooling the top. The model tries to reproduce these effects. So we expect to see drying in the boundary layer, moistening above the boundary layer, and deepening of the boundary layer.”

The aim here is to use the above text as a keyword (instead of single words in the search field) in the abstract fields of the 44.886 AMS articles, hoping it will return articles that are related to this text using the machine learning algorithm. This text is particularly good for test purposes because it really defines shallow convection in a general sense and talks about several related concepts in the atmosphere (e.g., boundary layer, cumulus, mixed layer etc.) which would help return introductory material as I want. Also, there is no over-usage of words in the text, e.g., the word “shallow” is used only once, which would pose a challenge to the algorithm in finding related articles without being too broad.

Since the applied method is an unsupervised learning technique, there is no training dataset with labels for the abstracts indicating any relevance to the target text above. So, I assessed the results subjectively (by actually reading the returned abstracts) in terms of their similarity to the target text. I labelled the articles in the search results using 4 categories:

Excellent Match (EM): Article directly related to shallow convection and its simulation by models.
Good Match (GM): Article not directly related to shallow convection, however it contains good introductory knowledge about strongly related subjects such as convection, boundary layer etc.
Mediocre Match (MM): Article contains words that are present in the target text, but weakly related to the target text in context.
Bad Match (BM): Article contains words that are present in the target text but irrelevant to the context of the target text.

Results

The search results of the algorithm change depending on the ‘n’ in the n-grams used in the calculation of the TF-IDF matrix. I used four different combinations of n-grams:

· Only 1-grams (n1)

· 1- and 2-grams (n12)

· 1-,2-, and 3-grams (n123)

· 1-,2-,3- and 4-grams (n1234).

The below table shows the results using the 10 most similar articles (sorted according to the k-nearest neighbour search) returned by each of the four combinations.

	n1	n12	n123	n1234
EM	4	3	1	1
GM	5	3	3	3
MM	1	4	4	4
BM	0	0	2	2
% Correct	90	60	40	40

Let’s consider excellent and good matches (EM and GM) as correct and mediocre and bad matches (MM and BM) as incorrect search returns by the algorithm. According to this, n1 yields the most accurate results by 90% (9 of the 10 search returns are labelled as either EM or GM). The accuracy of the algorithm decreases with the inclusion of 2-grams (n12) and is worst with n123 and n1234. However, the results show that even though underperforming, n123 and n1234 can still return useful articles. For example, an article that is an EM called “A New Bulk Shallow-Cumulus Model and Implications for Penetrative Entrainment Feedback on Updraft Buoyancy”, which is directly related to shallow convection, does not appear in the results of the n1 but it is returned as the 5^th most similar article by n12, and 2^nd for both n123 and n1234. Therefore, populating a result set via merging the results from each of the 4 n-gram combinations shall yield most accurate search results.

Merging Results by Simple Averaging

Such merging can be achieved by a simple averaging of the rankings of each article in the results. For example, if an article is retuned as the 6^th most similar by n1, 2^nd by n12, 4^th by n1234 and doesn’t appear in the results by n123, the final score of that article can be computed as the average (6+2+4)/3 = 4. A final list can be populated by sorting these final average scores from small to large. However, this approach has a caveat of assigning erroneous high-scores for articles that appear in the results of only a couple or just one of the n-gram combinations (e.g., if an article only appears in the results n1234 as the 3^rd most similar, the final score of it will be 3/1 = 3, which is a high score). The first 10 articles in the final list populated by this simple averaging method contain 4 EMs, 4 GMs, but also 2 MMs as the 6^th and the 7^th most similar. Also, a couple of EMs appeared lower in the list as 11^th and 12^th. We don’t want mediocre matches (MM) in our results and improve the scores of the excellent matches (EM) so that they appear higher in the list. So maybe application of some scaling on the merge will improve the final list.

Merging Results by Simple Averaging over Scaled Rankings

The averaged results can be improved by scaling them to weigh the results of each n-gram combinations separately. In this shallow convection search test, Table 1 shows the usage of only 1-grams (n1) yield the most accurate similarity results, so we can give the results of n1 precedence relative to the other n-gram combinations. Let’s say we scale the n-gram combinations by using the coefficients 1.25, 1.75, 2.5, and 2.5 for n1, n12, n123 and n1234 respectively, the final score of the above example will change as:

(6x1.25 + 2x1.75 + 4x2.5) / (1.25 + 1.75 + 2.5) = 3.82

This time, the first 10 articles in the final list populated by the averaging over the scaled rankings contain 6 EMs, 3 GMs and 1 MMs, which is an improvement over the simple averaging method. The EMs that weren’t returned in the 10-most similar list by the simple averaging now appear as the 6^th and the 9^th in the final list.

I say it is doing an OK job. For the more interested reader, I list the first 5 articles returned by this method in the Appendix below. But before that, how about another application this time with philosophical texts?

Application to Notre Dame Reviews

A friend of mine who is working on her philosophy PhD said this kind of search can be useful to her and sent me a recent abstract of hers so that I can search the similar book reviews in Notre Dame Reviews. The abstract is a long one, and it is about vulnerability from a feminist point of view. Here is a sample of the abstract:

“Vulnerability seems to be at the core of several philosophical discussions, which indicates the breadth and the complexity of the concept. However, contemporary debates on vulnerability usually underpin a negative interpretation of the term, focusing on one’s susceptibility to harm and injury…”

I used merging both with the simple averaging and the averaging over the scaled rankings, and the results were similar this time (my friend was kind enough to skim the results and label them with the 4 categories described above). The results are not as good as the shallow convection example: among the 10-most similar the search returns, there are 2 excellent, 3 good, 3 mediocre, and 3 bad matches. Though the number of texts in this case is lower -about ten times less- than that of the AMS abstracts used in the previous search. It is quite possible that there are not many closely related book reviews to the specific subject we look for. Besides, in this test I also included another text in the search, again by my friend written on a very similar subject, to see if the search will return it. That text was returned as the 3^rd most similar, and the other excellent match came first, so I say the method is performing ok. Besides, she said she found some very interesting book reviews as a result of my search, so I call it a success! The results of this search are also in the appendix (clicking on each result will take you to the article or the review).

Appendix

Top 5 search results for shallow convection text in AMS abstracts:

Top 5 search results for vulnerability text in Notre Dame reviews:

Vulnerability: New Essays in Ethics and Feminist Philosophy
Wounded Heroes: Vulnerability as a Virtue in Ancient Greek Literature and Philosophy
The similar text used for testing
Subjectivity: Ancient and Modern
Mind, Self and Person

Search This Blog, Save Time

Sky Above, Great Wind