A Text Search Method Using Similarities by K-Nearest Neighbours [Machine Learning]
A significant part of being an academician or a
researcher is keeping up with the literature. A comprehensive literature review
is essential especially if you are taking up a new scientific problem or a
research project and the method to do that is an online search. The search
tools out there for journal articles (or books) have never been sufficiently
satisfactory for me. Typically, a simple search is performed by using one or
more keywords in select fields such as title, author, date etc. Many good
online databases such as the Web of Science also
offers the option of an advanced search, where you can narrow down your search to
filter out the unnecessary stuff. I always found the simple search to be too
broad, that is, it usually returns too many articles and skimming all those
articles to find the useful ones becomes a very tedious and time-consuming
task. On the other hand, the advanced search usually narrows the results so
much so that I’d miss some very useful articles. This is a natural drawback of
conducting a search using one-to-one matching for keywords (or phrases) even
though it is possible to use a combination of keywords and/or search fields.
Therefore, I wanted to test an unsupervised learning algorithm, namely the
k-nearest neighbours method, to alleviate these issues by a more sophisticated
matching algorithm and hopefully improve the quality of the search results. This
way, a full text rather than single keywords can be used as search input (e.g.,
one can enter a paragraph in a book to search for similar texts to subjects
mentioned in that paragraph). This method can of course be applied to any text,
for example, you can enter a recipe of a dish you love to find similar dishes,
not just by matching similar ingredients, but also the amounts of the
ingredients, the way they are cooked etc. in a combined fashion.
I implemented and tested the algorithm on the
abstracts of the articles on two fields, atmospheric science and modelling, published
by the American Meteorological Society (AMS). An
abstract is a distilled version of the article giving brief information about
the research problem, methods and the key findings, therefore it serves as a
good search field to find articles similar to what is sought after. A total of 44886 scientific
abstracts spanning a period between 1967-2018 are gathered from the AMS
Journals website by a web scraping algorithm. I then applied the algorithm on
texts from a different field, philosophy, scraped from the Notre Dame Reviews (4813 book reviews). The Python codes I developed for this work can be found under this GitHub repository.
The Method
As I
implied above, the algorithm I used to quantify similarity is an unsupervised machine
learning method called the k-nearest neighbours. The initial step in the method is to construct vectors that
contain the words appearing in each of the abstracts together with their number
of occurrences. Then the TF-IDF (Term Frequency – Inverse Document Frequency) matrix is formed
using these vectors. The TF-IDF statistic is a measure of how important a word
is in the document by weighing the word proportional to its frequency of
occurrence. It assigns less weights to commonly used words such as ‘the’, ‘a’,
‘have’ etc. The similarities between texts are then quantified as distances applying
the k-nearest neighbours algorithm on the TF-IDF matrix, and these distances can
be used to identify the most similar texts to a text used for search. Such an
approach results in a more probabilistic matching between entire texts instead
of limited the one-to-one matching of keywords. This method is implemented
using the scikit-learn
package
in Python (here is a tutorial you can use).
Since the applied method is an unsupervised learning technique, there is no training dataset with labels for the abstracts indicating any relevance to the target text above. So, I assessed the results subjectively (by actually reading the returned abstracts) in terms of their similarity to the target text. I labelled the articles in the search results using 4 categories:
Excellent Match (EM): Article directly related to shallow convection and its simulation by models.
Good Match (GM): Article not directly related to shallow convection, however it contains good introductory knowledge about strongly related subjects such as convection, boundary layer etc.
Mediocre Match (MM): Article contains words that are present in the target text, but weakly related to the target text in context.
Bad Match (BM): Article contains words that are present in the target text but irrelevant to the context of the target text.
Let’s consider excellent and good matches (EM and GM) as correct and mediocre and bad matches (MM and BM) as incorrect search returns by the algorithm. According to this, n1 yields the most accurate results by 90% (9 of the 10 search returns are labelled as either EM or GM). The accuracy of the algorithm decreases with the inclusion of 2-grams (n12) and is worst with n123 and n1234. However, the results show that even though underperforming, n123 and n1234 can still return useful articles. For example, an article that is an EM called “A New Bulk Shallow-Cumulus Model and Implications for Penetrative Entrainment Feedback on Updraft Buoyancy”, which is directly related to shallow convection, does not appear in the results of the n1 but it is returned as the 5th most similar article by n12, and 2nd for both n123 and n1234. Therefore, populating a result set via merging the results from each of the 4 n-gram combinations shall yield most accurate search results.
Including n-grams in the Search
N-grams
are combinations of the adjacent n number of words in a given text. For example
"I am the" is a 3-gram or "my own" is a 2-gram (see my previous post onn-grams). Inclusion
of n-grams to the search algorithm (i.e., finding similarities using the TF-IDF
matrices of different n-grams) improves the accuracy of the results since this
method will diversify probabilistic matching. For example, one may want to
access texts that emphasize the concept of ‘potential vorticity’ (a 2-gram) rather than
single words ‘potential’ and ‘vorticity’, which have different meanings and
will possibly return some irrelevant articles.
I tested
the algorithm on a subject I started working on as a postdoctoral research
fellow: shallow convection and how
it is simulated in climate models. I don’t have extensive knowledge and experience on shallow
convection in the atmosphere, so a search that would return the articles that
contain introductory material is what I want. For this purpose, I use the below
text on shallow convection as the target text:
“Shallow convection is what
the model uses to deal with cumulus-topped mixed layers. It assumes the cumuli
are not precipitating, meaning integrated column moisture and heat content are
unchanged, only the vertical distribution is rearranged. In order to account
for the latent heating in the cumuli and the evaporation from air mixing with the
cumuli, the model formulation has to be more complex than simply dry mixing.
Remember, the cumulus are sub-grid scale and the air on average is not
saturated in the cumulus layer.
In nature, the cumulus
transport moisture out of the boundary layer and push the inversion upward,
warming the bottom of the cumulus layer and cooling the top. The model tries to
reproduce these effects. So we expect to see drying in the boundary layer,
moistening above the boundary layer, and deepening of the boundary layer.”
The aim here is to use the above text as a keyword
(instead of single words in the search field) in the abstract fields of the
44.886 AMS articles, hoping it will return articles that are related to this text
using the machine learning algorithm. This text is particularly good for test
purposes because it really defines shallow convection in a general sense and
talks about several related concepts in the atmosphere (e.g., boundary layer,
cumulus, mixed layer etc.) which would help return introductory material as I
want. Also, there is no over-usage of words in the text, e.g., the word
“shallow” is used only once, which would pose a challenge to the algorithm in
finding related articles without being too broad.
Since the applied method is an unsupervised learning technique, there is no training dataset with labels for the abstracts indicating any relevance to the target text above. So, I assessed the results subjectively (by actually reading the returned abstracts) in terms of their similarity to the target text. I labelled the articles in the search results using 4 categories:
Excellent Match (EM): Article directly related to shallow convection and its simulation by models.
Good Match (GM): Article not directly related to shallow convection, however it contains good introductory knowledge about strongly related subjects such as convection, boundary layer etc.
Mediocre Match (MM): Article contains words that are present in the target text, but weakly related to the target text in context.
Bad Match (BM): Article contains words that are present in the target text but irrelevant to the context of the target text.
Results
The search results of the algorithm change depending
on the ‘n’ in the n-grams used in the calculation of the TF-IDF matrix. I used
four different combinations of n-grams:
· Only
1-grams (n1)
· 1-
and 2-grams (n12)
· 1-,2-,
and 3-grams (n123)
· 1-,2-,3-
and 4-grams (n1234).
The below table shows the results using the 10 most similar
articles (sorted according to the k-nearest neighbour search) returned by each of
the four combinations.
n1
|
n12
|
n123
|
n1234
|
|
EM
|
4
|
3
|
1
|
1
|
GM
|
5
|
3
|
3
|
3
|
MM
|
1
|
4
|
4
|
4
|
BM
|
0
|
0
|
2
|
2
|
% Correct
|
90
|
60
|
40
|
40
|
Let’s consider excellent and good matches (EM and GM) as correct and mediocre and bad matches (MM and BM) as incorrect search returns by the algorithm. According to this, n1 yields the most accurate results by 90% (9 of the 10 search returns are labelled as either EM or GM). The accuracy of the algorithm decreases with the inclusion of 2-grams (n12) and is worst with n123 and n1234. However, the results show that even though underperforming, n123 and n1234 can still return useful articles. For example, an article that is an EM called “A New Bulk Shallow-Cumulus Model and Implications for Penetrative Entrainment Feedback on Updraft Buoyancy”, which is directly related to shallow convection, does not appear in the results of the n1 but it is returned as the 5th most similar article by n12, and 2nd for both n123 and n1234. Therefore, populating a result set via merging the results from each of the 4 n-gram combinations shall yield most accurate search results.
Merging Results by Simple Averaging
Such merging can be achieved by a simple averaging of
the rankings of each article in the results. For example, if an article is
retuned as the 6th most similar by n1, 2nd by n12, 4th
by n1234 and doesn’t appear in the results by n123, the final score of that
article can be computed as the average (6+2+4)/3 = 4. A final list can be
populated by sorting these final average scores from small to large. However, this
approach has a caveat of assigning erroneous high-scores for articles that appear
in the results of only a couple or just one of the n-gram combinations (e.g.,
if an article only appears in the results n1234 as the 3rd most
similar, the final score of it will be 3/1 = 3, which is a high score). The
first 10 articles in the final list populated by this simple averaging method contain
4 EMs, 4 GMs, but also 2 MMs as the 6th and the 7th most
similar. Also, a couple of EMs appeared lower in the list as 11th
and 12th. We don’t want mediocre matches (MM) in our results and
improve the scores of the excellent matches (EM) so that they appear higher in
the list. So maybe application of some scaling on the merge will improve the
final list.
Merging Results by Simple Averaging over Scaled Rankings
The averaged results can be improved by scaling them
to weigh the results of each n-gram combinations separately. In this shallow
convection search test, Table 1 shows the usage of only 1-grams (n1) yield the
most accurate similarity results, so we can give the results of n1 precedence
relative to the other n-gram combinations. Let’s say we scale the n-gram
combinations by using the coefficients 1.25, 1.75, 2.5, and 2.5 for n1, n12,
n123 and n1234 respectively, the final score of the above example will change
as:
(6x1.25 + 2x1.75 + 4x2.5) / (1.25 + 1.75 + 2.5) = 3.82
This time, the first 10 articles in the final list
populated by the averaging over the scaled rankings contain 6 EMs, 3 GMs and 1
MMs, which is an improvement over the simple averaging method. The EMs that
weren’t returned in the 10-most similar list by the simple averaging now appear
as the 6th and the 9th in the final list.
I say it is doing an OK job. For the more interested
reader, I list the first 5 articles returned by this method in the Appendix
below. But before that, how about another application this time with
philosophical texts?
Application to Notre Dame Reviews
A friend of mine who is working on her philosophy PhD
said this kind of search can be useful to her and sent me a recent abstract of hers
so that I can search the similar book reviews in Notre Dame Reviews. The
abstract is a long one, and it is about vulnerability from a feminist point of
view. Here is a sample of the abstract:
“Vulnerability seems to be at the core of several
philosophical discussions, which indicates the breadth and the complexity of
the concept. However, contemporary debates on vulnerability usually underpin a
negative interpretation of the term, focusing on one’s susceptibility to harm
and injury…”
I used merging both with
the simple averaging and the averaging over the scaled rankings, and the
results were similar this time (my friend was kind enough to skim the results
and label them with the 4 categories described above). The results are not as
good as the shallow convection example: among the 10-most similar the search
returns, there are 2 excellent, 3 good, 3 mediocre, and 3 bad matches. Though
the number of texts in this case is lower -about ten times less- than that of
the AMS abstracts used in the previous search. It is quite possible that there
are not many closely related book reviews to the specific subject we look for.
Besides, in this test I also included another text in the search, again by my
friend written on a very similar subject, to see if the search will return it.
That text was returned as the 3rd most similar, and the other
excellent match came first, so I say the method is performing ok. Besides, she
said she found some very interesting book reviews as a result of my search, so
I call it a success! The results of this search are also in the appendix (clicking on each result will take you to the article or the review).
Appendix
Top 5 search results for shallow convection text in AMS abstracts:
- The Significance of Thermodynamic Forcing by Cumulus Convection in a GeneralCirculation Model
- Subcloud-Layer Feedbacks Driven by the Mass Flux of Shallow Cumulus Convection over Land
- Regional Model Simulations of Marine Boundary Layer Clouds over the Southeast Pacificoff South America. Part II: Sensitivity Experiments
- A Multiple Mass Flux Parameterization for the Surface-Generated Convection. PartII: Cloudy Cores
- A New Bulk Shallow-CumulusModel and Implications for Penetrative Entrainment Feedback on Updraft Buoyancy
Comments
Post a Comment