UEFA Champions League Success vs. Local League Success
It is the 2017–18 soccer season and my team Beşiktaş is writing history in the UEFA Champions League (CL), finishing the group stage as unbeaten leaders (the first ever Turkish team to do so) ahead of Monaco, Porto and Leipzig. But at home, Beşiktaş are underachieving in the local Turkish Super League, having finished the first half of the season in 4th place, six points behind the leader Medipol Başakşehir. This is no good, as only the champions of the Turkish Super League directly qualify for the next year’s CL group stage, while the team who achieves second place start from earlier qualifying rounds. History tells that it is unlikely a Turkish team will pass these early qualifying rounds and make it to the group stage. So why is Beşiktaş doing so well in the CL, but failing to achieve glory at home? There are likely to be many interconnected variables at play here causing such an outcome. Industrial soccer and its professional competitions in an information age is such a complicated phenomenon; as a wise man once said, soccer resembles life itself. The common consensus implies that the management and the team itself are investing so much physical and mental energy into the CL, that they have difficulty focusing on local league games. In light of this, I analysed historical data to investigate the relationship between local and CL performances of the teams which attended the CL competition and, indeed, there seem to be some interesting correlations at play.
I looked at all the teams which played in the CL
between 2003–04 and 2016–17. I started with the 2003 season because this was
when the current tournament structure was initialized: a single group stage (32
teams) followed by a knockout stage starting from the round of 16 advancing all
the way to the final. I implemented a web scraping algorithm to read data from
the Wikipedia pages of each CL season to get the advance levels of each team (final,
semi-final, quarter-final etc.) during each competition. Then for each team, I
implemented another web scraping algorithm to read the final standings from the
Wikipedia pages of their local league seasons. Finally, I compared the CL
advance levels and the corresponding final local league standings to
investigate a possible relationship between the two variables, making use of
suitable data stratifications when necessary (as will be detailed in the rest
of this blog).
I will try to keep the technicalities of the analysis
to a minimum in this blog, but occasionally, there will be some details that
might appeal to readers who have prior knowledge of statistics. In such cases,
I provide references in case you want to learn more about the techniques I
employ; however, it is not necessary to follow the story. All the web scraping
and the consequent data analysis is conducted using Python 3.7.
Turkish Abstract
Söz konusu veri
analizi 2003-2004 ve 2016-2017 sezonları arasını kapsayan (14 yıl) sürede UEFA
Şampiyonlar Ligi’ne (ŞL) katılmış tüm takımların ŞL performansları ile yerel
lig performanslarının karşılaştırmasını içermektedir. Bu veri analizinin temel
motivasyonu Beşiktaş’ın 2017-2018 Şampiyonlar Ligi`nde gösterdiği tarihi
başarının yanında Türkiye Süper Ligi’nde şampiyonlukla noktalanan önceki iki
seneye kıyasla düşük bir performans göstermesidir (takımların ŞL mücadelesi
boyunca yerel liglere hem fiziksel hem de mental olarak konsantrasyon zorluğu
çektikleri popüler bir tartışma konusu). ŞL ve yerel lig performansları
arasındaki bağıntıyı, eğer varsa, takımların bu iki ligdeki sene sonu
sıralamalarındaki yerlerinin karşılaştırmasını -korelasyon analizi kullanarak-
yaparak ortaya çıkarmaya çalıştım. Basit bir şekilde açıklamak gerekirse: pozitif
korelasyon takımların ŞL’de ne kadar başarılı ise yerel ligde de o kadar
başarılı olduğunun bir kanıtı olarak görülebilir. Negatif korelasyon ise bunun
tam tersini işaret eder (ŞL’de geçilen turlar arttıkça yerel lig sıralamasında
gerileme).
Analiz için
kullandığım veri tabanının tümüne bakıldığı zaman bir pozitif korelasyon
gözlenmekte, yani ilk bakışta sanki bir takımın ŞL’deki başarı artışı yerel
lige de pozitif olarak yansıyor gibi görünüyor. Bu yanıltıcı bir bakış açısı,
zira veri tabanının tamamı büyük ve tecrübeli kulüpler ile (Real Madrid,
Arsenal, Man Utd. vs.) mütevazi ve tecrübesiz takımların (BATE Borisov,
Ludogorets vs.) karışımından oluşmakta, ve hatta büyük takımların ŞL’ye katılım
sayısı görece daha yüksek olduğu için genel istatistikler bu takımların
performansı tarafından domine edilmekte. Mesela, Barcelona gibi bir takımın ŞL
finaline yükselecek bir kadro ve forma sahip olmasının La Liga’ya da yüksek
performans olarak yansıması beklenen bir durum olarak kabul edilebilir, fakat
aynısı mütevazi takımlar için de geçerli mi? Yoksa ŞL tecrübesi az olan takımların
ŞL’de turları geçtikçe yerel lige olan konsantrasyonlarının azalması sonucu
(Beşiktaş örneğinde çokça tartışıldığı gibi) performans düşüşü mü söz konusu?
Bunu anlayabilmek için korelasyon analizini veri tabanını takımların ŞL
tecrübeleri temel alarak belli alt kümelere ayırarak yaptım. Bu analiz için iki
ayrı tecrübe tanımını belirledim:
1) ŞL’ye katılım
sayısı.
2) ŞL’de
ulaşılmış en yüksek seviye.
Veri tabanı bu
doğrultuda ayrıştırıldığında görülmekte ki tecrübeliden tecrübesize doğru
gidildiğinde korelasyonlarda pozitiften negatife doğru istikrarlı bir azalma
mevcut (Üçüncü ve dördüncü grafikler). Bu sonuç, bir takımın ŞL ile yerel ligi
aynı anda ve aynı başarı ile götürebilme kapasitesinin o takımın ŞL tecrübesi
ile ilişkisinin bir kanıtı olarak görülebilir. Ayrıca bu analizin sonuçları, yaptığım
iki tecrübe tanımının birbirinden bağımsız olmadığını göstermekte. Beşinci
grafik veri tabanındaki takımların ulaştığı en yüksek ŞL seviyesi ile ortalama
ŞL katılım sayısını karşılaştırmakta ve burada görülmekte ki grup
sonunculuğundan finale doğru gidildiğinde ortalama katılım sayısı da yaklaşık
1`den 10.2’ye kadar çıkıyor (diğer bir deyişle; mevcut 14 yıllık veri tabanında
ŞL finalini en az bir kere görebilmiş takımların ŞL katılım ortalaması 10.2. 14
senenin yaklaşık 10’unda bu takımlar oradaymış).
Veri tabanını, ŞL
tecrübesinin yanı sıra bir de ülkelere göre ayırarak korelasyon analizi yaptım
ve sonuçlar tecrübe analizine benzer nitelikte çıktı. Altıncı grafikte
görülüyor ki en yüksek pozitif korelasyonlar sırasıyla İngiltere, İspanya,
İtalya, Almanya, Fransa ve Portekiz’e, yani 6 büyük lige ait. Lig kalitesi
azaldıkça korelasyonlar da azalmakta ve Avusturya ile başlayarak sonuncu olan
Kazakistan’a doğru negatif korelasyonlar gözlenmekte.
Peki bütün bu analiz sonuçları Beşiktaş’ın, 2017-2018 sezonunda ŞL’de son
16’ya kaldığını göz önünde bulundurursak, Türkiye Süper Ligi’ni nerede
bitireceği konusunda bir tahmin yapmamıza yardımcı olabilir mi? Öncelikle
futbolda başarı birçok parametreye bağlı ve dolayısı ile yalnızca ŞL başarısını
baz alarak bir tahmin yapmak pek mümkün değil. Mevcut veri analizi boyunca
ortaya çıkan korelasyonların düşük olması (mesela tablo 1, grafik 3 ve grafik
4’deki korelasyonların maksimum 0.2-0.3 seviyelerinde olması) bunun bir göstergesi.
Yine de bir tahmin algoritması geliştirmek için 3 yeni parametreyi de analize
dahil edip (kulübün mevcut kadro değeri, kulübün yaşı, kulübün stadyum
kapasitesi) Random Forest diye bilinen bir otomatik öğrenme (Machine Learning)
uygulaması geliştirdim. Bu otomatik öğrenme algoritmasının amacı bu
parametrelerin bilinen değerlerini kullanarak bir takımın yerel ligde sezonu
hangi sırada bitireceğini belirlemek. Algoritmayı toplam veri tabanının
yaklaşık üçte ikisini kullanarak eğittim, sonra da bu algoritmaya veri
tabanının kalan üçte birlik kısmındaki yerel lig sıralamalarını tahmin ettirdim.
Sonuç olarak tahminlerin yalnızca 23%’si doğru çıktı (rastgeleden hallice). Yani
ekstra parametrelerin varlığına rağmen bir futbol takımının yerel lig
performansını belirleyebilmek mümkün değil, ki futbolun bu tahmin edilemeyen
doğası popülerliğinin en büyük sebeplerinden biri olarak görülür.
Fakat mevcut veri analizinin ışığında, Beşiktaş’ın son 14 yılda ŞL’ye
yalnızca 4 kere katılmış olması ve bu katılımlar dahilinde de en fazla grup
üçüncülüğü görmüş olmasının son 16’ya kalınan bu sezon sonunda Türkiye Süper
Lig şampiyonu olma ihtimalini azalttığını söylemek yanlış olmaz. Zira 3b
grafiğinde 4 ve altı katılım sağlayan, ve 4b grafiğinde en fazla grup üçüncülüğü
görmüş olan takımlar negatif korelasyonlar göstermekte. Bütün bunların
sonucunda benim kişisel tahminim ise Beşiktaş’ımızın ŞL’de son 16’dan öteye
gidemeyeceği ve ligi de, maalesef, en fazla ikinci ya da üçüncü tamamlayacağı yönünde.
Preliminary Analysis
Analysis of 14 years of CL competition is done with a
focus on group stage onwards. The analysis consists of two main approaches:
- Data sample populated by including all final local league standings (except when relegated) of each team for every year during the analysis period. This means the inclusion of the final league standing of a team even when it didn’t make it to the CL group stage in that year. Hereafter, this sample will be referred to as ‘full’.
- Data samples populated by including the local league standings of each team only when they attended the CL group stage and onwards. Hereafter, this sample will be referred to as ‘limited’.
Over the 14 years, 32 teams make up a sample size of 14x32
= 448 for the limited sample.
The full sample number goes up to 1426 with relegations excluded. The histograms
and the distributions of the final league standings and the CL advance levels of
the teams in both the full and the limited samples are given in Figure 1.
Figure 1: The final league standing and the CL advance level histograms
for the (a, c) full, and (b, d) limited analysis approaches. The league
standing histograms are overlayed by the probability density functions
estimated using Gaussian kernel density.
The CL advance level is a categorical variable where:
- NA: Didn’t Attend
- G4: 4th Place in the Group Stage
- G3: 3rd Place in the Group Stage
- R16: Round of Sixteen
- QF: Quarter Final
- SF: Semi Final
- F: Final
The final league standings exhibit a geometric
distribution skewed towards the top of the league table. Of course, these are
some pretty strong teams; strong enough to compete in the CL (151 of the 448
topped their league, and 103 came second within the limited sample).
Most of the elements in the full sample are
non-attendances (NA), as can be seen in Figure 1c; i.e., there are many teams
who attended CL only once or a few times during the 14 years of analysis. The
histogram of CL advance levels for the limited sample (Figure 1d) is a zoomed-in
version of Figure 1c without the NA. Figure 1d shows that, among the teams who
attended the CL during the analysis period, those who advanced to round of 16 almost
equals that of the 4th and 3rd places in the group stage.
The number of advances to QF and onwards, of course, decreases steadily. Before
looking at some distributions of CL advance level versus the final league
standings, here are some of the facts that came out of this preliminary
analysis which I found interesting:
Some Facts from the Preliminary Analysis:
- A total of 31 countries have been represented in the CL competition between 2003–2016.
- Spain represents the highest number of teams (12), followed by Germany (9), Italy and France (8) and England (7).
- A total of 107 teams attended the
CL competition between 2003–2016. During this period, 26 of these teams were either
relegated from their local league (Monaco is one notable example among the
relegated teams) or promoted to their top local competition league.
- 37 of the 107 teams had the chance of competing in the CL only once between 2003–2016.
- Belarus (BATE Borisov), Croatia (Dinamo
Zagreb), Norway (Rosenborg),
Sweden (Malmo), Hungary (Debrecen), Poland (Legia Warsaw), Serbia/Serbia and Montenegro (Partizan), Slovenia (Maribor)
and Kazakhstan (Astana) have been
represented by only one team. Among these teams Legia Warsaw, Debrecen,
Partizan, Maribor and Astana attended the competition only once between 2003–2016.
BATE Borisov attended five5 times (e.g., higher than Beşiktaş and Fenerbahce of
Turkey with four times each). Although Austria sent two teams (Austria Wien and
Rapid Wien) to the competition, they each only attended once (2013 and 2005
respectively) making Austria one of the most underrepresented countries in the
competition between 2003–2016.
- Arsenal and Real Madrid attended all 14 competitions. They are followed by Barcelona, Bayern Munich, Chelsea and Porto with 13 attendances.
- The teams which had the worst final league standings are (>= 15th):
o Celta
Vigo, 19th (2003 – CL advance level = R16)
o Villareal,
18th (2011 – CL advance level = G4)
o Real
Sociedad, 15th (2003 – CL advance level = R16)
…La
Liga is tough! In fact, the worst official
finish is by Juventus in the 2005–06
season in Series
A (20th), while they
advanced to QF in CL. However, this
was due to the fine they got after a
match fixing scandal was uncovered
(they actually finished 1st in the league that year).
- The teams who have reached the CL final AND finished the league 1st in the same season are:
o Barcelona
(2005, 2008, 2010, 2014)
o Manchester
United (2007, 2008, 2010)
o Bayern
Munich (2009, 2012)
o Juventus
(2014, 2016)
o Internazionale
(2009)
o Real
Madrid (2016)
o Porto
(2003)
o Atletico
Madrid (2013)
Back to visuals. Here is the so-called violin plot of
the league standings (y-axis) vs. the CL advance levels (x-axis):
Figure 2: The violin plot of the final league standings versus the CL
advance levels derived from the full
sample.
So, this nice and colourful plot shows some
first-order relationship between the two variables with some statistics on it.
This plot gives some idea about how the final league standings of the teams
look like if they didn’t attend CL, or if they did but only advanced until the
R16, and so on. Each of the coloured shapes in Figure 2 (supposedly looking
like a violin, but in this case
they really look like kemençe, so I will refer to them as kemençe
plots) shows the distribution of different categories. All the category (CL
advance levels) distributions in Figure 2 resemble the geometric distribution apparent
in Figures 1a and 1b, only sideways and symmetric, thus forming the shape of an
upright standing string instrument. The
thick black lines in the middle of each kemençe show the interquartile ranges, while the white dots show the median value of the final league standings for each category. We
can see that the median value for the final league standing for the teams which
didn’t attend CL (NA) is 4, whereas the value for the teams who advanced to
every level other than the final (F) is 2. The mighty finalists usually finish
their leagues as champions (thus the median value 1). We also see the
distribution of the finalists bulged towards the lower numbers in Figure 2 (the
top of the league table) with a short tail pointing upwards (the shorter the tail,
the higher the confidence on the median values and the ranges shown on the kemençe).
So, this means that the CL finalists are likely to finish at the top of their
leagues, where the final standings of the rest (that is all the NA, G4, G3,
R16, QF and SF) can vary between top and the bottom with a higher likelihood
towards the top.
So,
we should be sufficiently confident that a CL finalist will finish their league
in 1st or 2nd place, or at least 3rd place. Or
should we? So far, both figures I’ve shown demonstrate what we can already
guess: the higher up a team progresses in the CL competition means they are stronger
and thus have a higher likelihood of finishing at the top in their local
league.
Well,
this doesn’t really agree with my initial motivation of Beşiktaş doing really badly
in the local league because of their success in the CL. Before jumping to quick
conclusions and claiming that my team sucks in the Turkish Super League because
the players (such as Ricardo Quaresma) just don’t care about local league
anymore, we should look at the data in more depth and detail. The results so
far are only out of the bulk data, so next I will look at some correlations
over specific stratifications of both the full and the limited samples.
Analysis of the Correlation between CL Advance Levels and the Final League Standings
Here,
I will investigate if there is any correlation between two variables, namely,
CL advance levels and final league standings. If so, is it a positive
correlation (i.e., higher the CL advance level, the higher the final league
standing) or vice versa? Before delving into the results, I would like to present
information about the correlation metrics I used in this analysis. First, I
assigned monotonic values to each CL level, such that NA=7, G4=6, G3=5…, F=1,
to convert this categorical/ordinal data to numerical values. Then I calculated
Spearman’s rank statistic to quantify any correlation
between the two variables. This statistic is suitable for this case, as it is a
nonparametric measure of rank correlation which is a more reliable statistic
for non-normal distributions (as is the case observed in Figures 1a and 1b).
The
conversion of the categorical CL level data to monotonic numerical values have
some caveats, though. For example, the difficulty of reaching the final (F,
value = 1) is probably exponentially – not linearly – higher than that of,
say, the quarter final (QF, value = 3). However, the same conclusion can also be
made for the final league standings, which are already monotonic (1 to 20),
thus making it a fair comparison. I played with the conversion, changing it to
weighted values etc., to understand if it makes a difference in the results and
didn’t observe any significant changes. For the sake of completeness, I also
added the Pearson correlation statistics, which
is the measure of linear correlation between two variables and, as you will see
next, both Pearson and Spearman agree on the final correlations.
Let’s
look at some numbers and try to understand what they mean in relation to the
question at hand.
Full Sample
|
Limited Sample
|
||
Spearman’s
|
Correlation Coeff.
|
0.28
|
0.08
|
P value
|
0.00
|
0.07
|
|
Pearson’s
|
Correlation Coeff.
|
0.24
|
0.12
|
P value
|
0.00
|
0.01
|
Table
1: Spearman’s and Pearson’s correlation coefficients and their corresponding p
vales for the full and limited samples.
The
correlation coefficient tells us how much correlation exists, and the p value
tells us how much confidence we have in the coefficient (the lower the p value,
the higher the confidence). In statistics, a p value less than or equal to 0.05
is usually accepted as the value to have sufficient confidence – 95% to be
exact – in the coefficient. In Table 1, the coefficient is almost zero
(0.08/0.12) for the limited sample and a bit higher (0.28/0.24) for the full
sample, along with very low p values, thus indicating a high confidence for
both. A perfect positive correlation would be 1, so in our case that would
indicate the higher up a team progresses in the CL means the higher up they
finish in the local league (and vice versa for the perfect negative correlation
-1). In summary, good performance in the CL indicates good performance in the
local league (as Beşiktaş is
failing to do so far) with a very linear fashion in the case of a correlation
of 1. In both limited and full samples, we have positive correlations. The fact
that the correlation for the full sample is higher can be attributed to the
inclusions of the non-attendances to the CL (NA), which comprise a very large
portion of the full sample (Figure 1c). This creates a larger contrast between
the local league standings and the CL levels. This all makes sense, after all,
the CL advance level is definitely not the only effect on a team’s final local
league standing. There are, of course, many other parameters at play, such as a
club’s wealth, difficulty of the league etc. And here, in both samples, we have
a mixture of beasts like Real Madrid and Manchester United along with underdogs,
such as Artmedia, Zilina etc.
If
you follow the CL at all, you would know that experience is everything in that
competition. It is imperative for a team to have competed in the CL for a
while, to understand the atmosphere and get used to what it is like to play
against the top teams in the world before being able to even think about
advancing towards, say, QF. And advancing even higher is usually reserved for
select teams such as Barcelona, Real Madrid, Bayern Munich, Liverpool and a few
others which UEFA loves (!!!). But let’s assume for a moment that we are living
in a fair world and that there is total meritocracy in the CL. So, next I will
investigate how CL experience affects the correlation between the CL advance
levels and final league standings. I will look at this problem via two
different angles by defining the ‘CL experience’ in two different ways:
- The number of seasons a team attended the CL competition (Quantity)
- The highest level a team advanced to during a CL season (Quality)
The
first definition (quantity) assumes that the more a team was involved in the
competition and listened to that CL music before the games, the more they will
be used to compete both in the CL and the local league at the same time. The
second definition (quality) assumes that the higher a team ascends the
mountain, the more they will be attuned to the atmosphere of the CL and will
not be consumed by its glitter while competing in the local league.
Cl
Experience: Quantity
Here,
I stratified the teams in both the full and the limited samples by the number
of their attendances to the CL (i.e., less than or equal to 14 times, 13 times,
12 times and so on, until 1). Then, I calculated correlation coefficients and
their corresponding p values for each stratification. Additionally, I removed
the teams which have small standard deviations (less than 1.5) in their league
performances to eliminate teams such as Celtic, Olympiacos, Bate Borisov,
Ludogorets Razgard etc. who are consistently placed 1st or 2nd
in their local leagues regardless of their performance in the CL, thus
contaminating the signal I am looking for in the correlation analysis. The
results are in Figure 3.
Figure 3: The Spearman’s (blue lines) and Pearson’s (green lines)
correlation coefficients and their corresponding p values (dashed lines)
between the CL advance levels and the final league standings with respect to
the CL experience: attendance over (a) the full sample and (b) the limited
sample.
Finally, the data starts telling something. The CL
experience of teams are the highest at the leftmost side of Figures 3a and 3b,
decreasing towards the right. We see a relationship between the correlation of
CL advance level versus final local league standing with diminishing CL
experience in Figure 3: The correlations diminish as well! It never goes
negative when we look at the full sample (Figure 3a), but it does in the
limited sample (Figure 3b), indicating that the less experienced a team is, the
more poorly they will likely perform in their local league as they advance
higher in the CL. The p values exhibit an increase towards the low experience
side in both Figures 3a and 3b, due to the reduction of the sample size as the
cumulative experience level goes down. However, the consistent decrease in the
correlation coefficients indicate a reliable analysis. Besides, the p values start
heading down again after the experience level 6 in Figure 3b, indicating that
even if the sample size is smaller, the negative correlation becomes clearer, and
so the confidence in the results increases. One question that pops up in my
head looking at Figure3 is: why do the correlation coefficients go all the way
down to -0.4 in the limited sample, whereas it never drops below 0.0 in the
full sample? This may be attributed to the long streaks of non-attendance by a
significant number of teams in the full sample, introducing an irrelevant
relationship to the final results (that is, irrelevant to the CL advance level
vs. final local league standing). The limited sample only contains data from
when every team attended the CL, thus giving a more accurate view of the
correlation I am looking for. My main motivation in including the full sample
to my analysis was to see if there is a clear effect of non-attendance, (i.e.,
if the teams were performing significantly better in their local leagues when
they didn’t attend the CL). However, it seems hard to discern such information
from this analysis due to the possible introduction of the irrelevant
relationships I implied above.
Cl
Experience: Quality
In
the case of quality, I looked for the maximum CL level reached by teams and
stratified accordingly. For example, if a team attended CL only twice during
the 14-year analysis period, but reached the quarter final in one of them, that
team goes to the QF bin. The results are in Figure 4.
Figure 4: Same as Figure 3 but
with respect to the CL experience: highest advance level.
What we see in Figure 4 is quite similar to what we
saw in Figure 3, and this time the decrease in the correlation coefficient is
more linear and sharper for the limited sample (Figure 4b). The correlation
becomes negative below QF level, suggesting it is likely that a team who only reached
round 16 or lower during the analysis period, would struggle in their local
league if they advance to higher levels in the CL. You might notice there is no
reported value for G4 in the limited sample. That is because there is no
variability of the sample for the lowest advance level (it only contains G4)
thus not yielding a correlation. It is possible, however, to calculate a
correlation coefficient for G4 in the full sample (Figure 4a), since it
contains two categories, namely G4 and NA. However, having only two categories
leads to weak variability, leading to a bump in the p value for G4 in Figure
4a. The p values for the limited sample reach and exceed 0.6 for SF and QF,
reducing the reliability of the statistics for these two levels. However, the
consistency in the monotonic decrease in the results is convincing enough (for
me, at least) that they have value.
As a matter of fact, it is no surprise that the
results of the quantity (Figure 3) and the quality (Figure 4) analyses are
similar, since these two approaches are not independent of each other. To
demonstrate that, I calculated the average CL attendances for each of the CL
advance level categories and plotted them (in other words, I plotted the
quantity versus quality).
Figure 5:
CL advance level vs. the average CL attendance
Indeed,
there is a very strong relationship between the experience of a team in the CL
and the highest level it advances to. Figure 5 demonstrates that the average
attendance for the finalists between 2003–2016 is 10.2, that is, they attended
more than 10 of the 14 competitions on average. This number is 1.3 for the
teams who didn’t go higher than placing last in the initial group stage (G4).
You can see from Figure 5 that you might want your team to be in the CL at
least 5 out of 14 times to see the quarter finals or higher. Did you also
notice the abrupt jump from around 5.9 to 10.2, the average CL attendances for
SF and F respectively? Crunch time is when you need experience the most.
Correlation
Coefficient by Country
I
also wanted to see what the correlation coefficients look like if we focus on
the data stratified by country. I did that analysis over the full sample to
make use of the larger dataset and to see, unlike the analysis over CL
experience, if negative correlations exist for some countries even in the full
sample.
Figure 6: Spearman’s rank correlation coefficients stratified by country using the full sample. The white numbers on the bars indicate the p value of the statistic for the corresponding country.
And
yes, we do see negative correlations for some countries over the full sample,
suggesting a strong correlation signal by country. In other words, the strength
of the local league has an impact on how a team performs while also competing
in the CL. Or should we say, popularity instead of strength? Money? Maybe
experience instead of strength, in line with the findings in my experience
analysis. As you can see in Figure 6, the top six leagues –England, Spain,
Italy, Germany, France and Portugal – are conveniently placed in front of the
remaining bunch. They have the highest positive correlation coefficients
(varying in the vicinity of 0.5–0.6) with 0.0 p values (the white numbers on
the bars), indicating complete confidence in the results. There are also some
seemingly unusual countries such as Ukraine, Greece, Scotland, and Turkey (!!) showing
up among the ivy league. But if you look closely, their relatively high
correlation coefficients are accompanied by relatively higher p values,
reducing the reliability of their placement (other than Greece, which seems to
do OK). Overall, as the league quality goes down, correlation coefficients go
down too, all the way down to around -0.4 by Kazakhstan.
What
happens to Beşiktaş, then?
Going
back to the motivation of all this: what will happen to Beşiktaş in Turkish
Super League now that they have advanced to R16 in the CL 2017–18 season? Can
we give a confident answer to this question in light of the above analysis? The
answer is, of course, no, because I only looked at a small, possible effect to
a soccer team’s success, that is, its CL performance. As I mentioned before, a
team’s success depends on many other variables, such as the squad worth. In
fact, this may be why the correlation coefficients were capped around 0.3 for
the experience analysis (Figures 3 and 4b), meaning there is no strong
correlation in general between the CL performance and the final league
standings. The informative aspect of this analysis is not the correlation
coefficient itself, but its change with respect to team experience and the
country/league. I think I managed to demonstrate, quantitatively, how important
experience is in CL especially if you are not in a strong league (i.e., one of the
top leagues in Figure 6). The more experienced you are in CL, the more you know
how to compete in both your local league and CL in the same season. So,
although Beşiktaş is a Turkish team, a country which seems to rank among
stronger countries (Figure 6), they lack experience with only 4 attendances in
14 competitions, reaching only up to G3 during this period. If we were to plug
these numbers into Figures 3b and 4b, we would see negative correlations both
in attendance and skill. And we are doing great in the CL this year having
reached the QF already! Does that mean we, as fans, should forget about being
the champions of the Turkish Super League this year? Well, of course I refuse
to think so and I will refuse no matter what comes out of any kind of analysis.
As a proper Beşiktaş fan, I firmly believe that Beşiktaş is the biggest and
grandest soccer club on earth (yes, even bigger than Real Madrid or A.C. Milan),
but I can still look at some more data and try to get some predictions about
the end of the season. To reduce the lack of effects/variables included in the
analysis so far, I wrote another web scraping algorithm, this time for
transfermarkt, to read some
characteristic data for each team, this time unrelated to CL.
The
additional information I read from the transfermakt website were: current squad
value (as a measure of club wealth), the year club is founded (as a measure of
club tradition), the number of seats in the stadium (as a measure of spectator
support). I added these three variables to the already existing three: country,
CL advance level, CL experience (number of attendances) and formed a 6-variable
predictor list to see if I can predict a team’s final leage standing. To do so,
I used a machine learning approach called fandom forest. Curious reader can look at
the references and understand the details of these methods, but in summary,
what we do with machine learning is training some smart algorithms with
historical data (in this case the full sample I have) and use those learned
algorithms to predict future states. To quantify if the algorithm is doing
fine, the historical data is separated into a training set (to train the
algorithm) and a test set (to test the algorithm). In my case, I used 1000 of
the 1426 entries in the full sample for training and the remaining 426 for
testing (the statistical details of this analysis are not given here). As a
result, my random forest classifier performed badly, with an abysmal prediction
score of 23%. This means, of the 426
entries I used for testing, the algorithm predicted only the 23% of the final
league standings correctly. This is another indication of how unpredictable
soccer results are. The 6 predictors simply aren’t enough to confidently predict
a team’s long-term performance. In fact, we usually can’t even predict a match
score even when we take a pick after the half-time! This is just one of the
reasons why soccer is the most popular sport in the world: the element of
surprise.
In
summary, although I can’t make a solid prediction of where Beşiktaş will end up
in the table at the end of the season, the statistics say that the inexperience
of the team and its currently high advance level in the CL will hurt their
chances of being the champions this season. My terrible random forest
classifier predicted 2nd place for Beşiktaş given we don’t beat
Bayern Munich in the R16 and advance to the QF. If we do, the classifier
thinking Turkey is one of the strong countries in the competition (Figure 6), predicts
us finishing at the top of the league. But again, with a 23% success rate,
these are not at all confident predictions. So, we will keep our fingers
crossed and hope Beşiktaş will up their performance in the second half and see
what happens.
Thanks
for reading (though I seriously doubt anyone made it this far in this post).
Comments
Post a Comment