It is the 2017–18 soccer season and my team Beşiktaş is writing history in the UEFA Champions League (CL), finishing the group stage as unbeaten leaders (the first ever Turkish team to do so) ahead of Monaco, Porto and Leipzig. But at home, Beşiktaş are underachieving in the local Turkish Super League, having finished the first half of the season in 4^th place, six points behind the leader Medipol Başakşehir. This is no good, as only the champions of the Turkish Super League directly qualify for the next year’s CL group stage, while the team who achieves second place start from earlier qualifying rounds. History tells that it is unlikely a Turkish team will pass these early qualifying rounds and make it to the group stage. So why is Beşiktaş doing so well in the CL, but failing to achieve glory at home? There are likely to be many interconnected variables at play here causing such an outcome. Industrial soccer and its professional competitions in an information age is such a complicated phenomenon; as a wise man once said, soccer resembles life itself. The common consensus implies that the management and the team itself are investing so much physical and mental energy into the CL, that they have difficulty focusing on local league games. In light of this, I analysed historical data to investigate the relationship between local and CL performances of the teams which attended the CL competition and, indeed, there seem to be some interesting correlations at play.

I looked at all the teams which played in the CL between 2003–04 and 2016–17. I started with the 2003 season because this was when the current tournament structure was initialized: a single group stage (32 teams) followed by a knockout stage starting from the round of 16 advancing all the way to the final. I implemented a web scraping algorithm to read data from the Wikipedia pages of each CL season to get the advance levels of each team (final, semi-final, quarter-final etc.) during each competition. Then for each team, I implemented another web scraping algorithm to read the final standings from the Wikipedia pages of their local league seasons. Finally, I compared the CL advance levels and the corresponding final local league standings to investigate a possible relationship between the two variables, making use of suitable data stratifications when necessary (as will be detailed in the rest of this blog).

I will try to keep the technicalities of the analysis to a minimum in this blog, but occasionally, there will be some details that might appeal to readers who have prior knowledge of statistics. In such cases, I provide references in case you want to learn more about the techniques I employ; however, it is not necessary to follow the story. All the web scraping and the consequent data analysis is conducted using Python 3.7.

Turkish Abstract

Söz konusu veri analizi 2003-2004 ve 2016-2017 sezonları arasını kapsayan (14 yıl) sürede UEFA Şampiyonlar Ligi’ne (ŞL) katılmış tüm takımların ŞL performansları ile yerel lig performanslarının karşılaştırmasını içermektedir. Bu veri analizinin temel motivasyonu Beşiktaş’ın 2017-2018 Şampiyonlar Ligi`nde gösterdiği tarihi başarının yanında Türkiye Süper Ligi’nde şampiyonlukla noktalanan önceki iki seneye kıyasla düşük bir performans göstermesidir (takımların ŞL mücadelesi boyunca yerel liglere hem fiziksel hem de mental olarak konsantrasyon zorluğu çektikleri popüler bir tartışma konusu). ŞL ve yerel lig performansları arasındaki bağıntıyı, eğer varsa, takımların bu iki ligdeki sene sonu sıralamalarındaki yerlerinin karşılaştırmasını -korelasyon analizi kullanarak- yaparak ortaya çıkarmaya çalıştım. Basit bir şekilde açıklamak gerekirse: pozitif korelasyon takımların ŞL’de ne kadar başarılı ise yerel ligde de o kadar başarılı olduğunun bir kanıtı olarak görülebilir. Negatif korelasyon ise bunun tam tersini işaret eder (ŞL’de geçilen turlar arttıkça yerel lig sıralamasında gerileme).

Analiz için kullandığım veri tabanının tümüne bakıldığı zaman bir pozitif korelasyon gözlenmekte, yani ilk bakışta sanki bir takımın ŞL’deki başarı artışı yerel lige de pozitif olarak yansıyor gibi görünüyor. Bu yanıltıcı bir bakış açısı, zira veri tabanının tamamı büyük ve tecrübeli kulüpler ile (Real Madrid, Arsenal, Man Utd. vs.) mütevazi ve tecrübesiz takımların (BATE Borisov, Ludogorets vs.) karışımından oluşmakta, ve hatta büyük takımların ŞL’ye katılım sayısı görece daha yüksek olduğu için genel istatistikler bu takımların performansı tarafından domine edilmekte. Mesela, Barcelona gibi bir takımın ŞL finaline yükselecek bir kadro ve forma sahip olmasının La Liga’ya da yüksek performans olarak yansıması beklenen bir durum olarak kabul edilebilir, fakat aynısı mütevazi takımlar için de geçerli mi? Yoksa ŞL tecrübesi az olan takımların ŞL’de turları geçtikçe yerel lige olan konsantrasyonlarının azalması sonucu (Beşiktaş örneğinde çokça tartışıldığı gibi) performans düşüşü mü söz konusu? Bunu anlayabilmek için korelasyon analizini veri tabanını takımların ŞL tecrübeleri temel alarak belli alt kümelere ayırarak yaptım. Bu analiz için iki ayrı tecrübe tanımını belirledim:

1) ŞL’ye katılım sayısı.

2) ŞL’de ulaşılmış en yüksek seviye.

Veri tabanı bu doğrultuda ayrıştırıldığında görülmekte ki tecrübeliden tecrübesize doğru gidildiğinde korelasyonlarda pozitiften negatife doğru istikrarlı bir azalma mevcut (Üçüncü ve dördüncü grafikler). Bu sonuç, bir takımın ŞL ile yerel ligi aynı anda ve aynı başarı ile götürebilme kapasitesinin o takımın ŞL tecrübesi ile ilişkisinin bir kanıtı olarak görülebilir. Ayrıca bu analizin sonuçları, yaptığım iki tecrübe tanımının birbirinden bağımsız olmadığını göstermekte. Beşinci grafik veri tabanındaki takımların ulaştığı en yüksek ŞL seviyesi ile ortalama ŞL katılım sayısını karşılaştırmakta ve burada görülmekte ki grup sonunculuğundan finale doğru gidildiğinde ortalama katılım sayısı da yaklaşık 1`den 10.2’ye kadar çıkıyor (diğer bir deyişle; mevcut 14 yıllık veri tabanında ŞL finalini en az bir kere görebilmiş takımların ŞL katılım ortalaması 10.2. 14 senenin yaklaşık 10’unda bu takımlar oradaymış).

Veri tabanını, ŞL tecrübesinin yanı sıra bir de ülkelere göre ayırarak korelasyon analizi yaptım ve sonuçlar tecrübe analizine benzer nitelikte çıktı. Altıncı grafikte görülüyor ki en yüksek pozitif korelasyonlar sırasıyla İngiltere, İspanya, İtalya, Almanya, Fransa ve Portekiz’e, yani 6 büyük lige ait. Lig kalitesi azaldıkça korelasyonlar da azalmakta ve Avusturya ile başlayarak sonuncu olan Kazakistan’a doğru negatif korelasyonlar gözlenmekte.

Peki bütün bu analiz sonuçları Beşiktaş’ın, 2017-2018 sezonunda ŞL’de son 16’ya kaldığını göz önünde bulundurursak, Türkiye Süper Ligi’ni nerede bitireceği konusunda bir tahmin yapmamıza yardımcı olabilir mi? Öncelikle futbolda başarı birçok parametreye bağlı ve dolayısı ile yalnızca ŞL başarısını baz alarak bir tahmin yapmak pek mümkün değil. Mevcut veri analizi boyunca ortaya çıkan korelasyonların düşük olması (mesela tablo 1, grafik 3 ve grafik 4’deki korelasyonların maksimum 0.2-0.3 seviyelerinde olması) bunun bir göstergesi. Yine de bir tahmin algoritması geliştirmek için 3 yeni parametreyi de analize dahil edip (kulübün mevcut kadro değeri, kulübün yaşı, kulübün stadyum kapasitesi) Random Forest diye bilinen bir otomatik öğrenme (Machine Learning) uygulaması geliştirdim. Bu otomatik öğrenme algoritmasının amacı bu parametrelerin bilinen değerlerini kullanarak bir takımın yerel ligde sezonu hangi sırada bitireceğini belirlemek. Algoritmayı toplam veri tabanının yaklaşık üçte ikisini kullanarak eğittim, sonra da bu algoritmaya veri tabanının kalan üçte birlik kısmındaki yerel lig sıralamalarını tahmin ettirdim. Sonuç olarak tahminlerin yalnızca 23%’si doğru çıktı (rastgeleden hallice). Yani ekstra parametrelerin varlığına rağmen bir futbol takımının yerel lig performansını belirleyebilmek mümkün değil, ki futbolun bu tahmin edilemeyen doğası popülerliğinin en büyük sebeplerinden biri olarak görülür.

Fakat mevcut veri analizinin ışığında, Beşiktaş’ın son 14 yılda ŞL’ye yalnızca 4 kere katılmış olması ve bu katılımlar dahilinde de en fazla grup üçüncülüğü görmüş olmasının son 16’ya kalınan bu sezon sonunda Türkiye Süper Lig şampiyonu olma ihtimalini azalttığını söylemek yanlış olmaz. Zira 3b grafiğinde 4 ve altı katılım sağlayan, ve 4b grafiğinde en fazla grup üçüncülüğü görmüş olan takımlar negatif korelasyonlar göstermekte. Bütün bunların sonucunda benim kişisel tahminim ise Beşiktaş’ımızın ŞL’de son 16’dan öteye gidemeyeceği ve ligi de, maalesef, en fazla ikinci ya da üçüncü tamamlayacağı yönünde.

Preliminary Analysis

Analysis of 14 years of CL competition is done with a focus on group stage onwards. The analysis consists of two main approaches:

Data sample populated by including all final local league standings (except when relegated) of each team for every year during the analysis period. This means the inclusion of the final league standing of a team even when it didn’t make it to the CL group stage in that year. Hereafter, this sample will be referred to as ‘full’.
Data samples populated by including the local league standings of each team only when they attended the CL group stage and onwards. Hereafter, this sample will be referred to as ‘limited’.

Over the 14 years, 32 teams make up a sample size of 14x32 = 448 for the limited sample. The full sample number goes up to 1426 with relegations excluded. The histograms and the distributions of the final league standings and the CL advance levels of the teams in both the full and the limited samples are given in Figure 1.

Figure 1: The final league standing and the CL advance level histograms for the (a, c) full, and (b, d) limited analysis approaches. The league standing histograms are overlayed by the probability density functions estimated using Gaussian kernel density.

The CL advance level is a categorical variable where:

NA: Didn’t Attend
G4: 4^th Place in the Group Stage
G3: 3^rd Place in the Group Stage
R16: Round of Sixteen
QF: Quarter Final
SF: Semi Final
F: Final

The final league standings exhibit a geometric distribution skewed towards the top of the league table. Of course, these are some pretty strong teams; strong enough to compete in the CL (151 of the 448 topped their league, and 103 came second within the limited sample).

Most of the elements in the full sample are non-attendances (NA), as can be seen in Figure 1c; i.e., there are many teams who attended CL only once or a few times during the 14 years of analysis. The histogram of CL advance levels for the limited sample (Figure 1d) is a zoomed-in version of Figure 1c without the NA. Figure 1d shows that, among the teams who attended the CL during the analysis period, those who advanced to round of 16 almost equals that of the 4^th and 3^rd places in the group stage. The number of advances to QF and onwards, of course, decreases steadily. Before looking at some distributions of CL advance level versus the final league standings, here are some of the facts that came out of this preliminary analysis which I found interesting:

Some Facts from the Preliminary Analysis:

A total of 31 countries have been represented in the CL competition between 2003–2016.

Spain represents the highest number of teams (12), followed by Germany (9), Italy and France (8) and England (7).

A total of 107 teams attended the CL competition between 2003–2016. During this period, 26 of these teams were either relegated from their local league (Monaco is one notable example among the relegated teams) or promoted to their top local competition league.
37 of the 107 teams had the chance of competing in the CL only once between 2003–2016.

Belarus (BATE Borisov), Croatia (Dinamo Zagreb), Norway (Rosenborg), Sweden (Malmo), Hungary (Debrecen), Poland (Legia Warsaw), Serbia/Serbia and Montenegro (Partizan), Slovenia (Maribor) and Kazakhstan (Astana) have been represented by only one team. Among these teams Legia Warsaw, Debrecen, Partizan, Maribor and Astana attended the competition only once between 2003–2016. BATE Borisov attended five5 times (e.g., higher than Beşiktaş and Fenerbahce of Turkey with four times each). Although Austria sent two teams (Austria Wien and Rapid Wien) to the competition, they each only attended once (2013 and 2005 respectively) making Austria one of the most underrepresented countries in the competition between 2003–2016.

Arsenal and Real Madrid attended all 14 competitions. They are followed by Barcelona, Bayern Munich, Chelsea and Porto with 13 attendances.

The teams which had the worst final league standings are (>= 15^th):

o Celta Vigo, 19^th (2003 – CL advance level = R16)

o Villareal, 18^th (2011 – CL advance level = G4)

o Real Sociedad, 15^th (2003 – CL advance level = R16)

…La Liga is tough! In fact, the worst official finish is by Juventus in the 2005–06 season in Series A (20^th), while they advanced to QF in CL. However, this was due to the fine they got after a match fixing scandal was uncovered (they actually finished 1^st in the league that year).

The teams who have reached the CL final AND finished the league 1^stin the same season are:

o Barcelona (2005, 2008, 2010, 2014)

o Manchester United (2007, 2008, 2010)

o Bayern Munich (2009, 2012)

o Juventus (2014, 2016)

o Internazionale (2009)

o Real Madrid (2016)

o Porto (2003)

o Atletico Madrid (2013)

Back to visuals. Here is the so-called violin plot of the league standings (y-axis) vs. the CL advance levels (x-axis):

Figure 2: The violin plot of the final league standings versus the CL advance levels derived from the full sample.

So, this nice and colourful plot shows some first-order relationship between the two variables with some statistics on it. This plot gives some idea about how the final league standings of the teams look like if they didn’t attend CL, or if they did but only advanced until the R16, and so on. Each of the coloured shapes in Figure 2 (supposedly looking like a violin, but in this case they really look like kemençe, so I will refer to them as kemençe plots) shows the distribution of different categories. All the category (CL advance levels) distributions in Figure 2 resemble the geometric distribution apparent in Figures 1a and 1b, only sideways and symmetric, thus forming the shape of an upright standing string instrument. The thick black lines in the middle of each kemençe show the interquartile ranges, while the white dots show the median value of the final league standings for each category. We can see that the median value for the final league standing for the teams which didn’t attend CL (NA) is 4, whereas the value for the teams who advanced to every level other than the final (F) is 2. The mighty finalists usually finish their leagues as champions (thus the median value 1). We also see the distribution of the finalists bulged towards the lower numbers in Figure 2 (the top of the league table) with a short tail pointing upwards (the shorter the tail, the higher the confidence on the median values and the ranges shown on the kemençe). So, this means that the CL finalists are likely to finish at the top of their leagues, where the final standings of the rest (that is all the NA, G4, G3, R16, QF and SF) can vary between top and the bottom with a higher likelihood towards the top.

So, we should be sufficiently confident that a CL finalist will finish their league in 1^st or 2^ndplace, or at least 3^rdplace. Or should we? So far, both figures I’ve shown demonstrate what we can already guess: the higher up a team progresses in the CL competition means they are stronger and thus have a higher likelihood of finishing at the top in their local league.

Well, this doesn’t really agree with my initial motivation of Beşiktaş doing really badly in the local league because of their success in the CL. Before jumping to quick conclusions and claiming that my team sucks in the Turkish Super League because the players (such as Ricardo Quaresma) just don’t care about local league anymore, we should look at the data in more depth and detail. The results so far are only out of the bulk data, so next I will look at some correlations over specific stratifications of both the full and the limited samples.

Analysis of the Correlation between CL Advance Levels and the Final League Standings

Here, I will investigate if there is any correlation between two variables, namely, CL advance levels and final league standings. If so, is it a positive correlation (i.e., higher the CL advance level, the higher the final league standing) or vice versa? Before delving into the results, I would like to present information about the correlation metrics I used in this analysis. First, I assigned monotonic values to each CL level, such that NA=7, G4=6, G3=5…, F=1, to convert this categorical/ordinal data to numerical values. Then I calculated Spearman’s rank statistic to quantify any correlation between the two variables. This statistic is suitable for this case, as it is a nonparametric measure of rank correlation which is a more reliable statistic for non-normal distributions (as is the case observed in Figures 1a and 1b).

The conversion of the categorical CL level data to monotonic numerical values have some caveats, though. For example, the difficulty of reaching the final (F, value = 1) is probably exponentially – not linearly – higher than that of, say, the quarter final (QF, value = 3). However, the same conclusion can also be made for the final league standings, which are already monotonic (1 to 20), thus making it a fair comparison. I played with the conversion, changing it to weighted values etc., to understand if it makes a difference in the results and didn’t observe any significant changes. For the sake of completeness, I also added the Pearson correlation statistics, which is the measure of linear correlation between two variables and, as you will see next, both Pearson and Spearman agree on the final correlations.

Let’s look at some numbers and try to understand what they mean in relation to the question at hand.

		Full Sample	Limited Sample
Spearman’s	Correlation Coeff.	0.28	0.08
Spearman’s	P value	0.00	0.07
Pearson’s	Correlation Coeff.	0.24	0.12
Pearson’s	P value	0.00	0.01

Table 1: Spearman’s and Pearson’s correlation coefficients and their corresponding p vales for the full and limited samples.

The correlation coefficient tells us how much correlation exists, and the p value tells us how much confidence we have in the coefficient (the lower the p value, the higher the confidence). In statistics, a p value less than or equal to 0.05 is usually accepted as the value to have sufficient confidence – 95% to be exact – in the coefficient. In Table 1, the coefficient is almost zero (0.08/0.12) for the limited sample and a bit higher (0.28/0.24) for the full sample, along with very low p values, thus indicating a high confidence for both. A perfect positive correlation would be 1, so in our case that would indicate the higher up a team progresses in the CL means the higher up they finish in the local league (and vice versa for the perfect negative correlation -1). In summary, good performance in the CL indicates good performance in the local league (as Beşiktaş is failing to do so far) with a very linear fashion in the case of a correlation of 1. In both limited and full samples, we have positive correlations. The fact that the correlation for the full sample is higher can be attributed to the inclusions of the non-attendances to the CL (NA), which comprise a very large portion of the full sample (Figure 1c). This creates a larger contrast between the local league standings and the CL levels. This all makes sense, after all, the CL advance level is definitely not the only effect on a team’s final local league standing. There are, of course, many other parameters at play, such as a club’s wealth, difficulty of the league etc. And here, in both samples, we have a mixture of beasts like Real Madrid and Manchester United along with underdogs, such as Artmedia, Zilina etc.

If you follow the CL at all, you would know that experience is everything in that competition. It is imperative for a team to have competed in the CL for a while, to understand the atmosphere and get used to what it is like to play against the top teams in the world before being able to even think about advancing towards, say, QF. And advancing even higher is usually reserved for select teams such as Barcelona, Real Madrid, Bayern Munich, Liverpool and a few others which UEFA loves (!!!). But let’s assume for a moment that we are living in a fair world and that there is total meritocracy in the CL. So, next I will investigate how CL experience affects the correlation between the CL advance levels and final league standings. I will look at this problem via two different angles by defining the ‘CL experience’ in two different ways:

The number of seasons a team attended the CL competition (Quantity)
The highest level a team advanced to during a CL season (Quality)

The first definition (quantity) assumes that the more a team was involved in the competition and listened to that CL music before the games, the more they will be used to compete both in the CL and the local league at the same time. The second definition (quality) assumes that the higher a team ascends the mountain, the more they will be attuned to the atmosphere of the CL and will not be consumed by its glitter while competing in the local league.

Cl Experience: Quantity

Here, I stratified the teams in both the full and the limited samples by the number of their attendances to the CL (i.e., less than or equal to 14 times, 13 times, 12 times and so on, until 1). Then, I calculated correlation coefficients and their corresponding p values for each stratification. Additionally, I removed the teams which have small standard deviations (less than 1.5) in their league performances to eliminate teams such as Celtic, Olympiacos, Bate Borisov, Ludogorets Razgard etc. who are consistently placed 1^st or 2^nd in their local leagues regardless of their performance in the CL, thus contaminating the signal I am looking for in the correlation analysis. The results are in Figure 3.

Figure 3: The Spearman’s (blue lines) and Pearson’s (green lines) correlation coefficients and their corresponding p values (dashed lines) between the CL advance levels and the final league standings with respect to the CL experience: attendance over (a) the full sample and (b) the limited sample.

Finally, the data starts telling something. The CL experience of teams are the highest at the leftmost side of Figures 3a and 3b, decreasing towards the right. We see a relationship between the correlation of CL advance level versus final local league standing with diminishing CL experience in Figure 3: The correlations diminish as well! It never goes negative when we look at the full sample (Figure 3a), but it does in the limited sample (Figure 3b), indicating that the less experienced a team is, the more poorly they will likely perform in their local league as they advance higher in the CL. The p values exhibit an increase towards the low experience side in both Figures 3a and 3b, due to the reduction of the sample size as the cumulative experience level goes down. However, the consistent decrease in the correlation coefficients indicate a reliable analysis. Besides, the p values start heading down again after the experience level 6 in Figure 3b, indicating that even if the sample size is smaller, the negative correlation becomes clearer, and so the confidence in the results increases. One question that pops up in my head looking at Figure3 is: why do the correlation coefficients go all the way down to -0.4 in the limited sample, whereas it never drops below 0.0 in the full sample? This may be attributed to the long streaks of non-attendance by a significant number of teams in the full sample, introducing an irrelevant relationship to the final results (that is, irrelevant to the CL advance level vs. final local league standing). The limited sample only contains data from when every team attended the CL, thus giving a more accurate view of the correlation I am looking for. My main motivation in including the full sample to my analysis was to see if there is a clear effect of non-attendance, (i.e., if the teams were performing significantly better in their local leagues when they didn’t attend the CL). However, it seems hard to discern such information from this analysis due to the possible introduction of the irrelevant relationships I implied above.

Cl Experience: Quality

In the case of quality, I looked for the maximum CL level reached by teams and stratified accordingly. For example, if a team attended CL only twice during the 14-year analysis period, but reached the quarter final in one of them, that team goes to the QF bin. The results are in Figure 4.

Figure 4: Same as Figure 3 but with respect to the CL experience: highest advance level.

What we see in Figure 4 is quite similar to what we saw in Figure 3, and this time the decrease in the correlation coefficient is more linear and sharper for the limited sample (Figure 4b). The correlation becomes negative below QF level, suggesting it is likely that a team who only reached round 16 or lower during the analysis period, would struggle in their local league if they advance to higher levels in the CL. You might notice there is no reported value for G4 in the limited sample. That is because there is no variability of the sample for the lowest advance level (it only contains G4) thus not yielding a correlation. It is possible, however, to calculate a correlation coefficient for G4 in the full sample (Figure 4a), since it contains two categories, namely G4 and NA. However, having only two categories leads to weak variability, leading to a bump in the p value for G4 in Figure 4a. The p values for the limited sample reach and exceed 0.6 for SF and QF, reducing the reliability of the statistics for these two levels. However, the consistency in the monotonic decrease in the results is convincing enough (for me, at least) that they have value.

As a matter of fact, it is no surprise that the results of the quantity (Figure 3) and the quality (Figure 4) analyses are similar, since these two approaches are not independent of each other. To demonstrate that, I calculated the average CL attendances for each of the CL advance level categories and plotted them (in other words, I plotted the quantity versus quality).

Figure 5: CL advance level vs. the average CL attendance

Indeed, there is a very strong relationship between the experience of a team in the CL and the highest level it advances to. Figure 5 demonstrates that the average attendance for the finalists between 2003–2016 is 10.2, that is, they attended more than 10 of the 14 competitions on average. This number is 1.3 for the teams who didn’t go higher than placing last in the initial group stage (G4). You can see from Figure 5 that you might want your team to be in the CL at least 5 out of 14 times to see the quarter finals or higher. Did you also notice the abrupt jump from around 5.9 to 10.2, the average CL attendances for SF and F respectively? Crunch time is when you need experience the most.

Correlation Coefficient by Country

I also wanted to see what the correlation coefficients look like if we focus on the data stratified by country. I did that analysis over the full sample to make use of the larger dataset and to see, unlike the analysis over CL experience, if negative correlations exist for some countries even in the full sample.

Figure 6: Spearman’s rank correlation coefficients stratified by country using the full sample. The white numbers on the bars indicate the p value of the statistic for the corresponding country.

And yes, we do see negative correlations for some countries over the full sample, suggesting a strong correlation signal by country. In other words, the strength of the local league has an impact on how a team performs while also competing in the CL. Or should we say, popularity instead of strength? Money? Maybe experience instead of strength, in line with the findings in my experience analysis. As you can see in Figure 6, the top six leagues –England, Spain, Italy, Germany, France and Portugal – are conveniently placed in front of the remaining bunch. They have the highest positive correlation coefficients (varying in the vicinity of 0.5–0.6) with 0.0 p values (the white numbers on the bars), indicating complete confidence in the results. There are also some seemingly unusual countries such as Ukraine, Greece, Scotland, and Turkey (!!) showing up among the ivy league. But if you look closely, their relatively high correlation coefficients are accompanied by relatively higher p values, reducing the reliability of their placement (other than Greece, which seems to do OK). Overall, as the league quality goes down, correlation coefficients go down too, all the way down to around -0.4 by Kazakhstan.

What happens to Beşiktaş, then?

Going back to the motivation of all this: what will happen to Beşiktaş in Turkish Super League now that they have advanced to R16 in the CL 2017–18 season? Can we give a confident answer to this question in light of the above analysis? The answer is, of course, no, because I only looked at a small, possible effect to a soccer team’s success, that is, its CL performance. As I mentioned before, a team’s success depends on many other variables, such as the squad worth. In fact, this may be why the correlation coefficients were capped around 0.3 for the experience analysis (Figures 3 and 4b), meaning there is no strong correlation in general between the CL performance and the final league standings. The informative aspect of this analysis is not the correlation coefficient itself, but its change with respect to team experience and the country/league. I think I managed to demonstrate, quantitatively, how important experience is in CL especially if you are not in a strong league (i.e., one of the top leagues in Figure 6). The more experienced you are in CL, the more you know how to compete in both your local league and CL in the same season. So, although Beşiktaş is a Turkish team, a country which seems to rank among stronger countries (Figure 6), they lack experience with only 4 attendances in 14 competitions, reaching only up to G3 during this period. If we were to plug these numbers into Figures 3b and 4b, we would see negative correlations both in attendance and skill. And we are doing great in the CL this year having reached the QF already! Does that mean we, as fans, should forget about being the champions of the Turkish Super League this year? Well, of course I refuse to think so and I will refuse no matter what comes out of any kind of analysis. As a proper Beşiktaş fan, I firmly believe that Beşiktaş is the biggest and grandest soccer club on earth (yes, even bigger than Real Madrid or A.C. Milan), but I can still look at some more data and try to get some predictions about the end of the season. To reduce the lack of effects/variables included in the analysis so far, I wrote another web scraping algorithm, this time for transfermarkt, to read some characteristic data for each team, this time unrelated to CL.

The additional information I read from the transfermakt website were: current squad value (as a measure of club wealth), the year club is founded (as a measure of club tradition), the number of seats in the stadium (as a measure of spectator support). I added these three variables to the already existing three: country, CL advance level, CL experience (number of attendances) and formed a 6-variable predictor list to see if I can predict a team’s final leage standing. To do so, I used a machine learning approach called fandom forest. Curious reader can look at the references and understand the details of these methods, but in summary, what we do with machine learning is training some smart algorithms with historical data (in this case the full sample I have) and use those learned algorithms to predict future states. To quantify if the algorithm is doing fine, the historical data is separated into a training set (to train the algorithm) and a test set (to test the algorithm). In my case, I used 1000 of the 1426 entries in the full sample for training and the remaining 426 for testing (the statistical details of this analysis are not given here). As a result, my random forest classifier performed badly, with an abysmal prediction score of 23%. This means, of the 426 entries I used for testing, the algorithm predicted only the 23% of the final league standings correctly. This is another indication of how unpredictable soccer results are. The 6 predictors simply aren’t enough to confidently predict a team’s long-term performance. In fact, we usually can’t even predict a match score even when we take a pick after the half-time! This is just one of the reasons why soccer is the most popular sport in the world: the element of surprise.

In summary, although I can’t make a solid prediction of where Beşiktaş will end up in the table at the end of the season, the statistics say that the inexperience of the team and its currently high advance level in the CL will hurt their chances of being the champions this season. My terrible random forest classifier predicted 2^nd place for Beşiktaş given we don’t beat Bayern Munich in the R16 and advance to the QF. If we do, the classifier thinking Turkey is one of the strong countries in the competition (Figure 6), predicts us finishing at the top of the league. But again, with a 23% success rate, these are not at all confident predictions. So, we will keep our fingers crossed and hope Beşiktaş will up their performance in the second half and see what happens.

Thanks for reading (though I seriously doubt anyone made it this far in this post).

Search This Blog, Save Time

Sky Above, Great Wind

UEFA Champions League Success vs. Local League Success

Turkish Abstract

Preliminary Analysis

Analysis of the Correlation between CL Advance Levels and the Final League Standings

Comments

Post a Comment

Popular posts from this blog

The Appearance of 'Climate Change' and 'Global Warming" in AMS Scientific Articles [Time Series]

Philosophical vs. Scientific Texts [N-gram Comparison]

A Text Search Method Using Similarities by K-Nearest Neighbours [Machine Learning]