Exploring Stromboli volcano with unsupervised Machine Learning

The 2019 explosions of Stromboli

Visiting an active volcano is a breathtaking experience and each year thousands of tourists around the world are seeking out the thrill to experience the power of the restless earth. One of the most visited active volcanoes in the world is Stromboli, a volcano known as the “Lighthouse of the Mediterranean” on a small island of the same name just off the coast of Sicily. Its persistent explosive Strombolian activity consists of several hundred of small eruptions per day. Visiting this fiery mountain (that allegedly J. R. R. Tolkien identified with his fictional volcano Mount Doom in the Lord of the Rings) comes at a certain risk though – on very rare occasions it produces major explosions, when as a tourist you might want to be as far away as possible from this island. To protect the tourists and the inhabitants of the island, the Italian National Institute for Geology and Volcanology (INGV) has installed a network that measures the seismic activity of Stromboli volcano and constantly monitors various parameters that characterize its state. Every morning the INGV researchers have a briefing with the local government and if the volcano shows anomalous behavior they can suspend the possibility to access the top of the volcano.

Figure 1: The left panel shows a typical Strombolian explosion that occurs several times per day. The right panel displays a snapshot of the major explosion on July 3.

Despite all of these measures, Stromboli erupted on the 3rd of July 2019 without showing any anomaly in the routinely monitored parameters. The explosion caused a pyroclastic flow (a fast-moving current of hot gas and volcanic matter) that extended for several hundred meters into the sea. Unfortunately, the event also claimed one fatality and could have been much worse and lead to a catastrophe if it had occurred a few hours later in the evening when many people climb the volcano in guided groups. On the 28th of August 2019 a second paroxysm occurred similar to that of July 3 with another pyroclastic flow, luckily this time without any fatalities.

The quest for predictive parameters

Since the routinely monitored seismic parameters did not predict this major eruption, the researchers of the INGV are now analyzing the collected seismic data in order to understand whether other seismic parameters showed any anomalies in the months prior to the 2019 eruptions. Ideally one wants to find new parameters that potentially can predict major eruptions in the future to make the experience of visiting Stromboli safer for future tourists and inhabitants. A major challenge is hereby that the 2019 explosions are the first measurements of such strength at Stromboli – the previous one occurred in 2002 when the network was not installed yet. Therefore at the moment it is impossible to quantify whether possible anomalies in the seismic parameters are indeed able to predict a major explosion – only the future will tell.

Within the context of my secondment at the INGV in Naples I had the possibility to work with the INGV seismologists to explore new seismic parameters by making use of unsupervised Machine Learning. In the past it had been conjectured that there is a connection between the waveforms of the seismic signals that are associated with the daily occurring Strombolian explosions (see left panel of Figure 1) and the physical state of the volcano, e.g. the gas mixture. Therefore we carried out a cluster analysis of the waveforms of the seismic signals to study how different clusters of waveforms behave in the time period of the 2019 paroxysms.

Visualization of seismic signals

In the period from May 2019 to September 2019 the INGV recorded around 20.000 seismic events that stem from the daily Strombolian explosions of the volcano. An example for a single signal is shown in Figure 2 below. The seismic signals that are associated with the explosions can be interpreted as time-series. In order to clean the signals, they are preprocessed by applying a band-pass filter that only allows frequencies between 0.05 – 0.5 Hz and the signals are downsampled to reduce the dimensionality of the time-series.

Figure 2: The left panel shows a seismic signal of a Strombolian explosion; the top panel shows the raw signal, the middle panel shows the same signal after applying a bandpass filter and the bottom panel shows the downsampled signal. The right panel shows the output of the t-SNE algorithm.

In data science several algorithms to visualize high dimensional data are available. In our case we use the t-SNE algorithm that allows to embed high dimensional data in a low dimensional space. Specifically, it models each high-dimensional object, in our case a high dimensional signal, by a two dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. As such, t-SNE can be used to visualize clusters in the high-dimensional data. Running this algorithm with our dataset of seismic signals produces a 2D distribution that is shown in the right panel of Figure 2 above. Each of the displayed points corresponds to a high dimensional signal and it becomes visible that there are structures in the data that we can exploit with a clustering algorithm.

Clustering of seismic signals

Clustering defines structures by separating unlabeled datasets into homogeneous groups. For our analysis we use the K-Means algorithm which clusters data by trying to separate samples groups of equal variance, minimizing a criterion called within-cluster sum-of-squares. One of the challenges using K-Means is that the number of clusters has to be specified beforehand. Since we are working with unlabeled data, we don’t know the optimal number of clusters and there is no ground truth that could help us to assess the performance of our clustering. Thus the performance has to be evaluated using the model itself. One way to do so is to use Silhouette scores, that measure the separation distance between the resulting clusters. Higher scores relate to a model with better defined clusters. The Silhouette coefficient is calculated using the mean distance between a sample and all other points in the same class (a) and the distance between a sample and the nearest cluster that the sample is not a part of (b). The Silhouette coefficient for a single sample is then calculated as: (b - a) / max(a, b).

Figure 3: The top panel displays the Silhouette coefficients (left) and t-SNE projection (right) for K-Means with n=2 clusters. The bottom panel shows the Silhouette coefficients (left) and t-SNE projection (right) for K-Means with n=4 clusters.

Silhouette coefficients near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.

In Figure 3 the silhouette coefficients and the t-SNE projection of the resulting clusters is shown for the K-Means algorithm with n=2 and n=4 clusters for the seismic signals associated with Strombolian explosions. The red line indicates the average silhouette score of all samples. In our analysis we found that the highest average silhouette score is obtained with n=2 clusters which indicates that the seismic signals form two main clusters. In Figure 4 the centroids of the two main clusters are shown. As expected, the waveforms of the two cluster centroids are very different. The example signal shown in Figure 2 can be clearly associated with Cluster 1.

Figure 4: The left panel shows the centroids of the two main clusters obtained with the K-Means algorithm. The right panel shows the time evolution of the two main clusters.

Since we have access to the time stamps of the signals we can display the time evolution of the signals in the two main clusters. The resulting distribution from May 2019 to September 2019 is shown in the right panel of Figure 4 and the 2019 explosions are marked with red lines. We observe a strong asymmetry between the two clusters in this time period.


The researchers of the INGV conjecture that there is a connection between the waveforms of the signals associated to the Strombolian explosions and the physical state of the volcano. The observed asymmetry in the time evolution of the waveform clusters might therefore indicate large changes of the physical state of the volcano in the eruption period. The ratio between the two main waveform clusters might be an important variable to monitor the state of the volcano. In principle this variable can be monitored in real time by automatically assigning selected signals to one of the two clusters. The next important steps that the INGV researchers will take is analyzing the waveforms of the seismic signals for larger time scales and at the same time analyze controlled experiments set up in a laboratory to better understand the connection between the physical state of volcanic systems and the resulting seismic signals.

Moreover, given that most of the routinely monitored seismic parameters showed no anomaly prior to the major eruptions in 2019, the ratio of the two main waveform clusters may be potentially important to predict future explosions. However, as data analysts we are in a dilemma – on the one hand we need more data to understand which variables are correlated with major explosions of Stromboli – on the other hand we don’t want to hope for further major explosions that endanger the life of tourists and inhabitants. All that we can do is wait what the nature will bring us while trying our best to make use of the limited data we have to make visiting Stromboli a safe experience.

Walking, talking, and quacking like a Higgs boson

The ATLAS Collaboration at Europe’s Particle Physics Lab CERN, have reported a study of the Higgs boson, the elementary particle discovered at CERN’s Large Hadron Collider (LHC) in 2012. The results were presented at the biennial International Conference of High Energy Physics (ICHEP), hosted this year by the University of Prague but held entirely online because of the Covid-19 crisis. They have found that the “strength” with which the Higgs interacts with other particles agrees extremely well with the predictions of our best theory, the so-called Standard Model of Particle Physics.

Figure 1:  A depiction of a proton-proton collision in the ATLAS Experiment at CERN’s Large Hadron Collider resulting in production of a Higgs particle and a Z boson.  The Higgs boson decays into two other Z particles; one of the Zs decays into a pair of muons indicated in red and one of the Z bosons into an electron-positron pair shown in green (figure from the ATLAS Collaboration).

In 2012 physicists at CERN discovered a new particle with properties consistent with those predicted for the Higgs boson. In particular one could measure how often the Higgs boson would disintegrate or “decay” into other types of particles. But these and other properties were measured with limited precision, and as the world of elementary particles is large and complex, one could still question whether the new particle was really the Higgs. Over the last 8 years, studies by the ATLAS Collaboration as well as the competitor experiment called CMS have continued to reduce any doubt about whether the new particle is in fact the Higgs boson.  

The latest results from ATLAS for the coupling strengths are shown in Fig. 2 below. In both the upper and lower plots, the data points show on the vertical axis a quantity related to the “coupling strength” of the Higgs to other known elementary particles, namely, the muon (μ), tau lepton (τ), b-quark, W/Z bosons and the top-quark (t), while the horizontal axis indicates the mass of those particles. The dashed line shows the relation between these quantities predicted by the Standard Model. The data points are seen to agree very well indeed, with remaining small discrepancies consistent with estimated measurement uncertainties as indicated by the vertical bars on the points.

Figure 2:  Measurements of the coupling strength of the Higgs to other particles versus the mass of the particle with which it interacts.  The dashed lines indicate theoretical predictions of the Standard Model (figure from the ATLAS Collaboration).

Is the level of agreement enough to prove that the particle we’ve found is the Higgs?  Technically no, there is always some small room for doubt. But one should keep in mind that if the particle were to have some other non-Higgs explanation, then there would be no reason to expect anything like the pattern found in the figure above. So if it walks, talks and quacks like a Higgs, then we can regard this crucial part of the Standard Model to be well confirmed.

To scrutinise subtle signs of deviations of data taken at the LHC with respect to the Standard Model predictions, the Higgs measurements reported in this study can be reconciled in the framework of an Effective Field Theory. The framework helps in understanding how the signatures of new phenomena manifest in our detector even when the new phenomena occurs at distances even smaller than those directly probed by the LHC. This is an active area of work ongoing in the ATLAS Collaboration, stay tuned for updates in the future.

The ATLAS Collaboration is an international consortium of 181 universities, including 5 ESRs from INSIGHTS. Rahul Balasubramanian and supervisor Wouter Verkerke contributed to this analysis. INSIGHTS Scientific Coordinator Glen Cowan (Royal Holloway, University of London), also played a role in the analysis through development of the statistical methods used as well as Chair of its Editorial Board.

Infections and identified cases of COVID-19 from random testing data – Allen Caldwell, Max Planck Institute for Physics, Munich

Abstract: There are many hard-to-reconcile numbers circulating concerning Covid-19. Using reports from random testing, the fatality ratio per infection is evaluated and used to extract further information on the actual fraction of infections and the success of their identification for different countries.

The PDF can be downloaded here.

The value that is keeping Italy in lockdown

Yesterday the Italian government released the analysis that motivated the very mild release of the lockdown. This is an impressive analysis and essentially takes into account all the points I raised last week; the population is divided into age groups with different susceptibility and transmission rate, it is done at the regional level and takes into account an incredible amount of information. Really fantastic work.

However, all the predictions rely on one crucial data: the CFR in Lombardia which is 0.657%. This is incredibly pessimistic and is caused by the fact that the Lombardia health system was overwhelmed. The same value for the rest of Italy would be much lower, about half. In Sweden, the CFR is 0.3% and in other countries is even lower. If a value of 0.35% had been used, the estimate of critical cases would have also been (more or less) halved and the rate of spread would have also been reduced considering that twice as many people would have been categorised as immune.

So why was the CFR of Lombardia used for the whole of Italy? The model is run on each regional independently, so why not use the regional CFR too? I am sure the results would have been significantly different and the Italians would have been less upset.


Not an exponential curve fit: a data analysis exercise on COVID-19 Italian data trying to estimate the number of infected people

Update at 15:30: The Swedish study has just been withdrawn, so IP3 is no longer valid. The two other models are not affected.  This shows how difficult is to make an analysis of live data.

This post can be downloaded as PDF.

How many people are really infected? Nobody knows. The number of asymptomatic people is high, but nobody has a definitive number. Mostly because any estimate would only be accurate on a small region since the number of tests carried out differ significantly from place to place.  Knowing the number of infected is useful for society as many cured people would allow to re-open Europe, at least partially. Since our network is aiming to help society using particle physics tools, I will try to give an estimate based on some of the data available from Italy. Will my estimate be accurate? Probably not, but even if they only help someone to better understand the complex nature of analysing COVID data, I think it is worth sharing them.

Let me start by saying that this is not a quantitative analysis, I do not have the background to do that. I will make some assumptions which should be correct enough for a qualitative analysis and give an approximate result which should be confirmed by experts.

Review of existing estimate

Since I started developing this method, two estimates were produced in Italy. One is described in a preprint, the second is a survey from Doxa. The former uses the Case Fatality Ratio (CFR), defined as the probability to die if infected, in small countries (0.2%), on the Diamond Princess (1%) or in Italian hospital (10%). I find these numbers not very precise as the early cases in small countries like Island, Luxembourg was imported from people travelling. This is usually a relatively young population which we know has a lower fatality rate than the average, so I find this a bit optimistic. The estimate using the cruise ship Diamond Princess is based on the 7 deaths out of more than 700 infected, so the statistical uncertainty is large. Furthermore, the population on a cruise ship probably does not reproduce the demographic of a country. Getting a good age distribution is crucial given the much higher mortality for the older age groups; this is the problem of using the hospital population which is significantly skewed toward the older population. The Doxa analysis is based on 1000 people spread across Italy and counting those describing symptoms associated with the virus. There are several assumptions, but the main problem is that 1000 people is not enough to sample all regions and age groups accurately. In general, all methods try to extrapolate to the whole Italian population which, in my opinion, the biggest problem as the diffusion depends strongly on geography. About half of all Italian cases are in Lombardia so a more granular approach is needed.

The data

Moving to the data, the most granular information available is in the ISS (Istituto Superiore della Sanita’) appendice con dettaglio regionale (appendix with region-by-region information). This appendix is published every week and the latest available data is from the 16th of April. The number of infected is provided for each province but the demographic curves of the infected and the number of deaths are only presented for the regions. It would be great if ISS could share this additional data so that the predictions could be improved.

I also want to make a consideration on the quality of this data. A lot of people on the internet question the validity and the usefulness of all these numbers. As in all data analysis, there is no wrong data, at most, there is bad data and it the job of the analyser to understand it. In general, domain expertise would allow doing this correctly, but in this case, the confusion is generated by the attempt to extrapolate or aggregate to national level what should be analysed as a localised problem. This is what I will try in my analysis.

 The analysis

The figure that intrigued me and led me to start this analysis is the demographic curve of infected people in Italy (an example is provided here). This is very different from the demographic curve of the whole Italian population (link). If we could find a segment of the population where most of the infected are identified, we could get a first approximation of the all infected population. I think such a group exists and are the males over 80. I base this assumption on the fact that 40% of the infected male over 80 died because of the virus. Given such high mortality, it is unlikely many will be asymptomatic like most under 40. Since most infected people in this group will be very symptomatic, most of them should be tested and identified. There are several problems with this assumption, for example, people may die before being tested and a small fraction may have very mild or no symptoms at all. Hence, any result based on this assumption is a lower limit. The rate of asymptomatic in this age group could be easily measured by a dedicated study monitoring this group in a region or province; since this is a constant number it could then be applied to the whole country. Indeed, this information could be extracted by the available tests carried out in hospices. I selected male because an important factor in this pandemic is the sex of the infected; in the over 80 population, a similar number of male and female is infected although the population is dominated by women. I do not think there is a substantial difference in the lifestyle of the two sexes, so both should be infected at the same rate. My assumption here is that women simply get milder symptoms and therefore are less tested but will have a similar infection rate. This is supported by the fact that the mortality rate for women is only 20%; it looks like women are more resistant, so could be also more asymptomatic. Since for people over 90, the women-man ratio in the population and in the infected is significantly different from the 80-90, I will focus on the 80-90 population only (all demographic curves use 10 years ranges, so this simplify some of the data extraction too). The first hypothesis is that at least as many people in each region are infected as the 80-90 age group; this hypothesis is denoted IP1 from now on. This assumption is supported by a recent study done on the completely locked-down town of Vo, in Veneto. The 80-90 age group had an infection rate comparable to the whole population; it must also be noted that only 5 cases were identified, so there is large statistical uncertainty in this confirming result. One side note about this study; it showed that children do not get the infection and are not super-spreader like in the normal influenza, so the closing school did not make a big impact and should be among the first measured to be lifted as already done in Denmark, Norway and Germany.

In normal conditions, people in their 80s are not the most socially active group around. So, I would expect a significantly higher transmission and infection rate in younger age groups. As a physicist, what I find interesting about this estimate, is that it is easy to correct because it does not require to know the absolute infection rate of the virus, but the relative infection rate between age groups! This could be derived from studies like the one cited above done on the population of Vo. In particle physics, we measure a lot of ratios because we get a lot of effects that cancel out. I had a look at publications in this area, which would be based in non-lockdown conditions, and I could only find this paper. In summary, people below 50 are twice more likely to get infected than 65+. This factor two may not be the most appropriate for a lockdown scenario, it may be smaller (if everyone is really isolated at home as was the case of the study on Vo mentioned above where the factor was 1) or higher if a significant fraction of the younger working population continue to interact (50% of companies are still open as deemed essential) while the older population practice tight isolation. It is possible to get a rough estimate this factor from the Vo study as they measured the rate of infection at the beginning of the lockdown (so those cases would have been caused by normal circulation of people) and after two weeks of total lockdown. Unfortunately, the new cases were so few that any extrapolation based on the ratio between age group would be meaningless (error bigger than 100%). Comparing the age-inclusive rate has a smaller (but still large) statistical uncertainty; 0.3% during lockdown compared to 2.6% before. Taking the mid-point for the working group would give 1.5%. The ratio between locked down older generation over the working population would be 5 to 1, higher than the factor two mentioned before. When extrapolating to the total population this value must be reduced as the school-age population is in lockdown and those working would probably take measure to reduce infection. Combining all these effects a rough estimate could be that the whole population is only twice as infected as the old; I welcome ideas to improve this estimate as I know it is not great. I will denote this as IP2 from now on.

Let me stress that my estimates do not use the mortality rate at all and can rely on the bigger numbers of identified infected people to extrapolate to the actual number of infected, so statistical errors are smaller and the estimate more accurate. The results are summarised in Table 1 [only extrapolated results are shown in this post, more columns are shown on the tables in the downloadable Study on Italian data on Covid19- MFG – 22-04PDF]. I decided to include some of the intermediated steps for completeness. The only man-women division based on demographic that I could find was on a national scale (here), indicating that 49.4% of the cases in the 80-90 group were men. I have no reason to suspect that this ratio should be significantly different in any Italian region, so I will assume that this is true in all regions. This is data available to the ISS, so it would be great if they could provide it to make the prediction more precise. The column “Fraction of infected All [%]” is the fraction of infected in each region; the large differences are caused by the different level of spread but also by the significant difference in the number tests performed by each region. Many people complain that COVID data does not make sense citing these numbers; actually, the data is correct but must be understood, for example by avoiding such easy (but wrong) comparisons or by aggregating nonhomogeneous data. By analysing each region independently, I avoid this problem and I find that data make more sense. I provide the summary for Italy and Italy without Lombardia only to stress this point. The latter is also useful to show that Italy is doing much better, especially in the number of deaths per inhabitants, is if the worse affected region is removed from the national calculation.

Table 1 Summary of data and estimates for all Italian regions

Region Infected Male 80-90 Fraction infected Male 80-90 [%] (IP1) Fraction of infected All [%] Ratio
IP2 [%]
Lombardia 4692 2.08 0.54 3.81 4.15
Lazio 254 0.20 0.07 2.75 0.40
Campania 121 0.12 0.05 2.40 0.25
Sicilia 89 0.09 0.04 2.40 0.17
Veneto 932 0.85 0.26 3.22 1.70
Emilia-Romagna 1444 1.24 0.41 3.04 2.48
Piemonte 1091 0.92 0.28 3.24 1.84
Puglia 174 0.20 0.07 2.95 0.40
Toscana 368 0.36 0.14 2.59 0.73
Calabria 35 0.08 0.04 1.98 0.16
Sardegna 74 0.20 0.06 3.67 0.40
Liguria 278 0.58 0.19 3.11 1.16
Marche 405 0.96 0.31 3.09 1.92
Abruzzo 93 0.28 0.14 1.96 0.56
Friuli Venezia Giulia 125 0.41 0.16 2.48 0.81
Trentino-Alto Adige 365 1.67 0.48 3.46 3.35
Umbria 44 0.18 0.14 1.33 0.36
Basilicata 5 0.03 0.03 1.29 0.07
Molise 13 0.16 0.08 2.15 0.33
Valle d’Aosta 77 2.61 0.70 3.75 5.22
Italy 10681 0.77 0.23 3.39 1.53
Italy w/out Lombardy 5989 0.51 0.16 3.17 1.02

To further prove the need for a province-level granularity of the data, I analysed the two provinces of Trentino Alto Adige since the ISS provide all the details for both. The results are shown in Table 2; Trento, a province closer to the epicentre of the pandemic, has twice as many cases as Bolzano. Using the region data only show the average between the two that is not an accurate description of the spread of the virus.

Table 2 Comparison of provinces in Trentino-Alto Adige

Region Infected Male 80-90 Fraction infected Male 80-90 [%] (IP1) Fraction of infected All [%] Ratio
IP2 [%]
Trentino-Alto Adige 365 1.67 0.48 3.46 3.35
Bolzano 133 1.27 0.35 3.61 2.54
Trento 232 2.05 0.61 3.33 4.10

Study on CFR, a digression

The CFR has been another source of confusion that led people to claim that the available data is wrong or useless. Again, the data is correct and simply should not be compared because of the definition of CFR includes a value that it not comparable between regions. Let me go back to the definition; CFR is the fraction of deaths over infected. While the numerator (the number of deaths) is a well-defined number which is common to all regions, the denominator (the number of infected) suffers from the different number of tests per inhabitants performed in the various regions. For example, Veneto has performed as many tests as Lombardia but has a population that is less than half. This resulted in more people being identified as positive, particularly in lower age groups. This resulted in Veneto having an average age in the infected of 58 while Lombardia average age for the infected in 65.

This problem could be avoided if the demographic of the deceased could be provided as it is done for the infected. With this breakdown, it would be possible to use the 80-90 age control group to compare the CFR between regions; again, I would assume that all regions are identifying all (or at least the majority) of the infected in this age group.

The 80-90 age group can also provide an evaluation of the performance of the regionalised health system in Italy by counting the fraction of deaths in this group. The guidelines to treat patients in case of shortage of ICS places state that priority should be given to the healthier and younger patients that have a higher success rate in surviving. Therefore, a significantly higher number of deaths my control group would be evidence of doctors having to make difficult decisions due to lack of resources.

Unfortunately, the data to carry out this study is not available as only the national breakdown of death in age groups is provided. What can be done is to divide the number of deaths in this age group proportionately to the deaths in each region. This approximation would hide some of the excesses that the study aims to find as the deaths are equally distributed, so any excess found will be underestimated. I already calculated the number of infected in the 80-90 age group which are used as the denominator for the calculation of the fatality. The results for all regions are shown in Table 3. The CFR in Lombardia is the highest of all Italy and is almost twice the average in the rest of Italy. It is also interesting to notice that Liguria has a very high CFR too, something that has not been picked by the media (probably due to the bigger numbers in Lombardia). Emilia-Romagna, the second most affected region also show a higher rate than the average, it would be interesting to have the breakdown by provinces to see if the rate would get worse in the provinces closer to Lombardia which are more affected by the virus. These numbers do not definitively prove that medics had to make difficult choices but hint in this direction. A definitive answer could be given if ISS would provide the additional information mentioned above (the breakdown of deaths by age in each region and provinces).

 Table 3 Estimated number of deaths and CFR for the 80-90 age group in all regions

Region Dead 80-90 CFR 80-90 [%]
Lombardia 4594 39.19
Lazio 105 15.42
Campania 59 19.90
Sicilia 58 25.16
Veneto 396 17.60
Emilia-Romagna 1120 32.40
Piemonte 615 20.31
Puglia 119 25.79
Toscana 151 15.86
Calabria 21 27.08
Sardegna 33 16.80
Liguria 214 34.56
Marche 172 17.23
Abruzzo 15 6.55
Friuli Venezia Giulia 84 29.81
Trentino-Alto Adige 220 23.95
Umbria 23 25.44
Basilicata 8 77.05
Molise 6 22.27
Valle d’Aosta 55 31.59
Italy 8070 30.22
Italy without Lombardy 3476 23.20

A hope from the north of Europe?

The fraction of infected evaluated with IP1 and IP2 are rather low and, even in the most affected regions, the numbers are far from those needed for herd immunity. However, these are rather conservative values and the infection may be more widespread.

Some hope is coming from Sweden than just announced that some regions of the country already show sign of herd immunity and should reach the required levels of infected (more than 60%?) in about a month. While Sweden as been described as a nut case by the media by not imposing a lockdown, it does not have any count (fraction of infections or deaths) as high as Italy. So, can Italy, or at least north Italy be already close to herd immunity?

In the week of the 2nd of April, they tested 773 people in Stockholm and 2.5% resulted infected. Extrapolating to the 9th of April, they estimate that (7.5±2.5)% of the population in the capital region were infected. Some detail can be found here. Translating these numbers to the Italian data is not easy; it is not possible to use the number of infected people (the CFR) as the testing strategy in Sweden is different from Italy. The only possible common factor is the fraction of deaths over the whole population. There are two problems in using this fraction, one is the different demographics of the two countries, the second is the fact that in Sweden all Covid deaths are considered while in Italy only those in hospitals are considered. The former difference can be corrected by comparing the fraction of over-70, the largest fraction of deaths being above this age. In Italy 17.2% of the population is over-70 while only 14.8% of Swedes are in the same age group. Therefore, we could expect a 16% (17.2/14.8=1.16) higher mortality in Italy simply due to demographics. A similar comparison done by experts can be found here. It is difficult to set a correction for the different way of counting the deaths, the number of deaths in Italy should be higher than those reported but we could assume that in a non-overwhelmed region, all people with acute symptoms were transported in hospitals and therefore most deaths were counted.

Ok, moving to the math. Stockholm has a population of about 2.3 million and on the 9th had 486 deaths giving a fatality ratio (FR) of 0.02%. Please note that in this case the fatality is calculated over the whole population as this is a measure of the progress of the infection; this is different from the CFR where the denominator is the infected people. Actually, the two value are identical once the whole population is infected; at the current pace, Stockholm should have a CFR of 0.3, slightly higher than the lowest estimate reported in the review section but significantly lower than the Diamond Princes data suggested. The FR is then increased by 16% to be compared to the Italian regions fatality rates as described above. It is also crucial to say that Stockholm was not overwhelmed by the pandemic (a 400 places field hospital is still unused), so this number can only be compared to regions that are not overwhelmed, i.e. I will not show Lombardia. The results of this study are presented in Table 4. The extrapolation of the Swedish model is denoted as IP3 and both min and max values derived from the original statistical error are shown for completeness.

Table 4 Current fatality rate over the whole population and predicted fraction of infected based on the Swedish study

Region Deaths / population [%] IP3 min [%] IP3 [%] IP3 max [%]
Lombardia 0.1313 NA NA NA
Lazio 0.0044 0.9 1.4 1.9
Campania 0.0025 0.5 0.8 1.1
Sicilia 0.0029 0.6 0.9 1.2
Veneto 0.0200 4.2 6.3 8.4
Emilia-Romagna 0.0622 13.1 19.7 26.2
Piemonte 0.0350 7.4 11.1 14.7
Puglia 0.0073 1.5 2.3 3.1
Toscana 0.0101 2.1 3.2 4.2
Calabria 0.0026 0.6 0.8 1.1
Sardegna 0.0050 1.1 1.6 2.1
Liguria 0.0342 7.2 10.8 14.4
Marche 0.0280 5.9 8.9 11.8
Abruzzo 0.0028 0.6 0.9 1.2
Friuli Venezia Giulia 0.0172 3.6 5.4 7.3
Trentino-Alto Adige 0.0509 10.7 16.1 21.5
Umbria 0.0066 1.4 2.1 2.8
Basilicata 0.0037 0.8 1.2 1.6
Molise 0.0052 1.1 1.7 2.2
Valle d’Aosta 0.1090 23.0 34.5 46.0
Italy 0.0331 7.0 10.5 14.0
Italy without Lombardy 0.0171 3.6 5.4 7.2

The hope is that, if these numbers are confirmed, the northern regions are approaching infection rates that will allow a natural reduction of the spread of the virus and a return to a life closer to the pre-pandemic period. It must also be stressed that all regions in centre-south of Italy have a very low infection rate even in this model. So, a prudent and conservative approach is really needed to avoid a second wave in these regions. Italy really looks divided in two by these numbers and any policy should reflect them, to maximise the benefits in restoring personal freedoms in the north and to protect the fragile health system in the south.

Most affected provinces

While all the media was focussed on Bergamo due to the highest absolute counts of infected and deaths, the province that has the highest identified number of infected is Cremona followed by Lodi and Piacenza. To provide the estimates in these provinces I will need to make some additional assumption as some data is not available. For example, I will use the region demographic curve to calculate the number of 80-90 people affected in the province. The national male-female ratio is also used for the provinces. A difference in the infected demographic may be caused by different testing policies in different areas; overwhelmed areas may only test the severe cases while less affected areas may still test a larger spectrum of the population. Since older people are more affected, a bias in the tests based on the severity of the symptoms is also a bias in the demographic. Different provinces may also have different testing capabilities, resulting in a different sampling of the population. All these biases are likely limited by two factors, the regional based sanity system is likely to have provided similar resources per capita and that most cases are from the recent days when the whole region was under similar stress. The three models are presented in Table 5.

The conservative approaches described in IP1 and IP2 show that these provinces should already have a non-negligible fraction of the population infected, probably enough to already be having an impact on the transmission rate (as observed in Stockholm).

Since these provinces may have been overwhelmed, I used the national average (w/out Lombardy) and multiplied it by the ratio of IP1 between the province and Italy (w/out Lombardy) to scale the number of deaths for IP3. The results are shown in Table 5. According to this estimate, (49±16)% of people have been infected in the province of Cremona. This is very close to the levels required for heard immunity but there are many assumptions to reach these values which would need further scrutiny and more granular data to be sure they are correct.

Table 5 Estimates for the most affected provinces

Province Infected 80-90 Infected M 80-90 IP1 [%] IP2 [%] IP3 min [%] IP3 [%] IP3 max [%]
Cremona 980 484 5.74 11.5 32.8 49.1 65.5
Lodi 487 240 5.31 10.6 30.3 45.5 60.6
Piacenza 512 253 3.29 6.6 18.8 28.2 37.6


I presented a way to analyse the available data on a regional base which better describe the fragmented Italian health system. Using the assumption that could be easily be verified by ISS which has more data, I provided 3 estimates of the number of infected in Italy as of the 16th of April. If more data could be made available, more precise prediction could be provided. Two predictions are rather conservative so can probably be considered lower limits. A recent study in Sweden provided a new way to estimate the infected that, if confirmed, would put the most affected provinces in Italy close to the levels required for herd immunity, or at least high enough to provide a significant rate reduction once the lockdown measures will be lifted. The Swedish are conducting a new study with higher statistics that will provide more accurate data and each Italian region should follow the same example as these tests are much cheaper than the mass tests that many regions are planning.

Let us hope that the last figures are correct and that we will be able to return to normal life soon.

Michele Faucci Giannelli

PS: Thanks to all people that provided feedback during the preparation of this post.