Visiting the Dutchies: A Statistical Modeling Workshop

The modern scientific method has its origins in the 17th century and has been constantly developing throughout the centuries. And even though procedures may vary from one science to another a crucial part for all is the comparison of experimental data with theoretical predictions. To draw any conclusion and solve physical problems based on observation and theory one needs to develop a statistical model to connect the two. Therefore one of our supervisors professor Wouter Verkerke, who happens to be an expert on the matter, gave a 3-day workshop for both INSIGHTS and local PhD students at the Dutch National Institute for Subatomic Physics (Nikhef) in Amsterdam.amsterdam

As a Dutchman it was a welcomed excuse to travel back to my country and also revisit the institute where I used to work before my PhD. Most of ESRs arrived on the evening before the workshop. Because it had been while since we last saw each other we used the opportunity to catch up and exchange stories on our first few months. And to give the other ESRs some taste of the Dutch culture we did so whilst enjoying some drinks and “stamppot met rookworst” in the center of Amsterdam. I thought it would ease them in instead of throwing them in the deep with the raw herring and onions tradition. Maybe next time!

The next day we started our workshop which had a clear cut structure and a good build up in complexity and detail. In the morning Wouter gave us lectures on the theory and in the afternoon we got to apply the concepts with a set of exercises in RooFit, one of the most used statistical modeling software packages at CERN and the brainchild of David Kirby and Wouter himself. Throughout the workshop we learned about basic concepts such as typical probability density models, p-values and Likelihood Ratios to more advanced topics such as incorporation of nuisance parameters, unfolding and Effective Lagrangian Morphing.

The workshop was closed with a talk from former Nikhef PhD student Max Baak currently working at KPMG as chief data scientist. Because many PhD students and post-docs continue in industry or business Max was invited to give a talk on his experience at KPMG. He told us how he applied his knowledge acquired in academics and used some of his recent business cases as examples. Good to see what some of the non-academic possibilities are!

Kudos to Wouter Verkerke for giving us such a complete and clear picture of statistical modeling in particle physics including hands-on experience in RooFit. It was a great workshop and hopefully we can come back soon for a follow-up!

 

nikhef

Meet the ESRs: Pim Verschuuren

Hi everybody!

My name is Pim Verschuuren and I am the ESR at Royal Holloway, University of London.

I am originally from the Netherlands where I acquired my bachelors and masters degree in Physics from Utrecht University. Utrecht is the fourth biggest city of The Netherlands and has a history that traces back to the Romans that laid the first foundations for the “Domstad”. However, nowadays Utrecht has become a modern and progressive city, culture and science play an increasingly important role and has a vibrant student life for both natives and internationals. But apart from housing all these aspects that I enjoyed whilst living there, it is also the place where I developed my inclination with particle physics.

During my studies I found my passion for particle physics as soon as during my bachelors. I therefore tried to submerge myself as much as possible with courses and research projects within this field and quickly came in contact with the organization that is the nexus of particle physics: CERN. The past few decades this combined effort of thousands of technicians, engineers and physicists from all over the world has proven to be very fruitful with the Higgs boson as the most recent crown jewel. I myself was lucky to contribute to both the ALICE and the ATLAS experiment where my biggest project entailed measurements of Higgs boson properties.

After multiple projects within particle physics at CERN I was convinced that a PhD in this field would be the right next step for me. But apart from the standard PhD characteristics like analysis of complex and abstract problems I was also looking for some additional specifics. More and more has machine learning become a part of varying parts of our society and the scientific community of CERN is no exception. I therefore sought a PhD that combines particle physics and the newest machine learning techniques to be part of this surge of innovation. And taking into account my love for traveling, working with a diverse group of scientists from all over the world and keeping the learning curve as steep as possible I came to the conclusion that a PhD in the INSIGHTS network was a perfect fit for me.

My main scientific subject will be on machine learning techniques in unfolding under the supervision of professor Glen Cowan from Royal Holloway, University of London. Just like with any scientific experiment the measuring devices in particle physics are never perfect. The measurements that should reflect nature perfectly actually give a convoluted picture specific to the measuring device. The whole game of unfolding is to take this convoluted picture and try to retrieve the true result that correctly reflects nature.

After the few events that we had with the network I feel even more excited and motivated to contribute to the INSIGHTS network. All of the other ESRs and supervisors clearly feel the same way and have already shown to be a great source of inspiration and creativity. With still a large part of the program ahead of me I look forward collaborating with all of them!

Subjective probability and data-driven decision making

We finished our last post by observing that, as human beings, we are not that good at evaluating uncertainties and this can heavily affect the outcome of our decisions, both in our work and in our private life. It does not help the fact that the most appropriate mathematical concepts to quantify uncertainties are too often presented through arcane formulas that can hardly be understood outside trivial didactical examples (dice throws, coin flips, card draws, etc.), and they seem unsuitable to describe situations as complex as the real business phenomena.

The key idea to overcome these problems in business context, based on PangeaF experience, is to introduce the concept of subjective probability. That is, to quantify the probability of an event through the degree of belief that it would occur, based on the available information.

This image has an empty alt attribute; its file name is thomas_bayes.gif
Thomas Bayes
(image from wikipedia.org)

This latter concept is definitely a crucial point towards bringing probability in business applications, since it allows to define probabilities for events which have never been observed before (e.g. the launch of a new product, the expansion towards a new market, etc.) and to include different degrees of information into the evaluations. Such approach also gives, through Bayes’ rule, an easy way to update each evaluation in presence of new sources of info.
To fix the idea you could ask two different persons to evaluate how probable is a doubling of the values of the shares of a company: typically, they would answer with a very small probability, because doubling the value is a macroscopic increase. However, if one of the two persons has some insider contact who reveals that the company is going to release a new revolutionary product, then this person would assign a higher probability to the hypothetical doubling (typically still small, but not as small as before). Neither of the interviewed would be wrong in their evaluations: it is just that, with different levels of knowledge about the event of interest, different quantifications follow.
Moreover, subjective does not mean arbitrary: while subjects with different states of information can evaluate the probability of the same event differently, they must provide rational and factual assessments, by relying on probability rules to evaluate multiple related events playing a role in the same problem.

By using subjective probabilities and Bayesian networks to deal with complex connections among the measured quantities, it is possible: 

  • to perform proper inference processes, unravelling the cause-effect relationship hidden in data in order to find the most probable reasons behind the observed events, even in the presence of complex scenarios and multiple competing causes;
  • to integrate the experts’ knowledge about a given problem, through appropriate relationship among elements in a descriptive model and suitable probability distributions associated to different situations;
  • to obtain true probabilities from the computations, and not some hard-to-interpret estimate, informing us of how much we have to weigh the occurrence of each event, given the information we received.

These aspects are crucial in all decision making processes and they allow the agents to make their best assessment, through exploitation of all available information (i.e. data). And they come with great flexibility, since they can be applied to a variety of statistical distributions and of business sectors.

It is important to stress that moving towards data-driven decisions does not mean to make such decisions automated or to remove from them the human factor. Algorithms shall mostly be exploited in what they are good for: to integrate consistently the available information, without biases interfering with the quantitative evaluation.
Then, the results provided by such algorithms have to be combined, by human decision makers, with the external factors that can hardly be modeled into algorithms (no matter what some vendors claim): what is the risk level that a company can accept in the specific moment a decision has to be taken? what is the impact on stakeholders, in terms of long-term scenarios and company reputation? what are the ethical implication of one decision versus another one?

Data-driven decisions, at least as PangeaF sees them, shall be the moment to bring together the best that domain experts, data scientists and human decision makers can offer: experts can help spotting the key meaningful relations among measured quantities in a business process; data scientists can turn such relations and what historical data say into a coherent and effective model, trading off advanced solutions with actual performance achievements; human decision makers can take the results of the models and use them to take more effective choices, optimizing resources or focusing efforts on the important parts of the process.

In the next posts, we will present some of the exciting experiences PangeaF developed by building bridges between real world problems and advanced machine learning techniques.

Stay tuned!

Meet the ESRs: Daria Morozova

Buongiorno da Roma!

My name is Daria Morozova and I am currently the ESR hired by Pangea Formazione within «INSIGHTS» Innovative Training Network. Without a doubt, the Network is a great opportunity for a young researcher to contribute to the Science and Society. I would be really glad to share with you all the details of this amazing journey and keep you updated on the highlights of every step of the program. 

The research project I am involved in is carried out in Rome. It is focused on the exploitation of the latest Machine and Deep Learning techniques to image and sound recognition applications. In particular, my goal is twofold: on one hand, to estimate traffic through crossroads and to identify special class vehicles (e.g. police and ambulance) in order to prioritize them; on the other hand, to develop a tool to coordinate and synchronize the drone swarm for emergency services, especially in search-and-rescue scenarios. This will be done using audio and video data streams collected by sensors on Unmanned Aerial Vehicles (or «UAV») in order to detect other UAV in the surroundings for collision avoidance (even during loss of ground communication!), and to detect search-and-rescue targets. 

About me: I was born and raised in Moscow, which is the northernmost and coldest megacity and metropolis on Earth. I graduated with a 5-year Specialist’s Degree program in Applied Mathematics and Information Theory at Lomonosov Moscow State University and with a Master Degree in Economics at the National Research University «Higher School of Economics». I also had a chance to study abroad: 5-months overseas stay at the Catholic University of Sacred Heart in Milan, Italy, which gave me the opportunity to improve my linguistic and intercultural skills and facilitated my relocation to Rome. 🙂

(picture from: versus.com)

In the following blogs I am going to present the current events and a Step by Step approach how to carry out an exciting project: stay tuned!

See you soon!

(written by Daria Morozova)


Data-driven decision making

In my first post about Pangea Formazione (PangeaF in the following), I have mentioned a few times that our company has set its mission as to help other companies to make good use of the data they own, in order to move towards data-driven decision process.

Is this really something useful and/or needed? In fact, it is. 

Since the late 70s there have been plenty of studies which revealed the huge impact that bias and heuristics can have on our quantitative decisions, not because of lack of expertise or just ignorance, but due to the actual evolution process of the human brain through centuries. A typical example is the so called “framing effect”, studied by Kahneman and Tversky in the early 80s [1].

Daniel Kahneman (picture from: wikipedia.org)

Two separate groups of participants are presented with a different scenario, related to the outbreak of an Asian epidemic who would affect six thousand people. Participants are asked to choose among two possible courses of actions, based on their rational preferences. The first group was presented with the following choices:

  • with plan A, 2000 persons will be saved;
  • with plan B, we have 1/3 of probability to save 6000 persons (everybody), and 2/3 of probability that no people are saved.

The second group was presented with the following choices:

  • with plan C, 4000 persons will die;
  • with plan D, we have 1/3 of probability that no people die, and 2/3 of probability that 6000 persons (everybody) die.

PLAN A
2000 saved

PLAN B
A 33% chance of saving all 6000 people,
66% possibility of saving no one.
PLAN C
4000 dead

PLAN D
A 33% chance that no people will die,
66% possibility that all 6000 will die.

What has been observed both in the original experiment and in many replications is that in the first case around 70% of the participants prefer plan A, while in the second case almost 80% of the participants prefer plan D. But plan A is the same as plan C, and plan B is the same of plan D! The only change is in the frame which is used to present the decision making problem, that affects the choice much more than any rational decision making theory would allow. [*]
The problem is that the description of the experiment in the two settings triggers different areas of our brain: when presenting the choice in terms of gains (first group) mechanisms of risk-aversion take precedence, while when presenting the choice in terms of losses (second group) we are much more propense to choose a risky option because of loss-aversion. 

Other examples can be found in Kahneman’s book “Thinking, fast and slow” [2], that the famous psychologist and 2002 Nobel laureate for Economic Sciences wrote to present the results of decades of experiments on the psychology of judgment and decision-making, as well as behavioral economics. 

And this is not just an example taken from some psychological study to “push our agenda”, with no true impact on the business world: it is something that is continuously seen in action. A 20+ years monitoring research on public, private and no-profit companies throughout USA, Europe and Canada [3] has shown that typically 50% of the business decisions ends up in failure, 33% of all decisions made are never implemented, and half of the decisions which get implemented are discontinued after 2 years. One of the causes of such (depressing) trend is the fact that in two cases out of three, choices are taken based either on failure-prone methods or on fads that are popular but not based on actual evidences.
In several cases it has also been shown that failure-prone methods are still followed because of difficulties to deal correctly with uncertainties that are intrinsic with decision making processes in strategic and business contexts.

There exist several types of uncertainties which can affect a decision making process: factors that there is no time or money to monitor effectively, factors that our outside our control capabilities like competitors’ moves or other stakeholders’ decisions, factors that are truly random and unexpected and that can lead the same decision towards very different results. Uncertainty assessment is a critical element in such scenario and we always find surprising to see how often it is underestimated: typically, it is only considered when assessing the global risk level of a productive process or “a posteriori” when a decision has undesired outcomes.

The described difficulties in evaluating quantitatively uncertainties are absolutely in line with the psychological researches we mentioned above, but there seems to be an additional inertia towards adoption of software-based tools that could provide with more coherent and consistent probability evaluations in different scenarios. 

What can be done to address such problems? How can we improve our skills in dealing with uncertainties? We will provide a possible answer in the next post, which shall complete the overview of the main points of the approach followed by PangeaF when implementing software solutions to support decision making processes.

Stay tuned!

[*] On a side note, you might want to notice that the expected value of each plan is always the same, so that assuming human choices follow a model based on perfect information, and defining rationality along the lines of von Neumann & Morgenstern’s game theory, we shall conclude that any “rational” decision maker would be indifferent among the four possible plans.

Bibliography

[1] A. Tversky & D. Kahneman, The Framing of decisions and the psychology of choice, Science. 211 (4481), 453?458 (1981). doi:10.1126/science.7455683.

[2] D. Kahneman. Thinking, Fast and Slow. Farrar, Straus and Giroux, New York, 2011. ISBN: 0374533555

[3] P. C. Nutt. Why Decisions Fail. Berrett-Koehler Publishers, Oakland, California, 2002. ISBN: 1576751503