Observing the Higgs with over one petabyte of new CMS Open Data

By Achintya Rao

20 December 2017: The CMS Collaboration at CERN is pleased to announce the release of the third batch of high-level open data from the CMS detector at the Large Hadron Collider (LHC), available on the CERN Open Data portal. This batch contains over 550 terabytes, or 11.6 inverse femtobarns (fb⁻¹), of proton-proton collision data recorded in 2012 at a centre-of-mass energy of 8 TeV — around half the data from that year. Around 510 terabytes of Monte Carlo simulation data are also being made available now, with more in the data-transfer pipeline. This release includes datasets that were used to discover the Higgs boson and the data are available in the same format that CMS scientists use for research. CMS is also providing smaller, simplified datasets for use in educational contexts. As previously, the data are released into the public domain under the Creative Commons CC0 waiver.

A collision event recorded by CMS in 2012 and available in the latest release of CMS Open Data. File for visualisation prepared by Tom McCauley

LHC data are complicated and they are big. The LHC began operations in 2010 and since then, under nominal conditions, intense bunches of protons collide inside the CMS detector up to 40 million times every second. CMS researchers have recorded petabytes of data from these collisions and have so far published hundreds of scientific papers with these data. “Our data are an important element of the CMS Collaboration’s rich scientific legacy,” says CMS Spokesperson, Joel Butler. “We would like to ensure that they are not only preserved in the long run but are also available to the public, so that both CMS members and external researchers can re-examine them in the future. This is part of our commitment to openness and long-term data preservation [PDF].”

It takes over 2000 people to do all the work necessary to operate CMS, develop the reconstruction software and analyse the data. The analysis itself is a deeply complex process requiring detailed knowledge of how the data were obtained. Therefore, providing the collision data alone is only one step towards full openness. Preserving the knowledge of how to analyse them is critical for any future re-examination or independent exploration of these datasets. In addition to releasing the simulated datasets necessary for comparing experimental data with theoretical predictions, the CMS Data Preservation and Open Data team has assembled a comprehensive collection of supplementary materials, including example code for performing relatively simple analyses as well as metadata such as information on how data were selected and what the LHC’s running conditions were during the time of data collection. And of course, you do not have to download all the datasets to your local machine in order to explore them. The CernVM team has provided custom virtual-machine images for analysing CMS data that come with the official (and open source) CMS analysis framework, CMSSW, and a means to access data from the CERN Open Data portal on an on-demand basis. “We recognise that working with CMS Open Data comes with its challenges; analysing experimental data is a complex process and making data public will not make the process simpler,” says Kati Lassila-Perini, the CMS Data Preservation and Open Access co-coordinator. “We are continuously improving and adding to the information we provide with the datasets, making sure that research-level analyses can be conducted with CMS Open Data. Our setup has already turned out to be an excellent test bench for data preservation in terms of long-term usability for our earlier data from 2010.”

Having researchers from outside the CMS Collaboration conduct independent and novel research with CMS data has been a stated goal of the CMS Open Data endeavour from the start. Only a few months ago, the first two such research papers were published by a team of theorists led by Jesse Thaler from MIT. The theorists were interested in performing a measurement that CMS scientists had themselves not done: specifically they wanted to measure particular substructures in clusters of particles known as “jets” produced in proton-proton collisions at the LHC. They began working on CMS data that were collected by the detector in 2010 and released via the CERN Open Data portal in 2014, receiving advice and guidance from CMS scientists on how to interpret these data. “Though there was considerable initial scepticism that a team of five theoretical physicists could perform a publishable analysis based on open collider data, much of that scepticism dissipated once it became clear that our analysis was based largely on the same workflow used by CMS,” Jesse says in a recent blog post. “Our two publications are a proof of principle that open collider analyses are feasible and potentially impactful.”

Left: The official CMS plot for the “Higgs to four leptons” channel, shown on the day of the Higgs discovery announcement. Right: A similar plot produced by Nur Zulaiha Jomhari et al. using CMS Open Data from 2011 and 2012. Although the plots appear similar, the analysis with CMS Open Data uses more data (at 8 TeV and overall) than the official CMS one from the original discovery but is a lot less sophisticated and is not scrutinised by the wider CMS community of experts.

The latest release of CMS Open Data also carries the fascinating possibility of allowing one to repeat the analysis that led to the Higgs discovery by studying the same data used by CMS scientists to announce the particle’s existence in 2012. This announcement came after months of work by hundreds of experts within the CMS Collaboration, who could also rely on the resources of the Worldwide LHC Computing Grid for conducting their analyses. So, although any external efforts to re-trace these steps would require a deep understanding of experimental particle physics and come at great cost in terms of time and effort, the possibility nonetheless exists. As a proof of concept, Kati’s fellow co-coordinator Achim Geiser set CMS doctoral student Nur Zulaiha Jomhari the task of recreating part of one of the analyses that contributed to the Higgs discovery. In simple terms, when a Higgs boson is produced in particle collisions, it transforms almost instantaneously into lighter particles, which CMS can observe directly or indirectly. Sometimes, the Higgs transforms into four leptons (electrons or muons), and Nur Zulaiha was asked to extract this signal from the CMS Open Data. “We knew it would not be easy,” says Achim, “in particular since the bulk of the data from collisions and simulations only became available to Nur Zulaiha a bit more than two months ago. But she managed to analyse the data and produce plots similar to some of those shown when the Higgs discovery was announced by CMS five years ago. Now, this analysis is a lot less sophisticated than the official CMS one and the signal we see does not have the same statistical significance, but it’s great to demonstrate the potential of our open data.” The code for recreating Nur Zulaiha’s analysis is also available on the CERN Open Data portal, although running it on a single machine can take around a month or so to complete the full analysis.

Analysing CMS Open Data in a web browser without any additional installation of software or download of data. Animated GIF produced by Achintya Rao

Not only do example analyses like Nur Zulaiha’s Higgs work show researchers what they can do with CMS Open Data, they also serve as valuable teaching tools for high-school and university education. CMS resources on the CERN Open Data portal include detailed guides for research and for education as well as lightweight tutorials. For example, using a tool known as Binder, you can get a quick introduction to analysing CMS Open Data using python with just a modern browser and an internet connection, without downloading any data onto your computer. And the ever-popular visualisation tool that allows you to look at individual collision events from CMS directly on the portal has been upgraded with new features including a stereoscopic view for use with Google Cardboard; you can also use it to look at collisions from the latest batch of CMS Open Data.

A collision event recorded by CMS in 2012 showing a “Higgs candidate”, available on the CERN Open Data portal with the latest release of CMS Open Data. File for visualisation prepared by Tom McCauley; animated GIF produced by Achintya Rao

At the moment, CMS has committed to releasing up to 50% of each year’s recorded data a few years after they were collected, once CMS scientists finish most of their analysis of these datasets. “To see our open data in use outside CMS has been very rewarding,” says Kati. “It has been a great motivation for us and we look forward to continuing our pioneering efforts to release research-quality open data from the LHC in the years to come.”

Acknowledgements

The CERN Open Data portal is openly developed on GitHub by the CERN Information Technology and CERN Scientific Information Services teams, in cooperation with the experimental collaborations who release open data on it. CMS would like to thank CERN for providing resources and expertise to build and maintain the portal. We would also like to acknowledge the support of many of our collaboration members who have helped us release this latest batch of CMS Open Data.

Key points

This is the third release of high-level CMS Open Data, following release of 2010 data in 2014, and 2011 data in 2016
Latest release contains around half of data collected in 2012
557+ terabytes of collision data (11 inverse femtobarns); 510+ terabytes of simulation data with more to come in next months
Contains datasets used for the Higgs discovery of 2012
Supplementary information and metadata provided
Novel research has been done by external scientists with previously released datasets
Examples of simplified datasets — suitable for educational contexts — and code to analyse them also available

CERN Accelerating science

Observing the Higgs with over one petabyte of new CMS Open Data

Acknowledgements

News