By Dmytro Kovalskyi

 

Data taking is restarting in CMS for the continuation of Run 3 - and with it the enormous responsibility of keeping the data safe and making sure it is usable in order to conduct the physics analyses.

Last April 2023, the Large Hadron Collider again started to deliver stable collisions to experiments at CERN. Restarting data taking after a long break is always a challenge. Despite running many tests and taking numerous runs of cosmic data we have to be prepared to face issues, since nothing compares in complexity with the real collisions. The ride can become bumpy very quickly if serious issues emerge.

The CMS Tier-0 service is responsible for the prompt processing that happens almost immediately after data becomes available, and distribution of the data collected by the CMS Experiment. The data are then distributed around the world to Tier-1 computing sites for future reprocessing and long term storage and Tier-2 to enable analysis by the Collaboration. 

The amount of data we receive annually is on the order of 15-20 petabytes, a volume equivalent to the size of the digital content collections held by the Library of Congress of the United States. It's essential to ensure that all of this data is not only stored safely but also remains usable. We conduct an initial online data quality monitoring check on a small percentage of the data, before proceeding to a comprehensive offline scrutiny through a certification process. The data collected by the detector needs to be reformatted and tightly compressed to minimise storage space. Simultaneously, we carry out the first step of data processing, which involves calibrating the detector data. This must be completed in time for the reconstruction of all collision events, a process that automatically begins 48 hours after data collection and uses up to 50,000 CPUs. This entire procedure requires meticulous coordination. If data processing stalls, we risk filling available disk spaces in just a few days. This could potentially lead to an interruption in data collection and result in the loss of valuable beam time.

Like with many other big IT services, we all expect Tier-0 just to work. People never consider the possibility of Google search to be unavailable when we need to search for something or Netflix to be down when we finally have some time to take a break. Hardly anyone knows people working on the project unless a disaster happens and no one wants to become known for that. 

Problems tend to show up unannounced, for example because of some rare failure that hardly ever happens, eventually happens. Detecting problems early helps to avoid spending days or even weeks fixing the consequences. At such moments we are grateful that we have people responsible for various other critical services, who are there to help when things fall apart. With their help we manage to put things together, investigate what happened, take actions to prevent such failures in future and relax till another rare or unexpected failure shows up.

Fortunately, as the data taking conditions stabilise, the processing experience improves significantly. Past issues fade in memory and all that remains is the final result, which is really impressive. In the first year of Run 3, the CMS experiment managed to take data at 50% higher rates than in the last year of Run 2. We achieved a maximum data rate of 10 gigabytes per second, setting a new record, which is almost a factor of two higher than what was achieved during the most computationally heavy weeks at the end of 2018. Overall, we will remember 2022 as a successful year for CMS and we will do our best to make 2023 even better.



 

Date of publication