\authorinfo

Corresponding author: David V. Nguyen
E-mail: david.nguyen@yale.edu

The Simons Observatory: Alarms and Detector Quality Monitoring

David V. Nguyen \orcidlink0000-0002-7575-8145 Wright Laboratory, Department of Physics, Yale University, New Haven, CT 06511, USA Sanah Bhimani \orcidlink0000-0002-9763-1663 Wright Laboratory, Department of Physics, Yale University, New Haven, CT 06511, USA Nicholas Galitzki \orcidlink0000-0001-7225-6679 Department of Physics, University of Texas at Austin, Austin, TX, 78712, USA Weinberg Institute for Theoretical Physics, Texas Center for Cosmology and Astroparticle Physics, Austin, TX 78712, USA Brian J. Koopman \orcidlink0000-0003-0744-2808 Wright Laboratory, Department of Physics, Yale University, New Haven, CT 06511, USA Jack Lashner \orcidlink0000-0002-6522-6284 Wright Laboratory, Department of Physics, Yale University, New Haven, CT 06511, USA Laura Newburgh \orcidlink0000-0002-7333-5552 Wright Laboratory, Department of Physics, Yale University, New Haven, CT 06511, USA Max Silva-Feaver \orcidlink0000-0001-7480-4341 Wright Laboratory, Department of Physics, Yale University, New Haven, CT 06511, USA Kyohei Yamada \orcidlink0000-0003-0221-2130 Joseph Henry Laboratories of Physics, Jadwin Hall, Princeton University, Princeton, NJ 08544, USA Department of Physics, The University of Tokyo, Tokyo 113-0033, Japan
Abstract

The Simons Observatory (SO) is a group of modern telescopes dedicated to observing the polarized cosmic microwave background (CMB), transients, and more. The Observatory consists of four telescopes and instruments, with over 60,000 superconducting detectors in total, located at similar-to\sim5,200 m altitude in the Atacama Desert of Chile. During observations, it is important to ensure the detectors, telescope platforms, calibration and receiver hardware, and site hardware are within operational bounds. To facilitate rapid response when problems arise with any part of the system, it is essential that alerts are generated and distributed to appropriate personnel if components exceed these bounds. Similarly, alerts are generated if the quality of the data has become degraded. In this paper, we describe the SO alarm system we developed within the larger Observatory Control System (OCS) framework, including the data sources, alert architecture, and implementation. We also present results from deploying the alarm system during the commissioning of the SO telescopes and receivers.

keywords:
cosmology, Cosmic Microwave Background, Simons Observatory, control software, monitoring, data acquisition, alarms, data quality, visualization

1 Introduction

Measurements from CMB experiments have resulted in precise constraints on ΛΛ\Lambdaroman_ΛCDM parameters and derived quantities. [1] The Simons Observatory (SO) is a ground-based CMB experiment being commissioned in the Atacama Desert in Chile. It consists of three 0.5 m diameter small-aperture telescopes (SATs) and one 6 m diameter large-aperture telescope (LAT), totaling over 60,000 antenna-coupled, multi-chroic transition edge sensor (TES) bolometers across a frequency range of 27-280 GHz. [2, 3, 4, 5] The detectors use a microwave multiplexing (μ𝜇\muitalic_μMUX) readout architecture and SLAC Microresonator Radio Frequency (SMuRF) readout electronics. [6, 7] The increased detector count provides improved sensitivity compared to previous generations of telescopes. As a result, SO will improve our understanding of science goals including theories of inflation, the number of light relic particles, lensing of large-scale structure, the kinetic and thermal Sunyaev-Zel’dovich effect in galaxy clusters, extra-galactic point sources, and transient events. [8]

To achieve this increased detector count across a wide variety of science cases, SO has deployed detectors across four separate telescopes. The resulting telescope platforms, millimeter receivers, and site infrastructure include over 5,000 slow data fields across cryogenics, power distribution, computing, networking, weather, etc. We call the non-detector data housekeeping (HK) data, acquired slower than the detector sampling rate. To perform control, data acquisition, and monitoring of HK systems across the SO, we have developed the Observatory Control System***https://github.com/simonsobs/ocs (ocs). [9] A key component of observations is the monitoring and alarming based on metrics from these subsystems. The alarm system must be able to monitor the health of the entire telescope, provide quick overviews of the state of the system, emit alerts that sufficiently describe the issue, and notify researchers via various methods depending on priority level and user preference. It must also be easily modified and scalable as new alarms continue to be added with more subsystems and additional telescopes. Using the data feeds monitored by ocs, we use campana to generate alarms based on Grafana alert rules and to distribute these alerts using various notification methods. These alarms are monitored daily to ensure proper observations.

In this paper, we present data sources for alarms in Section 2, including both HK and detector data. In Section 3, we describe an overview of the alarm system, detailing its requirements, architecture, and implementation. Next, in Section 4, we describe how the alarm system is deployed at the site in Chile, improvements and successes, and plans moving forward. Finally, we conclude in Section 5. Appendix A contains the table of acronyms used within this paper.

2 Data Sources for Alarms

SO has a large collection of data sets for assessing the health of the telescopes, site equipment, and other subsystems, along with tracking the detector data quality for science analysis. From these housekeeping and detector-related data sources, we create alarms depending on user-defined conditions (e.g., safe ranges, binary states, or combinations of thresholds). As of this writing, SO acquires data from over 5,000 fields; we use >500 of those fields to generate alarms at various levels. We will describe these data sources in this section.

2.1 Housekeeping (HK) Data

HK data sets come from any hardware devices except those for detector data acquisition. In SO, we typically divide into 2 broad categories: telescope-specific and the site. To target the most relevant researchers when an HK system is in an alarm state, we separate the alarms into the following groups (an example of an alarm state for most of these groups is given in Table 1):

  • Computing. These HK data fields come from computers at the site and are usually acquiring data such as disk usage, CPU usage, and memory usage (via Telegrafhttps://www.influxdata.com/time-series-platform/telegraf/ instead of ocs).

  • SMuRF. [6, 7] These HK data fields track the detector readout system health. These include board temperatures, board current draw, and coolant leak sensors.

  • Cryogenics. These HK data fields acquire data from the dilution refrigerator (DR), which cools down the detectors to superconducting temperatures, and associated cryostat sources. These include DR temperatures, pressures, and flow; compressor state, temperatures, and pressure; cryostat pressure and temperatures.

  • Half-Wave Plate. Each SAT has a cryogenic half-wave plate that spins at similar-to\sim2 Hz to modulate the incident polarization. [10] HK data fields from this subsystem include rotation angle, rotation speed, and IRIG (absolute) timing.

  • Power. These HK data fields acquire data from power storage systems and generators. These include uninterruptible power supply (UPS) health (e.g., battery state, charge remaining) and diesel generator health (e.g., fuel level, shutdown status, electrical trip status).

  • Platform. These HK data fields acquire data from the telescope movable platform and its antenna control unit (ACU) which drives the platform. These fields include position and velocity, mode state (safe or remote), and time synchronization.

  • Agents. As described in Section 3.2, HK agents communicate between hardware devices and the overall OCS software architecture. Monitoring for the agent operation status catches agent crashes.

  • Timing. The SO timing system is centralized; timing is distributed either over the network as PTP or over dedicated fiber as specially generated timing signals defined by the SMuRF systems for the detector timing. [11] The central timing device and the edge-clocks that receive timing signals for calibrators on the platforms have HK data fields such as GPS synchronization state, PTP state, and PTP accuracy.

  • Environmental. These HK data fields are used to monitor the state of the site conditions. These include weather metrics such as wind speed, temperature, and precipitable water vapor (PWV).

  • Remote Observing Schedule. SO has a scheduler system that runs a set of commands in order of line number. [12] We monitor whether that system has resulted in a fault and is no longer observing.

Some metrics are more important to monitor than others, as abnormal conditions may cause hardware damage or safety concerns. The alarm system distributes notifications appropriately depending on the severity level of alarms. At the time of writing, four HK metrics trigger phone calls (in addition to other notification methods), two of which are in Table 1. These are:

  • DR temperature: During nominal observations, if the 100 mK stage exceeds 120 mK, the cryogenic system is in a critical state and needs immediate attention.

  • Pulse Tube Cryocoolers (PTCs) state: PTCs are used to cool the DR and also as cryostat radiation shields for all receivers. PTC shutdowns require immediate attention.

  • Wind speed: Wind speed is important because it determines whether it is safe for people to work at the site or for the telescope platforms to move. A phone call is generated if gusts are > 70 km/hour since the platforms cannot observe during those conditions. Note: a separate alert is distributed via other notification methods at > 50 km/hour for site personnel safety, who often do not have access to phone calls due to reception at the site.

  • Sun avoidance: A phone call is triggered if the boresight of the telescope is within some distance of the sun. The exact degree value depends on the telescope. Solar avoidance has been implemented within the platform control; however, under certain conditions (e.g., manual control of telescope) pointing too close to the sun is still possible.

2.2 Detector Quality Metrics

Detector data is important to monitor since it provides the most direct assessment of observation quality. There are two main detector data quality sources for the alarms. The first source uses SMuRF HK feeds: in particular, data feeds that track the number of detectors in the superconducting transition (i.e., in a usable state for CMB observations). If too few detectors are on transition, we trigger an alarm under the assumption that the automatic detector biasing routine failed in some way.

The second source of alarms for detector data quality results from the data processing pipeline. The processing pipeline reads in the raw detector data, performs data quality flags based on common response to the sky and the HWP signal (for the SATs), and finds and corrects glitches; this produces an output required for quick-look maps https://github.com/simonsobs/sotodlib. The processing is run at the site separately for each telescope within 12 hours of data acquisition and is automated using prefect§§§https://www.prefect.io/. [12] The same alarm system described in Section 3 is used to alert on both metrics produced from the data processing pipeline and the HK fields described above. This is useful for catching otherwise unnoticed issues, not for real-time application. For example, we can be notified if too many detectors are removed during a script that cuts detectors due to some criteria. This may lead us to believe that detector calibration was unsuccessful.

Refer to caption
Figure 1: An example 4-set Venn diagram to demonstrate the detector bias cuts. Each set is a parameter that causes the detector to be removed. The bias group (bg) is flagged if <0 since unassigned detectors are labeled with -1. The detector resistance (rtessubscript𝑟𝑡𝑒𝑠r_{tes}italic_r start_POSTSUBSCRIPT italic_t italic_e italic_s end_POSTSUBSCRIPT) is flagged if <=0. The fractional resistance (rfracsubscript𝑟𝑓𝑟𝑎𝑐r_{frac}italic_r start_POSTSUBSCRIPT italic_f italic_r italic_a italic_c end_POSTSUBSCRIPT) is flagged if outside of the defined range (0.05, 0.9). The saturation power (psatsubscript𝑝𝑠𝑎𝑡p_{sat}italic_p start_POSTSUBSCRIPT italic_s italic_a italic_t end_POSTSUBSCRIPT) is flagged if outside of the defined range (0, 20). Overlaps indicate detectors that are cut in common between different parameters. This is an example detector wafer from commissioning of one of the SATs, in which there are less desirable observing conditions to illustrate the cut effects. In this case, 564 out of 1495 total detectors were cut, 214 of which were common to all flags. This visualization indicates that the rfracsubscript𝑟𝑓𝑟𝑎𝑐r_{frac}italic_r start_POSTSUBSCRIPT italic_f italic_r italic_a italic_c end_POSTSUBSCRIPT cut is the most aggressive, and all detectors removed through the rtessubscript𝑟𝑡𝑒𝑠r_{tes}italic_r start_POSTSUBSCRIPT italic_t italic_e italic_s end_POSTSUBSCRIPT and bg cuts are in common with other parameters, implying these are less discriminatory.

The pipeline also produces visualizations at each step, showing the transformed signal and related statistics, which aids inspection to ensure that data quality is adequate for CMB maps. For example, at the start of each observation, bias steps are taken to measure the response from small steps in detector bias voltage. Several parameters can be deduced from this procedure: the location of the detector in its superconducting transition (rfracsubscript𝑟𝑓𝑟𝑎𝑐r_{frac}italic_r start_POSTSUBSCRIPT italic_f italic_r italic_a italic_c end_POSTSUBSCRIPT), the estimated TES resistance (rtessubscript𝑟𝑡𝑒𝑠r_{tes}italic_r start_POSTSUBSCRIPT italic_t italic_e italic_s end_POSTSUBSCRIPT), if detectors could not be assigned to a bias group due to bad fit, and the number of saturated (i.e., non-responsive) detectors. We remove detectors whose bias parameters are outside the expected range. These cuts can be implemented separately; however, they frequently overlap because they can probe the same poor behavior in the transition (e.g., saturated channels because of higher loading when the water vapor content in the atmosphere is large). To visualize this process to help understand how many detectors are cut from each parameter, we plot their overlap in Figure 1 as a 4-set Venn diagram. Generating this figure is included in the automated processing scripts; this allows us to rapidly assess behavior in conjunction with alarms.

3 Alarm System Overview

The SO alarm system is a collection of software packages intended to monitor the observatory and alert observers of critical errors. Different groups of people such as Remote Observing Coordinators (ROC), site engineers, and system expert personnel can receive alerts. ROCs are researchers who take shifts to monitor the status of each telescope and respond to alarms. ROCs efficiently receive alerts, facilitating immediate response to recover systems for regular operations. This system was designed to allow ease of use through rapid inspection and multiple methods of receiving alerts. In this section, we describe the requirements needed for successful operations of the alarm system as well as the architecture and implementation.

Refer to caption
Figure 2: The SO alarm system architecture. ocs Agents monitor hardware and SMuRF HK metrics, sending data feeds to InfluxDB (via the Influx Publisher Agent). The data processing pipeline also produces detector data quality metrics, which are sent to InfluxDB. Using the InfluxDB as the data source, Grafana monitors the data feeds and emits an alert when thresholds are surpassed. Users create visualizations and define alert rules on Grafana. When an alarm triggers, alerts are sent to Slack as well as campana, which sends emails, SMS, and phone calls. Users subscribe to campana to receive alerts. Nodes colored blue are data sources as described in Section 2 and those colored red consist of the architecture and contact points described in Section 3.

3.1 Requirements

The alarm system must be able to monitor the health of every component, both hardware and software, across all telescopes and the site itself. The state of the system should be easily discernible by any SO member, whether that may be the ROC, site engineer, or system expert. When issues arise, the system must emit alerts that clearly describe the problem and contain a link to the live monitor for the data which sourced the alarm. Additionally, the system must emit alerts that link to failed observation schedules. These alerts should be distributed to the appropriate people who can best respond to the situation; thus, the system should accommodate various notification methods and emit alerts according to their defined group and severity level. It must also be scalable since new alarms will continue to be added with additional subsystems and telescopes.

3.2 Architecture and Implementation

The architecture and interfaces of the SO alarm system, as well as software dependencies, are shown in Figure 2. This system is described in more detail in this section.

Refer to caption
Figure 3: An example Grafana dashboard used to monitor metrics for one of the SATs. In this case, we show the health of the SMuRF system and the cryogenics system. This dashboard has a combination of different display types: time-series plots of temperatures, some with multiple sensors shown in the same plot, as well as ‘single-stat’ panels, which give the most recent value (for this case, the FPGA currents/temperatures and DR temperatures/pressures). Also shown are the single-stat states for binary flags from systems (in this case, that no leak is present in the system). Here, panels show green for normal operations. When thresholds are exceeded and alarms are triggered, the panels show red. Panels are grouped by the alarm groups described in Section 2.
Refer to caption
Figure 4: The Grafana alert rules page shows all active alerts grouped by the alarm overview dashboards. The list can be filtered to show currently firing alerts.

3.2.1 Visualization and Alert Generation

The first component of the alarm system is the interface between hardware and the ocs. The ocs is the central software stack that monitors and controls the telescope and site hardware via ocs Agents. [9] Agents are individual software servers, each connecting to a different piece of hardware. Agents are written in Python and typically deployed using Dockerhttps://www.docker.com/ containers with mounted configuration files. As noted in Section 2, some examples of metrics monitored by Agents include the temperature of the DR, the PWV levels from the radiometer, the fuel level of the diesel generators that power the entire observatory, and some SMuRF health metrics such as number of detectors that are on their superconducting transition after a bias step.

ocs sends HK data to InfluxDBhttps://www.influxdata.com/, a time series database, via the Influx Publisher Agent. The real-time detector data processing pipeline also produces metrics that are published to InfluxDB. Grafana******https://grafana.com/ uses the InfluxDB as the data source to visualize this time series data, whether from ocs Agents, Telegraf, or the processing pipeline. Grafana is a web application used to visualize and analyze time series data through the use of “dashboards”. A dashboard consists of “panels”, each displaying the data in a user-specified way. The panels can be arranged to group subsystem components together (e.g., as described in Section 2). An example of part of an SO dashboard is shown in Figure 3, with more detail provided in the caption. We lay out these dashboards to optimize space and give the most crucial information at a glance. The example shown in Figure 3 is an alarm overview page for one of the SAT telescopes and contains a subset of the most essential data to capture the health of the SAT. Additional dashboards can be created by users for specific purposes (e.g., investigating hardware issues); however, only the alarm overview dashboards are used to define the alarms. There are five alarm overview dashboards: one for each of the 3 SATs, one for the LAT, and one for the site. The panels show green or red depending on defined thresholds to illuminate which subsystem is in an error state. These thresholds are defined in Grafana via “alert rules”; all alert rules can be viewed on a single page on Grafana, providing an overview of the alarm states of all observatory systems (Figure 4). Since Grafana alert rules are easily duplicated, setting new alerts is easy and can be centralized to one person to avoid miscommunication. Grafana generates the alarms used by SO based on these alert rules. Grafana’s “contact points” determine where alert notifications are sent.

3.2.2 Alert Contact Points

SO uses two primary contact points: Slack and campana. The SO collaboration uses Slack as a primary messaging tool and an essential part of coordinating observatory operations; sending alerts via Slack allows for user-friendly notification. Each group of alarms is connected to a separate Slack channel using webhooks pushed from Grafana to Slack. Each user can join channels relevant to systems they want to monitor to receive notifications and respond accordingly.

Refer to caption
Figure 5: The campana webpage where ROCs can enter their information, subscribe to alert groups, and toggle active notification methods. The inset shows the drop-down menu of groups to select. In this example, the user would receive text messages, but not emails and phone calls.

Slack receives all alerts, regardless of priority level. Since any of the >500 possible alerts can be sent to Slack, the Slack channels can become congested; thus, ROCs can easily miss notifications or develop notification fatigue. To mitigate these issues, we developed a second contact point, campana, which is a software package that subscribes to alerts and sends notifications via email, SMS, and phone calls. Users subscribe to alert groups on campana, filtering out alarms unrelated to their expertise. The four high-priority alarms described in Section 2.1 are distributed via phone calls to ensure immediate attention. campana consists of three code repositories: the server backend, the web frontend, and the core library.

The campana backend consists of a REST API, written using the Flask framework, that receives alert information in the form of JSON data from Grafana via the HTTP POST method. The Flask server then publishes the data to a Redis††††††https://redis.io/ database. We use Redis as an alert queue system, using Pub/Sub as the messaging paradigm to which the core library subscribes. Thus, Redis bridges the connection between alerts emitted by Grafana and the users receiving alerts distributed by the core campana library.

The campana frontend, shown in Figure 5, is developed using Vue***https://vuejs.org/. Users can enter their email addresses and phone numbers as well as subscribe to groups of alerts they wish to receive. Subscriptions are handled in Vue by toggling notification methods and pre-defined groups. This easily allow ROCs to begin their shift by turning on their notification methods. Users can turn on phone calls triggered by the most critical, time-sensitive alarms. The user address and subscription information is stored in an SQLitehttps://www.sqlite.org/ database. The Flask server used for the backend is also used as the API for the web frontend; it reads the HTTP requests from the Vue interface and updates the SQLite database.

The core campana library consists of classes and methods to interact with the software required for alert formatting and distribution. The API is used by campanad, a systemd service that performs the above functions. As a systemd service, the alert system will function as long as the site computer running campana is operational. Each JSON-formatted alert from the Redis server contains information including what thresholds are being triggered and which groups to notify. campanad is subscribed to the Redis server and transforms the alert from JSON to text appropriate for email or SMS messaging. campana also reads the SQLite database for subscription information to send alerts to the appropriate users via the active notification methods. campana uses the Gmail SMTP server to send emails and Twiliohttps://www.twilio.com/en-us to send SMS and phone calls. Phone calls are generated from the alerts using Twilio’s text-to-speech, but only contain the name of the alert that is firing to indicate which subsystem to investigate. Each notification is distributed according to the groups labeled by the alert and the methods users have activated.

To ensure the robustness of the alarm system, the software can be automatically restarted after unexpected shutdowns due to power outages. The campana backend and frontend run in Docker containers that are configured to start on reboot. The campanad service also emits a Slack notification if it crashes for any reason. However, network outages will cause interruptions since a connection from the site to the internet is required for distribution using the services described above. This also means that local site engineers will not receive alerts while at the site if the network is interrupted. SO recently employed a fiber connection to North America through the ALMA telescope site, and we also have a backup radio link to a low site to maintain network connectivity in the event of fiber issues.

4 Deployment at Site

The SO alarm system described in Section 3 was first tested in-lab at Yale University and is currently deployed on-site as an integral part of observatory operations. The system monitors and alerts on data from all four telescopes and the site. Each telescope uses its own computing node which hosts the ocs Agents. [11, 13] The alarm system software (Grafana, InfluxDB, campana, etc) runs on a special computing node designated for site services. While the core function is the same throughout the observatory, we separate alarms into 6 overarching groups: one for the LAT, one for each of the 3 SATs, one for general site metrics, and one for data packaging/processing. Each group has its own Grafana alarm dashboard, campana notification group, and Slack notification channel. This allows easier separation of tasks for each group of SO researchers.

The alarm system has been operational since September 2023. Provided power and network, the system has not crashed during this time and has successfully emitted each triggering alert. The number of alert rules on Grafana has gradually increased to over 500 individual alerts at the time of writing. Many improvements have been added to the alarm system, including the addition of groups. The alert groups, which users can subscribe to using campana as described in Section 3.2.2, help prevent notification fatigue since each telescope’s team usually does not need to be aware of another’s status. We also make use of Grafana’s silencing feature to turn off alarms during situations such as DR cooldowns or warmups which would trigger many unnecessary alerts. Another improvement we have implemented is a feature where alerts are emitted for the observing scheduler, which is neither an ocs Agent nor part of the data processing pipeline. [14] This allows researchers to be aware of situations quickly and not lose precious observation time.

Refer to caption
Figure 6: Slack message showing an alert that caused researchers to investigate the telescope. The alert is sent to one of the channels appointed for receiving alarms. The notification is formatted to provide context of the condition that causes the trigger. In this case, the alert was triggered when the PTC helium pressure changed by >30 psi over a week. Researchers found a leak from a high-pressure helium line for the PTC that cools the DR.

We have had significant situations in which alarms help save observation time and prevent irreparable damage to the telescopes. For example, schedule crash alerts have been especially useful for ROCs. Another common situation is the high disk usage alarm, as described in Table 1. In one critical situation, there was a pressure leak in the PTC helium compressor lines. A small leak is usually difficult to discover due to the long period in which pressures may decrease. We use an alarm that triggers when PTC helium pressure changes by >30 psi over the span of a week. The Slack notification, as shown in Figure 6, notified researchers of this situation, prompting personnel to inspect the telescope to find the location of the leak.

We continuously add alarms as new hardware or new situations appear. At the time of writing, the deployment of data packaging and data processing alarms is in progress. We are developing our software to acquire detector metrics and create alarms to determine detector quality and analysis adequacy. We are working with the initial data processing to define the metrics from each step of the processing pipeline and continue to add visualizations to those steps. The alarm system continues to scale with the growing needs of the observatory.

5 Summary

We have presented an overview of the Simons Observatory alarm system, observation data processing and related data quality monitoring, and the deployment of these systems on-site. With the use of software tools such as Grafana, InfluxDB, and Slack, combined with packages written for SO such as ocs and campana, we employ a robust alarm system that allows successful observatory operations. Due to the many intricacies of a full-fledged observatory, many faults may cause loss of observing time, data corruption, and hardware damage. With the alarm system in place, we can promptly react to and prevent these issues.

We have presented the visualizations of the data processing pipeline, demonstrating additional capabilities for data quality monitoring. Using plots produced by sotodlib scripts running on automatic prefect schedules, we inspect the process for poor data due to internal (e.g., detector performance) and external (e.g., bad weather) factors. With metrics such as the number of detectors cut during the processing steps, we can emit alerts and catch these issues early in the pipeline.

As with much of the SO infrastructure, the alarm system can continue to scale up to meet the needs of efficient and satisfactory operations. This system will expand to assist with monitoring new SATs from SO:UK and SO:Japan, along with additional detectors and a renewable energy system planned for Advanced Simons Observatory (ASO).

Appendix A ACRONYMS

The acronyms used in this paper are described in Table 2.

ACKNOWLEDGMENTS

This work was funded by the Simons Foundation (Award #457687, B.K.) and Yale University. We would like to thank the communities of the many open-source packages in use with campana and sotodlib.

References

  • [1] Aghanim, N., Akrami, Y., Ashdown, M., Aumont, J., Baccigalupi, C., Ballardini, M., Banday, A. J., Barreiro, R. B., Bartolo, N., Basak, S., Battye, R., Benabed, K., Bernard, J.-P., Bersanelli, M., Bielewicz, P., Bock, J. J., Bond, J. R., Borrill, J., Bouchet, F. R., Boulanger, F., Bucher, M., Burigana, C., Butler, R. C., Calabrese, E., Cardoso, J.-F., Carron, J., Challinor, A., Chiang, H. C., Chluba, J., Colombo, L. P. L., Combet, C., Contreras, D., Crill, B. P., Cuttaia, F., de Bernardis, P., de Zotti, G., Delabrouille, J., Delouis, J.-M., Di Valentino, E., Diego, J. M., Doré, O., Douspis, M., Ducout, A., Dupac, X., Dusini, S., Efstathiou, G., Elsner, F., Enßlin, T. A., Eriksen, H. K., Fantaye, Y., Farhang, M., Fergusson, J., Fernandez-Cobos, R., Finelli, F., Forastieri, F., Frailis, M., Fraisse, A. A., Franceschi, E., Frolov, A., Galeotta, S., Galli, S., Ganga, K., Génova-Santos, R. T., Gerbino, M., Ghosh, T., González-Nuevo, J., Górski, K. M., Gratton, S., Gruppuso, A., Gudmundsson, J. E., Hamann, J., Handley, W., Hansen, F. K., Herranz, D., Hildebrandt, S. R., Hivon, E., Huang, Z., Jaffe, A. H., Jones, W. C., Karakci, A., Keihänen, E., Keskitalo, R., Kiiveri, K., Kim, J., Kisner, T. S., Knox, L., Krachmalnicoff, N., Kunz, M., Kurki-Suonio, H., Lagache, G., Lamarre, J.-M., Lasenby, A., Lattanzi, M., Lawrence, C. R., Le Jeune, M., Lemos, P., Lesgourgues, J., Levrier, F., Lewis, A., Liguori, M., Lilje, P. B., Lilley, M., Lindholm, V., López-Caniego, M., Lubin, P. M., Ma, Y.-Z., Macías-Pérez, J. F., Maggio, G., Maino, D., Mandolesi, N., Mangilli, A., Marcos-Caballero, A., Maris, M., Martin, P. G., Martinelli, M., Martínez-González, E., Matarrese, S., Mauri, N., McEwen, J. D., Meinhold, P. R., Melchiorri, A., Mennella, A., Migliaccio, M., Millea, M., Mitra, S., Miville-Deschênes, M.-A., Molinari, D., Montier, L., Morgante, G., Moss, A., Natoli, P., Nørgaard-Nielsen, H. U., Pagano, L., Paoletti, D., Partridge, B., Patanchon, G., Peiris, H. V., Perrotta, F., Pettorino, V., Piacentini, F., Polastri, L., Polenta, G., Puget, J.-L., Rachen, J. P., Reinecke, M., Remazeilles, M., Renzi, A., Rocha, G., Rosset, C., Roudier, G., Rubiño-Martín, J. A., Ruiz-Granados, B., Salvati, L., Sandri, M., Savelainen, M., Scott, D., Shellard, E. P. S., Sirignano, C., Sirri, G., Spencer, L. D., Sunyaev, R., Suur-Uski, A.-S., Tauber, J. A., Tavagnacco, D., Tenti, M., Toffolatti, L., Tomasi, M., Trombetti, T., Valenziano, L., Valiviita, J., Van Tent, B., Vibert, L., Vielva, P., Villa, F., Vittorio, N., Wandelt, B. D., Wehus, I. K., White, M., White, S. D. M., Zacchei, A., and Zonca, A., “Planck2018 results: Vi. cosmological parameters,” Astronomy & Astrophysics 641, A6 (Sept. 2020).
  • [2] Galitzki, N., Tsan, T., Spisak, J., Randall, M., Silva-Feaver, M., Seibert, J., Lashner, J., Adachi, S., Adkins, S. M., Alford, T., Arnold, K., Ashton, P. C., Austermann, J. E., Baccigalupi, C., Bazarko, A., Beall, J. A., Bhimani, S., Bixler, B., Coppi, G., Corbett, L., Crowley, K. D., Crowley, K. T., Day-Weiss, S., Dicker, S., Dow, P. N., Duell, C. J., Duff, S. M., Gerras, R. G., Groh, J. C., Gudmundsson, J. E., Harrington, K., Hasegawa, M., Healy, E., Henderson, S. W., Hubmayr, J., Iuliano, J., Johnson, B. R., Keating, B., Keller, B., Kiuchi, K., Kofman, A. M., Koopman, B. J., Kusaka, A., Lee, A. T., Lew, R. A., Lin, L. T., Link, M. J., Lucas, T. J., Lungu, M., Mangu, A., McMahon, J. J., Miller, A. D., Moore, J. E., Morshed, M., Nakata, H., Nati, F., Newburgh, L. B., Nguyen, D. V., Niemack, M. D., Page, L. A., Sakaguri, K., Sakurai, Y., Rao, M. S., Saunders, L. J., Shroyer, J. E., Sugiyama, J., Tajima, O., Takeuchi, A., Bua, R. T., Teply, G., Terasaki, T., Ullom, J. N., Lanen, J. L. V., Vavagiakis, E. M., Vissers, M. R., Walters, L., Wang, Y., Xu, Z., Yamada, K., and Zheng, K., “The simons observatory: Design, integration, and testing of the small aperture telescopes,” (2024).
  • [3] Zhu, N., Bhandarkar, T., Coppi, G., Kofman, A. M., Orlowski-Scherer, J. L., Xu, Z., Adachi, S., Ade, P., Aiola, S., Austermann, J., Bazarko, A. O., Beall, J. A., Bhimani, S., Bond, J. R., Chesmore, G. E., Choi, S. K., Connors, J., Cothard, N. F., Devlin, M., Dicker, S., Dober, B., Duell, C. J., Duff, S. M., Dünner, R., Fabbian, G., Galitzki, N., Gallardo, P. A., Golec, J. E., Haridas, S. K., Harrington, K., Healy, E., Ho, S.-P. P., Huber, Z. B., Hubmayr, J., Iuliano, J., Johnson, B. R., Keating, B., Kiuchi, K., Koopman, B. J., Lashner, J., Lee, A. T., Li, Y., Limon, M., Link, M., Lucas, T. J., McCarrick, H., Moore, J., Nati, F., Newburgh, L. B., Niemack, M. D., Pierpaoli, E., Randall, M. J., Sarmiento, K. P., Saunders, L. J., Seibert, J., Sierra, C., Sonka, R., Spisak, J., Sutariya, S., Tajima, O., Teply, G. P., Thornton, R. J., Tsan, T., Tucker, C., Ullom, J., Vavagiakis, E. M., Vissers, M. R., Walker, S., Westbrook, B., Wollack, E. J., and Zannoni, M., “The simons observatory large aperture telescope receiver,” The Astrophysical Journal Supplement Series 256, 23 (Sept. 2021).
  • [4] Gudmundsson, J. E., Gallardo, P. A., Puddu, R., Dicker, S. R., Adler, A. E., Ali, A. M., Bazarko, A., Chesmore, G. E., Coppi, G., Cothard, N. F., Dachlythra, N., Devlin, M., Dünner, R., Fabbian, G., Galitzki, N., Golec, J. E., Patty Ho, S.-P., Hargrave, P. C., Kofman, A. M., Lee, A. T., Limon, M., Matsuda, F. T., Mauskopf, P. D., Moodley, K., Nati, F., Niemack, M. D., Orlowski-Scherer, J., Page, L. A., Partridge, B., Puglisi, G., Reichardt, C. L., Sierra, C. E., Simon, S. M., Teply, G. P., Tucker, C., Wollack, E. J., Xu, Z., and Zhu, N., “The simons observatory: modeling optical systematics in the large aperture telescope,” Applied Optics 60, 823 (Jan. 2021).
  • [5] McCarrick, H., Healy, E., Ahmed, Z., Arnold, K., Atkins, Z., Austermann, J. E., Bhandarkar, T., Beall, J. A., Bruno, S. M., Choi, S. K., Connors, J., Cothard, N. F., Crowley, K. D., Dicker, S., Dober, B., Duell, C. J., Duff, S. M., Dutcher, D., Frisch, J. C., Galitzki, N., Gralla, M. B., Gudmundsson, J. E., Henderson, S. W., Hilton, G. C., Ho, S.-P. P., Huber, Z. B., Hubmayr, J., Iuliano, J., Johnson, B. R., Kofman, A. M., Kusaka, A., Lashner, J., Lee, A. T., Li, Y., Link, M. J., Lucas, T. J., Lungu, M., Mates, J. A. B., McMahon, J. J., Niemack, M. D., Orlowski-Scherer, J., Seibert, J., Silva-Feaver, M., Simon, S. M., Staggs, S., Suzuki, A., Terasaki, T., Thornton, R., Ullom, J. N., Vavagiakis, E. M., Vale, L. R., Van Lanen, J., Vissers, M. R., Wang, Y., Wollack, E. J., Xu, Z., Young, E., Yu, C., Zheng, K., and Zhu, N., “The simons observatory microwave squid multiplexing detector module design,” The Astrophysical Journal 922, 38 (Nov. 2021).
  • [6] Henderson, S. W., Ahmed, Z., Austermann, J., Becker, D., Bennett, D. A., Brown, D., Chaudhuri, S., Cho, H.-M. S., D’Ewart, J. M., Dober, B., Duff, S. M., Dusatko, J. E., Fatigoni, S., Frisch, J. C., Gard, J. D., Halpern, M., Hilton, G. C., Hubmayr, J., Irwin, K. D., Karpel, E. D., Kernasovskiy, S. S., Kuenstner, S. E., Kuo, C.-L., Li, D., Mates, J. A. B., Reintsema, C. D., Smith, S. R., Ullom, J., Vale, L. R., Winkle, D. D. V., Vissers, M., and Yu, C., “Highly-multiplexed microwave SQUID readout using the SLAC Microresonator Radio Frequency (SMuRF) electronics for future CMB and sub-millimeter surveys,” in [Millimeter, Submillimeter, and Far-Infrared Detectors and Instrumentation for Astronomy IX ], Zmuidzinas, J. and Gao, J.-R., eds., 10708, 1070819, International Society for Optics and Photonics, SPIE (2018).
  • [7] Kernasovskiy, S. A., Kuenstner, S. E., Karpel, E., Ahmed, Z., Van Winkle, D. D., Smith, S., Dusatko, J., Frisch, J. C., Chaudhuri, S., Cho, H. M., Dober, B. J., Henderson, S. W., Hilton, G. C., Hubmayr, J., Irwin, K. D., Kuo, C. L., Li, D., Mates, J. A. B., Nasr, M., Tantawi, S., Ullom, J., Vale, L., and Young, B., “Slac microresonator radio frequency (smurf) electronics for read out of frequency-division-multiplexed cryogenic sensors,” Journal of Low Temperature Physics 193, 570–577 (May 2018).
  • [8] The Simons Observation Collaboration, “The simons observatory: science goals and forecasts,” Journal of Cosmology and Astroparticle Physics 2019, 056–056 (feb 2019).
  • [9] Koopman, B. J., Lashner, J., Saunders, L. J., Hasselfield, M., Bhandarkar, T., Bhimani, S., Choi, S. K., Duell, C. J., Galitzki, N., Harrington, K., Hincks, A. D., Ho, S.-P. P., Newburgh, L., Reichardt, C. L., Seibert, J., Spisak, J., Westbrook, B., Xu, Z., and Zhu, N., “The Simons Observatory: overview of data acquisition, control, monitoring, and computer infrastructure,” in [Software and Cyberinfrastructure for Astronomy VI ], Guzman, J. C. and Ibsen, J., eds., 11452, 1145208, International Society for Optics and Photonics, SPIE (2020).
  • [10] Yamada, K., Bixler, B., Sakurai, Y., Ashton, P. C., Sugiyama, J., Arnold, K., Begin, J., Corbett, L., Day-Weiss, S., Galitzki, N., Hill, C. A., Johnson, B. R., Jost, B., Kusaka, A., Koopman, B. J., Lashner, J., Lee, A. T., Mangu, A., Nishino, H., Page, L. A., Randall, M. J., Sasaki, D., Song, X., Spisak, J., Tsan, T., Wang, Y., and Williams, P. A., “The Simons Observatory: Cryogenic half wave plate rotation mechanism for the small aperture telescopes,” Review of Scientific Instruments 95, 024504 (02 2024).
  • [11] Koopman, B. J., “The Simons Observatory: Deployment of the Observatory Control System and supporting infrastructure,” in [Software and Cyberinfrastructure for Astronomy VIII ], International Society for Optics and Photonics, SPIE (in press).
  • [12] Guan, Y., “Simons Observatory: Observatory Scheduler and Automated Data Processing,” in [Software and Cyberinfrastructure for Astronomy VIII ], International Society for Optics and Photonics, SPIE (in press).
  • [13] Bhimani, S., “The Simons Observatory: Deployment and current configuration of the Observatory Control System for SAT-MF1 and data access software systems,” in [Millimeter, Submillimeter, and Far-Infrared Detectors and Instrumentation for Astronomy XI ], International Society for Optics and Photonics, SPIE (in press).
  • [14] Sakuma, T., Guan, Y., Hasselfield, M., Koopman, B., Newburgh, L., and Nguyen, D., “Nextline.” https://doi.org/10.5281/zenodo.11451619 (2020).
  • [15] Saunders, L. J., Hasselfield, M., Koopman, B. J., and Newburgh, L., “The Simons Observatory: antenna control software integration and implementation,” in [Millimeter, Submillimeter, and Far-Infrared Detectors and Instrumentation for Astronomy XI ], Zmuidzinas, J. and Gao, J.-R., eds., 12190, 121902P, International Society for Optics and Photonics, SPIE (2022).
Table 1: Example alarms and their conditions for each alarm group. Those in red are high-severity alarms that trigger phone calls.
Alarm Group Examples
Metric Alarm Condition System Failure Preventions
computing site computer disk usage >90% A full disk causes software such as ocs Agents to crash and HK/observational data can be lost. Experts can clear disk space before this happens.
SMuRF SMuRF FPGA current (A) and temperature (C) (>50A and >100C) or (>53A and >73C) While the SMuRFs have internal shutdown limits, these alarms allow ROCs to stop operations before hardware damage.
cryogenics DR temperature sensor >120 mK Since the TESs need to be superconducting, poor data quality results from high temperatures. ROCs can determine DR issues before more observations continue.
HWP HWP spin frequency >3 Hz HWP frequencies higher than within spec can cause hardware damage. ROCs can spin down the HWP before this happens.
power UPS state “on battery” During power outages, UPSs must prevent crucial hardware from losing power. Along with automatic shutdown procedures, ROCs can safely stop operations before hardware damage.
platform ACU lockout status “platform remote control locked out” Controlled by the ACU Agent [15], the telescope platform requires maintenance by the site crew in certain situations. For safety, ROCs cannot remotely move the telescope when the platform is locked out.
agents data acquisition status “failed” If any Agents crash for various reasons, these alarms allow ROCs to take necessary actions to reboot them.
timing timing device GPS sync “unsynchronized” This system is critical for timing synchronization of HK/observational data. When the central timing device becomes unsynchronized from GPS, ROCs can recognize the cause of any inconsistencies in the signal timestamps.
environmental site wind speed gusts >70 km/hr While the site crew is aware of wind speed, ROCs must also be informed since the telescope platform is rated to move within certain conditions. This allows ROCs to interrupt observations before hardware damage.
Table 2: Acronyms.
Acronym Definition
ACU Antenna Control Unit
API Application Programming Interface
ASO Advanced Simons Observatory
CMB Cosmic microwave background
CPU Central processing unit
DR Dilution refrigerator
FPGA Field Programmable Gate Array
HK Housekeeping
HTTP Hypertext Transfer Protocol
HWP Half-wave plate
IRIG Inter-range instrumentation group timecodes
JSON JavaScript Object Notation
LAT Large-aperture telescope
ocs Observatory Control System
PTC Pulse Tube Cryocoolers
PTP Precision Time Protocol
PWV Precipitable water vapor
REST Representational State Transfer
ROC Remote Observing Coordinator
SAT Small-aperture telescope
SMS Short Message Service
SMTP Simple Mail Transfer Protocol
SMuRF SLAC Microresonator Radio Frequency
SO Simons Observatory
TES Transition edge sensor
μ𝜇\muitalic_μMUX Microwave multiplexing
UPS Uninterruptible Power Supply