← Back to blog

Energy Systems Runtime Observatory (ESRO)

A lightweight, open source real time monitoring platform providing unified observability across local machines, Docker containers, and HPC clusters. Built to address the visibility gap that standard infrastructure tools leave open for scientific AI workflows running across heterogeneous compute environments.

Research Blogs
ESRO logo
ESRO: Unified observability for scientific AI workflows across local, container, and HPC environments.

Motivation

Scientific AI and energy systems research increasingly require running experiments across a mix of local machines, Docker containers, and high-performance computing (HPC) clusters. However, standard monitoring tools rarely provide a unified, real-time view across all these environments. This lack of observability can lead to missed bottlenecks, harder debugging, and challenges in reproducing results.

What is ESRO?

Energy Systems Runtime Observatory (ESRO) is an open source platform for unified, real-time monitoring of scientific workflows. ESRO brings together metrics from local systems, Docker containers, and remote HPC jobs, providing a single dashboard for resource usage, process health, and system metrics.

  • Unified Monitoring: Real-time metrics from local, container, and HPC jobs.
  • Lightweight & Extensible: Minimal overhead, easy deployment, and plugin support for new metrics or environments.
  • Open Source: Built with Streamlit, psutil, and standard Python libraries for transparency and accessibility.
ESRO Architecture Flow
Figure 1: ESRO Architecture Flow

Quickstart

  1. Clone the repository:
    git clone https://github.com/Jayapreethi/energy-system-runtime-observatory.git
  2. Install dependencies:
    pip install -r requirements.txt
  3. Run the dashboard using Streamlit:
    streamlit run dashboard.py
ESRO Dashboard Display
Figure 2: ESRO Dashboard Display

Key Features

  • Live dashboards for CPU, memory, GPU, and network usage.
  • Process-level monitoring and anomaly detection.
  • Support for Docker, SLURM, and custom HPC schedulers.
  • Exportable logs for reproducibility and audit trails.
ESRO Metrics Timeline Example
Figure 3: ESRO Metrics Timeline Example

Usage

  • Open the dashboard in your browser (Streamlit provides a local URL).
  • Use the sidebar to enable local system monitoring or connect to an HPC cluster.
  • For HPC jobs, enter your SSH credentials in the sidebar or use the example config below.

Example HPC Config

Reference JSON for HPC connection (see example_hpc_config.json):

{
  "host": "talon.und.edu",
  "username": "your_username",
  "password": "your_password"
}

Why ESRO Matters

  • Improves experiment reproducibility and reliability by capturing full infrastructure context.
  • Accelerates debugging and performance tuning with real-time, unified metrics.
  • Enables sharing of infrastructure and runtime context for published results.
ESRO closes the observability gap for scientific AI, making complex, distributed workflows more transparent and reproducible.

Written by Jaya Preethi Mohan. View the ESRO GitHub repository · Back to blog index