Metrics¶
User-defined metrics¶
To log time-series metrics use the log_metrics
method, for example:
The log_metrics
method can be called as many times as necessary during a run, and the time of each is recorded with microsecond precision. The
timestamp can be overriden if necessary, for example if the metrics are being extracted from another source with its own timestamps. For example:
init()
was called is also recorded. The relative time can be set manually if needed, e.g.:
where here time
is a floating point number.
Furthermore, an integer step
is recorded for each metric. Typically it represents some measurement of the progress of the simulation.
By default this starts at 0 and increments by 1 each time log_metrics
is called.
Alternatively it can be defined explicitly, e.g.
Naming metrics¶
It can be useful to employ a similar prefixes for metric names. The web interface allows you to group metrics with the same prefix together into a single plot. For example, metrics with names:
residuals.Ux
residuals.Uy
residuals.Uz
would be displayed on the same panel. In order to distinguish between sub-categories we can extend this prefix, for example in the following case:
residuals.Ux
residuals.Uy
residuals.Uz
residuals.p
we can separate the components of U
from p
:
residuals.U.x
residuals.U.y
residuals.U.z
residuals.p
this resulting in residuals.U.x
, residuals.U.y
and residuals.U.z
being displayed in one panel residuals.p
in another.
Note
You can of course select exactly what you want to displayed in a metrics plot, but for the default view displaying all metrics in a single page the dot notation is taken into account.
Resource usage metrics¶
Resource usage metrics are collected automatically by the Python client (unless disabled using the config
method), which consist of:
resources/cpu.usage.percent
: CPU usage as a percentage, where 100% indicates one CPU is 100% utilised. For example, 800% indicates that 8 CPUs are fully utilised.resources/memory.usage
: memory usage in MB.resources/gpu.utilisation.percent.i
: GPU utilisation as a percentage.resources/gpu.memory.percent.i
: GPU memory utilisation as a percentage.
In the above i
is the GPU index. If multiple GPUs are used metrics will be available for each separately.
By default the resource usage of the Python script itself is monitored (including any processes added by Run.add_process
). To monitor an external code execution not handled by the client, for example a FORTRAN or C++
simulation code, the (parent) PID needs to be specified using the set_pid
method of the Run
class, e.g.
init
method.
Important
There is a limitation currently for MPI jobs: only the resource usage of the processes on the node running the Python client will be measured, not any other node (for the case of multi-node jobs). This limitation will hopefully be removed in future.
Emissions Metrics¶
The Simvue server supports recording of CO2 emission estimates collected as additional metrics by the simvue.Run
class. The feature makes use of the CO2 signal API to obtain a CO2 intensity value for user's current region, using values of CPU and GPU percentage utilisation to then calculate a rough estimate for the amount of equivalent CO2 a run has produced.
Relative not absolute comparison
The values given by emission metrics should not be taken as accurate representations of the exact CO2 emission. These values are intended only to demonstrate the relative impact different choices can have (e.g. aborting simulations based on termination criteria) in terms of environmental impact.
Calculation¶
The CPU/GPU percentage used for calculating emissions is the percentage of CPU/GPU time the monitored processes used during the interval between the measurement and the current time. Firstly the total power consumption, \(P\), is calculated:
where \(\mathrm{TDP}_{\mathrm{CPU}}\), \(\mathrm{TDP}_{\mathrm{GPU}}\), \(\mathrm{\mathrm{CPU}}_{\%}\), \(\overline{\mathrm{GPU}_{\%}}\), \(N_{\mathrm{GPU}}\) are the CPU TDP, GPU TDP, CPU percentage, average GPU percentage across all GPUs, and the number of GPUs respectively. The equivalent \(\mathrm{CO}_{2}\) production in kg is then calculated as:
where \(I_{\mathrm{C}}\) is the carbon intensity and \(\Delta t\) the time interval of the measurement.
Configuration¶
To use the feature you will need to provide a CO2 Signal API token available here, and update your simvue.toml
configuration file to include an additional eco
section:
In addition it is strongly recommended that the configuration reflect the specifications of your system by specifying the CPU and GPU Thermal Design Power (TDP) values, this information can be easily found online. If not specified the arbitrary values of 80W and 130W are used:
The rate of requests to the CO2 Signal API is limited to 30 per hour, as multiple Simvue runs will utilise the same API token the Python API caches the CO2 intensity value locally only refreshing at the specified interval, the default being every day. This interval can be manually set as either an integer (in seconds) or a time descriptor string using the configuration. The location for the containing file is set to be the $SIMVUE_OFFLINE_DIRECTORY
(default is $HOME/.simvue
), this can also be updated:
[eco]
...
intensity_refresh_interval = "2 days" # or alternatively 172800
local_data_directory = "/home/shared/simvue_cache"
Finally you can skip use of the CO2 Signal API altogether by specifying a value for the CO2 intensity within the configuration file:
Enabling for Runs¶
Emission metrics are disabled by default, but can be enabled using the config
method:
Offline Run Emissions¶
For offline runs the currently cached CO2 Signal API value is used to calculate emission estimates. If simvue_sender
is then called by a system that has internet access this will instead ensure the CO2 intensity value is kept updated.