Offline docs (switch to live docs)                          UI-only          CLI-only

About MAAS performance

The MAAS engineering team actively works to improve the performance of MAAS. This article explains how performance measurements work, and provides reference tables for the various metrics used by MAAS.

Recent performance measurements

Recently, we improved the API performance of MAAS, by testing it with simulated loads. For this testing, we made the following assumptions:

To measure performance, we use continuous performance monitoring, arranged like this:

On a daily basis, we generate simulation data based on the assumptions above, for 10, 100, and 1000 machines. These three datapoints help us get a sense of how our performance improvements scale. A Jenkins tool exercises both the REST API and the WebSocket API, trapping the results in a database, from which we can build a dashboard. The dashboard looks something like this:

Note that we always compare the current stable release with the release in development, so that we can spot issues before they become harder to find and fix. We also capture profiling data that allows us to find bottlenecks, generating a heatmap that shows which parts of the code are causing issues at the moment.

For example, comparing MAAS 3.2 to MAAS 3.1, machine listings load, on average, 32% faster for the datasets we’re using.

Performance efforts to date

Here’s a short history of our performance efforts to date:

Note that this list only captures the bigger, sustained efforts, although there is a constant focus on weeding out slowdowns when we come across them.

Collecting your own metrics

It’s possible to collect your own MAAS metrics – and even share them with the MAAS engineering team. We are keen to know everything we can about machine counts, network sizes, and MAAS performance in all areas. Please use the discourse performance forum to share your feedback and observations.

Recent developments

As part of the MAAS 3.2 development effort, we have taken steps to improve the performance of machine listings. To date, we have measured the speed of listing a large number (100-1000) of machines via the REST API to be 32% faster, on average.

Next steps

Currently, we are actively working to improve MAAS performance for other operations, such as search.

Tables of available metrics

Currently, there are three types of available metrics:

The tables are generally self-explanatory.

Cluster metrics

Metric name Labels Descripton Alert levels
maas_machines maas_id: MAAS cluster UUID The number of machines known by MAAS, by status Any fast variation in the number of machines should trigger an alert
  status: machine status Unit = count of machines  
    Type: gauge  
maas_nodes maas_id: MAAS cluster UUID Number of nodes known by MAAS per type (machine, device or controller) Any fast variation in the number of nodes should trigger an alert
  type: type of node (machine/device/controller) Unit = count of machines  
    Type: gauge  
maas_net_spaces maas_id: MAAS cluster UUID Number of network spaces None defined
    Unit = count of spaces  
    Type: gauge  
maas_net_fabrics maas_id: MAAS cluster UUID Number of network fabrics None defined
    Unit = count of fabrics  
    Type: gauge  
maas_net_vlans maas_id: MAAS cluster UUID Number of network VLANs None defined
    Unit = count of VLANs  
    Type: gauge  
maas_net_subnets_v4 maas_id: MAAS cluster UUID Number of IPv4 subnets None defined
    Unit = count IPv4 subnets  
    Type: gauge  
maas_net_subnets_v6 maas_id: MAAS cluster UUID Number of IPv6 subnets None defined
    Unit = count of IPv6 subnets  
    Type: gauge  
maas_net_subnet_ip_count maas_id: MAAS cluster UUID Number of IPs in a subnet by status You should monitor the number of "available" IPs, as the depletion of the pool can prevent new deployments
  status: available or used Unit = count of IPs  
    Type: gauge  
maas_net_subnet_ip_dynamic maas_id: MAAS cluster UUID Number of used dynamic IPs in a subnet You should monitor the number of "available" IPs, as the depletion of the pool can prevent new deployments
  status: available or used Unit = count of used dynamic IPs  
  cidr: subnet address Type: gauge  
maas_net_subnet_ip_reserved maas_id: MAAS cluster UUID Number of used reserved IPs in a subnet You should monitor the number of "available" IPs, as the depletion of the pool can prevent new deployments
  status: available or used Unit = count of used reserved IPs  
  cidr: subnet address Type: gauge  
maas_net_subnet_ip_static maas_id: MAAS cluster UUID Number of used static IPs in a subnet You should monitor the number of "available" IPs, as the depletion of the pool can prevent new deployments
  status: available or used Unit = count of used static IPs  
  cidr: subnet address Type: gauge  
maas_machines_total_mem maas_id: MAAS cluster UUID Amount of combined memory for all machines None defined
    Unit = megabytes of memory  
    Type: gauge  
maas_machines_total_cpu maas_id: MAAS cluster UUID Amount of combined CPU counts for all machines None defined
    Unit = count of CPUs  
    Type: gauge  
maas_machines_total_storage maas_id: MAAS cluster UUID Amount of combined storage space for all machines None defined
    Unit = bytes of storage  
    Type: gauge  
maas_kvm_pods maas_id: MAAS cluster UUID Number of KVM hosts None defined
    Unit = count of KVM hosts  
    Type: gauge  
maas_kvm_machines maas_id: MAAS cluster UUID Number of virtual machines allocated in KVM hosts None defined
    Unit = count of virtual machines  
    Type: gauge  
maas_kvm_cores maas_id: MAAS cluster UUID Total number of CPU cores present on KVM hosts You should monitor the number of "available" CPU cores, as the depletion of the pool can prevent new deployments
  status: available or used Unit = count of KVM cores  
    Type: gauge  
maas_kvm_memory maas_id: MAAS cluster UUID Total amount of RAM present on KVM hosts You should monitor the amount of "available" RAM, as the depletion of the pool can prevent new deployments
  status: available or used Unit = megabytes of memory  
    Type: gauge  
maas_kvm_storage maas_id: MAAS cluster UUID Total amount of storage space present on KVM hosts You should monitor the amount of "available" storage space, as the depletion of the pool can prevent new deployments
  status: available or used Unit = bytes of storage  
    Type: gauge  
maas_kvm_overcommit_cores maas_id: MAAS cluster UUID Total number of CPU cores present on KVM hosts adjusted by the overcommit setting None defined
    Unit = overcommitted number of cores  
    Type: gauge  
maas_kvm_overcommit_memory maas_id: MAAS cluster UUID Total amount of RAM present on KVM hosts adjusted by the overcommit setting None defined
    Unit = overcommitted megabytes of memory  
    Type: gauge  
maas_machine_arches maas_id: MAAS cluster UUID Total number of machines per architecture None defined
  arch: machine architecture Unit = count of machines  
    Type: gauge  
maas_custom_static_images_uploaded maas_id: MAAS cluster UUID Number of custom OS images present in MAAS None defined
  base_image: custom image base OS Unit = count of images  
  file_type: image file type Type: gauge  
maas_custom_static_images_deployed maas_id: MAAS cluster UUID Number deployed machines running custom OS images None defined
    Unit = count of images  
    Type: gauge  
maas_vmcluster_projects maas_id: MAAS cluster UUID Number of KVM clusters None defined
    Unit = count of projects  
    Type: gauge  
maas_vmcluster_hosts maas_id: MAAS cluster UUID Total number of KVM hosts in clusters None defined
    Unit = count of VM hosts  
    Type: gauge  
maas_vmcluster_vms maas_id: MAAS cluster UUID Total number of virtual machines in KVM clusters None defined
    Unit = count of virtual machines  
    Type: gauge  

Node performance metrics

Metric name Labels Descripton Alert levels
maas_node_cpu_time host: controller IP address Standard Linux CPU performance counters. see man proc (5) High rates of work performed by either 'user' or 'system' can indicate this controller is overloaded, while increasing amounts of time spent in 'iowait' could indicate disk degradation.
  maas_id: MAAS cluster UUID Type: counter  
  service_type: region or rack Unit: jiffies  
  state: CPU state (see /proc/stat)    
maas_node_mem_AnonPages host: controller IP address Total anon (non-file) pages mapped into the page tables None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_Buffers host: controller IP address Total temporary storage element in memory None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_Cached host: controller IP address Total page cache size None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_CommitLimit host: controller IP address Total memory currently available for allocation on the system None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_Committed_AS host: controller IP address Total memory already allocated on the system None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_Dirty host: controller IP address Total memory that currently waits to be written back to the disk None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_HugePages_Free host: controller IP address Total amount of unallocated huge pages None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: none  
  service_type: region or rack    
maas_node_mem_HugePages_Rsvd host: controller IP address number of reserved huge pages for allocation from the pool None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: none  
  service_type: region or rack    
maas_node_mem_HugePages_Surp host: controller IP address number of surplus huge pages None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: none  
  service_type: region or rack    
maas_node_mem_HugePages_Total host: controller IP address total size of the huge pages pool None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: none  
  service_type: region or rack    
maas_node_mem_Mapped host: controller IP address Total memory used by mmaped files None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_MemAvailable host: controller IP address Total available RAM to processes None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_MemFree host: controller IP address total free RAM None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_MemTotal host: controller IP address total usable RAM None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_PageTables host: controller IP address Total amount of memory consumed by page tables None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_Shmem host: controller IP address Total memory used by shared memory and tmpfs None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_Slab host: controller IP address Total memory used by kernel-level data structures cache None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_SReclaimable host: controller IP address Total memory in reclaimable parts of Slab None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_SUnreclaim host: controller IP address Total memory in unreclaimable parts of Slab None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_SwapCached host: controller IP address Total recently used swap memory None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_SwapTotal host: controller IP address Total amount of swap space available None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_SwapFree host: controller IP address Total unused swap space None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_VmallocUsed host: controller IP address Total size of the used vmalloc memory space None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_Writeback host: controller IP address Total memory that is being written back at the moment None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    
maas_node_mem_WritebackTmp host: controller IP address Total temporary buffer for writebacks None defined
  maas_id: MAAS cluster UUID Type: gauge  
  pid: process system id Unit: kB  
  service_type: region or rack    

Performance metrics

Metric name Labels Descripton Alert levels
maas_http_request_latency host: controller IP address The time MAAS takes to process a REST API call. It doesn't include any time associated with network, including proxy processing The average response time depends on the endpoint, the controller hardware and size of the cluster. In most cases you can expect a response in less than 30 seconds
  maas_id: MAAS cluster UUID Type: histogram  
  method: HTTP method Unit: seconds  
  op: REST API operation name    
  path: REST API endpoint    
  status: HTTP response status code    
maas_http_response_size host: controller IP address The size of REST API responses None defined
  maas_id: MAAS cluster UUID Type: histogram  
  method: HTTP method Unit: bytes  
  op: REST API operation name    
  path: REST API endpoint    
  status: HTTP response status code    
maas_http_request_query_count host: controller IP address The number of database operations executed per REST API call The expected maximum number of DB operations performed in a single request depends on the MAAS version. Currently, values over 30 are considered potential bugs.
  maas_id: MAAS cluster UUID Type: histogram  
  method: HTTP method Unit: none  
  op: REST API operation name    
  path: REST API endpoint    
  status: HTTP response status code    
maas_http_request_query_latency host: controller IP address The time required to perform a single database operation during a REST API call. The database latency is measured from the moment MAAS starts a transaction until it gets the reponse The average latency depends on the cluster size and load in the database. Increasing latency trends can point to database degradation or the need to increase resources
  maas_id: MAAS cluster UUID Type: histogram  
  method: HTTP method Unit: seconds  
  op: REST API operation name    
  path: REST API endpoint    
  status: HTTP response status code    
maas_rack_region_rpc_call_latency call: RPC operation The time a Region controller takes to perform a RPC call to a Rack controller. The latency is measured from the request to the response. The average latency depends on the controllers involved and the load in the Rack. Spikes might point to transient network issues, while high average values can indicate the need to scale up MAAS either with a higher capacity controller or by adding more controllers
  host: controller IP address Type: histogram  
  maas_id: MAAS cluster UUID Unit: seconds  
maas_region_rack_rpc_call_latency call: RPC operation The time a Rack controller takes to perform a RPC call to a Region controller. The latency is measured from the request to the response. The average latency depends on the controllers involved and the load in the Region. Spikes might point to transient network issues, while high average values can indicate the need to scale up MAAS either with a higher capacity controller or by adding more controllers
  host: controller IP address Type: histogram  
  maas_id: MAAS cluster UUID Unit: seconds  
maas_websocket_call_query_count call: WS operation The number of database operations executed per WebSocket call The expected maximum number of DB operations performed in a single request depends on the MAAS version. Currently, values over 30 are considered potential bugs.
  host: controller IP address Type: histogram  
  maas_id: MAAS cluster UUID Unit: none  
maas_websocket_call_query_latency call: WS operation The time required to perform a single database operation during a WebSocket call. The database latency is measured from the moment MAAS starts a transaction until it gets the reponse The average latency depends on the cluster size and load in the database. Increasing latency trends can point to database degradation or the need to increase resources
  host: controller IP address Type: histogram  
  maas_id: MAAS cluster UUID Unit: seconds  
maas_websocket_call_latency call: WS operation The time MAAS takes to process a WebSocket call. It doesn't include any time associated with network, including proxy processing The average response time depends on the operation, the controller hardware and size of the cluster. In most cases you can expect a response in less than 30 seconds
  host: controller IP address Type: histogram  
  maas_id: MAAS cluster UUID Unit: seconds  
maas_dns_update_latency host: controller IP address The time MAAS takes to setup all zones in the DNS service per update type, which can be 'reload' (cold-start) or 'dynamic' (RNDC operation) The average time of a reload operation depends on the number of zones and records, and can take a few seconds to complete. A dynamic operation should be much faster, usually under 2 seconds.
  maas_id: MAAS cluster UUID Type: histogram  
  update_type: reload or dynamic Unit: seconds  
maas_dns_full_zonefile_write_count host: controller IP address Count of full DNS zone rewrite operations Full DNS zone rewrite operations should not occur very often, so a high rate of this operation could indicate something abnormal in MAAS
  maas_id: MAAS cluster UUID Type: counter  
  zone: DNS zone name Unit: none  
maas_dns_dynamic_update_count host: controller IP address Count of dynamic DNS zone update operations MAAS prefers dynamic updates whenever possible, so the rate of this operation should be similar to the rate of machine operations
  maas_id: MAAS cluster UUID Type: counter  
  zone: DNS zone name Unit: none  
maas_rpc_pool_exhaustion_count host: controller IP address number of occurances of the RPC connection pool allocate its maxmimum number of connections MAAS automatically manages its connection pool size, so a number of occurences of this is normal. Starting a new connection adds latency to RPC calls, so you might want to tune the `max_idle_rpc_connections` and `max_rpc_connections` parameters if you feel this is occurring too frequently
  maas_id: MAAS cluster UUID Type: counter  
    Unit: none  
maas_lxd_fetch_machine_failure host: controller IP address Total number of failures for fetching LXD machines Failures to fetch LXD VM information can have many reasons, including network issues and high load in the KVM host.
  maas_id: MAAS cluster UUID Type: counter  
    Unit: none  
maas_lxd_disk_creation_failure host: controller IP address Total number of failures of LXD disk creation Failures to allocate storage for a LXD VM are mainly caused by storage pool exaustion
  maas_id: MAAS cluster UUID Type: counter  
    Unit: none  
maas_virsh_storage_pool_creation_failure host: controller IP address Total number of failures of virsh storage pool creation Failures to allocate storage for a Virsh VM are mainly caused by storage pool exaustion
  maas_id: MAAS cluster UUID Type: counter  
    Unit: none  
maas_virsh_fetch_mac_failure host: controller IP address Total number of failures of virsh interfaces enumeration Failures to fetch Virsh VM information can have many reasons, including network issues, Virsh configuration errors and high load in the KVM host.
  maas_id: MAAS cluster UUID Type: counter  
    Unit: none  
maas_virsh_fetch_description_failure host: controller IP address Total number of failures of virsh domain description Failures to fetch Virsh VM information can have many reasons, including network issues, Virsh configuration errors and high load in the KVM host.
  maas_id: MAAS cluster UUID Type: counter  
    Unit: none  
maas_tftp_file_transfer_latency host: controller IP address Time required to transfer a file to a machine using TFTP The average time required to transfer a file depends on the file size, network load and the machine TFTP client implementation. Increasing transfer times could mean network connectivity issues or link congestion
  maas_id: MAAS cluster UUID Type: histogram  
  filename: file requested Unit: seconds