The MAAS engineering team actively works to improve the performance of MAAS. This article explains how performance measurements work, and provides reference tables for the various metrics used by MAAS.
Recent performance measurements
Recently, we improved the API performance of MAAS, by testing it with simulated loads. For this testing, we made the following assumptions:
To measure performance, we use continuous performance monitoring, arranged like this:
On a daily basis, we generate simulation data based on the assumptions above, for 10, 100, and 1000 machines. These three datapoints help us get a sense of how our performance improvements scale. A Jenkins tool exercises both the REST API and the WebSocket API, trapping the results in a database, from which we can build a dashboard. The dashboard looks something like this:
Note that we always compare the current stable release with the release in development, so that we can spot issues before they become harder to find and fix. We also capture profiling data that allows us to find bottlenecks, generating a heatmap that shows which parts of the code are causing issues at the moment.
For example, comparing MAAS 3.2 to MAAS 3.1, machine listings load, on average, 32% faster for the datasets we’re using.
Here’s a short history of our performance efforts to date:
↗
documents recent efforts to improve MAAS peformance, with quantitative results.↗
to improve the performance of the UI.Note that this list only captures the bigger, sustained efforts, although there is a constant focus on weeding out slowdowns when we come across them.
It’s possible to collect your own MAAS metrics – and even share them with the MAAS engineering team. We are keen to know everything we can about machine counts, network sizes, and MAAS performance in all areas. Please use the discourse performance forum↗
to share your feedback and observations.
As part of the MAAS 3.2 development effort, we have taken steps to improve the performance of machine listings. To date, we have measured the speed of listing a large number (100-1000) of machines via the REST API to be 32% faster, on average.
Currently, we are actively working to improve MAAS performance for other operations, such as search.
Currently, there are three types of available metrics:
The tables are generally self-explanatory.
Metric name | Labels | Descripton | Alert levels |
maas_machines | maas_id: MAAS cluster UUID | The number of machines known by MAAS, by status | Any fast variation in the number of machines should trigger an alert |
status: machine status | Unit = count of machines | ||
Type: gauge | |||
maas_nodes | maas_id: MAAS cluster UUID | Number of nodes known by MAAS per type (machine, device or controller) | Any fast variation in the number of nodes should trigger an alert |
type: type of node (machine/device/controller) | Unit = count of machines | ||
Type: gauge | |||
maas_net_spaces | maas_id: MAAS cluster UUID | Number of network spaces | None defined |
Unit = count of spaces | |||
Type: gauge | |||
maas_net_fabrics | maas_id: MAAS cluster UUID | Number of network fabrics | None defined |
Unit = count of fabrics | |||
Type: gauge | |||
maas_net_vlans | maas_id: MAAS cluster UUID | Number of network VLANs | None defined |
Unit = count of VLANs | |||
Type: gauge | |||
maas_net_subnets_v4 | maas_id: MAAS cluster UUID | Number of IPv4 subnets | None defined |
Unit = count IPv4 subnets | |||
Type: gauge | |||
maas_net_subnets_v6 | maas_id: MAAS cluster UUID | Number of IPv6 subnets | None defined |
Unit = count of IPv6 subnets | |||
Type: gauge | |||
maas_net_subnet_ip_count | maas_id: MAAS cluster UUID | Number of IPs in a subnet by status | You should monitor the number of "available" IPs, as the depletion of the pool can prevent new deployments |
status: available or used | Unit = count of IPs | ||
Type: gauge | |||
maas_net_subnet_ip_dynamic | maas_id: MAAS cluster UUID | Number of used dynamic IPs in a subnet | You should monitor the number of "available" IPs, as the depletion of the pool can prevent new deployments |
status: available or used | Unit = count of used dynamic IPs | ||
cidr: subnet address | Type: gauge | ||
maas_net_subnet_ip_reserved | maas_id: MAAS cluster UUID | Number of used reserved IPs in a subnet | You should monitor the number of "available" IPs, as the depletion of the pool can prevent new deployments |
status: available or used | Unit = count of used reserved IPs | ||
cidr: subnet address | Type: gauge | ||
maas_net_subnet_ip_static | maas_id: MAAS cluster UUID | Number of used static IPs in a subnet | You should monitor the number of "available" IPs, as the depletion of the pool can prevent new deployments |
status: available or used | Unit = count of used static IPs | ||
cidr: subnet address | Type: gauge | ||
maas_machines_total_mem | maas_id: MAAS cluster UUID | Amount of combined memory for all machines | None defined |
Unit = megabytes of memory | |||
Type: gauge | |||
maas_machines_total_cpu | maas_id: MAAS cluster UUID | Amount of combined CPU counts for all machines | None defined |
Unit = count of CPUs | |||
Type: gauge | |||
maas_machines_total_storage | maas_id: MAAS cluster UUID | Amount of combined storage space for all machines | None defined |
Unit = bytes of storage | |||
Type: gauge | |||
maas_kvm_pods | maas_id: MAAS cluster UUID | Number of KVM hosts | None defined |
Unit = count of KVM hosts | |||
Type: gauge | |||
maas_kvm_machines | maas_id: MAAS cluster UUID | Number of virtual machines allocated in KVM hosts | None defined |
Unit = count of virtual machines | |||
Type: gauge | |||
maas_kvm_cores | maas_id: MAAS cluster UUID | Total number of CPU cores present on KVM hosts | You should monitor the number of "available" CPU cores, as the depletion of the pool can prevent new deployments |
status: available or used | Unit = count of KVM cores | ||
Type: gauge | |||
maas_kvm_memory | maas_id: MAAS cluster UUID | Total amount of RAM present on KVM hosts | You should monitor the amount of "available" RAM, as the depletion of the pool can prevent new deployments |
status: available or used | Unit = megabytes of memory | ||
Type: gauge | |||
maas_kvm_storage | maas_id: MAAS cluster UUID | Total amount of storage space present on KVM hosts | You should monitor the amount of "available" storage space, as the depletion of the pool can prevent new deployments |
status: available or used | Unit = bytes of storage | ||
Type: gauge | |||
maas_kvm_overcommit_cores | maas_id: MAAS cluster UUID | Total number of CPU cores present on KVM hosts adjusted by the overcommit setting | None defined |
Unit = overcommitted number of cores | |||
Type: gauge | |||
maas_kvm_overcommit_memory | maas_id: MAAS cluster UUID | Total amount of RAM present on KVM hosts adjusted by the overcommit setting | None defined |
Unit = overcommitted megabytes of memory | |||
Type: gauge | |||
maas_machine_arches | maas_id: MAAS cluster UUID | Total number of machines per architecture | None defined |
arch: machine architecture | Unit = count of machines | ||
Type: gauge | |||
maas_custom_static_images_uploaded | maas_id: MAAS cluster UUID | Number of custom OS images present in MAAS | None defined |
base_image: custom image base OS | Unit = count of images | ||
file_type: image file type | Type: gauge | ||
maas_custom_static_images_deployed | maas_id: MAAS cluster UUID | Number deployed machines running custom OS images | None defined |
Unit = count of images | |||
Type: gauge | |||
maas_vmcluster_projects | maas_id: MAAS cluster UUID | Number of KVM clusters | None defined |
Unit = count of projects | |||
Type: gauge | |||
maas_vmcluster_hosts | maas_id: MAAS cluster UUID | Total number of KVM hosts in clusters | None defined |
Unit = count of VM hosts | |||
Type: gauge | |||
maas_vmcluster_vms | maas_id: MAAS cluster UUID | Total number of virtual machines in KVM clusters | None defined |
Unit = count of virtual machines | |||
Type: gauge |
Metric name | Labels | Descripton | Alert levels |
maas_node_cpu_time | host: controller IP address | Standard Linux CPU performance counters. see man proc (5) | High rates of work performed by either 'user' or 'system' can indicate this controller is overloaded, while increasing amounts of time spent in 'iowait' could indicate disk degradation. |
maas_id: MAAS cluster UUID | Type: counter | ||
service_type: region or rack | Unit: jiffies | ||
state: CPU state (see /proc/stat) | |||
maas_node_mem_AnonPages | host: controller IP address | Total anon (non-file) pages mapped into the page tables | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_Buffers | host: controller IP address | Total temporary storage element in memory | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_Cached | host: controller IP address | Total page cache size | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_CommitLimit | host: controller IP address | Total memory currently available for allocation on the system | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_Committed_AS | host: controller IP address | Total memory already allocated on the system | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_Dirty | host: controller IP address | Total memory that currently waits to be written back to the disk | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_HugePages_Free | host: controller IP address | Total amount of unallocated huge pages | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: none | ||
service_type: region or rack | |||
maas_node_mem_HugePages_Rsvd | host: controller IP address | number of reserved huge pages for allocation from the pool | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: none | ||
service_type: region or rack | |||
maas_node_mem_HugePages_Surp | host: controller IP address | number of surplus huge pages | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: none | ||
service_type: region or rack | |||
maas_node_mem_HugePages_Total | host: controller IP address | total size of the huge pages pool | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: none | ||
service_type: region or rack | |||
maas_node_mem_Mapped | host: controller IP address | Total memory used by mmaped files | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_MemAvailable | host: controller IP address | Total available RAM to processes | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_MemFree | host: controller IP address | total free RAM | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_MemTotal | host: controller IP address | total usable RAM | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_PageTables | host: controller IP address | Total amount of memory consumed by page tables | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_Shmem | host: controller IP address | Total memory used by shared memory and tmpfs | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_Slab | host: controller IP address | Total memory used by kernel-level data structures cache | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_SReclaimable | host: controller IP address | Total memory in reclaimable parts of Slab | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_SUnreclaim | host: controller IP address | Total memory in unreclaimable parts of Slab | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_SwapCached | host: controller IP address | Total recently used swap memory | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_SwapTotal | host: controller IP address | Total amount of swap space available | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_SwapFree | host: controller IP address | Total unused swap space | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_VmallocUsed | host: controller IP address | Total size of the used vmalloc memory space | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_Writeback | host: controller IP address | Total memory that is being written back at the moment | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack | |||
maas_node_mem_WritebackTmp | host: controller IP address | Total temporary buffer for writebacks | None defined |
maas_id: MAAS cluster UUID | Type: gauge | ||
pid: process system id | Unit: kB | ||
service_type: region or rack |
Metric name | Labels | Descripton | Alert levels |
maas_http_request_latency | host: controller IP address | The time MAAS takes to process a REST API call. It doesn't include any time associated with network, including proxy processing | The average response time depends on the endpoint, the controller hardware and size of the cluster. In most cases you can expect a response in less than 30 seconds |
maas_id: MAAS cluster UUID | Type: histogram | ||
method: HTTP method | Unit: seconds | ||
op: REST API operation name | |||
path: REST API endpoint | |||
status: HTTP response status code | |||
maas_http_response_size | host: controller IP address | The size of REST API responses | None defined |
maas_id: MAAS cluster UUID | Type: histogram | ||
method: HTTP method | Unit: bytes | ||
op: REST API operation name | |||
path: REST API endpoint | |||
status: HTTP response status code | |||
maas_http_request_query_count | host: controller IP address | The number of database operations executed per REST API call | The expected maximum number of DB operations performed in a single request depends on the MAAS version. Currently, values over 30 are considered potential bugs. |
maas_id: MAAS cluster UUID | Type: histogram | ||
method: HTTP method | Unit: none | ||
op: REST API operation name | |||
path: REST API endpoint | |||
status: HTTP response status code | |||
maas_http_request_query_latency | host: controller IP address | The time required to perform a single database operation during a REST API call. The database latency is measured from the moment MAAS starts a transaction until it gets the reponse | The average latency depends on the cluster size and load in the database. Increasing latency trends can point to database degradation or the need to increase resources |
maas_id: MAAS cluster UUID | Type: histogram | ||
method: HTTP method | Unit: seconds | ||
op: REST API operation name | |||
path: REST API endpoint | |||
status: HTTP response status code | |||
maas_rack_region_rpc_call_latency | call: RPC operation | The time a Region controller takes to perform a RPC call to a Rack controller. The latency is measured from the request to the response. | The average latency depends on the controllers involved and the load in the Rack. Spikes might point to transient network issues, while high average values can indicate the need to scale up MAAS either with a higher capacity controller or by adding more controllers |
host: controller IP address | Type: histogram | ||
maas_id: MAAS cluster UUID | Unit: seconds | ||
maas_region_rack_rpc_call_latency | call: RPC operation | The time a Rack controller takes to perform a RPC call to a Region controller. The latency is measured from the request to the response. | The average latency depends on the controllers involved and the load in the Region. Spikes might point to transient network issues, while high average values can indicate the need to scale up MAAS either with a higher capacity controller or by adding more controllers |
host: controller IP address | Type: histogram | ||
maas_id: MAAS cluster UUID | Unit: seconds | ||
maas_websocket_call_query_count | call: WS operation | The number of database operations executed per WebSocket call | The expected maximum number of DB operations performed in a single request depends on the MAAS version. Currently, values over 30 are considered potential bugs. |
host: controller IP address | Type: histogram | ||
maas_id: MAAS cluster UUID | Unit: none | ||
maas_websocket_call_query_latency | call: WS operation | The time required to perform a single database operation during a WebSocket call. The database latency is measured from the moment MAAS starts a transaction until it gets the reponse | The average latency depends on the cluster size and load in the database. Increasing latency trends can point to database degradation or the need to increase resources |
host: controller IP address | Type: histogram | ||
maas_id: MAAS cluster UUID | Unit: seconds | ||
maas_websocket_call_latency | call: WS operation | The time MAAS takes to process a WebSocket call. It doesn't include any time associated with network, including proxy processing | The average response time depends on the operation, the controller hardware and size of the cluster. In most cases you can expect a response in less than 30 seconds |
host: controller IP address | Type: histogram | ||
maas_id: MAAS cluster UUID | Unit: seconds | ||
maas_dns_update_latency | host: controller IP address | The time MAAS takes to setup all zones in the DNS service per update type, which can be 'reload' (cold-start) or 'dynamic' (RNDC operation) | The average time of a reload operation depends on the number of zones and records, and can take a few seconds to complete. A dynamic operation should be much faster, usually under 2 seconds. |
maas_id: MAAS cluster UUID | Type: histogram | ||
update_type: reload or dynamic | Unit: seconds | ||
maas_dns_full_zonefile_write_count | host: controller IP address | Count of full DNS zone rewrite operations | Full DNS zone rewrite operations should not occur very often, so a high rate of this operation could indicate something abnormal in MAAS |
maas_id: MAAS cluster UUID | Type: counter | ||
zone: DNS zone name | Unit: none | ||
maas_dns_dynamic_update_count | host: controller IP address | Count of dynamic DNS zone update operations | MAAS prefers dynamic updates whenever possible, so the rate of this operation should be similar to the rate of machine operations |
maas_id: MAAS cluster UUID | Type: counter | ||
zone: DNS zone name | Unit: none | ||
maas_rpc_pool_exhaustion_count | host: controller IP address | number of occurances of the RPC connection pool allocate its maxmimum number of connections | MAAS automatically manages its connection pool size, so a number of occurences of this is normal. Starting a new connection adds latency to RPC calls, so you might want to tune the `max_idle_rpc_connections` and `max_rpc_connections` parameters if you feel this is occurring too frequently |
maas_id: MAAS cluster UUID | Type: counter | ||
Unit: none | |||
maas_lxd_fetch_machine_failure | host: controller IP address | Total number of failures for fetching LXD machines | Failures to fetch LXD VM information can have many reasons, including network issues and high load in the KVM host. |
maas_id: MAAS cluster UUID | Type: counter | ||
Unit: none | |||
maas_lxd_disk_creation_failure | host: controller IP address | Total number of failures of LXD disk creation | Failures to allocate storage for a LXD VM are mainly caused by storage pool exaustion |
maas_id: MAAS cluster UUID | Type: counter | ||
Unit: none | |||
maas_virsh_storage_pool_creation_failure | host: controller IP address | Total number of failures of virsh storage pool creation | Failures to allocate storage for a Virsh VM are mainly caused by storage pool exaustion |
maas_id: MAAS cluster UUID | Type: counter | ||
Unit: none | |||
maas_virsh_fetch_mac_failure | host: controller IP address | Total number of failures of virsh interfaces enumeration | Failures to fetch Virsh VM information can have many reasons, including network issues, Virsh configuration errors and high load in the KVM host. |
maas_id: MAAS cluster UUID | Type: counter | ||
Unit: none | |||
maas_virsh_fetch_description_failure | host: controller IP address | Total number of failures of virsh domain description | Failures to fetch Virsh VM information can have many reasons, including network issues, Virsh configuration errors and high load in the KVM host. |
maas_id: MAAS cluster UUID | Type: counter | ||
Unit: none | |||
maas_tftp_file_transfer_latency | host: controller IP address | Time required to transfer a file to a machine using TFTP | The average time required to transfer a file depends on the file size, network load and the machine TFTP client implementation. Increasing transfer times could mean network connectivity issues or link congestion |
maas_id: MAAS cluster UUID | Type: histogram | ||
filename: file requested | Unit: seconds |