Neo4j Aura Metrics | Questions and General feedback

Neo4j Aura Metrics

planned

hanna.shevchenko@tecalliance.net

Hello,
I use the instance metrics endpoint to extract metrics from Neo4j Aura, send them to Amazon CloudWatch, and visualize them in Grafana. I have questions regarding metrics aggregation in relation to time intervals. To formulate my questions precisely, I provide the following concrete examples:
•	neo4j_aura_out_of_memory_errors_total{aggregation="SUM", instance_id="instance_id"}
This metric shows the sum of out-of-memory errors. What time interval does this sum cover? Is it the total number of errors that have occurred from the first metric scrape until the last one?
•	neo4j_dbms_page_cache_usage_ratio{aggregation="MIN", instance_id="instance_id"}
What time interval does the MIN refer to? Is it the lowest ratio of the allocated page cache in use, calculated over the time range from the first metric scrape until the last one?
•	neo4j_database_transaction_committed_total{aggregation="MAX", database="neo4j", instance_id="instance_id"}
What time interval does the MAX refer to? Is it the highest number of committed transactions within the time range from the first metric scrape until the last one?
•	neo4j_dbms_page_cache_hit_ratio_per_minute{aggregation="AVG", instance_id="instance_id"}
Is this the average of all metric values from the first scrape until the last one?
Unfortunately, the Neo4j Aura documentation does not answer these questions.
Thanks in advance for your help!

April 24, 2025

Chris Shelmerdine

marked this post as

planned

neo4j_aura_out_of_memory_errors_total: On Aura Business Critical or Virtual Dedicated Cloud instances, high availability is offered using multiple servers in a cluster. The aggregation here adds up the server out of memory counts for all servers in the cluster. If a server is replaced (for example in an upgrade) then the counters are reset, so the total may appear to decrease periodically. Any non-zero value could indicate the size of the instance is insufficient for the workload.
neo4j_dbms_page_cache_usage_ratio: Again, with a clustered deployment, each server maintains it's own page cache. We return the MIN value (worst case) for all servers in the cluster.
neo4j_database_transaction_committed_total: The same as above, the largest value of all Neo4j servers is returned. Since it is a counter, it is the total last known count of transactions counted from the start time of the server.
neo4j_dbms_page_cache_hit_ratio_per_minute: Again with multiple servers, this is the average page cache hit ratio over all the servers. This uses the neo4j metrics which calculates the hit ratio on a latest minute sliding window.
Thank you for taking the time to ask the questions and clarify the use of these metrics. We have planned an improvement to the documentation pages to make this clearer for other users.

Chris Shelmerdine

marked this post as

under review

Thanks for the feedback. We will review your questions and let you know.