Neo4j Aura Metrics
planned
hanna.shevchenko@tecalliance.net
Hello,
I use the instance metrics endpoint to extract metrics from Neo4j Aura, send them to Amazon CloudWatch, and visualize them in Grafana. I have questions regarding metrics aggregation in relation to time intervals. To formulate my questions precisely, I provide the following concrete examples:
• neo4j_aura_out_of_memory_errors_total{aggregation="SUM", instance_id="instance_id"}
This metric shows the sum of out-of-memory errors. What time interval does this sum cover? Is it the total number of errors that have occurred from the first metric scrape until the last one?
• neo4j_dbms_page_cache_usage_ratio{aggregation="MIN", instance_id="instance_id"}
What time interval does the MIN refer to? Is it the lowest ratio of the allocated page cache in use, calculated over the time range from the first metric scrape until the last one?
• neo4j_database_transaction_committed_total{aggregation="MAX", database="neo4j", instance_id="instance_id"}
What time interval does the MAX refer to? Is it the highest number of committed transactions within the time range from the first metric scrape until the last one?
• neo4j_dbms_page_cache_hit_ratio_per_minute{aggregation="AVG", instance_id="instance_id"}
Is this the average of all metric values from the first scrape until the last one?
Unfortunately, the Neo4j Aura documentation does not address these questions.
Thanks in advance for your help!
Chris Shelmerdine
planned
neo4j_aura_out_of_memory_errors_total: On Aura Business Critical or Virtual Dedicated Cloud instances, high availability is offered using multiple servers in a cluster. The aggregation here adds up the server out of memory counts for all servers in the cluster. If a server is replaced (for example in an upgrade) then the counters are reset, so the total may appear to decrease periodically. Any non-zero value could indicate the size of the instance is insufficient for the workload.
neo4j_dbms_page_cache_usage_ratio: Again, with a clustered deployment, each server maintains it's own page cache. We return the MIN value (worst case) for all servers in the cluster.
neo4j_database_transaction_committed_total: The same as above, the largest value of all Neo4j servers is returned. Since it is a counter, it is the total last known count of transactions counted from the start time of the server.
neo4j_dbms_page_cache_hit_ratio_per_minute: Again with multiple servers, this is the average page cache hit ratio over all the servers. This uses the neo4j metrics which calculates the hit ratio on a latest minute sliding window.
Thank you for taking the time to ask the questions and clarify the use of these metrics. We have planned an improvement to the documentation pages to make this clearer for other users.
hanna.shevchenko@tecalliance.net
Chris Shelmerdine Thank you for the answer!
I have a few points that still need clarification:
- What is meant by "cluster"? Does it refer to a single instance or a group of instances? I understand the cluster as a group of instances.
- Metrics for a single instance: I read metrics for an instance running on one server. Does this mean metric aggregations are irrelevant in my case since there’s only one server? Or are all instances in the cluster included in the metrics, even when I call the endpoint for a specific instance?
Chris Shelmerdine
under review
Thanks for the feedback. We will review your questions and let you know.