Events and alerts in Automate Turboscale
Event Alerts provide real-time updates on the health of your self-hosted infrastructure. Use them to monitor for potential problems and quickly resolve issues that might disrupt your testing.
Event alerts provide real-time visibility into the health of your self-hosted grid. You can:
- View real-time events when you open the page.
- Review the last 7 days of event history for a cluster.
- See impact and fix suggestions for each error or warning.
- Surface cluster-level events in the build dashboard so teams notice issues while looking at tests.
By monitoring key infrastructure events, you can identify and resolve issues in a self-serve manner that might impact your test runs, often before they cause widespread failures.
This guide explains how to use Event Alerts to monitor your grid, understand event logs, and take corrective action.
Prerequisites
- Events are collected only from clusters where the BrowserStack Agent is installed and running.
- Install and configure the Agent on every cluster you want to monitor.
- If the Agent is not installed, the Grid Management → Events page will show setup steps and no events will appear for that cluster.
- You need access to Grid Management in the project.
Find events
Event alerts appear in two locations, giving you both a high-level overview and a build-specific view of your grid’s health.
Cluster-level alerts
Cluster-level alerts provide a centralized view of your entire grid’s health. You can find them in the Grid Management section. These events are captured continuously, even when no tests are running, helping you monitor the overall stability of your infrastructure.
-
Use this view to monitor the overall health of nodes and pods across your entire cluster.
-
This alert appears under Grid Management > Events.
Build-level alerts
Build-level alerts appear directly on the Automate Dashboard when an infrastructure event occurs that may have directly impacted a specific test build. This helps you quickly correlate a test failure with an underlying grid issue.
- Use this view to quickly diagnose if a test failure was caused by a problem with the grid infrastructure.
- This alert appears as a banner on the Build Details page.
View and filter events
To access and filter your cluster’s event logs:
-
From the left-hand navigation menu, select Grid Management.
-
Click the Events tab to open the event log view.
-
By default, you will see events from the last hour. To change the time range, use the filter buttons:
-
15m
,30m
,1H
,1D
: Select a predefined time range. - Custom: Click the date field to open a calendar and select a custom date range. You can view logs for up to the last 7 days.
-
After following these steps, you’ll see a list of all cluster events that occurred within your selected time frame.
Understanding event details
Each event log provides details to help you diagnose the issue. For critical errors, we also provide suggested actions.
Here are some common build events you might encounter:
Event Type | Cause and Impact | Fix |
---|---|---|
NodeNotReady |
Cause: Kubelet stops heartbeats (network, crash, overload) Impact: Pods unreachable, may be evicted |
Check kubelet logs Verify network Restart kubelet Check node CPU/memory |
NodeHasDiskPressure |
Cause: Disk usage > 85% Impact: Node unschedulable, pods evicted |
Clean unused images Remove old logs Add disk space Configure log rotation Check large files |
OOMKilling |
Cause: Kernel killed process (OOM) Impact: Pod restart, disruption |
Increase pod memory limits Add node memory Review memory usage Enable swap Optimize app memory usage |
FailedScheduling |
Cause: Scheduler cannot place pod Impact: Pod stuck Pending |
Check requests vs capacity Review affinity/taints Add nodes Adjust constraints |
FailedCreatePodContainer |
Cause: Runtime failed to create container Impact: Pod creation fails |
Verify image exists Check runtime health Review security contexts Check resources |
Evicted |
Cause: Pod removed (resource pressure, policy) Impact: Pod terminated, rescheduled |
Reduce resource pressure Review QoS/priority Check eviction policies Increase requests |
BackoffLimitExceeded |
Cause: Job exceeded retries Impact: Job marked failed |
Increase backoffLimit Fix app issues Review job settings Check resources |
FailedRetrieveImagePullSecret |
Cause: Cannot access registry secret Impact: Image pull fails |
Verify secret exists Check credentials Review serviceAccount settings Test registry auth |
FailedCreatePodSandBox |
Cause: Runtime cannot create sandbox Impact: Pod creation fails |
Check runtime logs Verify CNI Review network config Check resources |
Here are some common cluster events you might encounter:
Event Type | Cause and Impact | Fix |
---|---|---|
KubeletIsDown |
Cause: Kubelet stopped/crashed Impact: Node mgmt stops, pods unresponsive, node NotReady |
Restart kubelet Check logs Verify API connectivity Check system resources |
PIDPressure |
Cause: PID limit exceeded Impact: Node unschedulable, prevents processes |
Increase PID limit Restart nodes Review workloads Monitor PID usage |
MemoryPressure |
Cause: Node memory exceeded threshold Impact: Node unschedulable, pods evicted |
Review requests/limits Scale down workloads Add memory Enable swap Monitor usage |
FailedMount |
Cause: Volume mount failed Impact: Pod startup fails |
Verify volume Check permissions Review storage class Check PV/PVC binding |
WorkflowFailed |
Cause: Workflow step failed/timeout Impact: Entire workflow failed |
Check step logs Review retry policies Fix failing steps Check timeouts/resources |
NetworkNotReady |
Cause: CNI/plugin issue Impact: Pod cannot communicate |
Check CNI Restart network daemon Verify policies Check node config |
FailedToUpdateEndpointSlices |
Cause: Service controller failed Impact: Service routing broken |
Check controller logs Verify RBAC Restart controller Review selector config |
SyncLoadBalancerFailed |
Cause: Cloud LB sync failed Impact: External access broken |
Check cloud API Review controller logs Check LB config Verify quotas |
FailedAttachVolume |
Cause: Cannot attach PV Impact: Pod cannot start |
Check volume availability Verify zone Review storage class Check limits |
FilesystemIsReadOnly |
Cause: FS mounted read-only Impact: Apps cannot write |
Check corruption Remount RW Review options Check storage health |
InvalidDiskCapacity |
Cause: Requested disk invalid Impact: Pod cannot start |
Review limits Check storage class Verify cloud limits Adjust PVC request |
FailedToScaleUpGroup |
Cause: Autoscaler cannot add nodes Impact: Pods pending |
Check quotas Verify autoscaler config Review node group Check subnet/security groups |
ScaleUpTimedOut |
Cause: Provisioning too slow Impact: Scaling aborted |
Check provisioning time Increase timeout Check availability Review config |
KubeletServingCertificateInvalid |
Cause: Expired/misconfigured cert Impact: Insecure API comms |
Renew certs Check validity Restart kubelet Verify CA chain |
Drain |
Cause: Node being drained Impact: Workloads moved |
Monitor evacuation Handle stuck pods Verify redistribution Check storage |
ContainerRuntimeIsDown |
Cause: Runtime stopped/unresponsive Impact: Node cannot manage containers |
Restart runtime Check logs Verify resources Review config |
We're sorry to hear that. Please share your feedback so we can do better
Contact our Support team for immediate help while we work on improving our docs.
We're continuously improving our docs. We'd love to know what you liked
We're sorry to hear that. Please share your feedback so we can do better
Contact our Support team for immediate help while we work on improving our docs.
We're continuously improving our docs. We'd love to know what you liked
Thank you for your valuable feedback!