Chaos Engineering with Playwright: 3 Practical Methods to Build Resilient Systems
"The tests are green, but production is down."
In 2026, we know that automated E2E tests often fail because they operate in a "happy path" vacuum. They assume low latency, 100% API availability, and infinite CPU resources. To build truly resilient systems, we must validate Graceful Degradation — how the system behaves when things go wrong.
Playwright offers powerful, low-level APIs to inject chaos without external tools. Here is how to move from "testing features" to "testing resilience."
1. Simulated API Failure (HTTP 500 Injection)
The Risk: If a critical API (e.g., billing or auth) fails, does the user face a white screen of death, or a meaningful "Retry" flow? This isn't just a UI check; it's a safeguard against data inconsistency.
Scenario: Intercept a billing request and return a 500 Internal Server Error.
def test_billing_resilience(page):
# Intercepting the payment endpoint to simulate a backend crash
page.route("**/api/v1/billing/pay", lambda route: route.fulfill(
status=500,
content_type="application/json",
body='{"error": "Database Deadlock"}'
))
page.goto("/checkout")
page.get_by_role("button", name="Complete Purchase").click()
# Asserting Graceful Degradation: Toast appears, but state is preserved
expect(page.locator(".error-toast")).to_be_visible()
expect(page.get_by_role("button", name="Complete Purchase")).to_be_enabled()
test('should handle billing API failure gracefully', async ({ page }) => {
await page.route('**/api/v1/billing/pay', route => route.fulfill({
status: 500,
contentType: 'application/json',
body: JSON.stringify({ error: 'Database Deadlock' }),
}));
await page.goto('/checkout');
await page.getByRole('button', { name: 'Complete Purchase' }).click();
await expect(page.locator('.error-toast')).toBeVisible();
// Ensure the UI allows a retry and doesn't freeze
await expect(page.getByRole('button', { name: 'Complete Purchase' })).toBeEnabled();
});
2. Network Instability (Sudden Offline Mode)
The Risk: SPAs (Single Page Applications) often break when the WebSocket or long-poll connection drops mid-session. Testing offline transitions reveals missing error boundaries.
Scenario: Start a heavy file upload and suddenly turn off the internet.
def test_upload_recovery(page, context):
page.goto("/upload")
page.get_by_label("Data Source").set_input_files("heavy_archive.zip")
# Chaos: Dropping the network during the upload stream
context.set_offline(True)
# Verify the system offers a Resume/Retry instead of crashing
expect(page.get_by_text("Connection lost. Tap to resume")).to_be_visible()
context.set_offline(False) # Restore and verify recovery
test('recovery logic on network drop', async ({ page, context }) => {
await page.goto('/upload');
await page.getByLabel('Data Source').setInputFiles('heavy_archive.zip');
await context.setOffline(true);
// Asserting UI resilience: Check for a resume prompt
await expect(page.getByText('Connection lost. Tap to resume')).toBeVisible();
await context.setOffline(false);
});
3. CPU Throttling: Catching Race Conditions
The Risk: Many race conditions stay hidden on high-end developer MacBooks but explode on low-end mobile devices. By slowing down the CPU, we shift the execution order of async scripts.
def test_race_condition_under_load(page):
# Emulate a 6x slower CPU via Chrome DevTools Protocol (CDP)
client = page.context.new_cdp_session(page)
client.send("Emulation.setCPUThrottlingRate", {"rate": 6})
page.goto("/dashboard")
page.get_by_role("button", name="Initialize Data").click()
# If the SDK hasn't loaded properly due to high CPU load,
# the status message will fail to update.
expect(page.locator("#init-status")).to_contain_text("Ready", timeout=10000)
test('performance race condition under CPU load', async ({ page }) => {
const client = await page.context().newCDPSession(page);
await client.send('Emulation.setCPUThrottlingRate', { rate: 6 });
await page.goto('/dashboard');
await page.getByRole('button', { name: 'Initialize Data' }).click();
await expect(page.locator('#init-status')).toContainText('Ready', { timeout: 10000 });
});
🛡️ Strategy & Constraints: The Professional View
The Cross-Browser Limitation
It is critical to note that CDP (CPU Throttling) is Chromium-only. While Playwright is cross-browser, these low-level emulations won't run on WebKit (Safari) or Firefox. For a consistent cross-browser chaos strategy, you may eventually need to move these injections to the network layer (Proxy) rather than the browser driver.
CI/CD Integration: Avoiding Flakiness
Chaos tests are inherently slower and more volatile. Do not run them on every Pull Request. * Best Practice: Run a dedicated "Resilience Suite" during nightly builds or pre-release cycles.
- Deterministic Chaos: Use fixed status codes and known latencies to avoid "flaky" failures that lose the team's trust.
Beyond the UI: Why This Matters
We aren't just checking if a "toast" appears. We are verifying that the backend isn't left in a partial state and that the user's money/data isn't lost during a 500 error. Chaos Engineering is the ultimate shield for your business revenue.