Chaos Engineering with Playwright: 3 Practical Methods to Build Resilient Systems

"The tests are green, but production is down."

In 2026, we know that automated E2E tests often fail because they operate in a "happy path" vacuum. They assume low latency, 100% API availability, and infinite CPU resources. To build truly resilient systems, we must validate Graceful Degradation — how the system behaves when things go wrong.

Playwright offers powerful, low-level APIs to inject chaos without external tools. Here is how to move from "testing features" to "testing resilience."

1. Simulated API Failure (HTTP 500 Injection)

The Risk: If a critical API (e.g., billing or auth) fails, does the user face a white screen of death, or a meaningful "Retry" flow? This isn't just a UI check; it's a safeguard against data inconsistency.

Scenario: Intercept a billing request and return a 500 Internal Server Error.

def test_billing_resilience(page):
    # Intercepting the payment endpoint to simulate a backend crash
    page.route("**/api/v1/billing/pay", lambda route: route.fulfill(
        status=500,
        content_type="application/json",
        body='{"error": "Database Deadlock"}'
    ))
    
    page.goto("/checkout")
    page.get_by_role("button", name="Complete Purchase").click()
    
    # Asserting Graceful Degradation: Toast appears, but state is preserved
    expect(page.locator(".error-toast")).to_be_visible()
    expect(page.get_by_role("button", name="Complete Purchase")).to_be_enabled()
test('should handle billing API failure gracefully', async ({ page }) => {
  await page.route('**/api/v1/billing/pay', route => route.fulfill({
    status: 500,
    contentType: 'application/json',
    body: JSON.stringify({ error: 'Database Deadlock' }),
  }));

  await page.goto('/checkout');
  await page.getByRole('button', { name: 'Complete Purchase' }).click();

  await expect(page.locator('.error-toast')).toBeVisible();
  // Ensure the UI allows a retry and doesn't freeze
  await expect(page.getByRole('button', { name: 'Complete Purchase' })).toBeEnabled();
});

2. Network Instability (Sudden Offline Mode)

The Risk: SPAs (Single Page Applications) often break when the WebSocket or long-poll connection drops mid-session. Testing offline transitions reveals missing error boundaries.

Scenario: Start a heavy file upload and suddenly turn off the internet.

def test_upload_recovery(page, context):
    page.goto("/upload")
    page.get_by_label("Data Source").set_input_files("heavy_archive.zip")
    
    # Chaos: Dropping the network during the upload stream
    context.set_offline(True)
    
    # Verify the system offers a Resume/Retry instead of crashing
    expect(page.get_by_text("Connection lost. Tap to resume")).to_be_visible()
    
    context.set_offline(False) # Restore and verify recovery
test('recovery logic on network drop', async ({ page, context }) => {
  await page.goto('/upload');
  await page.getByLabel('Data Source').setInputFiles('heavy_archive.zip');

  await context.setOffline(true);

  // Asserting UI resilience: Check for a resume prompt
  await expect(page.getByText('Connection lost. Tap to resume')).toBeVisible();

  await context.setOffline(false);
});

3. CPU Throttling: Catching Race Conditions

The Risk: Many race conditions stay hidden on high-end developer MacBooks but explode on low-end mobile devices. By slowing down the CPU, we shift the execution order of async scripts.

def test_race_condition_under_load(page):
    # Emulate a 6x slower CPU via Chrome DevTools Protocol (CDP)
    client = page.context.new_cdp_session(page)
    client.send("Emulation.setCPUThrottlingRate", {"rate": 6})
    
    page.goto("/dashboard")
    page.get_by_role("button", name="Initialize Data").click()
    
    # If the SDK hasn't loaded properly due to high CPU load, 
    # the status message will fail to update.
    expect(page.locator("#init-status")).to_contain_text("Ready", timeout=10000)
test('performance race condition under CPU load', async ({ page }) => {
  const client = await page.context().newCDPSession(page);
  await client.send('Emulation.setCPUThrottlingRate', { rate: 6 });

  await page.goto('/dashboard');
  await page.getByRole('button', { name: 'Initialize Data' }).click();

  await expect(page.locator('#init-status')).toContainText('Ready', { timeout: 10000 });
});

🛡️ Strategy & Constraints: The Professional View

The Cross-Browser Limitation

It is critical to note that CDP (CPU Throttling) is Chromium-only. While Playwright is cross-browser, these low-level emulations won't run on WebKit (Safari) or Firefox. For a consistent cross-browser chaos strategy, you may eventually need to move these injections to the network layer (Proxy) rather than the browser driver.

CI/CD Integration: Avoiding Flakiness

Chaos tests are inherently slower and more volatile. Do not run them on every Pull Request. * Best Practice: Run a dedicated "Resilience Suite" during nightly builds or pre-release cycles.

  • Deterministic Chaos: Use fixed status codes and known latencies to avoid "flaky" failures that lose the team's trust.

Beyond the UI: Why This Matters

We aren't just checking if a "toast" appears. We are verifying that the backend isn't left in a partial state and that the user's money/data isn't lost during a 500 error. Chaos Engineering is the ultimate shield for your business revenue.

Subscribe to ChaosQA

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe