Connectivity Issue Between SQL Server Databases and Windows Failover Cluster

Mir Majeed 20 Reputation points
2025-10-06T14:31:39.9866667+00:00

image1.png.pdf

We are currently experiencing connectivity issues between the Active/Active SQL Server databases and the Windows Failover Cluster. As a result, the cluster is not functioning properly, and all load is currently running on a single node.

We’ve observed that this issue typically occurs when the Trellix security scan runs on the cluster machines. Although we have excluded all SQL Server–related folders from the scan, the error continues to occur on a daily basis.

I’ve attached screenshots of the relevant logs for your review and troubleshooting.

Please advise on the next steps to restore stable cluster operation.

SQL Server Database Engine
{count} votes

Answer recommended by moderator
  1. Jay Pham (WICLOUD CORPORATION) 3,325 Reputation points Microsoft External Staff Moderator
    2025-10-07T08:01:09.6766667+00:00

    Hi @Mir Majeed,

    Thank you for contacting Microsoft Q&A. You're facing repeated "lease expired" errors in SQL Server Always On AG, where the primary misses its 20-second WSFC renewal—likely from daily Trellix scan spikes (CPU/I/O overload). Here's a cause breakdown and fixes per Microsoft docs.

    • Resource Strain: High CPU (>90%), disk latency (>5-10 ms), low memory (paging)—e.g., from scans locking files.
    • Comms Issues: RPC/shared memory failures or quorum loss.
    • OS Interference: Hangs, VM throttling, or AV scanning cluster paths (e.g., C:\Windows\Cluster).

    I have some solutions you can try:

    1. Share last 2–3 event timestamps (from AG Dashboard/Event Viewer: Microsoft > FailoverClustering > Operational).
    2. Confirm Trellix scan times + modules.
    3. Add AV exclusions: Data/log/TempDB, backups, ERRORLOG/.xel/.trc, Full-Text/Filestream/replication, C:\Windows\Cluster, CSV roots. Processes: sqlservr.exe, sqlagent.exe, fdhost.exe, fdlauncher.exe, Launchpad.exe (if ML), SQLBrowser.exe, clussvc.exe.
    4. Shift full scans to off-peak; enable throttling/scan-on-write for DBs. See Trellix KB: https://kcm.trellix.com/corporate/index?page=content&id=KB67211.
    5. If events persist pre-fix: Extend lease timeout via Failover Cluster Manager (Roles > AG resource > Properties > LeaseTimeout=60000 ms > Apply > Offline/Online). Or PowerShell: Get-ClusterResource "[AG_Name]" | Set-ClusterParameter -Name LeaseTimeout -Value 60000. Revert to 20000 after 2 stable scans. Monitor: SSMS > Always On > Right-click AG > Dashboard.

    Here are some referencs I find in Microsoft Office:

    I hope this helps you get things back on track quickly! If you agree with our suggestion, feel free to interact with the system accordingly!


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.