Blue Screen of Death on DAGs

It’s a disconcerting event when one day one of your Exchange 2010 SP1 or later DAG members blue-screens. We have a wide range of emotions that accompany one of these event’s including a sense of dread and an “OH NO” sense of something wrong, since after all, it IS a critical application AND it’s meant to be highly available, and then it blue-screens, or in Microsoft talk, it bugchecks.

After you finish panicking, after all the server reboots, and it looks like everything is fine, you could be tempted to write it off. You may not have been there when it blue screened to catch the error code, and with a bit of luck it was a passing thing. Could have been a patch installing, a transient driver error or a phase of the moon. Denial is great at suggesting options after all…..

And then it happens again, and again. Seemingly randomly. By this time you may be tempted to brush up your CV writing skills, however, fear not, there are other new skills that could help out.

Configuring a memory dump

 

Your server has suffered a BSOD event, or even a few of them, and you weren’t there to witness them. You may wonder if have any options at this point? Actually you do. By default, Windows operating systems are configured to write something, and that something depending on your configuration could be a complete image of your memory at the time of the error or a summary, also known as a Mini Dump. You may review these settings by examining My Computer|Properties| Advanced System Settings| Advanced | Startup and Recover Settings. On the screen below you are able to choose the level of detail written during a BSOD aka a Bugcheck.

image

 

Irrespective of what kind of dump you have, lets assume you have at least one.

 

Tools for reading a memory dump.

 

Looking at our server, we can see that there have been a number of BSOD events

image

Now we need some tools specifically a Debugger. I downloaded mine by navigating to the Windows Software Development Kit (SDK) for Windows 8 page on MSDN, running the installer and choosing only the debugger.

Since I’m running Windows 8 64 bit, and I chose the image 64 bit icon. At this stage one or two primers may be helpful, so for the sake of getting started and not repeating excellent content I would refer you to read both http://www.petri.co.il/bsod-troubleshooting.htm# and http://blogs.msdn.com/b/ntdebugging/archive/2009/07/27/debugging-a-bugcheck-0xf4.aspx. The first will push you out of the door, so to speak on your debugging skills, while the second will certainly raise you appreciation for what is possible using the debugger.

Moving along,  we launch our new debugger and press CTRL-S to configure the Symbol Search Path as “SRV*c:\symbols*http://msdl.microsoft.com/download/symbols”

image

Symbols are rather important to the readability of the output. Next Press CTRL-D to open the browse dialog for opening a dump. After finding and opening the dump, the debugger displays a bit of info and then seems to hang while it downloads the symbols required.

When it finishes, the output may be similar to the following:

Microsoft (R) Windows Debugger Version 6.2.9200.20512 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\temp\Minidump\050813-33477-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows 7 Kernel Version 7601 (Service Pack 1) MP (8 procs) Free x64
Product: Server, suite: Enterprise TerminalServer SingleUserTS
Built by: 7601.18113.amd64fre.win7sp1_gdr.130318-1533
Machine Name:
Kernel base = 0xfffff800`01811000 PsLoadedModuleList = 0xfffff800`01a54670
Debug session time: Wed May  8 08:58:14.950 2013 (UTC + 2:00)
System Uptime: 0 days 14:39:17.213
Loading Kernel Symbols
...............................................................
................................................................
..............
Loading User Symbols
Loading unloaded module list
.......
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck F4, {3, fffffa8013070b30, fffffa8013070e10, fffff80001b8d350}

----- ETW minidump data unavailable-----
Probably caused by : wininit.exe

Followup: MachineOwner
--------- image

Depending on the version of your debugger you may or may not have the !analyze –v  hyperlink, however if you click the link or type it into the prompt on the bottom left of the debugger the result is identical.

4: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

CRITICAL_OBJECT_TERMINATION (f4)
A process or thread crucial to system operation has unexpectedly exited or been
terminated.
Several processes and threads are necessary for the operation of the
system; when they are terminated (for any reason), the system can no
longer function.
Arguments:
Arg1: 0000000000000003, Process
Arg2: fffffa8013070b30, Terminating object
Arg3: fffffa8013070e10, Process image file name
Arg4: fffff80001b8d350, Explanatory message (ascii)

Debugging Details:
------------------

----- ETW minidump data unavailable-----

PROCESS_OBJECT: fffffa8013070b30

IMAGE_NAME:  wininit.exe

DEBUG_FLR_IMAGE_TIMESTAMP:  0

MODULE_NAME: wininit

FAULTING_MODULE: 0000000000000000

PROCESS_NAME:  msexchangerepl

BUGCHECK_STR:  0xF4_msexchangerepl

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT_SERVER

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from fffff80001c14d22 to fffff80001886c00

STACK_TEXT:
fffff880`08cc5b08 fffff800`01c14d22 : 00000000`000000f4 00000000`00000003 fffffa80`13070b30 fffffa80`13070e10 : nt!KeBugCheckEx
fffff880`08cc5b10 fffff800`01bc108b : ffffffff`ffffffff fffffa80`13b7e2c0 fffffa80`13070b30 fffffa80`1356d590 : nt!PspCatchCriticalBreak+0x92
fffff880`08cc5b50 fffff800`01b41144 : ffffffff`ffffffff 00000000`00000001 fffffa80`13070b30 00000000`00000008 : nt! ?? ::NNGAKEGL::`string'+0x17486
fffff880`08cc5ba0 fffff800`01885e93 : fffffa80`13070b30 fffff880`ffffffff fffffa80`13b7e2c0 00000000`00003548 : nt!NtTerminateProcess+0xf4
fffff880`08cc5c20 00000000`771b15da : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`4b29e718 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x771b15da

STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

FAILURE_BUCKET_ID:  X64_0xF4_msexchangerepl_IMAGE_wininit.exe

BUCKET_ID:  X64_0xF4_msexchangerepl_IMAGE_wininit.exe

Followup: MachineOwner
---------

From the output above we know that the event was a F4 (Critical Process Termination) and that the process directly responsible for causing the crash was “msexchangerepl“ otherwise known as the Exchange Replication Service. Remembering that we have multiple parameters ascociated with the crash, lets look at the fourth parameter,

Arg4: fffff80001b8d350, Explanatory message (ascii)

which may hold another clue. We will do this by typing dc fffff80001b8d350 at the prompt, which gives us the following output:

4: kd> dc fffff80001b8d350
fffff800`01b8d350  6d726554 74616e69 20676e69 74697263  Terminating crit
fffff800`01b8d360  6c616369 6f727020 73736563 25783020  ical process 0x%
fffff800`01b8d370  25282070 000a2973 90909090 90909090  p (%s)..........
fffff800`01b8d380  61657242 6f202c6b 67492072 65726f6e  Break, or Ignore
fffff800`01b8d390  69622820 00203f29 90909090 90909090   (bi)? .........
fffff800`01b8d3a0  74697243 6c616369 72687420 20646165  Critical thread
fffff800`01b8d3b0  70257830 6e692820 29732520 69786520  0x%p (in %s) exi
fffff800`01b8d3c0  0a646574 90909000 90909090 90909090  ted.............

 

Understanding the cause

 

Let’s recap. We know we’ve had a BSOD or bugcheck on a DAG member, and it appears that we know the cause: It’s Exchange!

Why would Exchange cause an intentional BSOD? It’s by design.

image We are able to confirm this by examining the Crimson Channel, which is the Exchange 2010/2013 High Availability event log. This may be found in the event viewer under “Applications and Services logs”,  as both the “High Availability” and “MailboxDatabaseFailureItems”

Depending on the nature of the event, we may find ESE events 507, 508, 509 or 510. ESE may have logged these events due to discovering that an IO has been outstanding for longer than 4 minutes, in other words a write has not committed to disk for 4 minutes. As a result the Microsoft Exchange Replication Service (MSExchangeRepl.exe) terminate the winit.exe process, and intentionally forces a BSOD.

Assuming you have the coroborating ESE events, you are now in a position to diagnose your BSOD: There is a great likelihood that your storage subsystem was tasked with a too high IO load for too long, or that you have a storage failure.

 

Assuming that the operating system hung, and the event log service was unable to write the event logs, you may have other issues affecting the operating system. Making a distinction between an OS issue and a storage issue is critical, therefore I would encourage you to parse the event logs in the Crimson Channel for further detail.

Where to next

 

Understanding why your server bugchecked and knowing what to do about are not the same thing. So where to next? If you have the corroborating ESE event’s you can follow Neil Johnson’s guidance on modifying the time out values.

Assuming you don’t have any corroborating event’s you may still be facing an over burdened Exchange server. You may need to change the distribution of active Databases and/or your mailbox distribution pattern. You may also need to look at the health of the OS underpinning Exchange. There may be additional event log items to support your search, specifically AV or Virtualization based events.

2 thoughts on “Blue Screen of Death on DAGs

  1. Hi Nic, I’m not a Windows or Exchange person but even I found that interesting. Great info!

Comments are closed.