[转载]Stack frame omission (FPO) optimization and consequences when debugging

Part1:

During the course of debugging programs, you’ve probably ran into the term “FPO” once or twice. FPO refers to a specific class of compiler optimizations that, on x86, deal with how the compiler accesses local variables and stack-based arguments.

With a function that uses local variables (and/or stack-based arguments), the compiler needs a mechanism to reference these values on the stack. Typically, this is done in one of two ways:
Access local variables directly from the stack pointer (esp). This is the behavior if FPO optimization is enabled. While this does not require a separate register to track the location of locals and arguments, as is needed if FPO optimization is disabled, it makes the generated code slightly more complicated. In particular, the displacement from esp of locals and arguments actually changes as the function is executed, due to things like function calls or other instructions that modify the stack. As a result, the compiler must keep track of the actual displacement from the current esp value at each location in a function where a stack-based value is referenced. This is typically not a big deal for a compiler to do, but in hand written assembler, this can get a bit tricky.
Dedicate a register to point to a fixed location on the stack relative to local variables and and stack-based arguments, and use this register to access locals and arguments. This is the behavior if FPO optimization is disabled. The convention is to use the ebp register to access locals and stack arguments. Ebp is typically setup such that the first stack argument can be found at [ebp+08], with local variables typically at a negative displacement from ebp.

A typical prologue for a function with FPO optimization disabled might look like this:
push   ebp               ; save away old ebp (nonvolatile)
mov    ebp, esp          ; load ebp with the stack pointer
sub    esp, sizeoflocals ; reserve space for locals
...                      ; rest of function

The main concept is that FPO optimization is disabled, a function will immediately save away ebp (as the first operation touching the stack), and then load ebp with the current stack pointer. This sets up a stack layout like so (relative to ebp):
[ebp-01]   Last byte of the last local variable
[ebp+00]   Old ebp value
[ebp+04]   Return address
[ebp+08]   First argument...

Thereafter, the function will always use ebp to access locals and stack based arguments. (The prologue of the function may vary a bit, especially with functions using a variation __SEH_prolog to setup an initial SEH frame, but the end result is always the same with respect to the stack layout relative to ebp.)

This does (as previously stated) make it so that the ebp register is not available for other uses to the register allocator. However, this performance hit is usually not enough to be a large concern relative to a function compiled with FPO optimization turned on. Furthermore, there are a number of conditions that require a function to use a frame pointer which you may hit anyway:
Any function using SEH must use a frame pointer, as when an exception occurs, there is no way to know the displacement of local variables from the esp value (stack pointer) at exception dispatching (the exception could have happened anywhere, and operations like making function calls or setting up stack arguments for a function call modify the value of esp).
Any function using automatic C++ objects with destructors must use SEH for compiler unwind support. This means that most C++ functions end up with FPO optimization disabled. (It is possible to change the compiler assumptions about SEH exceptions and C++ unwinding, but the default [and recommended setting] is to unwind objects when an SEH exception occurs.)
Any function using _alloca to dynamically allocate memory on the stack must use a frame pointer (and thus have FPO optimization disabled), as the displacement from esp for local variables and arguments can change at runtime and is not known to the compiler at compile time when code is being generated.

Because of these restrictions, many functions you may be writing will already have FPO optimization disabled, without you having explicitly turned it off. However, it is still likely that many of your functions that do not meet the above criteria have FPO optimization enabled, and thus do not use ebp to reference locals and stack arguments.

Now that you have a general idea of just what FPO optimization does, I’ll cover cover why it is to your advantage to turn off FPO optimization globally when debugging certain classes of problems in the second half of this series. (It is actually the case that most shipping Microsoft system code turns off FPO as well, so you can rest assured that a real cost benefit analysis has been done between FPO and non-FPO optimized code, and it is overall better to disable FPO optimization in the general case.)

Update: Pavel Lebedinsky points out that the C++ support for SEH exceptions is disabled by default for new projects in VS2005 (and that it is no longer the recommended setting). For most programs built prior to VS2005 and using the defaults at that time, though, the above statement about C++ destructors causing SEH to be used for a function (and thus requiring the use of a frame pointer) still applies.

Part2:
Last time, I outlined the basics as to just what FPO does, and what it means in terms of generated code when you compile programs with or without FPO enabled. This article builds on the last, and lays out just what the impacts of having FPO enabled (or disabled) are when you end up having to debug a program.

For the purposes of this article, consider the following example program with several do-nothing functions that shuffle stack arguments around and call eachother. (For the purposes of this posting, I have disabled global optimizations and function inlining.)
__declspec(noinline)
void
f3(
   int* c,
   char* b,
   int a
   )
{
   *c = a * 3 + (int)strlen(b);

__debugbreak();
}

__declspec(noinline)
int
f2(
   char* b,
   int a
   )
{
   int c;

   f3(
      &c,
      b + 1,
      a - 3);

return c;
}

__declspec(noinline)
int
f1(
   int a,
   char* b
   )
{
   int c;

   c = f2(
      b,
      a + 10);

c ^= (int)rand();

return c + 2 * a;
}

int
__cdecl
wmain(
   int ac,
   wchar_t** av
   )
{
   int c;

   c = f1(
      (int)rand(),
      "test");

printf("%d\n",
c);

return 0;
}

If we run the program and break in to the debugger at the hardcoded breakpoint, with symbols loaded, everything is as one might expect:
0:000> k
ChildEBP RetAddr
0012ff3c 010015ef TestApp!f3+0x19
0012ff4c 010015fe TestApp!f2+0x15
0012ff54 0100161b TestApp!f1+0x9
0012ff5c 01001896 TestApp!wmain+0xe
0012ffa0 77573833 TestApp!__tmainCRTStartup+0x10f
0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe
0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23

Regardless of whether FPO optimization is turned on or off, since we have symbols loaded, we’ll get a reasonable call stack either way. The story is different, however, if we do not have symbols loaded. Looking at the same program, with FPO optimizations enabled and symbols not loaded, we get somewhat of a mess if we ask for a call stack:
0:000> k
ChildEBP RetAddr
WARNING: Stack unwind information not available.
Following frames may be wrong.
0012ff4c 010015fe TestApp+0x15d8
0012ffa0 77573833 TestApp+0x15fe
0012ffac 7740a9bd kernel32!BaseThreadInitThunk+0xe
0012ffec 00000000 ntdll!_RtlUserThreadStart+0x23

Comparing the two call stacks, we lost three of the call frames entirely in the output. The only reason we got anything slightly reasonable at all is that WinDbg’s stack trace mechanism has some intelligent heuristics to guess the location of call frames in a stack where frame pointers are used.

If we look back to how call stacks are setup with frame pointers (from the previous article), the way a program trying to walk the stack on x86 without symbols works is by treating the stack as a sort of linked list of call frames. Recall that I mentioned the layout of the stack when a frame pointer is used:
[ebp-01]   Last byte of the last local variable
[ebp+00]   Old ebp value
[ebp+04]   Return address
[ebp+08]   First argument...

This means that if we are trying to perform a stack walk without symbols, the way to go is to assume that ebp points to a “structure” that looks something like this:
typedef struct _CALL_FRAME
{
struct _CALL_FRAME* Next;
void* ReturnAddress;
} CALL_FRAME, * PCALL_FRAME;

Note how this corresponds to the stack layout relative to ebp that I described above.

A very simple stack walk function designed to walk frames that are compiled with frame pointer usage might then look like so (using the _AddressOfReturnAddress intrinsic to find “ebp”, assuming that the old ebp is 4 bytes before the address of the return address):
LONG
StackwalkExceptionHandler(
   PEXCEPTION_POINTERS ExceptionPointers
   )
{
   if (ExceptionPointers->ExceptionRecord->ExceptionCode
      == EXCEPTION_ACCESS_VIOLATION)
      return EXCEPTION_EXECUTE_HANDLER;

return EXCEPTION_CONTINUE_SEARCH;
}

void
stackwalk(
   void* ebp
   )
{
   PCALL_FRAME frame = (PCALL_FRAME)ebp;

printf("Trying ebp %p\n",
ebp);

   __try
   {
      for (unsigned i = 0;
          i < 100;
          i++)
      {
         if ((ULONG_PTR)frame & 0x3)
         {
            printf("Misaligned frame\n");
            break;
         }

         printf("#%02lu %p [@ %p]\n",
            i,
            frame,
            frame->ReturnAddress);

         frame = frame->Next;
      }
   }
   __except(StackwalkExceptionHandler(
      GetExceptionInformation()))
   {
      printf("Caught exception\n");
   }
}

#pragma optimize("y", off)
__declspec(noinline)
void printstack(
   )
{
   void* ebp = (ULONG*)_AddressOfReturnAddress()
     - 1;

stackwalk(
ebp);
}
#pragma optimize("", on)

If we recompile the program, disable FPO optimizations, and insert a call to printstack inside the f3 function, the console output is something like so:
Trying ebp 0012FEB0
#00 0012FEB0 [@ 0100185C]
#01 0012FED0 [@ 010018B4]
#02 0012FEF8 [@ 0100190B]
#03 0012FF2C [@ 01001965]
#04 0012FF5C [@ 01001E5D]
#05 0012FFA0 [@ 77573833]
#06 0012FFAC [@ 7740A9BD]
#07 0012FFEC [@ 00000000]
Caught exception

In other words, without using any symbols, we have successfully performed a stack walk on x86.

However, this all breaks down when a function somewhere in the call stack does not use a frame pointer (i.e. was compiled with FPO optimizations enabled). In this case, the assumption that ebp always points to a CALL_FRAME structure is no longer valid, and the call stack is either cut short or is completely wrong (especially if the function in question repurposed ebp for some other use besides as a frame pointer). Although it is possible to use heuristics to try and guess what is really a call/return address record on the structure, this is really nothing more than an educated guess, and tends to be at least slightly wrong (and typically missing one or more frames entirely).

Now, you might be wondering why you might care about doing stack walk operations without symbols. After all, you have symbols for the Microsoft binaries that your program will be calling (such as kernel32) available from the Microsoft symbol server, and you (presumably) have private symbols corresponding to your own program for use when you are debugging a problem.

Well, the answer to that is that you will end up needing to record stack traces without symbols in the course of normal debugging for a wide variety of problems. The reason for this is that there is a lot of support baked into NTDLL (and NTOSKRNL) to assist in debugging a class of particularly insidious problems: handle leaks (and other problems where the wrong handle value is getting closed somewhere and you need to find out why), memory leaks, and heap corruption.

These (very useful!) debugging features offer options that allow you to configure the system to log a stack trace on each heap allocation, heap free, or each time a handle is opened or closed. Now the way these features work is that they will capture the stack trace in real time as the heap operation or handle operation happens, but instead of trying to break into the debugger to display the results of this output (which is undesirable for a number of reasons), they save a copy of the current stack trace in-memory and then continue execution normally. To display these saved stack traces, the !htrace, !heap -p, and !avrf commands have functionality that locates these saved traces in-memory and prints them out to the debugger for you to inspect.

However, NTDLL/NTOSKRNL needs a way to create these stack traces in the first place, so that it can save them for later inspection. There are a couple of requirements here:
The functionality to capture stack traces must not rely on anything layed above NTDLL or NTOSKRNL. This already means that anything as complicated as downloading and loading symbols via DbgHelp is instantly out of the picture, as those functions are layered far above NTDLL / NTOSKRNL (and indeed, they must make calls into the same functions that would be logging stack traces in the first place in order to find symbols).
The functionality must work when symbols for everything on the call stack are not even available to the local machine. For instance, these pieces of functionality must be deployable on a customer computer without giving that computer access to your private symbols in some fashion. As a result, even if there was a good way to locate symbols where the stack trace is being captured (which there isn’t), you couldn’t even find the symbols if you wanted to.
The functionality must work in kernel mode (for saving handle traces), as handle tracing is partially managed by the kernel itself and not just NTDLL.
The functionality must use a minimum amount of memory to store each stack trace, as operations like heap allocation, heap deallocation, handle creation, and handle closure are extremely frequent operations throughout the lifetime of the process. As a result, options like just saving the entire thread stack for later inspection when symbols are available cannot be used, since that would be prohibitively expensive in terms of memory usage for each saved stack trace.

Given all of these restrictions, the code responsible for saving stack traces needs to operate without symbols, and it must furthermore be able to save stack traces in a very concise manner (without using a great deal of memory for each trace).

As a result, on x86, the stack trace saving code in NTDLL and NTOSKRNL assumes that all functions in the call frame use frame pointers. This is the only realistic option for saving stack traces on x86 without symbols, as there is insufficient information baked into each individual compiled binary to reliably perform stack traces without assuming the use of a frame pointer at each call site. (The 64-bit platforms that Windows supports solve this problem with the use of extensive unwind metadata, as I have covered in a number of past articles.)

So, the functionality exposed by pageheap’s stack trace logging, and handle tracing are how stack traces without symbols end up mattering to you, the developer with symbols for all of your binaries, when you are trying to debug a problem. If you make sure to disable FPO optimization on all of your code, then you’ll be able to use tools like pageheap’s stack tracing on heap operations, UMDH (the user mode heap debugger), and handle tracing to track down heap-related problems and handle-related problems. The best part of these features is that you can even deploy them on a customer site without having to install a full debugger (or run your program under a debugger), only later taking a minidump of your process for examination in the lab. All of them rely on FPO optimizations being disabled (at least on x86), though, so remember to turn FPO optimizations off on your release builds for the increased debuggability of these tough-to-find problems in the field.

posted on 2009-10-29 18:01 大熊的口袋阅读(855) 评论(0) 编辑收藏引用所属分类: cpp

只有注册用户登录后才能发表评论。


相关文章: TRACE改进版（静态tls在dll中使用的问题）使用 C++ 编写内核模式驱动程序的优点与缺点 vc下格式化输出调试诊断信息（TRACE） [转载]Stack frame omission (FPO) optimization and consequences when debugging vc6下支持格式化输出的调试输出功能的简单实现

网站导航: 博客园博客园最新博文博问管理

大熊的口袋

[转载]Stack frame omission (FPO) optimization and consequences when debugging

导航

统计

公告

常用链接

留言簿(2)

随笔分类

随笔档案

win32 & debug

搜索

积分与排名

最新评论

阅读排行榜

评论排行榜