Simon Green, NVIDIA
The long awaited EXT_framebuffer_object was released last month. This extension was a result of the efforts of the uberbuffers workgroup, and although it does not currently contain all of the functionality promised by uberbuffers, it does include much of it, most importantly a much-improved render-to-texture capability. It also is designed to have additional functionality (e.g. rendering to vertex attributes) built on top of it.
By way of review, render-to-texture is used for procedural textures, reflections, and various multipass techniques (e.g. anti-aliasing, motion blur, depth of field, image processing, GPGPU). It allows you to have to the contents of the framebuffer placed directly into a texture. The advantage of this is that it avoids a copy and uses less memory (since there is only one copy of the image). However, in practice, the driver may have to perform a copy anyway, due to using separate memory for the framebuffer and textures, or because of different internal representations.
Pbuffers were introduced as a solution for render-to-texture. They are designed for off-screen rendering. The available formats for pbuffers are determined using the ChoosePixelFormat()/DescribePixelFormat() APIs.
Pbuffers come with a number of major drawbacks. Each requires its own OpenGL rendering context (though of course, texture objects and display lists can be shared between contexts using wglShareLists()). Besides being painful to manage and bug-prone, switching between pbuffers requires an expensive context switch. In addition, each has its own depth, stencil, and aux buffers, which are often not necessary.
The Windows-specific WGL_ARB_render_texture was built on top of pbuffers to improve their functionality. It allows you to bind the color or depth buffer of a pbuffer directly to a texture (using wglBindTexImageARB() and wglReleaseTexImageARB. When using this extension, the format of the texture is determined by the pixel format of the pbuffer. In order to write portable applications, you need to have a separate pbuffer for each renderable texture. To make this slightly more efficient, you can bind the front and back buffers of the pbuffer to separate textures (using glDrawBuffer(GL_FRONT/GL_BACK), giving you two textures per buffer/context.
Besides the previously listed limitations of existing render-to-texture methods, using it with antialiasing is complicated at best. It doesn't work with multisampling, because current hardware can't ready from a multisample buffer into a texture (though a slow copy could be done in the driver). One solution to this is creating a normal multisampled pbuffer and using glCopyTexImage(), which will downsample automatically. The application can also do supersampling itself if necessary for post-processing effects, but this will be much more expensive.
The framebuffer object (FBO) extension presents a much better and more simplified method of doing render-to-texture. Its advantages include:
- It only requires a single GL context, so switching between framebuffers faster than switching between pbuffers.
- It doesn't require the complex pixel format selection that pbuffers uses, since the format of the framebuffer is determined by texture or renderbuffer format. This puts the burden of finding compatible formats on developers.
- It's much more similar to D3D's render target model, making code porting easier.
- Renderbuffer images and texture images can be shared among framebuffers (for example, sharing a single depth buffer between multiple color targets), resulting in memory savings.
So what exactly does the framebuffer object extension entail? You'll recall that in OpenGL, the framebuffer is a collection of logical buffers, including the color, depth, stencil, and accumulation buffers. With the FBO extension, you can render to destinations other than those provided by the window system, allowing you to render-to-texture in a window system independent manner (notice that this is a GL extension, not a WGL extension). These destinations are known as "framebuffer-attachable images" (2D arrays of pixels that can be attached to a framebuffer). They can be off-screen buffers (Renderbuffers) or texture images. Each texture image in a texture object can be attached to a buffer in a framebuffer object. Renderbuffer images can be attached as well.
Two new objects have been added to OpenGL as part of this extension. The first, framebuffer objects, consist of a collection of framebuffer-attachable images. They are equivalent to a window system drawable. The second, renderbuffer objects, can be rendered to but can't be used as texture images.
Managing FBOs (as well as renderbuffers) is similar to texture objects. They are created and deleted with glGenFrameBuffersEXT() and glDeleteFramebuffersEXT(), and they are bound with glBindFramebufferEXT(). When an FBO is bound, its attached images are the source and destination for fragment operations. Binding an FBO id of 0 causes operations to occur on the default framebuffer.
Textures can be attached to a framebuffer using glFramebufferTexture2D (or the 1D or 3D variations). This API allows you to attach images from a texture object to one of the logical buffers of the currently bound framebuffer.
When using renderbuffers, glRenderbufferStorageEXT() is used to define the format and dimensions of the render buffer. This call is similar to glTexImage, but doesn't include any image data. The contents of a renderbuffer can be read or written to using glReadPixels()/glDrawPixels, or similar APIs. A renderbuffer can be attached to a framebuffer by using glFramebufferRenderbufferEXT().
Finally, you can automatically generate mipmaps for texture images attached to a target by using glGenerateMipmapEXT().
Similar to textures, framebuffers define a "completeness" that must be satisfied when rendering in order to work properly. A framebuffer is considered complete if the following conditions are met:
- The texture formats makes sense for attachment points (i.e. don't try to attach a depth texture to a color attachment).
- All attached images have the same width and height
- All images attached to COLOR_ATTACHMENTn_EXT must have the same format
Since completeness can be implementation dependent, the extension includes the ability to check the framebuffer status via glCheckFramebufferStatusEXT(). If it is incomplete, an error code will indicate why. If the return value is GL_FRAMEBUFFER_UNSUPPORTED_EXT, you should keep trying different format combinations until you find one that works.
To get optimal performance from FBOs, use the following tips:
- Don't create and destroy them every frame (just like other objects, but worse)
- Try to avoid modifying textures used as rendering destinations using glTexImage(), glCopyTexImage, etc.
In order to render to only the depth and stencil buffers (for example, when using shadow volumes or shadow maps), you have to set the draw and readbuffers to GL_NONE.
When using multiple rendering destinations, there are several ways to switch between them. The following are listed in order of increasing performance. Note that these may vary by platform, and because the extension is very immature, these may change over time.
- Using multiple FBOs, with a separate FBO for each texture, switching between them using glBindFramebufferEXT(). This is the most straightforward approach, and it is at least twice as fast as having to switch contexts.
- Using a single FBO with multiple texture attachments. This requires that the textures have the same format and dimensions. Textures can be switched between using glFramebufferTexture(). This is slightly more lightweight than using multiple FBOs, so it may be faster.
- Using a single FBO with multiple texture attachments and attaching the textures to different color attachments. glDrawBuffer() is used to switch between the attachments.
Framebuffer objects are currently an EXT extension, but it will almost certainly promoted to ARB status once the design is proven. The ARB welcomes feedback on the extension from developers. The following additional functionality is likely to be built on top of this extension in the future:
- Render to vertex attribute (similar to VBO w/ PBO extensions on NVIDIA hardware). This is useful for particle systems and other applications. This functionality will likely be built on top of renderbuffers.
- The addition of format groups. Similar to pixel formats, these define groups of formats that work together for a given implementation.
- Multisampling and accumulation buffer support.
The framebuffer object extension is currently available in beta drivers from both NVIDIA and ATI. It will be fully supported on NV30 and R300 and later, and possibly on NV1x and R200 as well.
OpenGL 2.0 and New Extensions
Cass Everitt, NVIDIA
The OpenGL 2.0 spec was released last year at SIGGRAPH, but it is still relatively new in that it isn't supported in commercial drivers yet. The most noteworthy addition in 2.0 was the adoption of the OpenGL Shading Language, but it also promoted multiple draw buffers (being able to draw to more than one target per fragment), non-power-of-two textures, point sprites, separate stencil (i.e. two sided stencil, so that the stencil buffer gets updated differently for front and back facing polygons), and separate blend functions and equations (RGB vs. alpha) to the core.
The OpenGL Shading Language adds shader objects, promoted from the ARB_shader_objects extension, to manage shader and program objects, and shader programs, promoted from ARB_vertex_shader/ARB_fragment_shader. The presence of the shading language is advertised via ARB_shading_language_100 and a compliant implementation must support version 1.10. In promoting the language to the core, the GLhandleARB type was replaced with GLuint.
The ability to write to multiple color buffers at once is based on the ATI_draw_buffers extension. This addition can reduce the number of rendering passes, as it allows you to write to more than just 4 values.
With non power-of-two textures, the requirement that texture dimensions be a power-of-two has been relaxed.
Point sprites were promoted from the ARB_point_sprite extension. They provide points with varying texture coordinates, which are useful for particle systems. An issue with point sprites in the past has been that their orientation was different from most OpenGL images (origin at the top left rather than lower left) due to this feature being imported from Direct3D. Now, the orientation can be changed by using GL_POINT_SPRITE_COORD_ORIGIN with GL_UPPER_LEFT or GL_LOWER_LEFT.
Two-sided stencil allows you to set stencil operation separately depending on whether a primitive is front- or back-facing. This allows you to do stencil shadow volumes in one pass instead of two.
The EXT_blend_func_separate and EXT_blend_equation_separate extensions were promoted to the core to allow for more flexible blending.
There were many other features that were considered for promotion to the core in 2.0 that didn't make it. These may appear in future updates.
- ARB_vertex_program and ARB_fragment_program are widely used, but there isn't much interest in making them part of the core. They will likely live on as ARB extensions for quite some time.
- Pixel buffer objects (now ARB_pixel_buffer_object use the same APIs as VBOs, using GL_PIXEL_PACK_BUFFER and GL_PIXEL_UNPACK_BUFFER binding points. They allow for more efficient texture downloads, and asynchronous read-back.
- Floating point throughout the pipleline, allowing for more general purpose use.
- ARB_color_buffer_float adds pixel formats/visuals with floating point RGBA color components. The extension includes the ability to enable clamping for various operations, including vertex colors (after lighting), fragment color, and pixel reads.
- ARB_texture_float adds floating point internal formats. It also supports queries to determine component type.
- ARB_half_float_pixel defines an external format for fp16 pixel data from the CPU.
- ARB_texture_rectangle has no power-of-two constraints, but it also doesn't support mipmaping, repeat, or borders, and texture coordinates are not normalized (0..w, 0..h). Although it has more restricted functionality than ARB_non_power_of_two, it is more widely supported in hardware (back to GeForce 256).
- ARB_fragment_program_shadow allows fragment programs to fetch shadow map values. It adds new texture targets for shadow map comparisons in fragment programs. To use this, textures must have depth format and have depth comparison mode set.
- ARB_texture_mirrored_repeat is useful for attenuation /spotlight maps, and also can help by reducing the size of symmetric textures.
The ARB wants to know where developers want them to focus efforts, i.e. on demos/whitepapers, by revising the spec more frequently, holding developer conferences, etc. Be sure to let them know what you think.
OpenGL Shading Language
Bill Licea-Kane, ATI Technology
When the OpenGL Shading Language was promoted from an ARB extension to the core, a number of changes were made to it. The GLhandleARB type was removed and replaced with GLuint. The word "object" was dropped from the APIs (so ShaderObject is now just Shader, ProgramObject is Program, etc.). You can now pass GL_SHADING_LANGUAGE_VERSION to glGetString() to determine the shading language version. Advertising the version via the extension string has been frozen (i.e. implementations will never report anything other than ARB_shading_language_100).
Several new preprocessor directives were added. #version lets you set what version of the shading language your program was written for (the default is 110). #extension lets you control how extensions to the language are used. The usage is #extension name : behavior where name is the name of the extension, and behavior is require, enable, warn, or disable. The default is #extension all : disable (which means assume 2.0 shading language with no extensions).
For convenience, more derived matrix states have been added (inverse, transpose, inversetranspose).
Function calls that include arrays have been changed to require that the number of elements in the array be declared. The number of parameters is considered part of the function's signature and thus can be used for overloading.
Constructors have been simplified in a way that won't even be noticed by most people. If you try to pass too many arguments, an error will be generated. You can also use combinations of compound data types to initialize values.
gl_FragData[n] has been (re)added. You can now use gl_FragColor to write to all buffers, or gl_FragColor[n] to write to multiple draw buffers, but you can't use both in the same shader.
A couple of new new built-in functions were added, including refract, and some existing functions were fixed (e.g. defining the domain of exponentials, and fixing step to step at the edge).
Shadows have been clarified to be the same as ARB_fragment_program.
With the release of the 2.0 spec, several new reservations were made. For variables and derived data types, gl_ is reserved for GL, and gl_VEN (where VEN is vendor specific (NV, ATI, etc.), EXT, OML, OES, GL2, or ARB) is reserved for extensions. For keywords, __ is reserved to GL and __VEN is reserved for extensions. In order to overload an operator, ARB approval is required.
Simon Green, NVIDIA; Cass Everitt, NVIDIA; Evan Hart, ATI; Bill Licea-Kane, ATI
Should a new branch of OpenGL that does not maintain backwards compatibility be created? (ala DX versions and original OpenGL 2.0 proposal)
Evan Hart: There are names that don't make sense any more and state that doesn't really need to be maintained that we may be able to get rid of. OpenGL ES is a good example of getting rid of things that aren't really useful while maintaining the core features of OpenGL.
Cass Everitt: Something to be said for the fact that applications written 10 or more years ago can still be compiled effortlessly. I don't feel that breaking backwards compatibility is likely to happen. OpenGL ES is a good example of how to handle needing to have a lighter-weight version of OpenGL when needed.
Simon Green: With Direct3D, almost all of the applications written are games, so maintaining backwards compatibility isn't a concern. With OpenGL, you have more than just games to consider.
Will there be an instancing API for OpenGL?
Bill Licea-Kane: The ARB's answer was that OpenGL already has this: Immediate Mode.
Cass Everitt: I'm in the camp that thinks that the OpenGL model already handles the issues related to instancing pretty well (better than D3D) so an extension for this would not be likely to improve performance. Further, the existing shader model – which already handles this pretty well – will probably evolve to handle it even better in the future.
Any work on an effects framework for OpenGL?
Bill Licea-Kane: ARB has been discussing an open effects framework. Right now, COLLADA will probably be well-suited to this.
Simon Green: NVIDIA is still supporting CgFX, which is going to be updated soon. It can be used with OpenGL across multiple hardware platforms.
Have you thought about non-RGB pixel formats (i.e. video formats)?
Simon Green: Nothing currently in the works.
Image Processing Tricks in OpenGL
Simon Green, NVIDIA
Image processing in games is becoming increasingly important as games becoming more like movies. Much of the "look" of a game is determined in post processing (color correction, blurs, depth of field, motion blur, etc.). Image processing is also important for offline tools (pre-processing (lightmaps), texture compression).
Image histograms give frequency of occurrence for each intensity level in image (useful in image anaylsis, HDR tone mapping algorithms). OpenGL includes histogram functionality, but it isn't widely supported in hardware. As an alternative, you can ccalculate histograms using multiple passes and occlusion query. The algorithm for this is as follows:
- Render scene to texture
- For each bucket in histogram
- Begin occlusion query
- Draw quad using fragment program that discards fragments outside the bucket
- End occlusion query
- Count fragments
- Process histogram
This approach requires n passes for n buckets.
OpenGL Performance Tuning: OpenGL Performance in a Shader-Centric World
Evan Hart, ATI Technology
Shaders are being widely used because of the better visual quality they offer, because they let you do exactly what you want to do, and because they allow you to offload more to the GPU. So making sure that they are performing optimally is critical to overall performance.
In general performance is analyzed and improved by finding the bottleneck, balancing performance, and repeating.
You can find the bottleneck by reducing the workload of different stages. If you reduce the workload for a particular and performance doesn't change, then you know that that stage isn't the bottleneck. If it does change, you can consider permanently reducing the workload, or determining what else you can do while the system waits for the bottleneck.
Pixel bottlenecks are the easiest to detect since you can simply change the resolution and see how it affects performance. Pixel bottlenecks can be caused by many things. If memory bandwidth is the issue, then disabling blending or reducing texture bit depth can help. If the shader itself is the issue, using a simplified shader can isolate the problem. If texture filtering is the bottleneck, disabling trilinear or anisotropic filtering can identify the problem.
Vertex bottlenecks are harder to detect, and less frequent. To identify them, render only half the triangles of each object, and reduce the complexity of the vertex shader. If both scale performance, then you definitely have a vertex bottleneck. If reducing the triangle count scales performance but a simplified vertex shader does not, then you have a submission/fetch bottleneck.
CPU bottlenecks are the most common today on high-end systems. Profilers like VTune® can help determine API versus application time. If you are application limited, then you can increase rendering quality and make your game prettier since it is essentially free.
To improve the performance of your shaders, there are many things you can do.
Keep floating point basics in mind. Operations aren't associative – (a * b) * c != a * (b * c) – which limits the ability of a compiler to reorder code. GPU compilers are likely to be fairly aggressive, but do not rely on extreme reordering
Remember hardware limitations. Falling back to software is the ultimate performance penalty (based on the OpenGL philosophy of always being correct). Dynamic addressing not available everywhere, and vertex texturing is limited, so be careful when using these to avoid falling back to software.
Try to balance your three major resources: computation, texturing, and interpolation. For example, if you're doing too much computation in a fragment shader, you may be able to replace it with a texture lookup or an interpolant. When balancing resource usage, keep the following hints in mind:
- Bandwidth is scarce. A simple texture fetch can consume up to 32 GB/s
- Bias toward ALU operations since they don't consume bandwidth. MAD is cheaper than sqrt (in general). This tip will become increasingly valuable moving forward.
- Using more varying almost always better, because it moves computation back to the vertex shader
As much as possible, you should strive to write clear code. Compilers are pretty good today, and GPU compilers are getting much better. Convoluted code may actually hurt the compiler's ability to optimize.
Shader conditionals are a powerful feature, but they come with complex performance implications. They can increase the instruction count. Don't assume that it'll be faster to skip work, since conditionals have a setup cost, and multiple paths may have to be executed anyway.
Make use of the const qualifier whenever you can. You should always use const with in. Don't use inout if you really mean just in or just out. Compile time constants are highly efficient and can be easily optimized, whereas uniforms require JIT optimizations. It's often more efficient to use multiple shaders rather than having a shader that branches based on a uniform that only has a few possible values.
Reducing the register pressure can help, but it's less important than it was in the past, due to better compilers and hardware evolution. Still, you should try to minimize array sizes and use types that you really need, since compilers can take advantage of smaller types.
Utilize vector instructions, since they help the compiler out by specifying explicit parallelism. They also reliably access "special" instructions.
You need to understand the architectures. Modern hardware uses unpublished instruction sets that are difficult to understand and properly tune, and they change from generation to generation.
State changes can be expensive, so you should organize your data to avoid them as much as possible. You should always compile your shaders at startup and leave them around for the duration of your program, rather than compiling them as needed. Similarly, you should avoid frequent sampler remapping. Queries should be done infrequently; you should ask once and cache the value (e.g. uniform locations). You only need to validate a shader once. As long as you bind it with the same samplers, if it's valid once, it'll always be valid.
Test "uber" shaders carefully. Switching between multiple specialized shaders may be faster than using one large shader that does everything.
In addition to the recommendations listed so far for shaders, there are many things you can do in the fixed function pipeline to improve performance.
The depth buffer is critical to performance, and they are highly optimized with modern GPUs. To get the most out of them:
- Clear depth and stencil together (since they are stored in interleaved memory)
- Clear the depth buffer (can be done very fast, don't try to do tricks to avoid having to clear the depth buffer)
- Draw in rough front to back order for opaque objects to take advantage of early Z test. Draw your skybox last to take full advantage of this.
- If your fragment cost is high (e.g. expensive fragment shaders), consider doing a depth fill pass first to avoid processing fragments that won't be visible anyway.
- Avoid killing pixels while updating the depth buffer (using alpha test or discard)
- Avoid reading the depth buffer
- Use a consistent depth function (GL_LEQUAL is ideal)
Use modern graphics programming methods. Extensions are often performance focused, so don't be afraid to use them (e.g. FBOs, VBOs, PBOs).
Minimize state changes, since toggling states costs CPU time. Avoid resetting states to default values "just because". Some state changes are worse than others. Shader changes and texture format changes are expensive, while things like enabling or disabling blending are cheap.
OpenGL Performance Tools
Sébastien Dominé, NVIDIA
NVIDIA has a couple of very useful tools for improving performance that I'll summarize briefly here.
NVShaderPerf allows you to to evaluate the cost of shaders offline. You can specify which shader model you want to estimate (GLSL (fragments), !!FP1.0, !!ARBfp1.0, Cg). It's available on GeForce FX, GeForce 6, Quadro FX. It outputs assembly code, the number of cycles (under optimal circumstances), the number of temporary registers used, and pixel throughput. It also allows you to force the program to use all fp16 or all fp32 to compare the performance difference.
Future versions of NVShaderPerf will provide more complete support for vertex programs, including vertex throughput and GLSL vertex scheduling, and will support multiple driver versions in one release.
NVPerfKIT is a complete performance instrumentation solution, that includes an instrumented driver, the NVIDIA Developer Control Panel, supports PDH (Performance Data Helper – Windows standard), includes code samples for OpenGL and Direct3D. It requires instrumented applications to be authorized – which requires recompilation.
The instrumented driver exposes GPU and driver performance counters, supports SLI counters, and supports OpenGL, but requires a GeForce FX or 6. The NVIDIA developer control panel is required to sample the counters.
The OpenGL driver counters include:
- FPS & frame time
- AGP texture, VBO, and total memory used
- Video texture, VBO, and total memory used
- Driver sleep time (driver waits for GPU)
The hardware/GPU counters include:
- GPU idle
- Pixel shader utilization
- Vertex attribute count
- Vertex shader utilization
- Texture waits for shader
- Shader waits for texture
- Shader waits for framebuffer
- FastZ utilization (UltraShadow) (making sure you're taking best advantage of the hardware when using shadows)
- Pixel, vertex, triangle, primitive, and culled primitive counts
Microsoft Performance Data Helper (PDH) is part of WMI (Windows Management and Instrumentation). Through this standard, the counters are available in VTune and Perfmon. They can also be accessed by your own application.
Developers will have beta access to this tool this week, with broader access in the coming weeks.
Advanced OpenGL Debugging & Profiling with gDEBugger
Yaki Tebeka, Avi Shapira, Graphic Remedy
Why is OpenGL debugging difficult?
- Application views the graphics system as a black box
- You cannot put a breakpoint on an OpenGL function
- Can't watch OpenGL state variables
- Cannot view allocated graphic objects (textures, VBOs)
- Render context is a huge state machine
- Commonly used features use a lot of state variables
- OpenGL is a low level API: thousands of calls per frame
- OpenGL error model very limited
An OpenGL debugger lets you watch state varibables, put breakpoints on OpenGL functions, view allocated objects, break automatically on OpenGL errors, view OpenGL call stack, etc.
The gDEBugger GUI was designed for graphics applications. Its features include a small footprint, customizable views, viewers, toolbars, and always-on-top mode.
gDEBugger lets you dump entire state machine to a file, so you can use diff to compare states at different times to track down bugs. It also includes a texture viewer that allows you to save the texture to a file.
Gremedy added the GL_GREMEDY_string_marker extension to allow you to mark segments of the log to make it more readable.
Additional features that gDEBugger offers to make debugging your OpenGL apps easier:
- Interactive mode. This lets you view your graphic scene as it is being rendered by forcing OpenGL to draw directly into the front buffer. It also lets you slow down OpenGL. Because it issues a flush after every OpenGL call, you can see your scene as it is being drawn, including any offscreen rendering. This enables you to break the app run when the desired object is drawn.
- Forced raster mode lets you force OpenGL polygon raster mode (points, lines, fill)
- In Profiling mode you can turn on an FPS counter and progressively disable pipeline stages to determine where your bottlenecks are.
The current version of gDEBugger (1.3) supports OpenGL 2.0 and dozens of extensions. Gremedy is currently adding a number of new features to gDEBugger, including
- Function call statistics (times called, spent time)
- The ability to track allocated OpenGL resources
- Rendered primitive statistics
- Primitive draw batch sizes
- Buffer viewer (all buffer types)
- Shader source code viewer & editor
- The ability to disable a given extension(s)
You can find out more about gDEBugger at www.gremedy.com. If you're a GDNet+ member, you're eligible to get a 15% discount on gDEBugger, which will more than pay for the cost of membership.