Shader games with Vulkan

What in tarnation?

index

Hey there. Wanna build games out of shaders? They have been around for some time now. For example, there’s quite the list on this website. Of course, the scope was always very limited because the technology was not meant to do this. To be fair, it is still the case today, but you’ll see that it’s no longer 100% true with Vulkan.

About

The goal of this project is to move as much logic as possible from the CPU to the GPU. As a demo, I’ve decided to create an Asteroids clone (wiki link) as it includes two of the features I wanted to see the most on the GPU: draw call management and assets management without adding too much clutter.

I’ll spoil the beans right away, the magic is done my exposing the vertex buffer, the indices buffer, and the indirect draw buffer in the compute shaders. Once this is done, it’s just a matter of recording a simple command buffer once and submitting it as many time as required. Continue reading to learn how it was done.

Before going into the details, the source code is available at https://github.com/gabdube/asteroids-shader . This article will also pinpoint the interesting parts.

A high level overview

The project contains 4 files:

In the setup script, because all the fun stuff is done on the GPU, all that’s left is a simple check list.

After creating all required Vulkan resource, it’s time to record the game command buffer. This is done once per framebuffer in the record function. The commands are recorded in the following order:

  1. Bind the compute pipeline
  2. Execute the game logic compute shader
  3. Use a barrier to make sure the execution is done before rendering
  4. Begin the render pass
  5. Bind all the required resources
  6. Execute the available CmdDrawIndexedIndirectCount* function

A important thing to remember is that CmdDrawIndexedIndirectCount* is a function that’s pretty recent (on non-AMD hardware). It depends on the extension VK_KHR_draw_indirect_count link which, according to GPUINFO, is only available in very recent graphics driver link.

Before the run loop begins, the initialization compute shader is recorded and executed once here.

Once the initalization is done, the script listen to the system events and calls the render function as fast as possible. Just before the game command buffer is executed, the game state is updated with the update function.

There you go. That’s the meat of the runtime.

Vulkan initialization

The python setup script initialize the all vulkan resources. At 2000 lines of code, it might seems like alot, but that’s just how verbose Vulkan is.

There’s nothing interesting in the instance setup, so lets skip to the device setup.

For the device setup, a feature and an extension must be enabled.

Extension: Draw indirect count

Vulkan alone supports indirect drawing through the function vkCmdDrawIndexedIndirect. There is one problem though: there’s no way to tell the GPU how many draw parameters there is in the buffer. This limitation can be removed with any of those two extensions: VK_AMD_draw_indirect_count and VK_KHR_draw_indirect_count. Both extensions work in the exact same way by exposing the vkCmdDrawIndexedIndirectCount* (link) function.

The difference is that VK_AMD_draw_indirect_count is exclusive to AMD hardware and is available since the beginning of time while VK_KHR_draw_indirect_count is available on any hardware vendor (including AMD), but only in recent drivers (it requires Vulkan 1.1 support). For example, my intel laptop with a IGPU do not support this function because it is stuck at Vulkan 1.0.43.

Feature: shaderStorageBufferArrayDynamicIndexing

By default, dynamic indexing is not supported in shaders. This can be fixed by enabling shaderStorageBufferArrayDynamicIndexing or shaderUniformBufferArrayDynamicIndexing in the device feature. As these features are widely supported, this should not be a problem.

Note that because write access is required, the shaders only use buffer block and not uniform block. Dynamic indexing for uniforms need not to be enabled.

Feature: shaderFloat64

I like to send the time to my shaders using a double instead of a float. This is a personal preference and this not required for this demo.

Buffers creation

The next important step is the buffers allocation and state buffer allocation.

In this step, three buffers are allocated. One for the vertex and indices, another one for the game data, and one last for the shared game state. Because they will be exposed as storage buffer in the shaders, the usage flags VK_BUFFER_USAGE_STORAGE_BUFFER_BIT must be added.

Because the vertex attributes and the game data will only the GPU will access the data, the buffer can be safely backed by MEMORY_PROPERTY_DEVICE_LOCAL_BIT memory.

The state buffer must be backed by MEMORY_PROPERTY_HOST_VISIBLE_BIT memory because it will be updated from the host.

Two very important things to remember

Data layout

The compute shaders expose 4 buffer bindings:

The vertex shader only expose the game buffer

In the vertex shader, the game buffer binding must be marked as readonly, otherwise the shader will use atomics to read from it, which will crash the program because this is a feature that must be enabled in the device.

Every bindings use the std430 layout (which is only avaible on buffer bindings). This allow the data to be tighly packed in arrays and structs. The default std140 can introduce a lot of unnecessary padding. for more information see this.

Using storage buffers also remove the buffer range limit. Even on Nvdia hardware (which is usually the most restrictive). For example, according to GPUINFO (link) a NVDIA GTX 1070 maximum range for a uniform buffer is 65536 bytes, but for storage buffers it is 4294970000 bytes.

This makes life a bit easier when defining the write sets because VK_WHOLE_SIZE can be used without worries.

Writing meshes from compute shaders

Attributes and indices data must be defined as array in the shader. Here is an example. A loop must then iterate over the values and save them in the buffers.

Not only is this very painful to write, it also produce very bloated SPRIV code. It was fine for this example because the meshes are very simple, but uploading a complex mesh would probably make the SPIRV binary explode in size.

Also, this is probably very slow. Luckily, mesh uploading is only in the initialization compute shader.

Managing draw call from compute shaders

Both the count buffer and the draw parameter buffer are exposed in the shader inside the game binding. The draw count is exposed in game.drawCount and the draw parameters are exposed in game.objects. Each draw parameter is associated to a game object. By sending a stride that is bigger than the VkDrawIndexedIndirectCommand structure, it is possible to append extra information to the draw commands.

Adding a draw call to the program is done by increasing the count buffer and appending a new value to the game object array.

Removing an object is done by setting the VkDrawIndexedIndirectCommand.indexCount member of the selected object to 0 and flag the draw command as “unused” using the extra data.

Don’t even think about moving the VkDrawIndexedIndirectCommand in the array. Seriously. I spent 5 hours trying to debug only to rewrite the whole thing because I coudn’t understand why the draw command were being corrupted.

Pointing to the right matrix in the vertex shader

The last important thing to remember is that the vertex shader do not have any clue on how to process the draw commands. Hopefully for us, VkDrawIndexedIndirectCommand can be self referencing using the firstInstance field.

Basically, setting the firstInstance field to the the current game object index (example) means that the vertex shader will be able to fetch the right game object using the gl_InstanceIndex value.

Note that gl_InstanceIndex is only available in Vulkan shader. For more information, see the accepted answer here.

Debugging game logic from shaders

this page was intentionally left blank

On a more serious note. Debugging is next to impossible when all your logic is executed on the GPU. The best way I found is to map the whole game state and print it out.

Conclusion

Because of the serious limitations: everything must fit in GPU memory, debugging is next to impossible, most structures are defined more than one, etc. Game logic should be stay a CPU task. Nevertheless, this project was a lot of fun to code. 10/10 would not program DOOM.