# Transform Feedback implementation on Metal back-end ### Overview - OpenGL ES 3.0 introduces Transform Feedback as a way to capture vertex outputs to buffers before the introduction of Compute Shader in later versions. - Metal doesn't support Transform Feedback natively but it is possible to be emulated using Compute Shader or Vertex Shader to write vertex outputs to buffers directly. - If Vertex Shader writes to buffers directly as well as to stage output (i.e. `[[position]]`, varying variables, ...) then the Metal runtime won't allow the `MTLRenderPipelineState` to be created. It is only allowed to either write to buffers or to stage output not both on Metal. This brings challenges to implement Transform Feedback when `GL_RASTERIZER_DISCARD` is not enabled, because in that case, by right OpenGL will do both the Transform Feedback and rasterization (feeding stage output to Fragment Shader) at the same time. ### Current implementation - Transform Feedback will be implemented by inserting additional code snippet to write vertex's varying variables to buffers called XFB buffers at compilation time. The buffers' offsets are calculated based on `[[vertex_id]]`/`gl_VertexIndex` & `[[instance_id]]`/`gl_InstanceID`. - When Transform Feedback ends, a memory barrier must be inserted because the XFB buffers could be used as vertex inputs in future draw calls. Due to Metal not supporting explicit memory barrier (currently only macOS 10.14 and above supports it, ARM based macOS doesn't though), the only reliable way to insert memory barrier currently is ending the render pass. - In order to support Transform Feedback capturing and rasterization at the same time, the draw call must be split into 2 passes: - First pass: Vertex Shader will write captured varyings to XFB buffers. `MTLRenderPipelineState`'s rasterization will be disabled. This can be done in `spirv-cross` translation step. `spirv-cross` can convert the Vertex Shader to a `void` function, effectively won't produce any stage output values for Fragment Shader. - Second pass: Vertex Shader will write to stage output normally, but the XFB buffers writing snippet are disabled. Note that the Vertex Shader in this pass is essential the same as the first pass's, only difference is the output route (stage output vs XFB buffers). This effectively executes the same Vertex Shader's internal logic twice. - If `GL_RASTERIZER_DISCARD` is enabled when Transform Feedback is enabled: - Only first pass above will be executed, the render pass will use 1x1 empty texture attachment because rasterization is not needed and small texture attachment's load & store at render pass's start & end boundary could be cheap. Recall that we have to end the render pass to enforce XFB buffers' memory barrier as mentioned above. - If `GL_RASTERIZER_DISCARD` is enabled and Transform Feedback is NOT enabled, we cannot disable `MTLRenderPipelineState`'s rasterization because if doing so, Metal runtime requires the Vertex Shader to be a `void` function, i.e. not returning any stage output values. In order to work-around this: - `MTLRenderPipelineState`'s rasterization will still be enabled this case. - However, the Vertex Shader will be translated to write `(-3, -3, -3, 1)` to `[[position]]`/`gl_Position` variable at the end. Effectively forcing the vertex to be clipped and preventing it from being sent down to Fragment Shader. Note that the `(-3, -3, -3, 1)` writing are controlled by a specialized constant, thus it could be turned on and off base on `GL_RASTERIZER_DISCARD` state. It is more efficient doing this way than re-translating the whole shader code again using `spirv-cross` to turn it to a `void` function. ### Future improvements - Use explicit memory barrier on macOS devices supporting it instead of ending the render pass. - Instead of executing the same Vertex Shader's logic twice, one alternative approach is writing the vertex outputs to a temporary buffer. Then in second pass, copy the varyings from that buffer to XFB buffers. If rasterization is still enabled, then the 3rd pass will be invoked to use the temporary buffer as vertex input, the Vertex Shader in 3rd pass might just a simple passthrough shader: 1. Original VS -> All outputs to temp buffer. 2. Temp buffer -> Copy captured varying to XFB buffers. Could be done in a Compute Shader. 3. Temp buffer -> VS pass through to FS for rasterization. - However, this approach might even be slower than executing the Vertex Shader twice. Because a memory barrier must be inserted after 1st step. This prevents multiple draw calls with Transform Feedback to be parallelized. Furthermore, on iOS devices or devices not supporting explicit barrier, the render pass must be ended and restarted after each draw call. - Most of the time, the application usually uses Transform Feedback with `GL_RASTERIZER_DISCARD` enabled, the original approach will just simply executes the Vertex Shader once and use a cheap 1x1 render pass, thus it should be fast enough.