Difference between revisions of "GPU"

(this is good stuff)
 
(26 intermediate revisions by 4 users not shown)
Line 1: Line 1:
==Mapping Memory==
+
= Classes =
 +
See [[GPU_Classes|GPU Classes]].
  
 +
= Mapping Memory =
 
First, to map a memory region on the GPU Address Space, caching needs to be disabled by using [[SVC#svcSetMemoryAttribute|svcSetMemoryAttribute]]. The Address passed is the Virtual Address of the region that will be mapped, the size is the region size, and State0/1 are both set to 8 to disable caching of the memory region. This is done to ensure that the GPU can actually "see" the data written there, and it doesn't get stuck on some cache.
 
First, to map a memory region on the GPU Address Space, caching needs to be disabled by using [[SVC#svcSetMemoryAttribute|svcSetMemoryAttribute]]. The Address passed is the Virtual Address of the region that will be mapped, the size is the region size, and State0/1 are both set to 8 to disable caching of the memory region. This is done to ensure that the GPU can actually "see" the data written there, and it doesn't get stuck on some cache.
  
Line 7: Line 9:
 
The above process is used to map all data that will be used by the GPU, like Textures, Command Lists (a.k.a. Push Buffers), Vertex/Index buffers and Shaders. They usually have their own mapping, but Command Lists can share the same mapping.
 
The above process is used to map all data that will be used by the GPU, like Textures, Command Lists (a.k.a. Push Buffers), Vertex/Index buffers and Shaders. They usually have their own mapping, but Command Lists can share the same mapping.
  
==Commands Submission==
+
= FIFO Commands =
 +
The GPU uses Nvidia's push buffer format for it's PFIFO engine. PFIFO is a special engine responsible for receiving user command lists and routing them to the appropriate engines (2D, 3D, DMA).
  
Commands are sent to the GPU through [[NV_services#NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO|NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO]]. This IoCtl command accepts various GpFifo entries, and each GpFifo entry points to a Command List. The GPU Command List is composed of 32-bits words, which usually are Command/Argument pairs.
+
Commands are submitted to the GPU's PFIFO engine through [[NV_services#NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO|NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO]].
  
====Command Word Structure====
+
This ioctl takes an array of gpfifo entries where each entry points to a FIFO command list. This list is composed of alternating 32-bit words containing FIFO commands and their respective arguments.
  
{| class="wikitable"
+
== Command Structure ==
 +
{| class="wikitable" border="1"
 
|-
 
|-
! scope="col"| Bits
+
! Bits || Description
! scope="col"| Description
 
 
|-
 
|-
|12-0
+
| 0-11 || Method address
|Command/Register Id
 
 
|-
 
|-
|15-13
+
| 12 || Reserved
|Sub Channel
 
 
|-
 
|-
|28-16
+
| 13-15 || Method subchannel
|Arguments Count (in 32-bits Words) or Inline Data (see below)
 
 
|-
 
|-
|31-29
+
| 16-28 || Method count, immediate-data or [[#Tertiary opcode|tertiary opcode]]
|Mode
+
|-
 +
| 29-31 || [[#Secondary opcode|Secondary opcode]]
 
|}
 
|}
  
====Command Mode====
+
Methods are treated as 4-byte addressable locations, and hence their numbers are written down multiplied by 4. The command's arguments, when present, follow the command word immediately.
  
{| class="wikitable"
+
=== Secondary opcode ===
 +
{| class="wikitable" border="1"
 
|-
 
|-
! scope="col"| Mode
+
! Mode || Description
! scope="col"| Description
 
! scope="col"| Offical name
 
 
|-
 
|-
|1
+
| 0 || [[#GRP0_USE_TERT|GRP0_USE_TERT]]
|Sequential Mode - Reads "Argument Count" arguments, while automatically incrementing the Register Id. So, each argument is written to a different register.
 
|INCR
 
 
|-
 
|-
|3
+
| 1 || [[#INC_METHOD|INC_METHOD]]
|Normal Mode - This is a Command with multiple arguments. Reads "Argument Count" arguments, all belonging to the same Command.
 
|NONINCR
 
 
|-
 
|-
|4
+
| 2 || [[#GRP2_USE_TERT|GRP2_USE_TERT]]
|Inline Mode - Bits 28-16 of the Command Word (where Inline Data is located) contains the Value of the argument written to the register. The next Word is another Command.
 
|IMM
 
 
|-
 
|-
|5
+
| 3 || [[#NON_INC_METHOD|NON_INC_METHOD]]
|Unobserved, but is valid too.
+
|-
|
+
| 4 || [[#IMMD_DATA_METHOD|IMMD_DATA_METHOD]]
 +
|-
 +
| 5 || [[#ONE_INC|ONE_INC]]
 +
|-
 +
| 6 || Reserved
 +
|-
 +
| 7 || [[#END_PB_SEGMENT|END_PB_SEGMENT]]
 
|}
 
|}
  
TODO: Find a better name for the "Normal Mode" and figure out what mode 5 is.
+
==== GRP0_USE_TERT ====
 +
Tells PFIFO to read [[#Tertiary opcode|tertiary opcode]] from bits 16-17 of the command word.
  
Other mode values are unobserved.
+
==== INC_METHOD ====
 +
Tells PFIFO to read as much arguments as specified by '''method count''', while automatically incrementing the '''method address''' value. This means that each argument will be written to a different method location.
  
Note: All Commands/Register Id values are multiples of 4, so they are divided by 4 when packing, and multiplied by 4 when unpacking.
+
==== GRP2_USE_TERT ====
 +
Tells PFIFO to read [[#Tertiary opcode|tertiary opcode]] from bits 16-17 of the command word.
  
==Sub Channel binding==
+
==== NON_INC_METHOD ====
 +
Tells PFIFO to read as much arguments as specified by '''method count'''. However, all arguments will be written to the same method location.
  
All Command Id values < 0x100 are special and aren't fowarded to the engines. The command 0 is used to bind engines to Sub Channels, and needs to be used before commands are submited to the engines.
+
==== IMMD_DATA_METHOD ====
 +
Tells PFIFO to read '''immediate-data''' from bits 16-28 of the command word, thus eliminating the need to pass additional words for the arguments.
  
The command 0 only has one argument, the Engine Id.
+
==== ONE_INC ====
 +
Tells PFIFO to read as much arguments as specified by '''method count''' and automatically increments the '''method address''' value once only.
  
====Engine Ids====
+
==== END_PB_SEGMENT ====
 +
Tells PFIFO to stop processing any further methods.
  
{| class="wikitable"
+
=== Tertiary opcode ===
|-
+
{| class="wikitable" border="1"
! scope="col"| Id
 
! scope="col"| Engine
 
 
|-
 
|-
|0x902d
+
! Mode || Description
|2D
 
 
|-
 
|-
|0xb197
+
| 0 || GRP0_INC_METHOD or GRP2_NON_INC_METHOD
|3D
 
 
|-
 
|-
|0xb1c0
+
| 1 || GRP0_SET_SUB_DEV_MASK
|Compute
 
 
|-
 
|-
|0xa140
+
| 2 || GRP0_STORE_SUB_DEV_MASK
|Kepler
 
 
|-
 
|-
|0xb0b5
+
| 3 || GRP0_USE_SUB_DEV_MASK
|DMA
 
 
|}
 
|}
  
The bits 15-13 of the Command Word contains the Sub Channel index that should be bound.
+
== SetObject ==
 
+
In order to bind an engine object to a specific subchannel, method 0 (SetObject) must be used first. The target subchannel is specified in bits 13-15 of the command word.
After binding the required Sub Channels, then the respective values can be used on the "Sub Channel" field of the Command Word to talk with the respective Engines.
 
  
==Fences==
+
After the engine object is bound to the desired subchannel, setting it's value in bits 13-15 of any subsequent command word will make PFIFO forward the command to the target engine.
  
Command Lists can contain fences to ensure that commands are executed on the correct order, and subsequent commands are only sent when the previously sent commands were already processed by the GPU. Fences uses the QUERY_* commands, and works like this:
+
This method only takes one argument, a [[#GPU_Classes|GPU Class ID]].
  
* First, QUERY_ADDRESS_HIGH and QUERY_ADDRESS_LOW commands are added to the Command List, with the High/Low 32 bits part of the 64-bits GPU Virtual Address where the fence is located. This GPU Virtual Address needs to be mapped to the process Virtual Address beforehand.
+
== Macro ==
* Then, QUERY_SEQUENCE is added with a sequential number. This number is basically a incrementing counter, so the first Command List can have QUERY_SEQUENCE = 1, the next one QUERY_SEQUENCE = 2, 3, 4... and so on.
+
Macros are small programs that can be uploaded to the gpu and are capable of reading and writing to the 3D engine registers on the GPU. The macros also accepts parameters, stored on a FIFO. Macros can be called using methods starting at 0xe00, where the first method triggers the macro execution, and the second one is used to push parameters to the FIFO, that can be read from the macro program using a instruction called ''parm''. This instruction pops the FIFO and reads the next parameter, while also allowing programs to use a variable number of parameters if desired.
* Finally, QUERY_GET is added and contains the mode and other unknown data.
 
 
 
The above commands are added using the Sequential Mode, since the Ids for all those 4 registers are sequential.
 
 
 
====QUERY_GET Structure====
 
 
 
{| class="wikitable"
 
|-
 
! scope="col"| Bits
 
! scope="col"| Description
 
|-
 
|1-0
 
|Mode
 
|-
 
|4
 
|Fence
 
|-
 
|15-12
 
|Unit
 
|}
 
  
====QUERY_GET Mode====
+
The first parameter is written to 0xe00 + n * 2 (where n is the macro index), and all subsequent parameters should be pushed to the FIFO using 0xe01 + n * 2. The first parameter is placed at the general purpose register R1 in the macro program when execution starts.
  
{| class="wikitable"
+
Official games uses those macros to conditionally write registers, one example of such uses is the macro at 0xe24, that is used to set shader registers (including shader address and binding the c1 Constant Buffer to the shader). In some cases, it's also used to set registers unconditionally.
|-
 
! scope="col"| Value
 
! scope="col"| Mode
 
|-
 
|0
 
|Write
 
|-
 
|1
 
|Sync
 
|-
 
|2
 
|Write ?
 
|-
 
|3
 
|Write ?
 
|}
 
  
TODO: Move this to a separate page with all GPU Commands with descriptions. Also figure out what the other values mean.
+
== Fences ==
 +
Command lists can contain fences to ensure that commands are executed on the correct order, and subsequent commands are only sent when the previously sent commands were already processed by the GPU. Fences uses the ReportSemaphore* registers, and works like this:
  
Some of the other fields are still unknown/unobserved.
+
* First, register ReportSemaphoreOffset is set to High/Low 32 bits part of the 64-bits GPU Virtual Address where the fence is located. This GPU Virtual Address needs to be mapped to the process Virtual Address beforehand.
 +
* Then, ReportSemaphorePayload is set with a sequential number. This number is basically a incrementing counter, so the first Command List can set ReportSemaphorePayload = 1, the next one to 2, then 3, 4... and so on.
 +
* Finally, ReportSemaphoreControl is added and contains the mode and other unknown data.
  
Official games will set Mode to 0, Fence to 1 and Unit to 0xF. The QUERY_SEQUENCE value is then written by the GPU to the address pointed to by QUERY_ADDRESS.
+
The above commands are added using the [[#Submission_mode|increasing mode]], since all those 4 registers are sequential.
On the CPU side, the game code should wait until the value at the address pointed to by QUERY_ADDRESS is >= to the last written SEQUENCE value. Official code waits for this condition to be true on a loop, and won't send any further commands before that.
 
  
==Vertex Data Submission==
+
Official games sets Operation to 0 (Release), bit 4 to 1, bits 12-15 (Unit) to 0xF, and bit 28 to 1 (OneWord). The ReportSemaphorePayload value is then written by the GPU to the address pointed to by ReportSemaphoreOffset.
 +
On the CPU side, the game code should wait until the value at the address pointed to by ReportSemaphoreOffset is >= to the last written value. Official code waits for this condition to be true on a loop, and won't send any further commands before that.
  
 +
= Vertex Data Submission =
 
Note: This is a observation on how the game Puyo Puyo Tetris sends textured squares to the GPU.
 
Note: This is a observation on how the game Puyo Puyo Tetris sends textured squares to the GPU.
  
Line 162: Line 129:
 
# VERTEX_END_GL is used with value 0 (currently unknown what this value means).
 
# VERTEX_END_GL is used with value 0 (currently unknown what this value means).
  
==Command List==
+
= Texture View =
 +
Texture information such as address, format and size is sent to the GPU through a structure know as Texture View (a.k.a. Texture Image Control, or TIC). Each texture that the game uses needs a separate TIC, and those TICs are written to a table, one after the other. Each [[#TIC_Structure|TIC entry]] has 0x20 bytes, and is composed of 8 32-bits words where the texture information is packed.
 +
 
 +
The index of the TIC entries that should be used by the shader is sent to the GPU with the CB_POS/CB_DATA (0) methods. Games usually follows the following steps to write the TIC entry indexes:
 +
 
 +
* Macro 0xe1a is used to set CB_ADDRESS_HIGH/LOW registers to the GPU Virtual Address of the Constant Buffer set on the register 0x982 (the ''Texture Constant Buffer'' index register), and also sets CB_SIZE.
 +
* CB_POS is used to set the write offset of the Constant Buffer to n * 4, where ''n'' is the index of the ''Handle'' being used on the shader program (this index starts at 8, so CB_POS should be at least 8 * 4 = 0x20).
 +
* CB_DATA (0) method is used to write the value into the Constant Buffer. The value is a ''Handle'' where the lower 20 bits is the TIC index, and the higher 12 bits is the TSC (Texture Sampler Control) index.
 +
 
 +
The address of a given TIC entry can be calculates as:
 +
 
 +
tic_entry_address = tic_base_address + tic_index * 0x20
  
These still need to be heavily verified and ''could'' be wrong
+
Where ''tic_base_address'' is the address written to TIC_ADDRESS_HIGH/LOW (methods 0x1574 and 0x1578), ''tic_index'' is the lower 20 bits of the word written into the Const Buffer with CB_DATA (0), and 0x20 is the size of each TIC entry in bytes.
 +
 
 +
The texture is accessed on the shader using one of the texture sampling instructions (usually the TEXS instruction). One of the parameters for this instruction is the ''Handle'' index. This index start at 8, so the index 8 will access the handle at 8 * 4 = 0x20 on the ''Texture Constant Buffer''. Each shader stage has a separate Constant Buffer, so for fragment shaders, this is located at CB_ADDRESS + 4 * CB_SIZE + TEXS_index * 4 (where the first 4 is the index of the fragment shader stage, and the second 4 is the size of a word, 4 bytes).
 +
 
 +
== TIC Structure ==
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
 
|-
 
|-
! Command || ID/Register || Sub Channel || Arg Count || Mode || Command Name
+
! Word || Bits || Description
 
|-
 
|-
| 0xA0020E00 || 0xE00 || 0 || 2 || 5 || BeginTransformFeedback
+
| 0 || 0-6 || [[GPU_Texture_Formats#Texture_Formats|Texture Format]]
 
|-
 
|-
| 0xA0030E30 || 0xE30 || 0 || 3 || 5 || DrawArrays
+
| 0 || 7-9 || [[#Channel_Data_Type|R Channel Data Type]]
 
|-
 
|-
| 0xA0050E36 || 0xE36 || 0 || 5 || 5 || DrawElements
+
| 0 || 10-12 || [[#Channel_Data_Type|G Channel Data Type]]
 
|-
 
|-
| 0xA0020E2E || 0xE2E || 0 || 2 || 5 || PopDebugGroupId
+
| 0 || 13-15 || [[#Channel_Data_Type|B Channel Data Type]]
 
|-
 
|-
| 0xA0040E2C || 0xE2C || 0 || 4 || 5 || PushDebugGroup
+
| 0 || 16-18 || [[#Channel_Data_Type|A Channel Data Type]]
 
|-
 
|-
| 0x2001054C || 0x54C || 0 || 1 || 1 || ResetCounter
+
| 1 || 0-31 || Lower 32-bits of the Texture GPU Virtual Address
 
|-
 
|-
| 0x8001047F || 0x47F || 0 || 1 || 4 || ResolveDepthBuffer
+
| 2 || 0-15 || Higher 16-bits of the Texture GPU Virtual Address
 
|-
 
|-
| 0x200104C4 || 0x4C4 || 0 || 1 || 1 || SetAlphaRef
+
| 4 || 0-15 || Texture Width minus 1
 
|-
 
|-
| 0x200404C7 || 0x4C7 || 0 || 4 || 1 || SetBlendColor
+
| 5 || 0-15 || Texture Height minus 1
|-
+
|}
| 0x2001064F || 0x6F4 || 0 || 1 || 1 || SetDepthClamp
+
 
 +
=== Channel Data Type ===
 +
{| class="wikitable" border="1"
 
|-
 
|-
| 0x200200CD || 0xCD || 0 || 2 || 1 || SetInnerTessellationLevels
+
! Value || Type
 
|-
 
|-
| 0x200204EC || 0x4EC || 0 || 2 || 1 || SetLineWidth
+
| 1 || SNORM
 
|-
 
|-
| 0x200400C9 || 0xC9 || 0 || 4 || 1 || SetOuterTessellationLevels
+
| 2 || UNORM
 
|-
 
|-
| 0x8???0373 || 0x373 || 0 || Variable || 4 || SetPatchSize
+
| 3 || SINT
 
|-
 
|-
| 0x20010546 || 0x546 || 0 || 1 || 1 || SetPointSize
+
| 4 || UINT
 
|-
 
|-
| 0x20030554 || 0x554 || 0 || 3 || 1 || SetRenderEnableConditional
+
| 5 || SNORM_FORCE_FP16
 
|-
 
|-
| 0x200403EF || 0x3EF || 0 || 4 || 1 || SetSampleMask
+
| 6 || UNORM_FORCE_FP16
 
|-
 
|-
| 0x200103D9 || 0x3D9 || 0 || 1 || 1 || SetTiledCacheTileSize
+
| 7 || FLOAT
 
|}
 
|}
  
==References==
+
= Shaders =
 +
See [[GPU_Shaders|GPU Shaders]].
  
Check out those pages for more useful data.
+
= References =
 +
FIFO engine overview:
 +
[https://envytools.readthedocs.io/en/latest/hw/fifo/intro.html]
  
Register Id values from the Fermi family GPU (a bit older than the Tegra X1, but values seems to be mostly the same):
+
Method values from the Fermi family GPU (a bit older than the Tegra X1, but values seems to be mostly the same):
 
[https://github.com/envytools/envytools/blob/master/rnndb/graph/gf100_3d.xml]
 
[https://github.com/envytools/envytools/blob/master/rnndb/graph/gf100_3d.xml]
 +
 +
TIC structure used on a Maxwell GPU:
 +
[https://github.com/envytools/envytools/blob/master/rnndb/graph/gm200_texture.xml]
  
 
Values for some types used on the above XML:
 
Values for some types used on the above XML:
 
[https://github.com/envytools/envytools/blob/master/rnndb/graph/nv_3ddefs.xml]
 
[https://github.com/envytools/envytools/blob/master/rnndb/graph/nv_3ddefs.xml]
  
Command Word packing code used on Mesa3d:
+
Command word packing code used on Mesa3d:
 
[https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_winsys.h]
 
[https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_winsys.h]
 +
 +
TIC entry pack/write code used on Mesa3d:
 +
[https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_tex.c#n65]

Latest revision as of 20:37, 10 May 2024

Classes

See GPU Classes.

Mapping Memory

First, to map a memory region on the GPU Address Space, caching needs to be disabled by using svcSetMemoryAttribute. The Address passed is the Virtual Address of the region that will be mapped, the size is the region size, and State0/1 are both set to 8 to disable caching of the memory region. This is done to ensure that the GPU can actually "see" the data written there, and it doesn't get stuck on some cache.

Then, NVMAP_IOC_CREATE is used to create a nvmap object with the desired size. After, NVMAP_IOC_ALLOC is used to allocate the memory on the GPU Address Space, and map data on the process Address Space into the GPU Address Space, by passing the Virtual Address as the input addr parameter, and also the Handle returned from NVMAP_IOC_CREATE. Lastly, the actual mapping is done by using NVGPU_AS_IOCTL_MAP_BUFFER_EX, and the GPU Virtual Address is returned on the offset parameter. It's also possible to manually set the offset where the mapping should be made on the GPU Address Space, by passing the address on the "offset" parameter, and setting the bit 0 of the flags parameter to 1. However, for this to work, the desired GPU Virtual Address needs to be previously reserved using NVGPU_AS_IOCTL_ALLOC_SPACE.

The above process is used to map all data that will be used by the GPU, like Textures, Command Lists (a.k.a. Push Buffers), Vertex/Index buffers and Shaders. They usually have their own mapping, but Command Lists can share the same mapping.

FIFO Commands

The GPU uses Nvidia's push buffer format for it's PFIFO engine. PFIFO is a special engine responsible for receiving user command lists and routing them to the appropriate engines (2D, 3D, DMA).

Commands are submitted to the GPU's PFIFO engine through NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO.

This ioctl takes an array of gpfifo entries where each entry points to a FIFO command list. This list is composed of alternating 32-bit words containing FIFO commands and their respective arguments.

Command Structure

Bits Description
0-11 Method address
12 Reserved
13-15 Method subchannel
16-28 Method count, immediate-data or tertiary opcode
29-31 Secondary opcode

Methods are treated as 4-byte addressable locations, and hence their numbers are written down multiplied by 4. The command's arguments, when present, follow the command word immediately.

Secondary opcode

Mode Description
0 GRP0_USE_TERT
1 INC_METHOD
2 GRP2_USE_TERT
3 NON_INC_METHOD
4 IMMD_DATA_METHOD
5 ONE_INC
6 Reserved
7 END_PB_SEGMENT

GRP0_USE_TERT

Tells PFIFO to read tertiary opcode from bits 16-17 of the command word.

INC_METHOD

Tells PFIFO to read as much arguments as specified by method count, while automatically incrementing the method address value. This means that each argument will be written to a different method location.

GRP2_USE_TERT

Tells PFIFO to read tertiary opcode from bits 16-17 of the command word.

NON_INC_METHOD

Tells PFIFO to read as much arguments as specified by method count. However, all arguments will be written to the same method location.

IMMD_DATA_METHOD

Tells PFIFO to read immediate-data from bits 16-28 of the command word, thus eliminating the need to pass additional words for the arguments.

ONE_INC

Tells PFIFO to read as much arguments as specified by method count and automatically increments the method address value once only.

END_PB_SEGMENT

Tells PFIFO to stop processing any further methods.

Tertiary opcode

Mode Description
0 GRP0_INC_METHOD or GRP2_NON_INC_METHOD
1 GRP0_SET_SUB_DEV_MASK
2 GRP0_STORE_SUB_DEV_MASK
3 GRP0_USE_SUB_DEV_MASK

SetObject

In order to bind an engine object to a specific subchannel, method 0 (SetObject) must be used first. The target subchannel is specified in bits 13-15 of the command word.

After the engine object is bound to the desired subchannel, setting it's value in bits 13-15 of any subsequent command word will make PFIFO forward the command to the target engine.

This method only takes one argument, a GPU Class ID.

Macro

Macros are small programs that can be uploaded to the gpu and are capable of reading and writing to the 3D engine registers on the GPU. The macros also accepts parameters, stored on a FIFO. Macros can be called using methods starting at 0xe00, where the first method triggers the macro execution, and the second one is used to push parameters to the FIFO, that can be read from the macro program using a instruction called parm. This instruction pops the FIFO and reads the next parameter, while also allowing programs to use a variable number of parameters if desired.

The first parameter is written to 0xe00 + n * 2 (where n is the macro index), and all subsequent parameters should be pushed to the FIFO using 0xe01 + n * 2. The first parameter is placed at the general purpose register R1 in the macro program when execution starts.

Official games uses those macros to conditionally write registers, one example of such uses is the macro at 0xe24, that is used to set shader registers (including shader address and binding the c1 Constant Buffer to the shader). In some cases, it's also used to set registers unconditionally.

Fences

Command lists can contain fences to ensure that commands are executed on the correct order, and subsequent commands are only sent when the previously sent commands were already processed by the GPU. Fences uses the ReportSemaphore* registers, and works like this:

  • First, register ReportSemaphoreOffset is set to High/Low 32 bits part of the 64-bits GPU Virtual Address where the fence is located. This GPU Virtual Address needs to be mapped to the process Virtual Address beforehand.
  • Then, ReportSemaphorePayload is set with a sequential number. This number is basically a incrementing counter, so the first Command List can set ReportSemaphorePayload = 1, the next one to 2, then 3, 4... and so on.
  • Finally, ReportSemaphoreControl is added and contains the mode and other unknown data.

The above commands are added using the increasing mode, since all those 4 registers are sequential.

Official games sets Operation to 0 (Release), bit 4 to 1, bits 12-15 (Unit) to 0xF, and bit 28 to 1 (OneWord). The ReportSemaphorePayload value is then written by the GPU to the address pointed to by ReportSemaphoreOffset. On the CPU side, the game code should wait until the value at the address pointed to by ReportSemaphoreOffset is >= to the last written value. Official code waits for this condition to be true on a loop, and won't send any further commands before that.

Vertex Data Submission

Note: This is a observation on how the game Puyo Puyo Tetris sends textured squares to the GPU.

  1. VERTEX_ATTRIB_FORMAT (0-15) are set (only the first 3 are really used, the rest are set float, with Size = 1 and offset at 0).
  2. VERTEX_ARRAY_FETCH (0) is set with the lower 12 bits set to 0x1c (Stride) and bit 12 to 1 (Enabled).
  3. VERTEX_ARRAY_START_HIGH/LOW (0) are set to the GPU Virtual Address where the Vertex Data is located.
  4. VERTEX_ARRAY_LIMIT_HIGH/LOW (0) are set to the GPU Virtual Address where the Vertex Data is located, plus the Vertex Data size in bytes minus 1.
  5. VERTEX_BEGIN_GL is used with the primitive type set to TRIANGLE_STRIP.
  6. VERTEX_BUFFER_FIRST with value 0 (indicating the index of the first primitive to render?).
  7. VERTEX_BUFFER_COUNT is set to 4, because the Vertex Buffer with the square has 4 vertices.
  8. VERTEX_END_GL is used with value 0 (currently unknown what this value means).

Texture View

Texture information such as address, format and size is sent to the GPU through a structure know as Texture View (a.k.a. Texture Image Control, or TIC). Each texture that the game uses needs a separate TIC, and those TICs are written to a table, one after the other. Each TIC entry has 0x20 bytes, and is composed of 8 32-bits words where the texture information is packed.

The index of the TIC entries that should be used by the shader is sent to the GPU with the CB_POS/CB_DATA (0) methods. Games usually follows the following steps to write the TIC entry indexes:

  • Macro 0xe1a is used to set CB_ADDRESS_HIGH/LOW registers to the GPU Virtual Address of the Constant Buffer set on the register 0x982 (the Texture Constant Buffer index register), and also sets CB_SIZE.
  • CB_POS is used to set the write offset of the Constant Buffer to n * 4, where n is the index of the Handle being used on the shader program (this index starts at 8, so CB_POS should be at least 8 * 4 = 0x20).
  • CB_DATA (0) method is used to write the value into the Constant Buffer. The value is a Handle where the lower 20 bits is the TIC index, and the higher 12 bits is the TSC (Texture Sampler Control) index.

The address of a given TIC entry can be calculates as:

tic_entry_address = tic_base_address + tic_index * 0x20

Where tic_base_address is the address written to TIC_ADDRESS_HIGH/LOW (methods 0x1574 and 0x1578), tic_index is the lower 20 bits of the word written into the Const Buffer with CB_DATA (0), and 0x20 is the size of each TIC entry in bytes.

The texture is accessed on the shader using one of the texture sampling instructions (usually the TEXS instruction). One of the parameters for this instruction is the Handle index. This index start at 8, so the index 8 will access the handle at 8 * 4 = 0x20 on the Texture Constant Buffer. Each shader stage has a separate Constant Buffer, so for fragment shaders, this is located at CB_ADDRESS + 4 * CB_SIZE + TEXS_index * 4 (where the first 4 is the index of the fragment shader stage, and the second 4 is the size of a word, 4 bytes).

TIC Structure

Word Bits Description
0 0-6 Texture Format
0 7-9 R Channel Data Type
0 10-12 G Channel Data Type
0 13-15 B Channel Data Type
0 16-18 A Channel Data Type
1 0-31 Lower 32-bits of the Texture GPU Virtual Address
2 0-15 Higher 16-bits of the Texture GPU Virtual Address
4 0-15 Texture Width minus 1
5 0-15 Texture Height minus 1

Channel Data Type

Value Type
1 SNORM
2 UNORM
3 SINT
4 UINT
5 SNORM_FORCE_FP16
6 UNORM_FORCE_FP16
7 FLOAT

Shaders

See GPU Shaders.

References

FIFO engine overview: [1]

Method values from the Fermi family GPU (a bit older than the Tegra X1, but values seems to be mostly the same): [2]

TIC structure used on a Maxwell GPU: [3]

Values for some types used on the above XML: [4]

Command word packing code used on Mesa3d: [5]

TIC entry pack/write code used on Mesa3d: [6]