Difference between revisions of "GPU"

From Nintendo Switch Brew
Jump to navigation Jump to search
(this is good stuff)
Line 1: Line 1:
==Mapping Memory==
+
== Mapping Memory ==
  
 
First, to map a memory region on the GPU Address Space, caching needs to be disabled by using [[SVC#svcSetMemoryAttribute|svcSetMemoryAttribute]]. The Address passed is the Virtual Address of the region that will be mapped, the size is the region size, and State0/1 are both set to 8 to disable caching of the memory region. This is done to ensure that the GPU can actually "see" the data written there, and it doesn't get stuck on some cache.
 
First, to map a memory region on the GPU Address Space, caching needs to be disabled by using [[SVC#svcSetMemoryAttribute|svcSetMemoryAttribute]]. The Address passed is the Virtual Address of the region that will be mapped, the size is the region size, and State0/1 are both set to 8 to disable caching of the memory region. This is done to ensure that the GPU can actually "see" the data written there, and it doesn't get stuck on some cache.
Line 7: Line 7:
 
The above process is used to map all data that will be used by the GPU, like Textures, Command Lists (a.k.a. Push Buffers), Vertex/Index buffers and Shaders. They usually have their own mapping, but Command Lists can share the same mapping.
 
The above process is used to map all data that will be used by the GPU, like Textures, Command Lists (a.k.a. Push Buffers), Vertex/Index buffers and Shaders. They usually have their own mapping, but Command Lists can share the same mapping.
  
==Commands Submission==
+
== FIFO Commands ==
  
Commands are sent to the GPU through [[NV_services#NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO|NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO]]. This IoCtl command accepts various GpFifo entries, and each GpFifo entry points to a Command List. The GPU Command List is composed of 32-bits words, which usually are Command/Argument pairs.
+
The GPU implements a variation of Tegra's push buffer format for it's PFIFO engine. PFIFO is a special engine responsible for receiving user command lists and routing them to the appropriate engines (2D, 3D, DMA).
  
====Command Word Structure====
+
Commands are submitted to the GPU's PFIFO engine through [[NV_services#NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO|NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO]].
 +
 
 +
This ioctl takes an array of gpfifo entries where each entry points to a FIFO command list. This list is composed of alternating 32-bit words containing FIFO commands and their respective arguments.
 +
 
 +
=== Command Structure ===
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 19: Line 23:
 
|-
 
|-
 
|12-0
 
|12-0
|Command/Register Id
+
|Method
 
|-
 
|-
 
|15-13
 
|15-13
|Sub Channel
+
|Subchannel
 
|-
 
|-
 
|28-16
 
|28-16
|Arguments Count (in 32-bits Words) or Inline Data (see below)
+
|Argument count (in 32-bits Words) or inline data (see below)
 
|-
 
|-
 
|31-29
 
|31-29
|Mode
+
|[[#Submission_mode|Submission mode]]
 
|}
 
|}
  
====Command Mode====
+
Note: Methods are treated as 4-byte addressable locations, and hence their numbers are written down multiplied by 4.
 +
 
 +
Note: The command's arguments, when present, follow the command word immediately.
 +
 
 +
==== Submission mode ====
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 38: Line 46:
 
! scope="col"| Description
 
! scope="col"| Description
 
! scope="col"| Offical name
 
! scope="col"| Offical name
 +
|-
 +
|0
 +
|Increasing mode (old)
 +
|
 
|-
 
|-
 
|1
 
|1
|Sequential Mode - Reads "Argument Count" arguments, while automatically incrementing the Register Id. So, each argument is written to a different register.
+
|Increasing mode - Tells PFIFO to read as much arguments as specified by '''argument count''', while automatically incrementing the '''method''' value. This means that each argument will be written to a different method location.
 
|INCR
 
|INCR
 +
|-
 +
|2
 +
|Non-increasing mode (old)
 +
|
 
|-
 
|-
 
|3
 
|3
|Normal Mode - This is a Command with multiple arguments. Reads "Argument Count" arguments, all belonging to the same Command.
+
|Non-increasing mode - Tells PFIFO to read as much arguments as specified by '''argument count'''. However, all arguments will be written to the same method location.
 
|NONINCR
 
|NONINCR
 
|-
 
|-
 
|4
 
|4
|Inline Mode - Bits 28-16 of the Command Word (where Inline Data is located) contains the Value of the argument written to the register. The next Word is another Command.
+
|Inline mode - Tells PFIFO to read '''inline data''' from bits 28-16 of the command word, thus eliminating the need to pass additional words for the arguments.
 
|IMM
 
|IMM
 
|-
 
|-
 
|5
 
|5
|Unobserved, but is valid too.
+
|Increase-once mode - Tells PFIFO to read as much arguments as specified by '''argument count''' and automatically increments the '''method''' value once only.
 
|
 
|
 
|}
 
|}
  
TODO: Find a better name for the "Normal Mode" and figure out what mode 5 is.
+
=== Command List ===
 +
 
 +
All methods with values < 0x100 are special and executed by the PFIFO's DMA puller. The others are forwarded to the engine object currently bound to a given subchannel.
 +
 
 +
{| class="wikitable" border="1"
 +
|-
 +
! Command || Method || Subchannel || Arg Count || Mode || Name
 +
|-
 +
| 0x2001?000 || 0x000 || Variable || 1 || 1 || [[#BindObject|BindObject]]
 +
|-
 +
| 0xA0020E00 || 0xE00 || 0 || 2 || 5 || BeginTransformFeedback
 +
|-
 +
| 0xA0030E30 || 0xE30 || 0 || 3 || 5 || DrawArrays
 +
|-
 +
| 0xA0050E36 || 0xE36 || 0 || 5 || 5 || DrawElements
 +
|-
 +
| 0xA0020E2E || 0xE2E || 0 || 2 || 5 || PopDebugGroupId
 +
|-
 +
| 0xA0040E2C || 0xE2C || 0 || 4 || 5 || PushDebugGroup
 +
|-
 +
| 0x2001054C || 0x54C || 0 || 1 || 1 || ResetCounter
 +
|-
 +
| 0x8001047F || 0x47F || 0 || 1 || 4 || ResolveDepthBuffer
 +
|-
 +
| 0x200104C4 || 0x4C4 || 0 || 1 || 1 || SetAlphaRef
 +
|-
 +
| 0x200404C7 || 0x4C7 || 0 || 4 || 1 || SetBlendColor
 +
|-
 +
| 0x2001064F || 0x6F4 || 0 || 1 || 1 || SetDepthClamp
 +
|-
 +
| 0x200200CD || 0xCD || 0 || 2 || 1 || SetInnerTessellationLevels
 +
|-
 +
| 0x200204EC || 0x4EC || 0 || 2 || 1 || SetLineWidth
 +
|-
 +
| 0x200400C9 || 0xC9 || 0 || 4 || 1 || SetOuterTessellationLevels
 +
|-
 +
| 0x8???0373 || 0x373 || 0 || Variable || 4 || SetPatchSize
 +
|-
 +
| 0x20010546 || 0x546 || 0 || 1 || 1 || SetPointSize
 +
|-
 +
| 0x20030554 || 0x554 || 0 || 3 || 1 || SetRenderEnableConditional
 +
|-
 +
| 0x200403EF || 0x3EF || 0 || 4 || 1 || SetSampleMask
 +
|-
 +
| 0x200103D9 || 0x3D9 || 0 || 1 || 1 || SetTiledCacheTileSize
 +
|}
  
Other mode values are unobserved.
+
Note: These still need to be heavily verified and ''could'' be wrong.
  
Note: All Commands/Register Id values are multiples of 4, so they are divided by 4 when packing, and multiplied by 4 when unpacking.
+
=== BindObject ===
  
==Sub Channel binding==
+
In order to bind an engine object to a specific subchannel, method 0 (BindObject) must be used first. The target subchannel is specified in bits 15-13 of the command word.
  
All Command Id values < 0x100 are special and aren't fowarded to the engines. The command 0 is used to bind engines to Sub Channels, and needs to be used before commands are submited to the engines.
+
After the engine object is bound to the desired subchannel, setting it's value in bits 15-13 of any subsequent command word will make PFIFO forward the command to the target engine.
  
The command 0 only has one argument, the Engine Id.
+
This method only takes one argument, an [[#Engine_IDs|engine ID]].
  
====Engine Ids====
+
==== Engine IDs ====
  
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
! scope="col"| Id
+
! scope="col"| ID
 
! scope="col"| Engine
 
! scope="col"| Engine
 
|-
 
|-
|0x902d
+
|0x902D
|2D
+
|FERMI_TWOD_A (2D)
 
|-
 
|-
|0xb197
+
|0xB197
|3D
+
|MAXWELL_B (3D)
 
|-
 
|-
|0xb1c0
+
|0xB1C0
|Compute
+
|MAXWELL_COMPUTE_B
 
|-
 
|-
|0xa140
+
|0xA140
|Kepler
+
|KEPLER_INLINE_TO_MEMORY_B
 
|-
 
|-
|0xb0b5
+
|0xB0B5
|DMA
+
|MAXWELL_DMA_COPY_A (DMA)
 
|}
 
|}
  
The bits 15-13 of the Command Word contains the Sub Channel index that should be bound.
+
=== Fences ===
 
 
After binding the required Sub Channels, then the respective values can be used on the "Sub Channel" field of the Command Word to talk with the respective Engines.
 
 
 
==Fences==
 
  
Command Lists can contain fences to ensure that commands are executed on the correct order, and subsequent commands are only sent when the previously sent commands were already processed by the GPU. Fences uses the QUERY_* commands, and works like this:
+
Command lists can contain fences to ensure that commands are executed on the correct order, and subsequent commands are only sent when the previously sent commands were already processed by the GPU. Fences uses the QUERY_* commands, and works like this:
  
 
* First, QUERY_ADDRESS_HIGH and QUERY_ADDRESS_LOW commands are added to the Command List, with the High/Low 32 bits part of the 64-bits GPU Virtual Address where the fence is located. This GPU Virtual Address needs to be mapped to the process Virtual Address beforehand.
 
* First, QUERY_ADDRESS_HIGH and QUERY_ADDRESS_LOW commands are added to the Command List, with the High/Low 32 bits part of the 64-bits GPU Virtual Address where the fence is located. This GPU Virtual Address needs to be mapped to the process Virtual Address beforehand.
Line 103: Line 160:
 
* Finally, QUERY_GET is added and contains the mode and other unknown data.
 
* Finally, QUERY_GET is added and contains the mode and other unknown data.
  
The above commands are added using the Sequential Mode, since the Ids for all those 4 registers are sequential.
+
The above commands are added using the [[#Submission_mode|increasing mode]], since the Ids for all those 4 registers are sequential.
  
====QUERY_GET Structure====
+
==== QUERY_GET Structure ====
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 122: Line 179:
 
|}
 
|}
  
====QUERY_GET Mode====
+
==== QUERY_GET Mode ====
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 149: Line 206:
 
On the CPU side, the game code should wait until the value at the address pointed to by QUERY_ADDRESS is >= to the last written SEQUENCE value. Official code waits for this condition to be true on a loop, and won't send any further commands before that.
 
On the CPU side, the game code should wait until the value at the address pointed to by QUERY_ADDRESS is >= to the last written SEQUENCE value. Official code waits for this condition to be true on a loop, and won't send any further commands before that.
  
==Vertex Data Submission==
+
== Vertex Data Submission ==
  
 
Note: This is a observation on how the game Puyo Puyo Tetris sends textured squares to the GPU.
 
Note: This is a observation on how the game Puyo Puyo Tetris sends textured squares to the GPU.
Line 162: Line 219:
 
# VERTEX_END_GL is used with value 0 (currently unknown what this value means).
 
# VERTEX_END_GL is used with value 0 (currently unknown what this value means).
  
==Command List==
+
== References ==
 
 
These still need to be heavily verified and ''could'' be wrong
 
{| class="wikitable" border="1"
 
|-
 
! Command || ID/Register || Sub Channel || Arg Count || Mode || Command Name
 
|-
 
| 0xA0020E00 || 0xE00 || 0 || 2 || 5 || BeginTransformFeedback
 
|-
 
| 0xA0030E30 || 0xE30 || 0 || 3 || 5 || DrawArrays
 
|-
 
| 0xA0050E36 || 0xE36 || 0 || 5 || 5 || DrawElements
 
|-
 
| 0xA0020E2E || 0xE2E || 0 || 2 || 5 || PopDebugGroupId
 
|-
 
| 0xA0040E2C || 0xE2C || 0 || 4 || 5 || PushDebugGroup
 
|-
 
| 0x2001054C || 0x54C || 0 || 1 || 1 || ResetCounter
 
|-
 
| 0x8001047F || 0x47F || 0 || 1 || 4 || ResolveDepthBuffer
 
|-
 
| 0x200104C4 || 0x4C4 || 0 || 1 || 1 || SetAlphaRef
 
|-
 
| 0x200404C7 || 0x4C7 || 0 || 4 || 1 || SetBlendColor
 
|-
 
| 0x2001064F || 0x6F4 || 0 || 1 || 1 || SetDepthClamp
 
|-
 
| 0x200200CD || 0xCD || 0 || 2 || 1 || SetInnerTessellationLevels
 
|-
 
| 0x200204EC || 0x4EC || 0 || 2 || 1 || SetLineWidth
 
|-
 
| 0x200400C9 || 0xC9 || 0 || 4 || 1 || SetOuterTessellationLevels
 
|-
 
| 0x8???0373 || 0x373 || 0 || Variable || 4 || SetPatchSize
 
|-
 
| 0x20010546 || 0x546 || 0 || 1 || 1 || SetPointSize
 
|-
 
| 0x20030554 || 0x554 || 0 || 3 || 1 || SetRenderEnableConditional
 
|-
 
| 0x200403EF || 0x3EF || 0 || 4 || 1 || SetSampleMask
 
|-
 
| 0x200103D9 || 0x3D9 || 0 || 1 || 1 || SetTiledCacheTileSize
 
|}
 
 
 
==References==
 
  
Check out those pages for more useful data.
+
FIFO engine overview:
 +
[https://envytools.readthedocs.io/en/latest/hw/fifo/intro.html]
  
Register Id values from the Fermi family GPU (a bit older than the Tegra X1, but values seems to be mostly the same):
+
Method values from the Fermi family GPU (a bit older than the Tegra X1, but values seems to be mostly the same):
 
[https://github.com/envytools/envytools/blob/master/rnndb/graph/gf100_3d.xml]
 
[https://github.com/envytools/envytools/blob/master/rnndb/graph/gf100_3d.xml]
  
Line 216: Line 230:
 
[https://github.com/envytools/envytools/blob/master/rnndb/graph/nv_3ddefs.xml]
 
[https://github.com/envytools/envytools/blob/master/rnndb/graph/nv_3ddefs.xml]
  
Command Word packing code used on Mesa3d:
+
Command word packing code used on Mesa3d:
 
[https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_winsys.h]
 
[https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nvc0/nvc0_winsys.h]

Revision as of 19:23, 2 February 2018

Mapping Memory

First, to map a memory region on the GPU Address Space, caching needs to be disabled by using svcSetMemoryAttribute. The Address passed is the Virtual Address of the region that will be mapped, the size is the region size, and State0/1 are both set to 8 to disable caching of the memory region. This is done to ensure that the GPU can actually "see" the data written there, and it doesn't get stuck on some cache.

Then, NVMAP_IOC_CREATE is used to create a nvmap object with the desired size. After, NVMAP_IOC_ALLOC is used to allocate the memory on the GPU Address Space, and map data on the process Address Space into the GPU Address Space, by passing the Virtual Address as the input addr parameter, and also the Handle returned from NVMAP_IOC_CREATE. Lastly, the actual mapping is done by using NVGPU_AS_IOCTL_MAP_BUFFER_EX, and the GPU Virtual Address is returned on the offset parameter. It's also possible to manually set the offset where the mapping should be made on the GPU Address Space, by passing the address on the "offset" parameter, and setting the bit 0 of the flags parameter to 1. However, for this to work, the desired GPU Virtual Address needs to be previously reserved using NVGPU_AS_IOCTL_ALLOC_SPACE.

The above process is used to map all data that will be used by the GPU, like Textures, Command Lists (a.k.a. Push Buffers), Vertex/Index buffers and Shaders. They usually have their own mapping, but Command Lists can share the same mapping.

FIFO Commands

The GPU implements a variation of Tegra's push buffer format for it's PFIFO engine. PFIFO is a special engine responsible for receiving user command lists and routing them to the appropriate engines (2D, 3D, DMA).

Commands are submitted to the GPU's PFIFO engine through NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO.

This ioctl takes an array of gpfifo entries where each entry points to a FIFO command list. This list is composed of alternating 32-bit words containing FIFO commands and their respective arguments.

Command Structure

Bits Description
12-0 Method
15-13 Subchannel
28-16 Argument count (in 32-bits Words) or inline data (see below)
31-29 Submission mode

Note: Methods are treated as 4-byte addressable locations, and hence their numbers are written down multiplied by 4.

Note: The command's arguments, when present, follow the command word immediately.

Submission mode

Mode Description Offical name
0 Increasing mode (old)
1 Increasing mode - Tells PFIFO to read as much arguments as specified by argument count, while automatically incrementing the method value. This means that each argument will be written to a different method location. INCR
2 Non-increasing mode (old)
3 Non-increasing mode - Tells PFIFO to read as much arguments as specified by argument count. However, all arguments will be written to the same method location. NONINCR
4 Inline mode - Tells PFIFO to read inline data from bits 28-16 of the command word, thus eliminating the need to pass additional words for the arguments. IMM
5 Increase-once mode - Tells PFIFO to read as much arguments as specified by argument count and automatically increments the method value once only.

Command List

All methods with values < 0x100 are special and executed by the PFIFO's DMA puller. The others are forwarded to the engine object currently bound to a given subchannel.

Command Method Subchannel Arg Count Mode Name
0x2001?000 0x000 Variable 1 1 BindObject
0xA0020E00 0xE00 0 2 5 BeginTransformFeedback
0xA0030E30 0xE30 0 3 5 DrawArrays
0xA0050E36 0xE36 0 5 5 DrawElements
0xA0020E2E 0xE2E 0 2 5 PopDebugGroupId
0xA0040E2C 0xE2C 0 4 5 PushDebugGroup
0x2001054C 0x54C 0 1 1 ResetCounter
0x8001047F 0x47F 0 1 4 ResolveDepthBuffer
0x200104C4 0x4C4 0 1 1 SetAlphaRef
0x200404C7 0x4C7 0 4 1 SetBlendColor
0x2001064F 0x6F4 0 1 1 SetDepthClamp
0x200200CD 0xCD 0 2 1 SetInnerTessellationLevels
0x200204EC 0x4EC 0 2 1 SetLineWidth
0x200400C9 0xC9 0 4 1 SetOuterTessellationLevels
0x8???0373 0x373 0 Variable 4 SetPatchSize
0x20010546 0x546 0 1 1 SetPointSize
0x20030554 0x554 0 3 1 SetRenderEnableConditional
0x200403EF 0x3EF 0 4 1 SetSampleMask
0x200103D9 0x3D9 0 1 1 SetTiledCacheTileSize

Note: These still need to be heavily verified and could be wrong.

BindObject

In order to bind an engine object to a specific subchannel, method 0 (BindObject) must be used first. The target subchannel is specified in bits 15-13 of the command word.

After the engine object is bound to the desired subchannel, setting it's value in bits 15-13 of any subsequent command word will make PFIFO forward the command to the target engine.

This method only takes one argument, an engine ID.

Engine IDs

ID Engine
0x902D FERMI_TWOD_A (2D)
0xB197 MAXWELL_B (3D)
0xB1C0 MAXWELL_COMPUTE_B
0xA140 KEPLER_INLINE_TO_MEMORY_B
0xB0B5 MAXWELL_DMA_COPY_A (DMA)

Fences

Command lists can contain fences to ensure that commands are executed on the correct order, and subsequent commands are only sent when the previously sent commands were already processed by the GPU. Fences uses the QUERY_* commands, and works like this:

  • First, QUERY_ADDRESS_HIGH and QUERY_ADDRESS_LOW commands are added to the Command List, with the High/Low 32 bits part of the 64-bits GPU Virtual Address where the fence is located. This GPU Virtual Address needs to be mapped to the process Virtual Address beforehand.
  • Then, QUERY_SEQUENCE is added with a sequential number. This number is basically a incrementing counter, so the first Command List can have QUERY_SEQUENCE = 1, the next one QUERY_SEQUENCE = 2, 3, 4... and so on.
  • Finally, QUERY_GET is added and contains the mode and other unknown data.

The above commands are added using the increasing mode, since the Ids for all those 4 registers are sequential.

QUERY_GET Structure

Bits Description
1-0 Mode
4 Fence
15-12 Unit

QUERY_GET Mode

Value Mode
0 Write
1 Sync
2 Write ?
3 Write ?

TODO: Move this to a separate page with all GPU Commands with descriptions. Also figure out what the other values mean.

Some of the other fields are still unknown/unobserved.

Official games will set Mode to 0, Fence to 1 and Unit to 0xF. The QUERY_SEQUENCE value is then written by the GPU to the address pointed to by QUERY_ADDRESS. On the CPU side, the game code should wait until the value at the address pointed to by QUERY_ADDRESS is >= to the last written SEQUENCE value. Official code waits for this condition to be true on a loop, and won't send any further commands before that.

Vertex Data Submission

Note: This is a observation on how the game Puyo Puyo Tetris sends textured squares to the GPU.

  1. VERTEX_ATTRIB_FORMAT (0-15) are set (only the first 3 are really used, the rest are set float, with Size = 1 and offset at 0).
  2. VERTEX_ARRAY_FETCH (0) is set with the lower 12 bits set to 0x1c (Stride) and bit 12 to 1 (Enabled).
  3. VERTEX_ARRAY_START_HIGH/LOW (0) are set to the GPU Virtual Address where the Vertex Data is located.
  4. VERTEX_ARRAY_LIMIT_HIGH/LOW (0) are set to the GPU Virtual Address where the Vertex Data is located, plus the Vertex Data size in bytes minus 1.
  5. VERTEX_BEGIN_GL is used with the primitive type set to TRIANGLE_STRIP.
  6. VERTEX_BUFFER_FIRST with value 0 (indicating the index of the first primitive to render?).
  7. VERTEX_BUFFER_COUNT is set to 4, because the Vertex Buffer with the square has 4 vertices.
  8. VERTEX_END_GL is used with value 0 (currently unknown what this value means).

References

FIFO engine overview: [1]

Method values from the Fermi family GPU (a bit older than the Tegra X1, but values seems to be mostly the same): [2]

Values for some types used on the above XML: [3]

Command word packing code used on Mesa3d: [4]