Difference between revisions of "GPU"

From Nintendo Switch Brew
Jump to navigation Jump to search
m
Line 544: Line 544:
 
Where ''tic_base_address'' is the address written to TIC_ADDRESS_HIGH/LOW (methods 0x1574 and 0x1578), ''tic_index'' is the lower 20 bits of the word written into the Const Buffer with CB_DATA (0), and 0x20 is the size of each TIC entry in bytes.
 
Where ''tic_base_address'' is the address written to TIC_ADDRESS_HIGH/LOW (methods 0x1574 and 0x1578), ''tic_index'' is the lower 20 bits of the word written into the Const Buffer with CB_DATA (0), and 0x20 is the size of each TIC entry in bytes.
  
The texture is accessed on the shader using one of the texture sampling instructions (usually the TEXS instruction). One of the parameters for this instruction is the ''Handle'' index. This index start at 8, so the index 8 will access the handle at 8 * 4 = 0x20 on the ''Texture Constant Buffer''. Each shader stage has a separate Constant Buffer, so for fragment shaders, this is located at CB_ADDRESS + 4 * CB_SIZE + TEXS_index * 4 (where 4 is the index of the fragment shader stage).
+
The texture is accessed on the shader using one of the texture sampling instructions (usually the TEXS instruction). One of the parameters for this instruction is the ''Handle'' index. This index start at 8, so the index 8 will access the handle at 8 * 4 = 0x20 on the ''Texture Constant Buffer''. Each shader stage has a separate Constant Buffer, so for fragment shaders, this is located at CB_ADDRESS + 4 * CB_SIZE + TEXS_index * 4 (where the first 4 is the index of the fragment shader stage, and the second 4 is the size of a word, 4 bytes).
  
 
=== TIC Structure ===
 
=== TIC Structure ===

Revision as of 21:32, 30 March 2018

Mapping Memory

First, to map a memory region on the GPU Address Space, caching needs to be disabled by using svcSetMemoryAttribute. The Address passed is the Virtual Address of the region that will be mapped, the size is the region size, and State0/1 are both set to 8 to disable caching of the memory region. This is done to ensure that the GPU can actually "see" the data written there, and it doesn't get stuck on some cache.

Then, NVMAP_IOC_CREATE is used to create a nvmap object with the desired size. After, NVMAP_IOC_ALLOC is used to allocate the memory on the GPU Address Space, and map data on the process Address Space into the GPU Address Space, by passing the Virtual Address as the input addr parameter, and also the Handle returned from NVMAP_IOC_CREATE. Lastly, the actual mapping is done by using NVGPU_AS_IOCTL_MAP_BUFFER_EX, and the GPU Virtual Address is returned on the offset parameter. It's also possible to manually set the offset where the mapping should be made on the GPU Address Space, by passing the address on the "offset" parameter, and setting the bit 0 of the flags parameter to 1. However, for this to work, the desired GPU Virtual Address needs to be previously reserved using NVGPU_AS_IOCTL_ALLOC_SPACE.

The above process is used to map all data that will be used by the GPU, like Textures, Command Lists (a.k.a. Push Buffers), Vertex/Index buffers and Shaders. They usually have their own mapping, but Command Lists can share the same mapping.

FIFO Commands

The GPU implements a variation of Tegra's push buffer format for it's PFIFO engine. PFIFO is a special engine responsible for receiving user command lists and routing them to the appropriate engines (2D, 3D, DMA).

Commands are submitted to the GPU's PFIFO engine through NVGPU_IOCTL_CHANNEL_SUBMIT_GPFIFO.

This ioctl takes an array of gpfifo entries where each entry points to a FIFO command list. This list is composed of alternating 32-bit words containing FIFO commands and their respective arguments.

Command Structure

Bits Description
12-0 Method
15-13 Subchannel
27-16 Argument count (in 32-bits Words) or inline data (see below)
28? Hmm?
31-29 Submission mode

Note: Methods are treated as 4-byte addressable locations, and hence their numbers are written down multiplied by 4.

Note: The command's arguments, when present, follow the command word immediately.

Submission mode

Mode Description Offical name
0 Increasing mode (old)
1 Increasing mode - Tells PFIFO to read as much arguments as specified by argument count, while automatically incrementing the method value. This means that each argument will be written to a different method location. INCR
2 Non-increasing mode (old)
3 Non-increasing mode - Tells PFIFO to read as much arguments as specified by argument count. However, all arguments will be written to the same method location. NONINCR
4 Inline mode - Tells PFIFO to read inline data from bits 28-16 of the command word, thus eliminating the need to pass additional words for the arguments. IMM
5 Increase-once mode - Tells PFIFO to read as much arguments as specified by argument count and automatically increments the method value once only.

Command List

All methods with values < 0x100 are special and executed by the PFIFO's DMA puller. The others are forwarded to the engine object currently bound to a given subchannel.

Command Method Subchannel Arg Count Mode Name
0x2001?000 0x0000 Variable 1 1 BindObject
0x80000040 0x40 0 0 4 ?
0xA???0045 0x45 0 Variable 5 SetGraphMacroCode
0x20020047 0x47 0 2 1 SetGraphMacroEntry
0x800?0049 0x49 0 Variable 4 ?
0x20056080 0x80 3 Variable 1 ?
0x20016085 0x85 3 1 1 ?
0x20026086 0x86 3 2 1 ?
0x20026088 0x88 3 2 1 ?
0x2004608C 0x8C 3 Variable 1 ?
0x20016091 0x91 3 1 1 ?
0x20026092 0x92 3 2 1 ?
0x20026094 0x94 3 2 1 ?
0x800160B5 0xB5 3 1 4 ?
0x800000BA 0xBA 0 0 4 ?
0x601000BE 0xBE 0 16 3 ?
0x200100BF 0xBF 0 1 1 ?
0x200180C0 0xC0 4 1 1 ?
0x200400C9 0xC9 0 4 1 SetOuterTessellationLevels
0x200200CD 0xCD 0 2 1 SetInnerTessellationLevels
0x200100DC 0xDC 0 1 1 ?
0x800?00DF 0xDF 0 Variable 4 SetRasterizerDiscard?
0x20048100 0x100 4 4 1 ?
0x20028102 0x102 4 2 1 ?
0x20018104 0x104 4 1 1 ?
0x20018105 0x105 4 1 1 ?
0x20018106 0x106 4 1 1 ?
0x200181C0 0x1C0 4 1 1 ?
0x200181C2 0x1C2 4 1 1 ?
0x200181C3 0x1C3 4 1 1 ?
0x200281C4 0x1C4 4 2 1 ?
0x200181C5 0x1C5 4 1 1 ?
0x200181C6 0x1C6 4 1 1 ?
0x200181C7 0x1C7 4 1 1 ?
0x200181C8 0x1C8 4 1 1 ?
0x200181CA 0x1CA 4 1 1 ?
0x200181CB 0x1CB 4 1 1 ?
0x200181CC 0x1CC 4 1 1 ?
0x200181CD 0x1CD 4 1 1 ?
0x200181CF 0x1CF 4 1 1 ?
0x800001D1 0x1D1 0 0 4 ?
0x200301F0 0x1F0 0 3 1 ?
0x800001F3 0x1F3 0 0 4 ?
0x200201F8 0x1F8 0 2 1 ?
0x200401FA 0x1FA 0 4 1 ?
0x20016223 0x223 3 1 1 ?
0x2004622C 0x22C 3 4 1 ?
0x20046230 0x230 3 4 1 ?
0x20046234 0x234 3 4 1 ?
0x200203?? 0x3?? Variable 0 2 1 SetScissors?
0x20040360 0x360 0 4 1 ?
0x20010364 0x364 0 1 1 ?
0x800?0368 0x368 0 Variable 4 ?
0x800?036B 0x36B 0 Variable 4 ?
0x800?036C 0x36C 0 Variable 4 ?
0x2001036F 0x36F 0 1 1 ?
0x80000370 0x370 0 0 4 ?
0x80000371 0x371 0 0 4 ?
0x80000372 0x372 0 0 4 ?
0x8???0373 0x373 0 Variable 4 SetPatchSize
0x80000374 0x374 0 0 4 ?
0x20010376 0x376 0 1 1 ?
0x800?03D5 0x3D5 0 Variable 4 ?
0x800?03D6 0x3D6 0 Variable 4 ?
0x800?03D7 0x3D7 0 Variable 4 ?
0x200103D9 0x3D9 0 1 1 SetTiledCacheTileSize
0x80?003DE 0x3DE 0 Variable 4 ?
0x800003E0 0x3E0 0 0 4 ?
0x200203E7 0x3E7 0 2 1 ?
0x800003ED 0x3ED 0 0 4 ?
0x800203EE 0x3EE 0 2 4 ?
0x200403EF 0x3EF 0 4 1 SetSampleMask
0x800003F5 0x3F5 0 0 4 ?
0x800103F6 0x3F6 0 1 4 ?
0x200503F8 0x3F8 0 5 1 ?
0x200203FD 0x3FD 0 2 1 ?
0x2004040C 0x40C 0 4 1 ?
0x200C0420 0x420 0 12 1 ?
0x80000446 0x446 0 0 4 ?
0x80000451 0x451 0 0 4 ?
0x800?0452 0x452 0 Variable 4 ?
0x20100458 0x458 0 16 1 ?
0x20040478 0x478 0 4 1 ?
0x8000047C 0x47C 0 0 4 ?
0x8000047E 0x47E 0 0 4 ?
0x8001047F 0x47F 0 1 4 ResolveDepthBuffer
0x2003048A 0x48A 0 3 1 ?
0x800?04B3 0x4B3 0 Variable 4 ?
0x800104B9 0x4B9 0 1 4 ?
0x800?04BA 0x4BA 0 Variable 4 ?
0x800004BB 0x4BB 0 0 4 ?
0x800?04C3 0x4C3 0 Variable 4 ?
0x200104C4 0x4C4 0 1 1 SetAlphaRef
0x200404C7 0x4C7 0 4 1 SetBlendColor
0x800004E0 0x4E0 0 0 4 ?
0x800?04E5 0x4E5 0 Variable 4 ?
0x800?04E6 0x4E6 0 Variable 4 ?
0x800?04E7 0x4E7 0 Variable 4 ?
0x200204EC 0x4EC 0 2 1 SetLineWidth
0x800?050D 0x50D 0 Variable 4 ?
0x80000519 0x519 0 0 4 ?
0x80000540 0x540 0 0 4 ?
0x20010546 0x546 0 1 1 SetPointSize
0x2001054C 0x54C 0 1 1 ResetCounter
0x800?054E 0x54E 0 Variable 4 ?
0x20030554 0x554 0 3 1 SetRenderEnableConditional
0x800?0556 0x556 0 Variable 4 SetRenderEnable?
0x2001055B 0x55B 0 1 1 ?
0x2001056F 0x56F 0 1 1 ?
0x80000572 0x572 0 0 4 ?
0x800?0574 0x574 0 Variable 4 ?
0x8000057F 0x57F 0 0 4 ?
0x80000580 0x580 0 0 4 ?
0x80000591 0x591 0 0 4 ?
0x20010592 0x592 0 1 1 ?
0x200205F2 0x5F2 0 2 1 ?
0x800?05F6 0x5F6 0 Variable 4 ?
0x2001061F 0x61F 0 1 1 ?
0x800?0620 0x620 0 Variable 4 ?
0x80010646 0x646 0 1 4 ?
0x800?0648 0x648 0 Variable 4 ?
0x2001064F 0x64F 0 1 1 SetDepthClamp
0x800?066F 0x66F 0 Variable 4 ?
0x120020671 0x671 0 2 9 ?
0x20010674 0x674 0 1 1 ?
0x8000068B 0x68B 0 0 4 ?
0x200406C0 0x6C0 0 4 1 ?
0x20010703 0x703 0 1 1 ?
0x200207?? 0x7?? Variable 0 2 1 ?
0x80300830 0x830 0 48 4 ?
0x80400840 0x840 0 64 4 ?
0x80500850 0x850 0 80 4 ?
0x200308E0 0x8E0 0 3 1 ?
0x200308E3 0x8E3 0 3 1 CB_POS(Const buffer position)
0x200308E4 0x8E4 0 3 1 ?
0x80??0904 0x904 0 Variable 4 ?
0x80??090C 0x90C 0 Variable 4 ?
0x80??0914 0x914 0 Variable 4 ?
0x80??091C 0x91C 0 Variable 4 ?
0x80??0924 0x924 0 Variable 4 ?
0x80000D1E 0xD1E 0 0 4 ?
0x800?0D28 0xD28 0 Variable 4 ?
0x20010D29 0xD29 0 1 1 ?
0x20010D34 0xD34 0 1 1 ?
0xA0020E00 0xE00 0 2 5 BeginTransformFeedback
0x20010E02 0xE02 0 1 1 ?
0x80020E04 0xE04 0 2 4 ?
0x20010E06 0xE06 0 1 1 ?
0xA0030E0A 0xE0A 0 3 5 ?
0x80050E0C 0xE0C 0 5 4 ?
0x800?0E0E 0xE0E 0 Variable 4 ?
0x20010E10 0xE10 0 1 1 ?
0xA0040E12 0xE12 0 4 5 ?
0x20010E1A 0xE1A 0 1 1 ?
0xA0040E1C 0xE1C 0 4 5 ?
0x81900E1E 0xE1E 0 400 4 ?
0x80000E20 0xE20 0 0 4 ?
0x80000E24 0xE24 0 0 4 ?
0xA0040E2C 0xE2C 0 4 5 PushDebugGroup
0xA0020E2E 0xE2E 0 2 5 PopDebugGroupId
0xA0030E30 0xE30 0 3 5 DrawArrays
0xA0050E32 0xE32 0 5 5 DrawArraysIndirect?
0xA0050E34 0xE34 0 5 5 DrawArraysInstanced?
0xA0050E36 0xE36 0 5 5 DrawElements
0xA0060E38 0xE38 0 6 5 DrawElementsIndirect?
0xA0060E3A 0xE3A 0 6 5 DrawElementsInstanced?
0xA0050E42 0xE42 0 5 5 ?
0xA0060E44 0xE44 0 6 5 ?

Note: These still need to be heavily verified and could be wrong.

BindObject

In order to bind an engine object to a specific subchannel, method 0 (BindObject) must be used first. The target subchannel is specified in bits 15-13 of the command word.

After the engine object is bound to the desired subchannel, setting it's value in bits 15-13 of any subsequent command word will make PFIFO forward the command to the target engine.

This method only takes one argument, an engine ID.

Engine IDs

ID Engine
0x902D FERMI_TWOD_A (2D)
0xB197 MAXWELL_B (3D)
0xB1C0 MAXWELL_COMPUTE_B
0xA140 KEPLER_INLINE_TO_MEMORY_B
0xB0B5 MAXWELL_DMA_COPY_A (DMA)

Macro

Macros are small programs that can be uploaded to the gpu and are capable of reading and writing to the 3D engine registers on the GPU. The macros also accepts parameters, stored on a FIFO. Macros can be called using methods starting at 0xe00, where the first method triggers the macro execution, and the second one is used to push parameters to the FIFO, that can be read from the macro program using a instruction called parm. This instruction pops the FIFO and reads the next parameter, while also allowing programs to use a variable number of parameters if desired.

The first parameter is written to 0xe00 + n * 2 (where n is the macro index), and all subsequent parameters should be pushed to the FIFO using 0xe01 + n * 2. The first parameter is placed at the general purpose register R1 in the shader program when execution starts.

Official games uses those macros to conditionally write registers, one example of such uses is the macro at 0xe24, that is used to set shader registers (including shader address and binding the c1 Constant Buffer to the shader). In some cases, it's also used to set registers unconditionally.

Fences

Command lists can contain fences to ensure that commands are executed on the correct order, and subsequent commands are only sent when the previously sent commands were already processed by the GPU. Fences uses the QUERY_* commands, and works like this:

  • First, QUERY_ADDRESS_HIGH and QUERY_ADDRESS_LOW commands are added to the Command List, with the High/Low 32 bits part of the 64-bits GPU Virtual Address where the fence is located. This GPU Virtual Address needs to be mapped to the process Virtual Address beforehand.
  • Then, QUERY_SEQUENCE is added with a sequential number. This number is basically a incrementing counter, so the first Command List can have QUERY_SEQUENCE = 1, the next one QUERY_SEQUENCE = 2, 3, 4... and so on.
  • Finally, QUERY_GET is added and contains the mode and other unknown data.

The above commands are added using the increasing mode, since the Ids for all those 4 registers are sequential.

QUERY_GET Structure

Bits Description
1-0 Mode
4 Fence
15-12 Unit

QUERY_GET Mode

Value Mode
0 Write
1 Sync
2 Write ?
3 Write ?

TODO: Move this to a separate page with all GPU Commands with descriptions. Also figure out what the other values mean.

Some of the other fields are still unknown/unobserved.

Official games will set Mode to 0, Fence to 1 and Unit to 0xF. The QUERY_SEQUENCE value is then written by the GPU to the address pointed to by QUERY_ADDRESS. On the CPU side, the game code should wait until the value at the address pointed to by QUERY_ADDRESS is >= to the last written SEQUENCE value. Official code waits for this condition to be true on a loop, and won't send any further commands before that.

Vertex Data Submission

Note: This is a observation on how the game Puyo Puyo Tetris sends textured squares to the GPU.

  1. VERTEX_ATTRIB_FORMAT (0-15) are set (only the first 3 are really used, the rest are set float, with Size = 1 and offset at 0).
  2. VERTEX_ARRAY_FETCH (0) is set with the lower 12 bits set to 0x1c (Stride) and bit 12 to 1 (Enabled).
  3. VERTEX_ARRAY_START_HIGH/LOW (0) are set to the GPU Virtual Address where the Vertex Data is located.
  4. VERTEX_ARRAY_LIMIT_HIGH/LOW (0) are set to the GPU Virtual Address where the Vertex Data is located, plus the Vertex Data size in bytes minus 1.
  5. VERTEX_BEGIN_GL is used with the primitive type set to TRIANGLE_STRIP.
  6. VERTEX_BUFFER_FIRST with value 0 (indicating the index of the first primitive to render?).
  7. VERTEX_BUFFER_COUNT is set to 4, because the Vertex Buffer with the square has 4 vertices.
  8. VERTEX_END_GL is used with value 0 (currently unknown what this value means).

Texture View

Texture information such as address, format and size is sent to the GPU through a structure know as Texture View (a.k.a. Texture Image Control, or TIC). Each texture that the game uses needs a separate TIC, and those TICs are written to a table, one after the other. Each TIC entry has 0x20 bytes, and is composed of 8 32-bits words where the texture information is packed.

The index of the TIC entries that should be used by the shader is sent to the GPU with the CB_POS/CB_DATA (0) methods. Games usually follows the following steps to write the TIC entry indexes:

  • Macro 0xe1a is used to set CB_ADDRESS_HIGH/LOW registers to the GPU Virtual Address of the Constant Buffer set on the register 0x982 (the Texture Constant Buffer index register), and also sets CB_SIZE.
  • CB_POS is used to set the write offset of the Constant Buffer to 0x20 + n * 4, where n is the index of the Handle being used on the shader sampler.
  • CB_DATA (0) method is used to write the value into the Constant Buffer. The value is a Handle where the lower 20 bits is the TIC index, and the higher 12 bits is the TSC (Texture Sampler Control) index.

The address of a given TIC entry can be calculates as:

tic_entry_address = tic_base_address + tic_index * 0x20

Where tic_base_address is the address written to TIC_ADDRESS_HIGH/LOW (methods 0x1574 and 0x1578), tic_index is the lower 20 bits of the word written into the Const Buffer with CB_DATA (0), and 0x20 is the size of each TIC entry in bytes.

The texture is accessed on the shader using one of the texture sampling instructions (usually the TEXS instruction). One of the parameters for this instruction is the Handle index. This index start at 8, so the index 8 will access the handle at 8 * 4 = 0x20 on the Texture Constant Buffer. Each shader stage has a separate Constant Buffer, so for fragment shaders, this is located at CB_ADDRESS + 4 * CB_SIZE + TEXS_index * 4 (where the first 4 is the index of the fragment shader stage, and the second 4 is the size of a word, 4 bytes).

TIC Structure

Word Bits Description
0 6-0 Texture Format
0 9-7 R Channel Data Type
0 12-10 G Channel Data Type
0 15-13 B Channel Data Type
0 18-16 A Channel Data Type
1 31-0 Lower 32-bits of the Texture GPU Virtual Address
2 15-0 Higher 16-bits of the Texture GPU Virtual Address
4 15-0 Texture Width minus 1
5 15-0 Texture Height minus 1

Channel Data Type

Value Type
1 SNORM
2 UNORM
3 SINT
4 UINT
5 SNORM_FORCE_FP16
6 UNORM_FORCE_FP16
7 FLOAT

References

FIFO engine overview: [1]

Method values from the Fermi family GPU (a bit older than the Tegra X1, but values seems to be mostly the same): [2]

TIC structure used on a Maxwell GPU: [3]

Values for some types used on the above XML: [4]

Command word packing code used on Mesa3d: [5]

TIC entry pack/write code used on Mesa3d: [6]