Nintendo Switch Brew - User contributions [en]

Sockets services

2020-07-03T05:26:57Z

Rodrigo: /* Initalize */

= bsd:u, bsd:s =
This is "nn::socket::sf::IClient".

All the services commands but the first two return -1 on failure and set errno when that happens. Although Nintendo has the FreeBSD kernel's to socket stack, '''the errno macro definitions being in use actually come from Linux (and not from FreeBSD as one would expect!)'''.

These services have max_sessions 0x40. There's 22 IPC handler threads for bsd:u, and 11 for bsd:s.

{| class="wikitable" border="1"
|-
! Cmd || Name
|-
| 0 || RegisterClient ([[#Initialize]])
|-
| 1 || StartMonitoring
|-
| 2 || [[#Socket]]
|-
| 3 || [[#Socket|#SocketExempt]]
|-
| 4 || [[#Open]]
|-
| 5 || Select
|-
| 6 || Poll
|-
| 7 || [[#Sysctl]]
|-
| 8 || Recv
|-
| 9 || RecvFrom
|-
| 10 || Send
|-
| 11 || SendTo
|-
| 12 || Accept
|-
| 13 || Bind
|-
| 14 || Connect
|-
| 15 || GetPeerName
|-
| 16 || GetSockName
|-
| 17 || GetSockOpt
|-
| 18 || Listen
|-
| 19 || [[#Ioctl]]
|-
| 20 || [[#Fcntl]]
|-
| 21 || SetSockOpt
|-
| 22 || Shutdown
|-
| 23 || ShutdownAllSockets
|-
| 24 || Write
|-
| 25 || Read
|-
| 26 || Close
|-
| 27 || [[#DuplicateSocket]]
|-
| 28 || [[#GetResourceStatistics]]
|-
| 29 || [3.0.0+] [[#RecvMMsg]]
|-
| 30 || [3.0.0+] [[#SendMMsg]]
|-
| 31 || [7.0.0+] EventFd
|-
| 32 || [7.0.0+] [[#RegisterResourceStatisticsName]]
|-
| 33 || [10.0.0+] [[#Initialize2]]
|}

== Initialize ==
Takes a [[#BsdBufferConfig]] (made-up name), the PID, the size of the transfer memory and a copy-handle of the latter.

=== BsdBufferConfig ===
/// Configuration structure for bsdInitalize
typedef struct {
u32 version; ///< Observed 1 on 2.0 LibAppletWeb, 2 on 3.0.

u32 tcp_tx_buf_size; ///< Size of the TCP transfer (send) buffer (initial or fixed).
u32 tcp_rx_buf_size; ///< Size of the TCP recieve buffer (initial or fixed).
u32 tcp_tx_buf_max_size; ///< Maximum size of the TCP transfer (send) buffer. If it is 0, the size of the buffer is fixed to its initial value.
u32 tcp_rx_buf_max_size; ///< Maximum size of the TCP receive buffer. If it is 0, the size of the buffer is fixed to its initial value.

u32 udp_tx_buf_size; ///< Size of the UDP transfer (send) buffer (typically 0x2400 bytes).
u32 udp_rx_buf_size; ///< Size of the UDP receive buffer (typically 0xA500 bytes).

u32 sb_efficiency; ///< Number of buffers for each socket (standard values range from 1 to 8).
} BsdBufferConfig;

The transfer memory must be larger than a the computed size below. Should the transfer memory be smaller than that, the BSD sockets service would only send ZeroWindow packets (for TCP), resulting in a transfer rate not exceeding 1 byte/s.

static size_t _bsdGetTransferMemSizeForBufferConfig(const BsdBufferConfig *config)
{
u32 tcp_tx_buf_max_size = config->tcp_tx_buf_max_size != 0 ? config->tcp_tx_buf_max_size : config->tcp_tx_buf_size;
u32 tcp_rx_buf_max_size = config->tcp_rx_buf_max_size != 0 ? config->tcp_rx_buf_max_size : config->tcp_rx_buf_size;
u32 sum = tcp_tx_buf_max_size + tcp_rx_buf_max_size + config->udp_tx_buf_size + config->udp_rx_buf_size;

sum = ((sum + 0xFFF) >> 12) << 12; // page round-up
return (size_t)(config->sb_efficiency * sum);
}

== Socket ==
FreeBSD's <code>socket</code> command.
<code>bsd:u</code> disallows the usage of the <code>SOCK_SEQPACKET</code> and <code>SOCK_RAW</code> types, with the exception of <code>(AF_INET, SOCK_RAW, IPPROTO_ICMP)</code> (IPv4 <code>ping</code>), <code>bsd:s</code> needs to be used for those.

The only registered domains are <code>AF_INET</code> and <code>AF_ROUTE</code>.

SocketExempt: same as socket but the socket is immediately shutdown (disconnected) on creation.

== Open ==
FreeBSD's <code>open</code> command, limited to opening <code>/dev/bpf</code>. This can be used, for example, to enable promiscuous mode, see FreeBSD's <code>/dev/bpf</code> for more details.

== Sysctl ==
FreeBSD's <code>sysctl</code> command. Privileged operations are reserved for <code>bsd:s</code>.
<code>CTL_NET</code>, <code>CTL_VM</code>, <code>CTL_KERN</code> and <code>CTL_DEBUG</code> commands are implemented (?).

== Ioctl ==
FreeBSD's <code>ioctl</code> function. The following ioctls are whitelisted, refer to FreeBSD's headers for more details: SIOCATMARK, BIOCGBLEN, BIOCSETF BIOCIMMEDIATE, BIOCSETIF, BIOCVERSION, FIONSPACE, FIONWRITE, FIONREAD, SIOCGETSGCNT, SIOCGIFMETRIC, SIOCSIFMETRIC, SIOCDIFADDR, SIOCGIFINDEX, SIOCGIFADDR, SIOCGIFCONF, SIOCGIFNETMASK, SIOCAIFADDR, SIOCGIFMTU, SIOCSIFMTU, SIOCGIFMEDIA, SIOCSIFLLADDR and SIOCGIFXMEDIA.

Nintendo use the following definition (hence changing all ioctls using this structure):
struct bpf_program {
u_int bf_len;
struct bpf_insn bf_insns[BPF_MAXINSNS]; // [512]. This is a pointer in the official structure
};

== Fcntl ==
FreeBSD's <code>fcntl</code>, limited to <code>F_GETFL</code> and <code>F_SETFL</code> with <code>O_NONBLOCK</code>.

== DuplicateSocket ==
Takes a socket file descriptor and an unused u64. Duplicates the socket (FreeBSD's <code>dup</code>). Reserved to <code>bsd:s</code>.

== GetResourceStatistics ==
Takes a total of 8-bytes of input, a PID, a type-0x22 output buffer, and returns a total of 8-bytes of output.

[4.0.0+] Now takes an additional 8-bytes of input.

[7.0.0+] Now takes an additional type-0x21 input buffer.

== RecvMMsg ==
Takes a total of 0x20-bytes of input, a type-0x22 output buffer, and returns a total of 8-bytes of output.

[7.0.0+] The buffer was replaced with a type-0x6 output buffer.

== SendMMsg ==
Takes a total of 0xC-bytes of input, two type-0x21 input buffers, and returns a total of 8-bytes of output.

[7.0.0+] The buffers were replaced with a type-0x6 output buffer.

== RegisterResourceStatisticsName ==
With [10.0.0+] this now takes an additional 8-bytes of input.

== Initialize2 ==
Same input/output as [[#Initalize]] except this doesn't take an input handle.

The work-buffer (size is still specified with this cmd) is backed by memory in the sysmodule, instead of TransferMemory. The size (after alignment) must be >=0x1000 and <=0x13F000.

sdknso will only use this cmd when two flags in the input config are set: the first one being set indicates that the bsd:s service is used, while the second flag enables using this cmd. An error is thrown if the work-buffer size is <0x1000. Otherwise when these flags aren't set, [[#Initialize]] is used as usual.

= bsdcfg =
This is "nn::bsdsocket::cfg::ServerInterface".

{| class="wikitable" border="1"
|-
! Cmd || Name
|-
| 0 || [[#SetIfUp]]
|-
| 1 || [[#SetIfUpWithEvent]]
|-
| 2 || CancelIf
|-
| 3 || SetIfDown
|-
| 4 || GetIfState
|-
| 5 || DhcpRenew
|-
| 6 || AddStaticArpEntry
|-
| 7 || RemoveArpEntry
|-
| 8 || LookupArpEntry
|-
| 9 || LookupArpEntry2
|-
| 10 || ClearArpEntries
|-
| 11 || ClearArpEntries2
|-
| 12 || PrintArpEntries
|}

== SetIfUp ==
Takes a total of 0x28-bytes of input and a type-0x5 input buffer, no output.

[3.0.0+] Takes an additional 4-bytes of input.

== SetIfUpWithEvent ==
Takes a total of 0x28-bytes of input and a type-0x5 input buffer, returns an output handle.

[3.0.0+] Takes an additional 4-bytes of input.

= ethc:c =
This is "nn::eth::sf::IEthInterface".

{| class="wikitable" border="1"
|-
! Cmd || Name
|-
| 0 || Initialize
|-
| 1 || Cancel
|-
| 2 || GetResult
|-
| 3 || GetMediaList
|-
| 4 || SetMediaType
|-
| 5 || GetMediaType
|}

= ethc:i =
This is "nn::eth::sf::IEthInterfaceGroup".

{| class="wikitable" border="1"
|-
! Cmd || Name
|-
| 0 || GetReadableHandle
|-
| 1 || Cancel
|-
| 2 || GetResult
|-
| 3 || GetInterfaceList
|-
| 4 || GetInterfaceCount
|}

= sfdnsres =
This is "nn::socket::resolver::IResolver".

This service uses <code>bionic/libc/dns</code> to perform its tasks.

This has max_sessions 8 and 8 IPC handler threads.

{| class="wikitable" border="1"
|-
! Cmd || Name
|-
| 0 || SetDnsAddressesPrivateRequest (stubbed, returns 0x7FE03)
|-
| 1 || GetDnsAddressPrivateRequest (stubbed, returns 0x7FE03)
|-
| 2 || GetHostByNameRequest
|-
| 3 || GetHostByAddrRequest
|-
| 4 || GetHostStringErrorRequest
|-
| 5 || GetGaiStringErrorRequest
|-
| 6 || [[#GetAddrInfoRequest]]
|-
| 7 || GetNameInfoRequest
|-
| 8 || GetCancelHandleRequest
|-
| 9 || CancelRequest
|-
| 10 || [5.0.0+] GetHostByNameRequestWithOptions
|-
| 11 || [5.0.0+] GetHostByAddrRequestWithOptions
|-
| 12 || [5.0.0+] GetAddrInfoRequestWithOptions
|-
| 13 || [5.0.0+] GetNameInfoRequestWithOptions
|-
| 14 || [5.0.0+] ResolverSetOptionRequest
|-
| 15 || [5.0.0+] ResolverGetOptionRequest
|}

== GetAddrInfoRequest ==
Takes three type 5 buffers (host, port, and hints), and a type 6 buffer (the output addrinfos). Also takes a u8 (padded to 4 bytes so the next raw parameter can align), a u32, and a u64. The u8 is a boolean for whether to enable "nsd resolve" (1) or not (0). Not sure what the u32 is. It seems to either come from a parameter to <tt>GetAddrInfo</tt> or be zero. The u64 is most likely a placeholder for the server to copy the PID into and should be zero. Both hints and the output buffer contain serialized addrinfo chains. The hints buffer is sized 0x400 bytes long by default, and the output buffer 0x1000 bytes.

=== Addrinfo Serialization Format ===
Each struct addrinfo in the linked list is serialized according to this format and then written to the buffer. All numbers are in network byte order.

{| class="wikitable" border="1"
|-
! Size (bytes) || Name || Notes
|-
| 4 || Magic || Needs to be 0xBEEFCAFE. Seriously.
|-
| 4 || ai_flags ||
|-
| 4 || ai_family ||
|-
| 4 || ai_socktype ||
|-
| 4 || ai_protocol ||
|-
| 4 || ai_addrlen ||
|-
| ai_addrlen ? ai_addrlen : 4 || ai_addr ||
|-
| null-terminated string || ai_canonname ||
|}

If <tt>ai_addrlen</tt> is zero, <tt>ai_addr</tt> will occupy 4 bytes.

{| class="wikitable" border="1"
|-
! Size (bytes) || Name || Notes
|-
| 4 || ai_addr
|}

If <tt>ai_family</tt> is recognized as AF_INET6 (28) or AF_INET (2), <tt>ai_addr</tt> is read as <tt>struct sockaddr_in</tt> or <tt>struct sockaddr_in6</tt>. Otherwise, it's just read as <tt>u8[ai_addrlen]</tt>.

The list should be terminated with a sentinel four-byte zero value.

= nsd:u, nsd:a =
This is "nn::nsd::detail::IManager".

{| class="wikitable" border="1"
|-
! Cmd || Name
|-
| 10 || GetSettingName
|-
| 11 || [[#GetEnvironmentIdentifier]]
|-
| 12 || GetDeviceId
|-
| 13 || DeleteSettings
|-
| 14 || ImportSettings
|-
| 15 || [4.0.0+] SetChangeEnvironmentIdentifierDisabled
|-
| 20 || Resolve
|-
| 21 || ResolveEx
|-
| 30 || GetNasServiceSetting
|-
| 31 || GetNasServiceSettingEx
|-
| 40 || GetNasRequestFqdn
|-
| 41 || GetNasRequestFqdnEx
|-
| 42 || GetNasApiFqdn
|-
| 43 || GetNasApiFqdnEx
|-
| 50 || GetCurrentSetting
|-
| 51 || [9.0.0+] WriteTestParameter
|-
| 52 || [9.0.0+] ReadTestParameter
|-
| 60 || [[#ReadSaveDataFromFsForTest]]
|-
| 61 || [[#WriteSaveDataToFsForTest]]
|-
| 62 || [[#DeleteSaveDataOfFsForTest]]
|-
| 63 || [4.0.0+] IsChangeEnvironmentIdentifierDisabled
|-
| 64 || [10.0.0+] SetWithoutDomainExchangeFqdns
|-
| 100 || [10.0.0+] GetApplicationServerEnvironmentType
|-
| 101 || [10.0.0+] SetApplicationServerEnvironmentType
|-
| 102 || [10.0.0+] DeleteApplicationServerEnvironmentType
|}

== GetEnvironmentIdentifier ==
Takes a type-0x16 buffer with size 8. Returns a string.

The output string is used by [[NIM_services|NIM]] for the "eid:%s" in the User-Agent strings.

This is the "lp1" string also used in domains.

== ReadSaveDataFromFsForTest ==
Requires the <code>nsd!test_mode</code> setting to be equal to 1.

Mounts the system save data for bsdsockets as <code>nsdsave</code> and reads from <code>nsd:/file</code> to the specified buffer, at the specified size and offset with no checks whatsoever. <code>nsdsave</code> is then unmounted.

== WriteSaveDataToFsForTest ==
Requires the <code>nsd!test_mode</code> setting to be equal to 1.

Mounts the system save data for bsdsockets as <code>nsdsave</code> and writes to <code>nsd:/file</code> (appending is allowed) using the specified buffer, at the specified size and offset, with no checks whatsoever. <code>nsdsave</code> is then commited and unmounted.

== DeleteSaveDataOfFsForTest ==
Requires the <code>nsd!test_mode</code> setting to be equal to 1.

Deletes the system save data for bsdsockets.

[[Category:Services]]

GPU Shaders

2019-05-14T12:46:19Z

Rodrigo: Replace italics with " for consistency

= Overview =
Like most D3D11-era GPUs the Tegra X1 has 5 pipeline shader stages: Vertex, tessellation control, tessellation evaluation, geometry and fragment. Maxwell GPUs have two vertex shader stages: A and B, using both is not mandatory or common. There are also compute shaders or kernels which run on a separate channel.

== Header ==
Pipeline stage shaders unlike compute kernels have a header documented by Nvidia[https://download.nvidia.com/open-gpu-doc/Shader-Program-Header/1/Shader-Program-Header.html]. It is a 0x50 byte structure that describes shader's behavior and enabled features that don't belong to the code itself (e.g. local memory size, whether it writes global memory). Some of these entries are not mandatory and are not emitted by Nvidia's compiler.

It is located at the beginning of the program address. Program code starts right after it (at address 0x50, address 0 on compute kernels).

== Text ==
Every 4 instructions starting from the first one there's a scheduler instruction. All instructions are 64 bits long, including scheduler instructions.

Shaders emitted by Nvidia's compiler include a branching instruction pointing to itself after the final unconditional exit. The intention of this "sign" is unknown.

== Registers ==
Maxwell GPUs have 255 type-less general purpose registers and one special register with id 255, ''nvdisasm'' shows it as RZ and ''envydis'' as 0x0. Writing here is a no-op unless there are side effects. Reading from RZ returns zero. The fewer registers a shader uses, the more it can be parallelized.

General purpose registers or GPRs are 32 bits long and are the same for all operations, these are given meaning on the instructions. Half float instructions are SIMD and operate on 16 bit pairs, meanwhile double instructions take two registers to operate. uint64 values are emulated using uint32 instructions extending their domain through condition codes; when an instruction has to read an uint64 value it reads two subsequent registers.

It is a common technique to read or write subsequent registers with a single instruction. For example TEXS (used to sample a texture) reads the texture coordinates from Ra and Ra+1, although it gets more complex with other layouts like 3D textures (it's not Ra+2). The result of the sample is given in an Rd and their subsequent registers.

TODO Add nvdisasm example

== Predicates ==
There are 6 general purpose "1-bit" predicates. These can be used to conditionally execute a given instructions. There is an extra predicate that always evaluates as true (and false when negated), writing here is a no-op unless there are side-effects. It is shown as "PT" in ''nvdisasm''.

Most of the time predicates can be negated.

TODO Add nvdisasm example

== Condition codes ==
These mostly match x86 condition codes. They are explicitly set by instructions with a ".CC" flag or by special instructions like R2P.

These can be analyzed with the P2R instruction (which moves predicates or condition codes to a register).

= Analysis Tools =

== Disassembling ==
To disassemble a Maxwell shader or compute kernel there are two applications:
* nvdisasm[https://developer.nvidia.com/cuda-toolkit]
* envydis[https://github.com/envytools/envytools/tree/master/envydis]

=== nvdisasm ===
Nvidia's disassembler. Since it's made by Nvidia it usually knows more instruction variants. It tends to print the intention of the instructions rather than their internal encoding, helping visual inspection. It can show the binary representation with an option.

One of its major problems is that the input file has to be perfectly aligned with a stride of 0x20 bytes or it will otherwise fail. Appending zeroes to the file will make it happy since those are NOP instructions. Invalid instructions will also abort the disassemble.

To disassemble a Maxwell shader or kernel in a file this command can be used:

$ nvdisasm -b SM53 <file>

For more information about this program it's recommended to read its help entry and CUDA Toolkit Documentation[https://docs.nvidia.com/cuda/cuda-binary-utilities/].

=== envydis ===
''envydis'' is nouveau's assembler and disassembler. Unlike Nvidia's disassembler this one doesn't care about alignment or invalid instructions, so it is useful to analyze a raw memory dump while searching for a shader or kernel. Its declarative-style C file[https://github.com/envytools/envytools/blob/master/envydis/gm107.c] describes instructions internal encoding.

To disassemble with it:

$ envydis -m gm107 -i <file>

''envydis'' also has an assembler variant: ''envyas''. To edit existing shaders or write shaders from scratch this is easier than assembling the bits manually.

Since it's the product of reverse engineering the GPU and ''nvdisasm'' itself, it might contain errors.

== Dumping existing shaders ==
One method to dump shaders from commercial games and homebrew applications is running it on an emulator.

Ryujinx[https://ryujinx.org/#/] offers in its configuration file an entry to dump the encountered shaders in two directories. ''Code'' contains shader dumps that are compatible with 'nvdisasm' and 'envydis' as they are. ''Full'' is the same as ''Code'' but it includes the header, it's intended to be decompiled with ''Ryujinx.ShaderTools''.

yuzu[https://yuzu-emu.org/] can dump shaders using its disk-based shader cache with an external tool. ''maxwell-dump''[https://gist.github.com/ReinUsesLisp/7ba72d3162e60cab283194fcca3474b2] is a small C application for this, it follows the same convention as Ryujinx for ''Code'' and ''Full''. To find the shader file dumped by the emulator, right click the game from the Qt front-end and select the option to view the transferable cache file. It's important to highlight that the file format might change in the future leaving this tool unable to extract the binaries until it's updated.

= Maxwell Instruction Set Architecture =
TODO Define instructions

GPU Shaders

2019-05-14T12:33:36Z

Rodrigo: Fix predicates count and remove "always false"

= Overview =
Like most D3D11-era GPUs the Tegra X1 has 5 pipeline shader stages: Vertex, tessellation control, tessellation evaluation, geometry and fragment. Maxwell GPUs have two vertex shader stages: A and B, using both is not mandatory or common. There are also compute shaders or kernels which run on a separate channel.

== Header ==
Pipeline stage shaders unlike compute kernels have a header documented by Nvidia[https://download.nvidia.com/open-gpu-doc/Shader-Program-Header/1/Shader-Program-Header.html]. It is a 0x50 byte structure that describes shader's behavior and enabled features that don't belong to the code itself (e.g. local memory size, whether it writes global memory). Some of these entries are not mandatory and are not emitted by Nvidia's compiler.

It is located at the beginning of the program address. Program code starts right after it (at address 0x50, address 0 on compute kernels).

== Text ==
Every 4 instructions starting from the first one there's a scheduler instruction. All instructions are 64 bits long, including scheduler instructions.

Shaders emitted by Nvidia's compiler include a branching instruction pointing to itself after the final unconditional exit. The intention of this "sign" is unknown.

== Registers ==
Maxwell GPUs have 255 type-less general purpose registers and one special register with id 255, ''nvdisasm'' shows it as RZ and ''envydis'' as 0x0. Writing here is a no-op unless there are side effects. Reading from RZ returns zero. The fewer registers a shader uses, the more it can be parallelized.

General purpose registers or GPRs are 32 bits long and are the same for all operations, these are given meaning on the instructions. Half float instructions are SIMD and operate on 16 bit pairs, meanwhile double instructions take two registers to operate. uint64 values are emulated using uint32 instructions extending their domain through condition codes; when an instruction has to read an uint64 value it reads two subsequent registers.

It is a common technique to read or write subsequent registers with a single instruction. For example TEXS (used to sample a texture) reads the texture coordinates from Ra and Ra+1, although it gets more complex with other layouts like 3D textures (it's not Ra+2). The result of the sample is given in an Rd and their subsequent registers.

TODO Add nvdisasm example

== Predicates ==
There are 6 general purpose "1-bit" predicates. These can be used to conditionally execute a given instructions. There is an extra predicate that always evaluates as true (and false when negated), writing here is a no-op unless there are side-effects. It is shown as ''PT'' in ''nvdisasm''.

Most of the time predicates can be negated.

TODO Add nvdisasm example

== Condition codes ==
These mostly match x86 condition codes. They are explicitly set by instructions with a ".CC" flag or by special instructions like R2P.

These can be analyzed with the P2R instruction (which moves predicates or condition codes to a register).

= Analysis Tools =

== Disassembling ==
To disassemble a Maxwell shader or compute kernel there are two applications:
* nvdisasm[https://developer.nvidia.com/cuda-toolkit]
* envydis[https://github.com/envytools/envytools/tree/master/envydis]

=== nvdisasm ===
Nvidia's disassembler. Since it's made by Nvidia it usually knows more instruction variants. It tends to print the intention of the instructions rather than their internal encoding, helping visual inspection. It can show the binary representation with an option.

One of its major problems is that the input file has to be perfectly aligned with a stride of 0x20 bytes or it will otherwise fail. Appending zeroes to the file will make it happy since those are NOP instructions. Invalid instructions will also abort the disassemble.

To disassemble a Maxwell shader or kernel in a file this command can be used:

$ nvdisasm -b SM53 <file>

For more information about this program it's recommended to read its help entry and CUDA Toolkit Documentation[https://docs.nvidia.com/cuda/cuda-binary-utilities/].

=== envydis ===
''envydis'' is nouveau's assembler and disassembler. Unlike Nvidia's disassembler this one doesn't care about alignment or invalid instructions, so it is useful to analyze a raw memory dump while searching for a shader or kernel. Its declarative-style C file[https://github.com/envytools/envytools/blob/master/envydis/gm107.c] describes instructions internal encoding.

To disassemble with it:

$ envydis -m gm107 -i <file>

''envydis'' also has an assembler variant: ''envyas''. To edit existing shaders or write shaders from scratch this is easier than assembling the bits manually.

Since it's the product of reverse engineering the GPU and ''nvdisasm'' itself, it might contain errors.

== Dumping existing shaders ==
One method to dump shaders from commercial games and homebrew applications is running it on an emulator.

Ryujinx[https://ryujinx.org/#/] offers in its configuration file an entry to dump the encountered shaders in two directories. ''Code'' contains shader dumps that are compatible with 'nvdisasm' and 'envydis' as they are. ''Full'' is the same as ''Code'' but it includes the header, it's intended to be decompiled with ''Ryujinx.ShaderTools''.

yuzu[https://yuzu-emu.org/] can dump shaders using its disk-based shader cache with an external tool. ''maxwell-dump''[https://gist.github.com/ReinUsesLisp/7ba72d3162e60cab283194fcca3474b2] is a small C application for this, it follows the same convention as Ryujinx for ''Code'' and ''Full''. To find the shader file dumped by the emulator, right click the game from the Qt front-end and select the option to view the transferable cache file. It's important to highlight that the file format might change in the future leaving this tool unable to extract the binaries until it's updated.

= Maxwell Instruction Set Architecture =
TODO Define instructions

GPU Shaders

2019-05-14T12:17:17Z

Rodrigo: Fix up registers count

= Overview =
Like most D3D11-era GPUs the Tegra X1 has 5 pipeline shader stages: Vertex, tessellation control, tessellation evaluation, geometry and fragment. Maxwell GPUs have two vertex shader stages: A and B, using both is not mandatory or common. There are also compute shaders or kernels which run on a separate channel.

== Header ==
Pipeline stage shaders unlike compute kernels have a header documented by Nvidia[https://download.nvidia.com/open-gpu-doc/Shader-Program-Header/1/Shader-Program-Header.html]. It is a 0x50 byte structure that describes shader's behavior and enabled features that don't belong to the code itself (e.g. local memory size, whether it writes global memory). Some of these entries are not mandatory and are not emitted by Nvidia's compiler.

It is located at the beginning of the program address. Program code starts right after it (at address 0x50, address 0 on compute kernels).

== Text ==
Every 4 instructions starting from the first one there's a scheduler instruction. All instructions are 64 bits long, including scheduler instructions.

Shaders emitted by Nvidia's compiler include a branching instruction pointing to itself after the final unconditional exit. The intention of this "sign" is unknown.

== Registers ==
Maxwell GPUs have 255 type-less general purpose registers and one special register with id 255, ''nvdisasm'' shows it as RZ and ''envydis'' as 0x0. Writing here is a no-op unless there are side effects. Reading from RZ returns zero. The fewer registers a shader uses, the more it can be parallelized.

General purpose registers or GPRs are 32 bits long and are the same for all operations, these are given meaning on the instructions. Half float instructions are SIMD and operate on 16 bit pairs, meanwhile double instructions take two registers to operate. uint64 values are emulated using uint32 instructions extending their domain through condition codes; when an instruction has to read an uint64 value it reads two subsequent registers.

It is a common technique to read or write subsequent registers with a single instruction. For example TEXS (used to sample a texture) reads the texture coordinates from Ra and Ra+1, although it gets more complex with other layouts like 3D textures (it's not Ra+2). The result of the sample is given in an Rd and their subsequent registers.

TODO Add nvdisasm example

== Predicates ==
There are 5 general purpose "1-bit" predicates. These can be used to conditionally execute a given instructions. There are 2 more predicates which evaluate as always true and always false.

Most of the time predicates can be negated.

TODO Add nvdisasm example

== Condition codes ==
These mostly match x86 condition codes. They are explicitly set by instructions with a ".CC" flag or by special instructions like R2P.

These can be analyzed with the P2R instruction (which moves predicates or condition codes to a register).

= Analysis Tools =

== Disassembling ==
To disassemble a Maxwell shader or compute kernel there are two applications:
* nvdisasm[https://developer.nvidia.com/cuda-toolkit]
* envydis[https://github.com/envytools/envytools/tree/master/envydis]

=== nvdisasm ===
Nvidia's disassembler. Since it's made by Nvidia it usually knows more instruction variants. It tends to print the intention of the instructions rather than their internal encoding, helping visual inspection. It can show the binary representation with an option.

One of its major problems is that the input file has to be perfectly aligned with a stride of 0x20 bytes or it will otherwise fail. Appending zeroes to the file will make it happy since those are NOP instructions. Invalid instructions will also abort the disassemble.

To disassemble a Maxwell shader or kernel in a file this command can be used:

$ nvdisasm -b SM53 <file>

For more information about this program it's recommended to read its help entry and CUDA Toolkit Documentation[https://docs.nvidia.com/cuda/cuda-binary-utilities/].

=== envydis ===
''envydis'' is nouveau's assembler and disassembler. Unlike Nvidia's disassembler this one doesn't care about alignment or invalid instructions, so it is useful to analyze a raw memory dump while searching for a shader or kernel. Its declarative-style C file[https://github.com/envytools/envytools/blob/master/envydis/gm107.c] describes instructions internal encoding.

To disassemble with it:

$ envydis -m gm107 -i <file>

''envydis'' also has an assembler variant: ''envyas''. To edit existing shaders or write shaders from scratch this is easier than assembling the bits manually.

Since it's the product of reverse engineering the GPU and ''nvdisasm'' itself, it might contain errors.

== Dumping existing shaders ==
One method to dump shaders from commercial games and homebrew applications is running it on an emulator.

Ryujinx[https://ryujinx.org/#/] offers in its configuration file an entry to dump the encountered shaders in two directories. ''Code'' contains shader dumps that are compatible with 'nvdisasm' and 'envydis' as they are. ''Full'' is the same as ''Code'' but it includes the header, it's intended to be decompiled with ''Ryujinx.ShaderTools''.

yuzu[https://yuzu-emu.org/] can dump shaders using its disk-based shader cache with an external tool. ''maxwell-dump''[https://gist.github.com/ReinUsesLisp/7ba72d3162e60cab283194fcca3474b2] is a small C application for this, it follows the same convention as Ryujinx for ''Code'' and ''Full''. To find the shader file dumped by the emulator, right click the game from the Qt front-end and select the option to view the transferable cache file. It's important to highlight that the file format might change in the future leaving this tool unable to extract the binaries until it's updated.

= Maxwell Instruction Set Architecture =
TODO Define instructions

GPU Shaders

2019-05-14T12:01:28Z

Rodrigo: Rephrase uint64

= Overview =
Like most D3D11-era GPUs the Tegra X1 has 5 pipeline shader stages: Vertex, tessellation control, tessellation evaluation, geometry and fragment. Maxwell GPUs have two vertex shader stages: A and B, using both is not mandatory or common. There are also compute shaders or kernels which run on a separate channel.

== Header ==
Pipeline stage shaders unlike compute kernels have a header documented by Nvidia[https://download.nvidia.com/open-gpu-doc/Shader-Program-Header/1/Shader-Program-Header.html]. It is a 0x50 byte structure that describes shader's behavior and enabled features that don't belong to the code itself (e.g. local memory size, whether it writes global memory). Some of these entries are not mandatory and are not emitted by Nvidia's compiler.

It is located at the beginning of the program address. Program code starts right after it (at address 0x50, address 0 on compute kernels).

== Text ==
Every 4 instructions starting from the first one there's a scheduler instruction. All instructions are 64 bits long, including scheduler instructions.

Shaders emitted by Nvidia's compiler include a branching instruction pointing to itself after the final unconditional exit. The intention of this "sign" is unknown.

== Registers ==
Maxwell GPUs have 254 type-less general purpose registers and one special register with id 255, ''nvdisasm'' shows it as RZ and ''envydis'' as 0x0. Writing here is a no-op unless there are side effects. Reading from RZ returns zero. The fewer registers a shader uses, the more it can be parallelized.

General purpose registers or GPRs are 32 bits long and are the same for all operations, these are given meaning on the instructions. Half float instructions are SIMD and operate on 16 bit pairs, meanwhile double instructions take two registers to operate. uint64 values are emulated using uint32 instructions extending their domain through condition codes; when an instruction has to read an uint64 value it reads two subsequent registers.

It is a common technique to read or write subsequent registers with a single instruction. For example TEXS (used to sample a texture) reads the texture coordinates from Ra and Ra+1, although it gets more complex with other layouts like 3D textures (it's not Ra+2). The result of the sample is given in an Rd and their subsequent registers.

TODO Add nvdisasm example

== Predicates ==
There are 5 general purpose "1-bit" predicates. These can be used to conditionally execute a given instructions. There are 2 more predicates which evaluate as always true and always false.

Most of the time predicates can be negated.

TODO Add nvdisasm example

== Condition codes ==
These mostly match x86 condition codes. They are explicitly set by instructions with a ".CC" flag or by special instructions like R2P.

These can be analyzed with the P2R instruction (which moves predicates or condition codes to a register).

= Analysis Tools =

== Disassembling ==
To disassemble a Maxwell shader or compute kernel there are two applications:
* nvdisasm[https://developer.nvidia.com/cuda-toolkit]
* envydis[https://github.com/envytools/envytools/tree/master/envydis]

=== nvdisasm ===
Nvidia's disassembler. Since it's made by Nvidia it usually knows more instruction variants. It tends to print the intention of the instructions rather than their internal encoding, helping visual inspection. It can show the binary representation with an option.

One of its major problems is that the input file has to be perfectly aligned with a stride of 0x20 bytes or it will otherwise fail. Appending zeroes to the file will make it happy since those are NOP instructions. Invalid instructions will also abort the disassemble.

To disassemble a Maxwell shader or kernel in a file this command can be used:

$ nvdisasm -b SM53 <file>

For more information about this program it's recommended to read its help entry and CUDA Toolkit Documentation[https://docs.nvidia.com/cuda/cuda-binary-utilities/].

=== envydis ===
''envydis'' is nouveau's assembler and disassembler. Unlike Nvidia's disassembler this one doesn't care about alignment or invalid instructions, so it is useful to analyze a raw memory dump while searching for a shader or kernel. Its declarative-style C file[https://github.com/envytools/envytools/blob/master/envydis/gm107.c] describes instructions internal encoding.

To disassemble with it:

$ envydis -m gm107 -i <file>

''envydis'' also has an assembler variant: ''envyas''. To edit existing shaders or write shaders from scratch this is easier than assembling the bits manually.

Since it's the product of reverse engineering the GPU and ''nvdisasm'' itself, it might contain errors.

== Dumping existing shaders ==
One method to dump shaders from commercial games and homebrew applications is running it on an emulator.

Ryujinx[https://ryujinx.org/#/] offers in its configuration file an entry to dump the encountered shaders in two directories. ''Code'' contains shader dumps that are compatible with 'nvdisasm' and 'envydis' as they are. ''Full'' is the same as ''Code'' but it includes the header, it's intended to be decompiled with ''Ryujinx.ShaderTools''.

yuzu[https://yuzu-emu.org/] can dump shaders using its disk-based shader cache with an external tool. ''maxwell-dump''[https://gist.github.com/ReinUsesLisp/7ba72d3162e60cab283194fcca3474b2] is a small C application for this, it follows the same convention as Ryujinx for ''Code'' and ''Full''. To find the shader file dumped by the emulator, right click the game from the Qt front-end and select the option to view the transferable cache file. It's important to highlight that the file format might change in the future leaving this tool unable to extract the binaries until it's updated.

= Maxwell Instruction Set Architecture =
TODO Define instructions

GPU Shaders

2019-05-14T11:57:24Z

Rodrigo: Fix up grammar

= Overview =
Like most D3D11-era GPUs the Tegra X1 has 5 pipeline shader stages: Vertex, tessellation control, tessellation evaluation, geometry and fragment. Maxwell GPUs have two vertex shader stages: A and B, using both is not mandatory or common. There are also compute shaders or kernels which run on a separate channel.

== Header ==
Pipeline stage shaders unlike compute kernels have a header documented by Nvidia[https://download.nvidia.com/open-gpu-doc/Shader-Program-Header/1/Shader-Program-Header.html]. It is a 0x50 byte structure that describes shader's behavior and enabled features that don't belong to the code itself (e.g. local memory size, whether it writes global memory). Some of these entries are not mandatory and are not emitted by Nvidia's compiler.

It is located at the beginning of the program address. Program code starts right after it (at address 0x50, address 0 on compute kernels).

== Text ==
Every 4 instructions starting from the first one there's a scheduler instruction. All instructions are 64 bits long, including scheduler instructions.

Shaders emitted by Nvidia's compiler include a branching instruction pointing to itself after the final unconditional exit. The intention of this "sign" is unknown.

== Registers ==
Maxwell GPUs have 254 type-less general purpose registers and one special register with id 255, ''nvdisasm'' shows it as RZ and ''envydis'' as 0x0. Writing here is a no-op unless there are side effects. Reading from RZ returns zero. The fewer registers a shader uses, the more it can be parallelized.

General purpose registers or GPRs are 32 bits long and are the same for all operations, these are given meaning on the instructions. Half float instructions are SIMD and operate on 16 bit pairs, meanwhile double instructions take two registers to operate. uint64 instructions are read two subsequent registers but operate on individual uint32 values extending their domain through the carry flag.

It is a common technique to read or write subsequent registers with a single instruction. For example TEXS (used to sample a texture) reads the texture coordinates from Ra and Ra+1, although it gets more complex with other layouts like 3D textures (it's not Ra+2). The result of the sample is given in an Rd and their subsequent registers.

TODO Add nvdisasm example

== Predicates ==
There are 5 general purpose "1-bit" predicates. These can be used to conditionally execute a given instructions. There are 2 more predicates which evaluate as always true and always false.

Most of the time predicates can be negated.

TODO Add nvdisasm example

== Condition codes ==
These mostly match x86 condition codes. They are explicitly set by instructions with a ".CC" flag or by special instructions like R2P.

These can be analyzed with the P2R instruction (which moves predicates or condition codes to a register).

= Analysis Tools =

== Disassembling ==
To disassemble a Maxwell shader or compute kernel there are two applications:
* nvdisasm[https://developer.nvidia.com/cuda-toolkit]
* envydis[https://github.com/envytools/envytools/tree/master/envydis]

=== nvdisasm ===
Nvidia's disassembler. Since it's made by Nvidia it usually knows more instruction variants. It tends to print the intention of the instructions rather than their internal encoding, helping visual inspection. It can show the binary representation with an option.

One of its major problems is that the input file has to be perfectly aligned with a stride of 0x20 bytes or it will otherwise fail. Appending zeroes to the file will make it happy since those are NOP instructions. Invalid instructions will also abort the disassemble.

To disassemble a Maxwell shader or kernel in a file this command can be used:

$ nvdisasm -b SM53 <file>

For more information about this program it's recommended to read its help entry and CUDA Toolkit Documentation[https://docs.nvidia.com/cuda/cuda-binary-utilities/].

=== envydis ===
''envydis'' is nouveau's assembler and disassembler. Unlike Nvidia's disassembler this one doesn't care about alignment or invalid instructions, so it is useful to analyze a raw memory dump while searching for a shader or kernel. Its declarative-style C file[https://github.com/envytools/envytools/blob/master/envydis/gm107.c] describes instructions internal encoding.

To disassemble with it:

$ envydis -m gm107 -i <file>

''envydis'' also has an assembler variant: ''envyas''. To edit existing shaders or write shaders from scratch this is easier than assembling the bits manually.

Since it's the product of reverse engineering the GPU and ''nvdisasm'' itself, it might contain errors.

== Dumping existing shaders ==
One method to dump shaders from commercial games and homebrew applications is running it on an emulator.

Ryujinx[https://ryujinx.org/#/] offers in its configuration file an entry to dump the encountered shaders in two directories. ''Code'' contains shader dumps that are compatible with 'nvdisasm' and 'envydis' as they are. ''Full'' is the same as ''Code'' but it includes the header, it's intended to be decompiled with ''Ryujinx.ShaderTools''.

yuzu[https://yuzu-emu.org/] can dump shaders using its disk-based shader cache with an external tool. ''maxwell-dump''[https://gist.github.com/ReinUsesLisp/7ba72d3162e60cab283194fcca3474b2] is a small C application for this, it follows the same convention as Ryujinx for ''Code'' and ''Full''. To find the shader file dumped by the emulator, right click the game from the Qt front-end and select the option to view the transferable cache file. It's important to highlight that the file format might change in the future leaving this tool unable to extract the binaries until it's updated.

= Maxwell Instruction Set Architecture =
TODO Define instructions

GPU Shaders

2019-05-14T06:51:51Z

Rodrigo: Created page with "= Overview = Like most D3D11-era GPUs the Tegra X1 has 5 pipeline shader stages: Vertex, tessellation control, tessellation evaluation, geometry and fragment. Maxwell GPUs hav..."

= Overview =
Like most D3D11-era GPUs the Tegra X1 has 5 pipeline shader stages: Vertex, tessellation control, tessellation evaluation, geometry and fragment. Maxwell GPUs have two vertex shader stages: A and B, using both is not mandatory or common. There are also compute shaders or kernels which run on a separate channel.

== Header ==
Pipeline stage shaders unlike compute kernels have a header documented by Nvidia[https://download.nvidia.com/open-gpu-doc/Shader-Program-Header/1/Shader-Program-Header.html]. It is a 0x50 byte structure that describes shader's behavior and enabled features that don't belong to the code itself (e.g. local memory size, whether it writes global memory). Some of these entries are not mandatory and are not emitted by Nvidia's compiler.

It is located at the beginning of the program address. Program code starts right after it (at address 0x50, address 0 on compute kernels).

== Text ==
Every 4 instructions starting from the first one there's a scheduler instruction. All instructions are 64 bits long, including scheduler instructions.

Shaders emitted by Nvidia's compiler include a branching instruction pointing to itself after the final unconditional exit. The intention of this "sign" is unknown.

== Registers ==
Maxwell GPUs have 254 type-less general purpose registers and one special register with id 255, ''nvdisasm'' shows it as RZ and ''envydis'' as 0x0. Writing here is a no-op unless there are side effects. Reading from RZ is returns zero. The fewer registers a shader uses, the more it can be parallelized.

General purpose registers or GPRs are 32 bits long and are the same for all operations, these are given meaning on the instructions. Half float instructions are SIMD and operate on 16 bit pairs, meanwhile double instructions take two registers to operate. uint64 instructions are read two subsequent registers but operate on individual uint32 values extending their domain through the carry flag.

It is a common technique to read or write subsequent registers with a single instruction. For example TEXS (used to sample a texture) reads the texture coordinates from Ra and Ra+1, although it gets more complex with other layouts like 3D textures (it's not Ra+2). The result of the sample is given in an Rd and their subsequent registers.

TODO Add nvdisasm example

== Predicates ==
There are 5 general purpose "1-bit" predicates. These can be used to conditionally execute a given instructions. There are 2 more predicates which evaluate as always true and always false.

Most of the time predicates can be negated.

TODO Add nvdisasm example

== Condition codes ==
These mostly match x86 condition codes. They are explicitly set by instructions with a ".CC" flag or by special instructions like R2P.

These can be analyzed with the P2R instruction (which moves predicates or condition codes to a register).

= Analysis Tools =

== Disassembling ==
To disassemble a Maxwell shader or compute kernel there are two applications:
* nvdisasm[https://developer.nvidia.com/cuda-toolkit]
* envydis[https://github.com/envytools/envytools/tree/master/envydis]

=== nvdisasm ===
Nvidia's disassembler. Since it's made by Nvidia it usually knows more instruction variants. It tends to print the intention of the instructions rather than their internal encoding, helping visual inspection. It can show the binary representation with an option.

One of its major problems is that the input file has to be perfectly aligned with a stride of 0x20 bytes or it will otherwise fail. Appending zeroes to the file will make it happy since those are NOP instructions. Invalid instructions will also abort the disassemble.

To disassemble a Maxwell shader or kernel in a file this command can be used:

$ nvdisasm -b SM53 <file>

For more information about this program it's recommended to read its help entry and CUDA Toolkit Documentation[https://docs.nvidia.com/cuda/cuda-binary-utilities/].

=== envydis ===
''envydis'' is nouveau's assembler and disassembler. Unlike Nvidia's disassembler this one doesn't care about alignment or invalid instructions, so it is useful to analyze a raw memory dump while searching for a shader or kernel. Its declarative-style C file[https://github.com/envytools/envytools/blob/master/envydis/gm107.c] describes instructions internal encoding.

To disassemble with it:

$ envydis -m gm107 -i <file>

''envydis'' also has an assembler variant: ''envyas''. To edit existing shaders or write shaders from scratch this is easier than assembling the bits manually.

Since it's the product of reverse engineering the GPU and ''nvdisasm'' itself, it might contain errors.

== Dumping existing shaders ==
One method to dump shaders from commercial games and homebrew applications is running it on an emulator.

Ryujinx[https://ryujinx.org/#/] offers in its configuration file an entry to dump the encountered shaders in two directories. ''Code'' contains shader dumps that are compatible with 'nvdisasm' and 'envydis' as they are. ''Full'' is the same as ''Code'' but it includes the header, it's intended to be decompiled with ''Ryujinx.ShaderTools''.

yuzu[https://yuzu-emu.org/] can dump shaders using its disk-based shader cache with an external tool. ''maxwell-dump''[https://gist.github.com/ReinUsesLisp/7ba72d3162e60cab283194fcca3474b2] is a small C application for this, it follows the same convention as Ryujinx for ''Code'' and ''Full''. To find the shader file dumped by the emulator, right click the game from the Qt front-end and select the option to view the transferable cache file. It's important to highlight that the file format might change in the future leaving this tool unable to extract the binaries until it's updated.

= Maxwell Instruction Set Architecture =
TODO Define instructions