Only scalar half-precision operations are supported at the moment.
- Adds general support for 'half' type in NVPTX
- fp16 math operations are supported on sm_53+ GPUs only (can be disabled with --nvptx-no-f16-math)
- type conversions to/from fp16 are supported on all GPU variants.
- On GPU variants that do not have full fp16 support (or if it's disabled), fp16 operations are promoted to fp32 and results are converted back to fp16 for storage.
ptxas is rather peculiar when it comes to fp16-related syntax, which had to be worked around:
- there's no way to represent immediate fp16 argument as a hex value. We load such constants into a .b16 register first.
- there are no .f16 variants of mov/ld/st instructions.
- ptxas only supports .f16 registers on sm_53+ only. It does accept .b16 registers for all supported fp16-related operations on all GPU variants, so that's the type the patch uses.
- NVPTX ABI explicitly states that fp16 can't be used as a function argument or return value. It also states that arguments and return values must be at least 32-bit wide. The patch follows the doc and uses .b32 for fp16 arguments and return values. On the other hand, current fp16 implementation in nvcc uses a struct to represent fp16 type which results in nvcc passing fp16 as an aggregate. I'm not sure whether we want/need to follow nvcc and pass fp16 as aggregates, too.
Everywhere that we talk about f16s being stored/returned as/loaded as "integers" or "untyped integers", I think we should just say "a b16" or "an untyped value". An "untyped integer" might mean something like llvm's i16, which is definitely an integer, but is "untyped" inasmuch as it might represent a signed or an unsigned int. That's conceptually different from .b16, which is just a bag of bits.
I've noted places that would need to change with the comment "b16", but you don't have to change all of them to read "b16", so long as we don't say "untyped integer". :)