Decode AMDGPU Kernel descriptors as assembler directives.
Details
- Reviewers
- t-tye - scott.linder - tpr - arsenm - jhenderson - MaskRay - kzhuravl - • espindola 
- Commits
- rG528057c19755: [AMDGPU] Support disassembly for AMDGPU kernel descriptors
 rG487a80531006: [AMDGPU] Support disassembly for AMDGPU kernel descriptors
 rGcacfb02d28a3: [AMDGPU] Support disassembly for AMDGPU kernel descriptors
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
The problem is that you're trying to do it all at once. There's no need to implement the full disassembly at once. You can do it in a series of patches, one after the other that build towards the primary goal. For example, there are many if (<X is some value>) type cases, which you could omit - just assume those are all false, in earlier versions, and add them in (with corresponding testing) in a later patch. Similarly, you could assume that all results of x & y return some specific value, e.g. 0, and just print that for now. Yes, that means you won't support everything from the point at which this patch lands, but it will make each individual patch easier to reason with. This fits much better with LLVM's preferred approach - please see https://llvm.org/docs/DeveloperPolicy.html#incremental-development.
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1264–1267 | I don't see a response to this suggestion. | |
| 1325 | next -> Next | |
| 1471 | /*IsLittleEndian =*/true -> /*IsLittleEndian=*/true (and same for AddressSize) as already asked for once... | |
| 1661 | Nit: missing full stop. | |
| 1663–1664 | Put this comment above the if it is referring to, like the exisitng bit of comment. | |
Oh, you could also put the additional behaviour behind a switch in llvm-objdump. That'll mean you won't break all the existing tests too. Once you've finished the development, you could either remove the switch entirely or just change its default (which would allow users to disable the behaviour if they want the less verbose version of the output).
- Updated to use Cursor. Now there isn't any neet to maintain CurrentIndex.
- Other small changes related to coding conventions.
The problem is that you're trying to do it all at once. There's no need to implement the full disassembly at once. You can do it in a series of patches, one after the other that build towards the primary goal. For example, there are many if (<X is some value>) type cases, which you could omit - just assume those are all false, in earlier versions, and add them in (with corresponding testing) in a later patch. Similarly, you could assume that all results of x & y return some specific value, e.g. 0, and just print that for now. Yes, that means you won't support everything from the point at which this patch lands, but it will make each individual patch easier to reason with. This fits much better with LLVM's preferred approach - please see https://llvm.org/docs/DeveloperPolicy.html#incremental-development.
Oh, you could also put the additional behaviour behind a switch in llvm-objdump. That'll mean you won't break all the existing tests too. Once you've finished the development, you could either remove the switch entirely or just change its default (which would allow users to disable the behaviour if they want the less verbose version of the output).
Thanks for the feedback! I'll keep these points in mind :)
Regarding this patch - I think that because it has been reviewed to quite some extent, it doesn't need to be broken into smaller patches now?
Following tests fail:
LLVM :: CodeGen/AMDGPU/call-encoding.ll
LLVM :: CodeGen/AMDGPU/s_code_end.ll
LLVM :: MC/AMDGPU/branch-comment.s
LLVM :: MC/AMDGPU/data.s
LLVM :: MC/AMDGPU/labels-branch-gfx9.s
LLVM :: MC/AMDGPU/labels-branch.s
LLVM :: MC/AMDGPU/offsetbug_once.s
LLVM :: MC/AMDGPU/offsetbug_one_and_one.s
LLVM :: MC/AMDGPU/offsetbug_twice.s
LLVM :: MC/AMDGPU/s_endpgm.s
LLVM :: Object/AMDGPU/objdump.s
LLVM :: tools/llvm-cov/ignore-filename-regex.test
LLVM :: tools/llvm-objdump/ELF/AMDGPU/source-lines.ll
These tests no longer fail.
Maybe I missed something from some of the other reviewers, but I haven't seen much in the way of commentary on the core functionality of the new code, so I wouldn't say that it has been reivewed to quite some extent. I've personally only skirted around style aspects. I also don't see any evidence of any testing of the new code.
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1476 | Bits or bytes? | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1476 | Bits. If there is a bit that is wrong in a particular chunk of bytes, we consider that the entire chunk of bytes is invalid. We then update the Size value. Further, we say that the first Size bytes in a symbol are invalid. Error handling in llvm-objdump will print these bytes using .byte directive. And then we fall back to decoding the remaining bytes in the symbol as instructions. | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1476 | We do this because most directives in the kernel descriptor affect single or a very few bits. | |
I'm curious if this is the intent of the document you linked, though? It says "The remaining inter-related work should be decomposed into unrelated sets of changes if possible.", but the disassembly of this directive is not decomposable into unrelated changes; if only part of the descriptor is disassembled then the whole .amdhsa_kernel block is invalid. I suppose in the interest of review one could break down the logical/atomic series of commits even further into inter-dependent patches that must be applied together? I think that causes confusion when doing git-bisect etc. so doesn't seem ideal. Or maybe the patches should be squashed back together after review but before they are committed?
I do think there needs to be more direct testing of the branches taken in the code. If decomposing the patch beyond the logical/atomic decomposition makes it easier to do that as an author and/or a reviewer then I am OK with it. In this case, though, a lot of the length in terms of SLOC is just repetition and a little redundancy, so breaking it up further would only hide that and make it harder to see where things are best removed or factored out.
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1227 | This doesn't seem right. This symbol is not accurate without careful attention from an assembly author, so it will surely be incorrect when used by a disassembler. We can accurately compute a VGPR count from the GRANULATED_WORKITEM_VGPR_COUNT, we just calculate e.g. (GRANULATED_WORKITEM_VGPR_COUNT + 1) * granularity where the + 1 is needed to account for the minimum allocation (i.e. a granulated count encoded as "0" actually indicates a 1 granule allocation) and where the scaling by granularity is device-specific. This will calculate the greatest value possible for .amdhsa_next_free_vgpr which will produce the same granulated count, but equally valid you could choose to calculate the minimum, or any value in between. All we care about is producing disassembly which results in the same descriptor. (Caveat: I would prove this to yourself by referencing https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-compute-pgm-rsrc1-gfx6-gfx10-table as I may have the actual calculation wrong; we just want to calculate the inverse of what the assembler does) Your point that we can't recover the exact value the original author intended to place in the assembly text is true, but that is OK as long as we can always give them a valid input to the assembler which gives them the same output. | |
| 1232 | This can be a char literal, i.e. '\n', same elsewhere. | |
| 1245 | For this and the above case we should have tests to prove this out. I.e. assemble sources to a binary, disassemble and reassemble it, and then compare the two binaries. Ideally we would do this for some edge cases around VGPR/SGPR allocation granularity. There may need to be some fixup between disassembly and reassembly to account for the remaining non-reassembleable bits produced by llvm-objdump, but they should be pretty minor for a trivial kernel, and I would expect you could handle them with just sed which seems to be available to LIT tests. | |
| 1254 | Just as with the VGPR count, we cannot use the symbol to define this, and we can compute a (non-unique) input which produces the same output. | |
| 1264 | This shift isn't necessary, you just need to check for the presence of any set bits. I also noticed when checking the types here that for some reason we declare the enum for these masks/shifts as int32_t, which makes one have to think about both integer promotion rules and then possibly which bitwise-operations are valid for signed integers. I think here the signed mask is promoted to unsigned, and you get what you want, but it may be good to go fix the definition of the masks separately. | |
| 1269 | I think a short-lived well-defined preprocessor macro could make the patch shorter and easier to read. For example: #define PRINT_TRIVIAL_FIELD(DIRECTIVE, MASK)  do { \
  KdStream << Indent << DIRECTIVE " " << ((FourByteBuffer & amdhsa:: ## MASK) >> amdhsa:: ## MASK ## _SHIFT) << '\n';
  } while (0);
PRINT_TRIVIAL_FIELD(".amdhsa_float_round_mode_32", COMPUTE_PGM_RSRC1_FLOAT_ROUND_MODE_32)
PRINT_TRIVIAL_FIELD(".amdhsa_float_round_mode_16_64", COMPUTE_PGM_RSRC1_FLOAT_ROUND_MODE_16_64)
// ...
#undef HANDLE_TRIVIAL_FIELDI think in part this is because the definition of the masks themselves is in terms of preprocessor macros. | |
| 1290 | Unnecessary shift | |
| 1301 | Unnecessary shift, same for all other cases below when checking that the bits under a mask are 0. | |
| 1345 | I don't know the general conventions here, but I don't think I have seen a comment for the end of a function elsewhere in LLVM. I do know that it is required for namespaces, so maybe it is permitted for long functions? | |
| 1387 | All of these comments seem redundant to me, especially when the condition is simplified to just: if (Buffer & MASK) return Fail; At the very least, repeating the actual bit indices here when they are a part of the mask definition seems verbose. | |
| 1472 | I don't understand the intent with carefully maintaining Size for the failure case. Aren't we certain by this point that this should be a kernel descriptor, and so the correct thing to do when we fail is to disassemble everything as .byte directives (i.e. set Size to the max value)? Why would we prefer to return a partial failure and have the disassembler start working on the remaining bytes as if they were instructions? That would also shorten the patch and make it obvious that the Size tracking is correct. | |
| 1484 | Rather than have these comments, which are still just filled with magic numbers, could we define these offsets more explicitly somewhere, as e.g. amdhsa::GROUP_SEGMENT_FIXED_SIZE_OFFSET? For example in llvm/include/llvm/Support/AMDHSAKernelDescriptor.h alongside the other definitions needed by the compiler? We could then also update the static_asserts there to use those definitions, so we aren't relying on inspection to know the same offsets are used everywhere. I.e. the following: 165 static_assert( 1 sizeof(kernel_descriptor_t) == 64, 2 "invalid size for kernel_descriptor_t"); 3 static_assert( 4 offsetof(kernel_descriptor_t, group_segment_fixed_size) == 0, 5 "invalid offset for group_segment_fixed_size"); 6 static_assert( 7 offsetof(kernel_descriptor_t, private_segment_fixed_size) == 4, 8 "invalid offset for private_segment_fixed_size"); 9 static_assert( 10 offsetof(kernel_descriptor_t, reserved0) == 8, 11 "invalid offset for reserved0"); ... Becomes: 165 static_assert( 1 sizeof(kernel_descriptor_t) == 64, 2 "invalid size for kernel_descriptor_t"); 3 static_assert( 4 offsetof(kernel_descriptor_t, group_segment_fixed_size) == GROUP_SEGMENT_FIXED_SIZE_OFFSET, 5 "invalid offset for group_segment_fixed_size"); 6 static_assert( 7 offsetof(kernel_descriptor_t, private_segment_fixed_size) == PRIVATE_SEGMENT_FIXED_SIZE_OFFSET, 8 "invalid offset for private_segment_fixed_size"); 9 static_assert( 10 offsetof(kernel_descriptor_t, reserved0) == RESERVED0_OFFSET, 11 "invalid offset for reserved0"); ... | |
| 1630 | The != 0 here is redundant. | |
| 1635 | Could you move the call to .drop_back(3) out into the calling function, so it appears next to the check for .endswith(StringRef(".kd"))? | |
| 1665 | I'm still not sure what we landed on for the semantics of SoftFail here? | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h | ||
| 83 | Could you either rename this to satisfy the linter, or else explicitly suppress the lint with a comment like: // NOLINTNEXTLINE(readability-identifier-naming) | |
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/code-object-v3.ll | ||
| 49 | I think we need more tests to ensure: 
 | |
- Used macro to shorten printing directives
- Replaced "\n" with '\n'
- Removed unnecessary bit shifts for bits that must be 0
- Removed extra comments for bits that must be 0
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1665 | It should be Success / Fail based on what the bytes are for code object v2. But there's nothing we are 'doing' at the moment for v2, I returned SoftFail. | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1245 | Right now we can't really re-assemble in the lit-test. This needs to be tested 'informally' by: 
 | |
- Fixed code-object-v3 lit test failure introduced in the previous change to this patch.
- More changes as per inline comments.
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1472 | My initial take was to decode as byte directives to the point of failure indicated by Size. Then going back to the normal flow of disassembling as instructions. But I get what you want to say in this comment. Updated the code based on this comment. | |
| 1635 | Done. | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h | ||
| 83 | Done. | |
- Compute .amdhsa_next_free_vgpr based on inverse of what the assembler does to compute GRANULATED_WORKITEM_VGPR_COUNT.
- Some changes to accomodate differences between GFX9 and GFX10
- Updated test case for GFX10 as well
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1245 | Went this route to check whether re-assembled binaries match or not. Turns out that both binaries match, in size (overall size as well as size of sections) and also in terms of all the disassembled content. But a diff object1 object2 says that binary files differ. | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1245 | I'm not sure I follow what you are describing; my thought was to start with just an asm source file containing only the kernel descriptor directive in the default section, and compare the output of the following (with, e.g. diff, as you mention): 
 As a trivial example, diff doesn't find any difference for the following example: $ printf '.amdhsa_kernel my_kernel\n.amdhsa_next_free_vgpr 0\n.amdhsa_next_free_sgpr 0\n.end_amdhsa_kernel' >a.s $ release/bin/llvm-mc --triple=amdgcn-amd-amdhsa -mcpu=gfx908 -filetype=obj a.s >a.o $ diff a.o \ <(release/bin/llvm-objdump --triple=amdgcn-amd-amdhsa --mcpu=gfx908 --disassemble-symbols=my_kernel.kd a.o \ | tail -n +8 \ | release/bin/llvm-mc --triple=amdgcn-amd-amdhsa -mcpu=gfx908 -filetype=obj) I don't think you need to use FileCheck for these tests at all, you can just rely on ending the RUN pipeline with diff, which seems to be supported by lit. You can then just copy-paste the test and edit fields in the input to validate edge cases for things like the SGPR/VGPR allocation directives. I think more comprehensive testing, including for other sections and executables/DSOs, would be good eventually but for now we should at least have some tests that explicitly confirm the KD disassembly round-trips. | |
| 1345 | I would lean towards omitting these, especially with the functions becoming shorter. For example, decodeCOMPUTE_PGM_RSRC2() is now <50 lines long at the entire definition now fits on one screen for me. It seems like there are other examples of this in the codebase, though, so I'm OK with it for the longer functions. | |
| 1349 | Can you expand this comment a little and move it to a Doxygen comment for the function? | |
| 1624 | Need to handle the "default" case here: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1510:1: warning: control may reach end of non-void function [-Wreturn-type] | |
| 1630 | Fair enough, in a type-safe language it would be required anyway, so it seems reasonable. | |
| 1665 | If SoftFail isn't applicable I don't think we should return it, even if it is just because we haven't implemented something yet. It existing doesn't mean it needs to be used, I think it has a very narrow definition that doesn't apply here. Maybe just emit a diagnostic and return Fail so we get the "decode as .byte" behavior? What exactly happens now with the current patch as-is? | |
- Handled default statement to silence the warning.
- Expanded comments for decodeCOMPUTE_PGM_RSRC1 and decodeCOMPUTE_PGM_RSRC2.
- Removed extra comment at the end of functions.
- Changed SoftFail to Success for code object v2.
- Replaced the old test case with a small assembly file.
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kernel-descriptor.s | ||
|---|---|---|
| 1 ↗ | (On Diff #276063) | No need for the <. llvm-mc is quite capable of taking inputs on the command-line as positional arguments. | 
| 2 ↗ | (On Diff #276063) | This line is too long. Please break it up into individual lines: | 
| 5–8 ↗ | (On Diff #276063) | This test is also quite small. Does it actually cover every code path? | 
| 10 ↗ | (On Diff #276063) | Is this a FIXME/TODO? If so, please add "FIXME" or "TODO". | 
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kernel-descriptor.s | ||
|---|---|---|
| 2 ↗ | (On Diff #276063) | Done. | 
| 5–8 ↗ | (On Diff #276063) | These values must be always specified by the user. Values of some bytes are computed using the values passed here. Consequently there are some cases where we need to test for getting the exact same bytes in the re-assembled binary, even if the disassembled values slightly deviate from the original values. Other bytes/bits hold the exact values specified using the assembler directives. Also they take default values if nothing is specified by the user. So I guess we don't need to test that for re-assembly. | 
| 10 ↗ | (On Diff #276063) | Done. | 
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kernel-descriptor.s | ||
|---|---|---|
| 1 ↗ | (On Diff #276063) | Done. | 
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1243–1250 | I think this can just be replaced with: NextFreeVGPR = (GranulatedWorkitemVGPRCount + 1) * getVGPRAllocGranule(STI); Or we could add another function called getNumVGPRs(const MCSubtargetInfo *STI, unsigned NumVGPRBlocks, Optional<bool> EnableWavefrontSize32) and put the definition directly next to getNumVGPRBlocks(const MCSubtargetInfo *STI, unsigned NumVGPRs, Optional<bool> EnableWavefrontSize32) so any future changes affecting one also affect the other. I would lean towards this, and documenting that they are the inverse of one another. | |
| 1278–1285 | Same as above, I think a new getNumSGPRs to complement getNumSGPRBlocks would make this easier to read. For the GFX10 case we could either leave the check here, or have the new function return Optional to indicate when there is an error. | |
| 1290 | What is "GS" and why is this commented out? | |
| 1484 | I would still like to see this done, the magic numbers here could lead to a few problems down the line, and presently they just make the code harder to read. | |
| 1572–1576 | Can this just be return decodeKernelDescriptor(...);? | |
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kernel-descriptor.s | ||
| 6 ↗ | (On Diff #279490) | Can you implement more tests? I don't know if it is feasible to include them all in the same .s file, as you have to work around the output not being re-assembleable in general, but even just copy-pasting and editing is fine with me. Just having some more test cases to cover at least a reasonable sample of the different branches, failure modes, etc. | 
- Got rid of raw number and added an enum for offsets.
- Added new tests.
- Updated the nop-data.ll test so that it passes with this patch.
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1243–1250 | I think this route is good for now : Similarly for SGPRs. I'd like to add the new functions via a separate patch. This patch is already quite big in terms of size. | |
| 1278–1285 | As I mentioned in my other reply, this patch is already quite big. So it'd be better to have a separate patch for the new functions. | |
| 1290 | This was was for printing Granulated wave front SGPR count. | |
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kernel-descriptor.s | ||
| 6 ↗ | (On Diff #279490) | I have added new tests in separate files. | 
I'm probably not in a position to review the majority of this further. However, I do have big reservations about the testing - there are a high number of possible code paths, but I only see 5 tests, which clearly can't cover all these code paths.
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kernel-descriptor-1.s | ||
|---|---|---|
| 1 ↗ | (On Diff #281180) | It would be helpful to add a small comment at the start of each of these tests explaining what the are specifically testing. | 
I think some of the difficulty in writing tests for the failure cases is not having a tool to produce them. I think one would have to generate the code object, and then hand-edit it in a hex-editor, copy it into the source tree (i.e. under Inputs/) maybe with some accompanying comment about the nature of the original source and the edits made?
It would be nice to at least cover some of the failure cases, even if they are more awkward to tests. There could also be some similar tests for e.g. GFX7,GFX10 and the edge cases around the {S,V}GPR calculation for each.
It seems to me that the format of the data structure is well-understood (otherwise you wouldn't be able to write code to disassemble it). In similar situations, e.g. the DWARF .debug_line parsing, we didn't rely on the built-in .file/.loc directives to generate our line table, and instead wrote it out by hand using .byte/.half/.long/.quad. It's not the prettiest of things of course, but it's better than code paths that never get exercised in tests. You could also use yaml2obj to do a generate the section with a raw hex blob, and/or write an array of bytes in a gtest unit test. The latter situation might be particularly useful because it would allow you to use the same "base" array, and modify individual bytes in individual test cases to check the behaviour for each.
- Updated tests and added a failure test.
- Checking error when cursor holds Error::success() while handling failure cases.
- Removed Size from subroutines' signatures as we aren't tracking the size.
Hi @rochauha,
I'm not sure this should have been pushed - nobody has reviewed your latest update, and I've had concerns about the testing prior to that, as stated. There is also a "Needs Revision" marker still outstanding, and I usually take that to mean that this shouldn't be pushed until the relevant reviewer is satisfied, regardless of what others have said (note that Phabricator highlighted that this patch landded in a "Needs Review" state).
Please could you revert, as there are potential assertion failures in my reading of this code, at least in the event you hit malformed input, in addition to the other issues raised.
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1220–1225 | This is not how to handle malformed input. This will result in an assertion in debug builds, which is equivalent to a crash, without any useful context to draw on, because you've not reported the error. More below. | |
| 1408 | Rather than all these checkError calls, I'd expect to see a check of the Cursor followed by a return of MCDisassembler::Fail to indicate there was a problem. | |
| 1542–1548 | If this loop terminates due to C being invalid, you don't want to fall out the bottom and return Success, I think. You'd want to check C after loop termination and return Fail. Alternatively, if you return Fail from the decodeKernelDescriptorDirective you can do cantFail(C.takeError()); after the loop. Ideally, we'd actually report the error from C in the event of failure, but currently there's no way of communicating that back up to the caller. | |
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kd-failure.s | ||
| 1 ↗ | (On Diff #286464) | Please add a comment to the top of this test explaining what this test is actually testing. | 
| 3 ↗ | (On Diff #286464) | Rather than writing text to another input file at run time in this way, you can use the new split-file tool to have all the input inline below, and split it up using the tool into multiple files. | 
| 11 ↗ | (On Diff #286464) | Please don't mix comment markers within the same file. You use '//' here, but ';' everywhere else. Additionally, new LLVM binutils tests tend to use double comment markers to indicate true comments as opposed to RUN/CHECK lines (i.e. ';;' in this context). | 
| 34–35 ↗ | (On Diff #286464) | Please remove the additional blank lines at the end of file. | 
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kd-sgpr.s | ||
| 1 ↗ | (On Diff #286464) | Please follow the comments from kd-failure.s in all the tests (where applicable) too. | 
Changes based on comments by @jhenderson
- Used llvm::cantFail(C.takeError()) to handle error.
- Removed blank lines at the end of test files.
- Streamlined beginning of comments.
- Used split-file tool to separate SGPR and VGPR test cases.
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kd-failure.s | ||
|---|---|---|
| 3 ↗ | (On Diff #286464) | In case of kd-failure.s, concatenation is necessary because the disassembled output will be .bytes, and the symbol information is needed to be get the same binary again. I feel that "printf'ing" to a file and concatenating with disassembled text is a simpler compared to split file and concatenating. However, I have used the split-file tool to separate the S/VGPR test cases. | 
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1542–1548 | Thanks for the pointer! I think cantFail seems to be all that is needed, because the cursor was holding Error::success in some cases, which needs to be 'checked' before moving further. Since the kernel descriptor is well defined, all cases where failure needs to be handled are handled using MCDisassembler::Fail. | |
@kzhuravl, do you have any additional comments to add at all, since you rejected the original patch?
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1394 | I think for self-documentation purposes, it would be helpful to assert that Bytes.size() == 64 here. I see that it is verified in the calling function but a) that's not obvious when looking at this function in isolation, and b) in the future, we don't want other places calling this code without that check, so the assert provides a backstop of sorts. | |
| 1542–1548 | Ah, I missed that the input can't be truncated, so C can't get into a failure state itself. Thanks! (See also my comment about an assert above). | |
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kd-failure.s | ||
| 2 ↗ | (On Diff #290241) | |
| llvm/test/tools/llvm-objdump/ELF/AMDGPU/kd-sgpr.s | ||
| 22 ↗ | (On Diff #290241) | Nit: missing full stop. | 
No more comments from me, but please give @kzhuravl a chance to respond before pushing this patch.
3 of the test cases - kd-sgpr.s, kd-vgpr.s, kd-zeroed-gfx10.s fail with the PowerPC buildbot (http://lab.llvm.org:8011/builders/clang-ppc64be-linux/builds/53608). From what I understand, these tests should only run if AMDGPU target is built. The lit.local.cfg file specifies that.
The PowerPC build bot just means a build bot that is running on a PowerPC host. It actually targets most targets. See the following excerpt from the CMake.
-- Targeting AArch64 -- Targeting AMDGPU -- Targeting ARM -- Targeting AVR -- Targeting BPF -- Targeting Hexagon -- Targeting Lanai -- Targeting Mips -- Targeting MSP430 -- Targeting NVPTX -- Targeting PowerPC -- Targeting RISCV -- Targeting Sparc -- Targeting SystemZ -- Targeting WebAssembly -- Targeting X86 -- Targeting XCore
I've run into this sort of problem before. The issue is almost certainly either a) incorrect assumption about host system endianness, meaning that you've incorrectly/inadvertently assumed the host is little endian, or b) assumed a 64-bit system somewhere. a) is almost certainly the issue here, based on both the test output and build bot name (ppc64*be*). I've skimmed the patch, but can't obviously see where the code is going wrong, but the test output for kd-zeroed-gfx10.s suggests it's around bytes 48-52. There may be other issues elsewhere though, since 0 renders the same regardless of size and endianness.
Yes, considering the failures appear to be exactly the same on PowerPC Big Endian and SystemZ (which is also Big Endian), I would assume that there is a host endianness assumption here (I haven't looked at the code). Please pull this patch until a fix is ready. There are at least the 4 bots that are red because of this for quite a few consecutive builds. Pull the patch so we can get the bots back to green and we can work on a fix afterwards.
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1395 | I think this is where little endian is being 'hardcoded'. However, since AMDGPU relocatable objects are meant to be little endian, I don't understand why they are big endian on pp64. | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1395 | I think this is actually a bug with the encoding code. Like you say, we should be host-endianness agnostic when encoding the kernel descriptor, but it seems we aren't. I think I vaguely remember this coming up when we implemented it, but I don't remember why we didn't do this to start. I think it is just an oversight. | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1395 | I think https://reviews.llvm.org/D88858 should be the fix, need to confirm if big-endian testers will run it automatically. | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1395 | I tried to determine if the pre-checkin builders include any big-endian archs, but gave up and just committed it. I'll keep an eye on the builders and see if it needs to be revered. After that you can proceed with this patch again. | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1395 | Thanks! | |
| llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp | ||
|---|---|---|
| 1395 | An update, I missed one test in the initial commit, but I followed up in bf5c1d92d92ef8cee2adbfa17ecca20a8f65dc0e and now the big-endian testers seem to be happy. I think you can try reapplying this. | |
Previous reversions were due to test cases failing on big endian hosts.
 This was due to multibyte values being laid out as-is from host memory.
 Now https://reviews.llvm.org/D88858 addresses the above, and has landed.
Use raw_string_ostream rather than std::stringstream.