This is an archive of the discontinued LLVM Phabricator instance.

[ADT] Simplifying hex string parsing so it runs faster in debug modes.
ClosedPublic

Authored by benvanik on Oct 19 2021, 4:14 PM.

Details

Summary

This expands the lookup table statically and avoids routing through methods that
contain asserts (like StringRef/std::string element accessors and drop_front)
such that performance is more predictable across compilation environments. This
was primarily driven by slow debug mode performance but has a large benefit in
release builds as well.

ssd_mobilenet_v2_face_float (42MB .mlir)
  Debug/MSVC (old):  5.22s
  Debug/MSVC (new):  0.16s
Release/MSVC (old):  0.81s
Release/MSVC (new):  0.02s

huggingface_minilm (536MB .mlir)
  Debug/MSVC (old): 65.31s
  Debug/MSVC (new):  2.03s
Release/MSVC (old):  9.93s
Release/MSVC (new):  0.27s

Now in debug the time is split evenly between lexString, tryGetFromHex, and
element attrs hashing, with the next step to making it faster being to combine
the work (incremental hashing during conversion, etc) - but this is at least in
the right order of magnitude and retains the original API surface.

I have not profiled a build with clang but this is strictly less code and simpler
data structures so I'd expect improvements there as well.

This also fixes a bug where 0xFF bytes in the input would read out of bounds.

Diff Detail

Event Timeline

benvanik created this revision.Oct 19 2021, 4:14 PM
benvanik requested review of this revision.Oct 19 2021, 4:14 PM
benvanik updated this revision to Diff 380835.Oct 19 2021, 6:04 PM

(w/ full context)

We don't usually optimize for debug performance - is there a particular case where this comes up that's worth prioritizing/changing code for (& is there a good heuristic we can use about which debug performance is going to matter and which isn't?)

llvm/include/llvm/ADT/StringExtras.h
71

Can we drop the array dimension here and rely on the array contents to be the right size?

72–87

Does this need to be written out explicitly? How much does this/why does this change the performance (of what situation) significantly?

I guess it's enabling some optimization opportunities in some way by making the constant values more obvious? Perhaps we could write the code in some other way that'd satisfy that?

I wonder what it'd be like if 'LUT' were a constexpr std::array returned from a constexpr function that built it?

llvm/unittests/ADT/StringExtrasTest.cpp
93

If this is purely a performance improvement - what's motivating this test change?

benvanik added a comment.EditedOct 21 2021, 9:04 AM

We don't usually optimize for debug performance - is there a particular case where this comes up that's worth prioritizing/changing code for (& is there a good heuristic we can use about which debug performance is going to matter and which isn't?)

This matters when parsing nearly any non-unit-test MLIR file as constant values over a small size are encoded as hex strings - and in ML models the constants get quite large. The numbers included show that this dramatically improves release performance as well. But what made me look into this is that parsing these files in debug mode is a normal activity as usually one is using a crash reproducer to debug some intermediate stage of compilation where lots of data is encoded in the file - waiting up to several minutes for a hex string to parse (note the above tests minilm, there are much larger examples) on a 4ghz machine in 2021 before being able to even step into the first line of code is not good :)

Agreed that minor adjustments for debug only performance improvements aren't usually worth it but in this case the code was near unusable in its previous form - orders of magnitude matter. For something as foundational as parsing hex strings - where this particular function is invoked in the critical path potentially billions of times - there's little worth in being needlessly slow. One reason to be slow would be to gain additional protection or verification, however there was no meaningful additional verification being performed: the hidden asserts that are avoided here were already verified higher up by way of the input string length checks and asserts outside the hot inner loop. Sometimes a conceptual for loop indexing into an array should really just be a for loop indexing into an array :)

llvm/include/llvm/ADT/StringExtras.h
71

For lookup tables where the size of the array is meaningful it's useful to have the size specified - here we must have exactly the range of a byte of elements and being able to communicate that to readers and the compiler (to get compile-time errors if there is an element missing) is important. Having the size specified also made it easy for me to spot the bug here (that before it did not have enough elements). Omitting array dimensions and allowing it to be inferred from the contents is useful when working with dynamic lists that may be modified frequently - if the size of the list must be a precise value it's usually harmful to elide it.

72–87

Yes, unfortunately it does. The constexpr version is not guaranteed to expand statically and it in fact does not in MSVC (and may not in other compilers either) in both release and debug. This meant that every single hex nibble lookup was reinitializing the entire lookup table. You can find a deeper discussion about this on discord: https://discord.com/channels/636084430946959380/642426447167881246/900079712930496522
TLDR: https://cdn.discordapp.com/attachments/642426447167881246/900079709486997624/unknown.png

Compiler explorer: https://godbolt.org/z/Pb48Mz34K
Note that even clang in debug mode includes a memcpy of the lookup table to the stack for every single character lookup so it's also able to benefit here (that could be fixed in the old code by making it constexpr static, however that doesn't help MSVC and doesn't solve the assert performance of the loop that calls this).

For background, see https://stackoverflow.com/questions/14248235/when-does-a-constexpr-function-get-evaluated-at-compile-time - basically, constexpr is not guaranteed to be evaluated at compile time, and especially not when used as part of a non-constexpr function (like this). You only get the guarantee when trying to use a constexpr expression within something that must be compile-time evaluated (like another constexpr).

As for std::array, some STL implementations assert on element lookup and would also have suboptimal performance characteristics here: https://github.com/microsoft/STL/blob/d8f03cf399d730780b6ca0e5321a9ff4fc76bb0f/stl/inc/array#L568, https://github.com/gcc-mirror/gcc/blob/2606dfea12dbe9692c9d576155e63bf3acebd1bf/libstdc%2B%2B-v3/include/std/array#L217 (libc++ doesn't, but I consider that a bug not a feature)

llvm/unittests/ADT/StringExtrasTest.cpp
93

I don't like undefined behavior and wanted to verify my new code handled this case correctly. Prior to this the behavior was undefined if you had any ascii character with value 255 in the input file, which I think can be agreed is not good regardless of whether most normal inputs include it or not.

(I'm curious why the question: is updating tests to improve coverage for fixed bugs something that needs extrinsic motivation? I did call out in the description that I fixed this bug as part of this refactoring)

benvanik added inline comments.Oct 21 2021, 9:46 AM
llvm/unittests/ADT/StringExtrasTest.cpp
93

(to clarify: this does not reduce test coverage as this was just testing that invalid characters were correctly identified as invalid - I just changed it so the invalid character it was testing was 0xFF)

dblaikie added inline comments.Oct 21 2021, 11:29 AM
llvm/include/llvm/ADT/StringExtras.h
72–87

I think I'd be in favor of addressing the clang perf issue with constexpr static and letting MSVC be what it is - performance of LLVM is generally tuned for LLVM on LLVM, with clang-cl that's feasible even on an MSVC platform. Seems like if this is in the hot-path for MLIR in a particularly extreme way, it isn't for an LLVM bootstrap, so not tuning for MSVC perf wouldn't be a huge impediment to bootstrapping then using clang-cl?

218–243

Might be worth a comment explaining why this is using lower level operations/raw string manipulation so this optimization doesn't get deoptimized/cleaned up by someone else in the future.

221

I'd avoid const_cast and use &s[0] or similar - const_cast tends to require a second look to validate that it's safe/correct (which, I agree, it is in this case & working around missing std::string::data which is added in C++17 which we aren't using in LLVM yet).

llvm/unittests/ADT/StringExtrasTest.cpp
93

Sorry, I missed the mention of the bug fix in the commit - agreed that bug fixes should have tests, and agreed that it is a bug/fix.

benvanik added inline comments.Oct 21 2021, 6:30 PM
llvm/include/llvm/ADT/StringExtras.h
72–87

I agree that addressing the clang perf issue would be good, but as a user of MSVC who is severely affected by this I have to disagree that letting it be is a good idea - even if only as a point of pride in engineering of what we can control. It's important to keep in mind that we're talking about an interview question-level for loop and a 256 element lookup table and that it's slow in any compiler of any era from any vendor is not something anyone should feel good about including the owner of the code.

Though MLIR and LLVM share the ADT they are not the same thing: LLVM-bias may make sense for the core tip-of-tree LLVM compiler tools themselves (and to be clear I'm a big fan of clang-cl), but MLIR is deployed independently of LLVM and is used with many other toolchains (gcc/msvc/etc). But even if a user is on a Clang/LLVM toolchain there are many production pipelines that may not pick that up for years - and the asserts still remain and will always be slow anywhere NDEBUG is undefined. Forcing users to change their toolchain is not possible or practical in all cases (most python wheel building infra runs on gcc on old centos distros, for example), nor is it appealing telling them that the reason they'd need to change is for non-technical politics. A bit of empathy for the people running this code would be appreciated :)

221

good call - will do!

stellaraccident accepted this revision.Oct 21 2021, 6:49 PM
stellaraccident added a subscriber: stellaraccident.

We don't usually optimize for debug performance - is there a particular case where this comes up that's worth prioritizing/changing code for (& is there a good heuristic we can use about which debug performance is going to matter and which isn't?)

Interested who the "we" is. Debug builds should be generally usable and this case was extreme, pushing things into unusable for very common programs.

I'm +1 on accepting patches that improve such gradients -- if the patch was important enough for a busy engineer to take their time to fix in order to improve qol for all of us, I'll accept that - and the reported gradient is extreme. The code as written is not surprising to me for a foundational, hot item like this.

This revision is now accepted and ready to land.Oct 21 2021, 6:49 PM
dblaikie added inline comments.Oct 21 2021, 6:52 PM
llvm/include/llvm/ADT/StringExtras.h
72–87

not something anyone should feel good about including the owner of the code.

This tone doesn't seem appropriate here. (suggesting that my feeling OK with this code is a character flaw)

nor is it appealing telling them that the reason they'd need to change is for non-technical politics.

Alternatively a missed opportunity in other toolchains - that could be improved there.

and the asserts still remain and will always be slow anywhere NDEBUG is undefined.

I'm only talking about this particular piece, the table building. The other piece related to assertions and using raw pointer ops rather than StringRef APIs I've got mixed feelings about, but generally seems acceptable to me.

dblaikie accepted this revision.Oct 21 2021, 7:00 PM

We don't usually optimize for debug performance - is there a particular case where this comes up that's worth prioritizing/changing code for (& is there a good heuristic we can use about which debug performance is going to matter and which isn't?)

Interested who the "we" is. Debug builds should be generally usable and this case was extreme, pushing things into unusable for very common programs.

My general understanding of the LLVM community - which is a fluid thing, to be sure. But insofar as any of us try to represent that amorphous set of values, that's what I'm trying to convey here.

(generally if someone else has ongoing/outstanding concerns over code, we also don't usually approve a patch in spite of that - or at least leave a "OK so long as <other person's> concerns are addressed" (I usually don't use "approve" in that case, I leave it to the other person to provide the final approval so as not to confuse things))

But I'll leave this to you folks - as much as a bunch of the substance and process here doesn't sit well with me.

benvanik added inline comments.Oct 21 2021, 7:05 PM
llvm/include/llvm/ADT/StringExtras.h
72–87

I apologize.

We don't usually optimize for debug performance - is there a particular case where this comes up that's worth prioritizing/changing code for (& is there a good heuristic we can use about which debug performance is going to matter and which isn't?)

Interested who the "we" is. Debug builds should be generally usable and this case was extreme, pushing things into unusable for very common programs.

My general understanding of the LLVM community - which is a fluid thing, to be sure. But insofar as any of us try to represent that amorphous set of values, that's what I'm trying to convey here.

(generally if someone else has ongoing/outstanding concerns over code, we also don't usually approve a patch in spite of that - or at least leave a "OK so long as <other person's> concerns are addressed" (I usually don't use "approve" in that case, I leave it to the other person to provide the final approval so as not to confuse things))

But I'll leave this to you folks - as much as a bunch of the substance and process here doesn't sit well with me.

I approved the patch because I believed it was worthy to approve. I feel like that is completely in bounds. Landing it without consensus would have been out of bounds - and the discussion was clearly still going on. I could have stated the intention more clearly but it was not meant to bypass.

In any case, thanks for the review and sorry for confusion. The author doesn't have commit access, so I'll land this.

Landed. dblaikie: Thanks for the review. Will work on comms style in the future as it seems like there were a couple of areas for improvement here.

We don't usually optimize for debug performance - is there a particular case where this comes up that's worth prioritizing/changing code for (& is there a good heuristic we can use about which debug performance is going to matter and which isn't?)

Interested who the "we" is. Debug builds should be generally usable and this case was extreme, pushing things into unusable for very common programs.

My general understanding of the LLVM community - which is a fluid thing, to be sure. But insofar as any of us try to represent that amorphous set of values, that's what I'm trying to convey here.

(generally if someone else has ongoing/outstanding concerns over code, we also don't usually approve a patch in spite of that - or at least leave a "OK so long as <other person's> concerns are addressed" (I usually don't use "approve" in that case, I leave it to the other person to provide the final approval so as not to confuse things))

But I'll leave this to you folks - as much as a bunch of the substance and process here doesn't sit well with me.

I approved the patch because I believed it was worthy to approve. I feel like that is completely in bounds. Landing it without consensus would have been out of bounds - and the discussion was clearly still going on. I could have stated the intention more clearly but it was not meant to bypass.

In any case, thanks for the review and sorry for confusion. The author doesn't have commit access, so I'll land this.

Ah, thanks for the context!