This is an archive of the discontinued LLVM Phabricator instance.

[clang]Fix length threshold for MicrosoftMangle md5 hash
ClosedPublic

Authored by mibintc on Nov 3 2020, 1:45 PM.

Details

Summary

Fix off-by-one bug in determining the threshold for when want to shorten very large external names, Microsoft compatibility.
This fixes the bug reported here, https://bugs.llvm.org/show_bug.cgi?id=38868

Diff Detail

Event Timeline

mibintc created this revision.Nov 3 2020, 1:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 3 2020, 1:45 PM
mibintc requested review of this revision.Nov 3 2020, 1:45 PM
mibintc updated this revision to Diff 302902.Nov 4 2020, 10:38 AM

Add lit test case, based on test case in bug report

Test case?

Thanks, done

Since the same code is used to mangle all these things, probably just test one of them?

Could use macros to stamp out longer names without having to write them out manually?

Since the same code is used to mangle all these things, probably just test one of them?

Could use macros to stamp out longer names without having to write them out manually?

Not sure what technique to use to stamp out longer names, I tried using token pasting but that doesn't work because the tokens aren't macro expanded before pasting? I like the test source I added because it came from a program seen in the wild, and it demonstrates that reasonable length identifiers can generate enormous mangled names.
#define X50 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
#define X45 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
#define X100 X50 ## X50 // X100 is X50X50

rnk accepted this revision.Nov 5 2020, 10:41 AM
rnk added a subscriber: rnk.

lgtm

I think the test case looks good as is, FWIW. Token pasting might make it more readable, but it's improving the existing test, and not necessarily in scope for this patch.

This revision is now accepted and ready to land.Nov 5 2020, 10:41 AM
dblaikie added a comment.EditedNov 5 2020, 11:30 AM

Since the same code is used to mangle all these things, probably just test one of them?

Could use macros to stamp out longer names without having to write them out manually?

Not sure what technique to use to stamp out longer names, I tried using token pasting but that doesn't work because the tokens aren't macro expanded before pasting?
#define X50 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
#define X45 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
#define X100 X50 ## X50 // X100 is X50X50

Yeah, there's a quirk of the preprocessor that means you need an extra level of indirection, like this:

#define C(X) X ## X
// X2: X * 2
#define X2(X) C(X)
// X8: X * 2^3 (8)
#define X8(X) X2(X2(X2(X)))
// X4096: X * 8^4 (4096)
#define X4096(X) X8(X8(X8(X8(X))))
void X4096(x)() {
}

https://godbolt.org/z/r6Evae

I like the test source I added because it came from a program seen in the wild, and it demonstrates that reasonable length identifiers can generate enormous mangled names.

I think using a very plain name (like "lots of 'x'") helps clarify the test case is only about length - not about any other property of the name. A comment describing how templates (& especially template default parameters) can lead to long names could be good, though.

In D90714#2376667, @rnk wrote:

lgtm

I think the test case looks good as is, FWIW. Token pasting might make it more readable, but it's improving the existing test, and not necessarily in scope for this patch.

Oh, sorry, didn't see that the big long contiguous identifiers were existing testing.

I'm even more inclined towards suggesting the new testing is not the way I'd prefer to go - especially in comparison to the existing testing. The new testing seems to really obscure that the length of the identifier is the important property, by adding a bunch of other features (templates, default template arguments, etc, etc) to the test that don't impact the logic under test - certainly for me that makes it quite a bit less clear what the goal of the test is, and so how to read and maintain it in the future.

(but yeah, sorry, didn't mean to suggest the token pasting cleanup as part of this test - I'd misread the delta - but if token pasting makes it easier to add a new test that could be consistent with (after cleaning up the old test - before or after this commit) other tests here, it seems relevant in that regard)

I'm sorry, I don't see how to build up an identifier with an arbitrary number of characters using the token pasting, for example an ident with 4095 characters is not a power of 2. I tried pasting together X2048 X1024 X512 etc but that doesn't work
fu.cpp:12:18: error: pasting ")" and "X16" does not give a valid preprocessing token
#define FUN X32(x) ## X16(x)

^

fu.cpp:13:1: note: in expansion of macro ‘FUN’
I want the test case to show the transition between when md5 is not needed and when it's needed, i want to build strings that might not be power of root string. I could add more comments to the test case, perhaps, "the strlen of the Microsoft mangled name is X" for the one case, and "the strlen of ... is X+1" for the other.

I'm sorry, I don't see how to build up an identifier with an arbitrary number of characters using the token pasting, for example an ident with 4095 characters is not a power of 2. I tried pasting together X2048 X1024 X512 etc but that doesn't work

fu.cpp:12:18: error: pasting ")" and "X16" does not give a valid preprocessing token
 #define FUN X32(x) ## X16(x)
                  ^
fu.cpp:13:1: note: in expansion of macro ‘FUN’

Ah, yeah, that's the awkward thing where you have to indirect macro expansions before concatenation (some discussion here, though couldn't find the perfect reference/discussion on the issue).

Essentially you've got to add an extra level of macro indirection to be able to concatenate like that, eg:

#define FUN_HELPER(x, y) x ## y
#define FUN FUN_HELPER(X32(x), X16(x))

I want the test case to show the transition between when md5 is not needed and when it's needed, i want to build strings that might not be power of root string. I could add more comments to the test case, perhaps, "the strlen of the Microsoft mangled name is X" for the one case, and "the strlen of ... is X+1" for the other.

They're really non-obvious that those particular mangled names produce names that are exactly a certain length, even with a comment that seems pretty subtle compared to a more direct construction.

Let's see how to build a 4095 length function name... bit more code, but I think fairly legible: https://godbolt.org/z/Tbv6zz

#define STR_(X) #X
#define STR(X) STR_(X)

#define C_(P1, P2) P1##P2
#define C2(P1, P2) C_(P1, P2)
#define C4(P1, P2, P3, P4) C2(C2(P1, P2), C2(P3, P4))

#define X2(X) C2(X, X)
#define X4(X) X2(X2(X))
#define X8(X) X2(X4(X))
#define X16(X) X2(X8(X))
#define X32(X) X2(X16(X))
#define X64(X) X2(X32(X))
#define X128(X) X2(X64(X))
#define X256(X) X2(X128(X))
#define X512(X) X2(X256(X))
#define X1024(X) X2(X512(X))
#define X2048(X) X2(X1024(X))
#define X4096(X) X2(X2048(X))

#define X4095(X)                              \
  C2(C2(                                      \
    C4(X,       X2(X),   X4(X),    X8(X)),    \
    C4(X16(X),  X32(X),  X64(X),   X128(X))), \
    C4(X256(X), X512(X), X1024(X), X2048(X)))

void X4096(a)() {}

extern int a[4096];
int a[sizeof(STR(X4096(a))) - 1];

void X4095(b)() {}

extern int b[4095];
int b[sizeof(STR(X4095(b))) - 1];
mibintc updated this revision to Diff 303477.Nov 6 2020, 9:00 AM

Fixed test case according to @dblaikie 's suggestion and input, thanks a lot David, the test case is certainly a lot easier to read and understand now. Anything else?

dblaikie accepted this revision.Nov 6 2020, 10:47 AM

Generally looks good, thanks!

The inline comment trying to understand the 4095 - 4088 == 7 math I think should be answered (possibly in the form of the test/CHECK rephrasing I mentioned to clarify what the extra 7 characters are, and maybe some more detail in the comment too, if needed) - the rest of the suggestions are minor/optional.
It might also be good to split up the patch and commit the new macros/cleanup of the existing tests in one patch, then the fix+new tests in another. This way if there's a problem with one of the patches it'd be easier to identify/separate from the other.

clang/test/CodeGenCXX/mangle-ms-md5.cpp
28–30

After I wrote a few of these and realized the test needs different names (so parameterizing the macro on the character to duplicate helps that), but then the X became ambiguous/confusing with the literal 'x' in the names. To avoid/reduce that confusion it might make sense to use some other letter (I had used X for the macro then a, b, c, etc for the real identifiers - but maybe that's still a bit confusing because "X" does tend to get used in x, y, z example names too, even if this test file was changed to avoid that) - R for Repeated?

Then probably the Z4089 and Z4089 could be removed (since they're only defined on one line and used once immediately after) and the macros used directly like on line 26 above. Though I can appreciate Y4095 being kept because it's used several times (I could go either way on that one)

29

This one's unused & could be removed, I think?

31

Might be worth some more words to describe what the 7 other characters are? (to help explain the "magic" number 4088)

Oh, one way might be to change the CHECK-DAG on 63 to verify the full name. You could use a regex to match all the zs compactly but precisely (eg: @"{{\?z{4088}@@3HA}}")*, rather than what I guess is a sample of 4 zs currently? That way it's clear what the other 7 characters are? (though I only count 6 other characters in my small examples - so I'd expect 4088 z plus those six to reach 4094, not 4095?)

  • might have to muck with the regex syntax a smidge - I always forget which things need escaping and which things don't
37–38

As on line 26, it's probably simple enough to skip the extra macro and write this as int X4088(z) = 1515;

(& as on line 26, probably skip the "= 1515" unless that adds some value (no pun intended) to the test? I'm guessing it's not important what value, if any, the variable is initialized with?)

42–46

Could you align the commas and trailing backslashes here and in the X4088 one?

you're right about the regex syntax. i think the 4088 is over the max repetition count and there's also a problem with balanced. not sure what i did wrong here. if you have a chance to look at this otherwise i'll toil over the weekend, i'm off for the day--cheers. here's what i'm trying now just to see if it would work. can i nest repetition counts?

63:20: error: invalid regex: braces not balanced
// CHECK-DAG: @"?{{\(z\){255}}}@@3HA" = dso_local global i32 1515, align 4

i also tried it without the parens around z or with parens around z but not backslash.

you're right about the regex syntax. i think the 4088 is over the max repetition count and there's also a problem with balanced. not sure what i did wrong here. if you have a chance to look at this otherwise i'll toil over the weekend, i'm off for the day--cheers. here's what i'm trying now just to see if it would work. can i nest repetition counts?

63:20: error: invalid regex: braces not balanced
// CHECK-DAG: @"?{{\(z\){255}}}@@3HA" = dso_local global i32 1515, align 4

i also tried it without the parens around z or with parens around z but not backslash.

Ah, the mismatched braces is from the }}} at the end, I think the first two }} are ending the regex (& then the regex contains an open { without a close). That's why I included the @@3HA inside the regex match in my example/first guess.

Let's see what I can come up with. Ah, yeah, seems there's a limit of 255.

I guess the thing to do would be to just * map the 'z' 'knowing' it'll be the number of zs in the macro, while still demonstrating the extra characters that get up to the precise length the test is intended to exercise.

// CHECK-DAG: @"?{{z+}}@@3HA" = dso_local global i32 1515, align 4
mibintc updated this revision to Diff 303640.Nov 7 2020, 7:44 AM

I submitted the test improvements separately, i'm going to push this version, thanks all

This revision was landed with ongoing or failed builds.Nov 7 2020, 7:44 AM
This revision was automatically updated to reflect the committed changes.