This is an archive of the discontinued LLVM Phabricator instance.

Use trigrams to speed up SpecialCaseList.
ClosedPublic

Authored by krasin on Nov 28 2016, 8:08 PM.

Details

Summary

it's often the case when the rules in the SpecialCaseList
are of the form hel.o*bar. That gives us a chance to build
trigram index to quickly discard 99% of inputs without
running a full regex. A similar idea was used in Google Code Search
as described in the blog post:
https://swtch.com/~rsc/regexp/regexp4.html

The check is defeated, if there's at least one regex
more complicated than that. In this case, all inputs
will go through the regex. That said, the real-world
rules are often simple or can be simplied. That considerably
speeds up compiling Chromium with CFI and UBSan.

As measured on Chromium's content_message_generator.cc:

before, CFI: 44 s
after, CFI: 23 s
after, CFI, no blacklist: 23 s (~1% slower, but 3 runs were unable to show the difference)
after, regular compilation to bitcode: 23 s

Event Timeline

krasin updated this revision to Diff 79506.Nov 28 2016, 8:08 PM
krasin retitled this revision from to Use trigrams to speed up SpecialCaseList..
krasin updated this object.
krasin updated this revision to Diff 79626.Nov 29 2016, 1:10 PM

Add test cases for trigrams into SpecialCaseList unit test.

krasin updated this revision to Diff 79639.Nov 29 2016, 2:00 PM

Simplify code; ensure build() is called.

krasin updated this revision to Diff 79678.Nov 29 2016, 5:03 PM

Remove debug leftovers.

krasin updated this revision to Diff 79684.Nov 29 2016, 6:20 PM

Get rid of build()

krasin updated this revision to Diff 79702.Nov 29 2016, 8:46 PM

Speed improvements.

krasin added a reviewer: pcc.Nov 29 2016, 8:47 PM
krasin updated this object.Nov 29 2016, 8:51 PM
pcc edited edge metadata.Nov 29 2016, 10:54 PM

Nice!

Do you have before/after performance numbers for a large Chromium TU? Maybe also measure without the blacklist for a baseline.

lib/Support/Regex.cpp
199

Does this belong in Regex? It feels more like an implementation detail of TrigramIndex.

lib/Support/SpecialCaseList.cpp
33

Maybe use a SmallVector instead of std::vector since the size is bounded?

This code also needs more comments. Please at least leave comments for TrigramIndex and its fields.

46

Maybe fold this into the use? I found it a little confusing to use a variable named "I" for something other than a loop counter.

49

Maybe use a range for loop here?

66

I wonder how important it is to test Was here. It looks like things would still work if you allow trigrams to have multiple identical indices, and as you mention in the comment above it isn't worth worrying about popular trigrams.

81

Remove braces

krasin updated this object.Nov 29 2016, 11:21 PM
krasin edited edge metadata.

Peter, thank you for your comments. I have added some benchmarks to the description and will address the rest of your points tomorrow.

krasin updated this object.Nov 30 2016, 11:33 AM
krasin updated this revision to Diff 79794.Nov 30 2016, 12:17 PM
krasin marked 2 inline comments as done.

Address the comments.

Please, take another look.

lib/Support/Regex.cpp
199

It belongs to Regex, because it uses RegexAdvancedMetachars constant that is very similar to RegexMetachars. If one is updated, another must be updated as well.

Also, isLiteralERE and isSimpleWildcard are very similar in meaning. If one lives in Regex, why not another?

lib/Support/SpecialCaseList.cpp
33

SmallVector: done
Comments to TrigramIndex and its fields: done.

46

Inlined and added a comment.

66

Actually, pretty important. Consider a test case like the following:

blablablablabla
blabla

If we don't use Was, the index will look the following:

"bla" -> [0, 0, 0, 0]
"lab" -> [0, 0, 0, 0]
"abl" -> [0, 0, 0, 0]

and counts: [13, 0]. One of the counts is zero => check defeated => all queries go through a regex => SLOW.

In the current scheme the index will be:
"bla" -> [0, 1]
"lab" -> [0, 1]
"abl" -> [0, 1]
and counts: [13, 4]. That gives a way to filter out all queries without these trigrams.

A second issue is that when we count trigrams at eval time, duplicates will increase per-rule counts (while we still have non-inflated limits) => regexps are called much more often => SLOW

Unfortunately, while I have a test case like that, it does not really have a way to know if the check is defeated or not. I only test for correctness. It means that speed regressions like I have explained above are still possible without breaking any tests.

And I will need a separate header for the TrigramIndex to make a more granular unit test possible. I would be willing to extract it, if you think that is a good idea (as usual, I don't feel the conventions in the LLVM projects).

pcc added inline comments.Nov 30 2016, 1:41 PM
lib/Support/Regex.cpp
199

I doubt that we'd actually be changing the definition of what an ERE is, the specification comes from POSIX and people are relying on the current semantics. On the other hand, the set of metacharacters for isSimpleWildcard is a property of how powerful TrigramIndex is, and I'd certainly expect it to change if we made TrigramIndex more powerful.

Besides, only SpecialCaseList mangles * into .*, which would seem to be critical to how we define this character list (i.e. you wouldn't be able to extract the trigram "abc" from the pattern "abc*" without this mangling).

lib/Support/SpecialCaseList.cpp
30

"a heuristic"

51

This should specify that these are indices into Counts.

66

Okay, I understand why this is necessary now; I hadn't had the insight that duplicate indices would effectively square the per-rule count.

As for your first point: it would seem that the index would be defeated under the current implementation with something like this:

blabla
blablabla
blablablabla
blablablablabla
blablablablablabla

or perhaps more realistically, something like:

class1.h
class2.h
class3.h
class4.h
class.h

It isn't clear how likely this would be in practice, though, and I suspect that you could mitigate against the more realistic scenarios by sorting the patterns by length beforehand.

Regarding testing: yes, we generally tend to extract algorithmically tricky classes and write unit tests for them (see the tests for CFI and devirtualization internals in llvm/unittests/Transforms/IPO).

krasin updated this revision to Diff 79839.Nov 30 2016, 4:39 PM

Extract TrigramIndex{.h,.cpp,Test.cpp}.

krasin marked an inline comment as done.Nov 30 2016, 4:49 PM

Please, take another look.

lib/Support/Regex.cpp
199

On the other hand, the set of metacharacters for isSimpleWildcard is a property of how powerful TrigramIndex is, and I'd certainly expect it to change if we made TrigramIndex more powerful.

This is a good argument. I buy it. Done.

lib/Support/SpecialCaseList.cpp
51

Sorry, I have lost the context here after moving the code around. Probably addressed, but I am not sure.

66

Actually, in

class1.h
class2.h
class3.h
class4.h

we have trigrams ss1, ss2, ss3, ss4 which will help us to keep it working.

Realistically, the current implementation works for CFI blacklist and almost works for UBSanVptr blacklist in Chromium. The latter fails due to using \+ in the regex, and I hope to address escaping in the next CL.

krasin added inline comments.Nov 30 2016, 4:51 PM
lib/Support/SpecialCaseList.cpp
66

By "fails" I mean that the corresponding TrigramIndex is defeated. The blacklist still works correctly, just slower.

pcc added inline comments.Nov 30 2016, 4:54 PM
lib/Support/SpecialCaseList.cpp
66

Unless I'm mistaken, after adding class[1-4].h, the trigrams would look like this:

"cla" -> [0, 1, 2, 3]
"las" -> [0, 1, 2, 3]
"ass" -> [0, 1, 2, 3]
"ss1" -> [0]
"ss2" -> [1]
"ss3" -> [2]
"ss4" -> [3]

When we add "class.h", we will try to add only the trigrams "cla", "las" and "ass", all of which would fail due to exceeding the bound, no?

krasin added inline comments.Nov 30 2016, 4:57 PM
lib/Support/SpecialCaseList.cpp
66

This is correct, and regexp will be executed for the query. But the check will still correctly predict that we don't need to do that for queries like "foo" or "bar". So, while the check is not exact, it's not defeated either.

krasin added inline comments.Nov 30 2016, 4:59 PM
lib/Support/SpecialCaseList.cpp
66

actually, not. The check will not execute query for class.h. I will add a test case to demonstrate that.

krasin updated this revision to Diff 79843.Nov 30 2016, 5:04 PM

Add one more test case.

pcc added a comment.Nov 30 2016, 5:04 PM

Maybe I'm not making myself clear. This test doesn't pass on my machine.

TEST_F(TrigramIndexTest, foo) {
  std::unique_ptr<TrigramIndex> TI = makeTrigramIndex(
      {"class1.h", "class2.h", "class3.h", "class4.h", "class.h"});
  EXPECT_FALSE(TI->isDefeated());
}
krasin added inline comments.Nov 30 2016, 5:07 PM
lib/Support/SpecialCaseList.cpp
66

What happens during class.h matching is the following: it considers "cla", "las", "ass" and makes counts for all 4 rules equal to 3. But they need to be at least 4 to trigger a check.

An example of a query that will go through is claclaass.h. In this case, the count will become 4, and isDefinitelyOut will return false.

In D27188#610048, @pcc wrote:

Maybe I'm not making myself clear. This test doesn't pass on my machine.

TEST_F(TrigramIndexTest, foo) {
  std::unique_ptr<TrigramIndex> TI = makeTrigramIndex(
      {"class1.h", "class2.h", "class3.h", "class4.h", "class.h"});
  EXPECT_FALSE(TI->isDefeated());
}

Ah, yes, I incorrectly read your example. I thought class.h was a query, not a rule. Yes, that's correct, and I will add this test case as well (although, I have a similar one already).

krasin updated this revision to Diff 79845.Nov 30 2016, 5:10 PM

Add a test case proposed by Peter.

pcc accepted this revision.Nov 30 2016, 5:59 PM
pcc edited edge metadata.

LGTM with nits

include/llvm/Support/TrigramIndex.h
1 ↗(On Diff #79845)

80 cols

30–35 ↗(On Diff #79845)

LLVM headers come before system headers

42 ↗(On Diff #79845)

There's no need for a constructor on this class; you can let Counts be default initialized and use in-class initializers for the other members.

lib/Support/CMakeLists.txt
97–98 ↗(On Diff #79845)

Sort alphabetically

lib/Support/SpecialCaseList.cpp
51

This pertained to the comment "// Index holds a list of rules indices for each trigram.". It doesn't look like you've changed this. Note that you can click the double left arrow in the top left to see comments in their original context.

66

To be clear, I think it's fine if the implementation isn't perfect; this is certainly a strict improvement over the status quo. I just wanted to point out one somewhat realistic case where this might fail.

lib/Support/TrigramIndex.cpp
1–15 ↗(On Diff #79845)

This comment needs to be updated.

17–22 ↗(On Diff #79845)

LLVM headers come before system headers.

24 ↗(On Diff #79845)

We normally write using namespace llvm; instead of enclosing the entire file in a namespace.

unittests/Support/TrigramIndexTest.cpp
1 ↗(On Diff #79845)

80 cols

10–15 ↗(On Diff #79845)

LLVM headers come before system headers

This revision is now accepted and ready to land.Nov 30 2016, 5:59 PM
krasin updated this revision to Diff 79851.Nov 30 2016, 6:31 PM
krasin edited edge metadata.

Address final nits. Also sync.

krasin marked 13 inline comments as done.Nov 30 2016, 6:33 PM
pcc added a comment.Nov 30 2016, 6:39 PM

Still LGTM

include/llvm/Support/TrigramIndex.h
65–66 ↗(On Diff #79851)

Does std::unordered_map<unsigned, SmallVector<size_t, 4>> Index{256}; work?

lib/Support/SpecialCaseList.cpp
192

auto &Entry

krasin updated this revision to Diff 79853.Nov 30 2016, 6:47 PM

Shorter in-class initialization (thx Peter!)

krasin updated this revision to Diff 79854.Nov 30 2016, 6:49 PM
krasin marked an inline comment as done.

Format nits.

krasin marked an inline comment as done.Nov 30 2016, 6:50 PM
krasin added inline comments.
include/llvm/Support/TrigramIndex.h
65–66 ↗(On Diff #79851)

Yes, it does. Thank you!

krasin closed this revision.Nov 30 2016, 7:04 PM