Page MenuHomePhabricator

Refactor Suffix Tree from MachineOutliner to LibSupport
ClosedPublic

Authored by AndrewLitteken on May 26 2020, 1:17 PM.

Details

Summary

For future projects that use the Suffix Tree for finding substrings, including a potential LLVM IR Outliner, this patch is a refactor that moves the Suffix Tree data structure from code only used by the Machine Outliner to Support.

Unit tests have also been added for the Suffix Tree.

Diff Detail

Event Timeline

AndrewLitteken created this revision.May 26 2020, 1:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 26 2020, 1:17 PM

Are there any changes that need to be called out beyond moving the code around?

llvm/include/llvm/Support/SuffixTree.h
18

LLVM headers shouldn't using namespace llvm;

Can you make the testcases more explicit?

"This tests that the suffix tree finds (stuff)"

"This tests that the suffix tree does not find (stuff)"

llvm/unittests/Support/SuffixTreeTest.cpp
20

Comment explaining what this testcase does?

28

static_cast for casting?

33–36

It would be better if these comments described exactly what is being tested instead of "what it is currently able to do".

Also, it would be good to call out explicitly what should not be found in this comment.

44

Use reference?

for (const SuffixTree::RepeatedSubstring &RS : SubStrings)  {
 ...
45

Can we check the specific substrings which were found?

49–52

Same here, can this comment describe what the test below is actually testing?

65

Comment for below testcase?

Are there any changes that need to be called out beyond moving the code around?

No there are not, only the MachineOutliner previously depended on this code, and there were no major changes that needed to occur to adapt it.

Fixing a few clang-format issues

Adding more descriptive comments to the tests and as well as better descriptions for how the test may change in the future.
Tests now also check the contents of the found substrings.

efriedma added inline comments.May 26 2020, 6:08 PM
llvm/include/llvm/Support/SuffixTree.h
18

Also, I guess I didn't mention it, but classes in LLVM headers should be defined in the "llvm" namespace.

Adding namespace and removing references from unsigned variables in for loops.

Some more testcase nits.

llvm/unittests/Support/SuffixTreeTest.cpp
27–29

remove braces

27–29

Style nit: remove braces around for?

30

static_cast?

30–32

Replace C-style casts with static_casts? (If anything, it's just easier to search for static_cast than a C-style cast)

Actually, now that I think about it, why not just use, 1u?

41

Put this as a FIXME, so people can find it when they grep for FIXMEs?

41–44

I would put this portion of the comment in a FIXME:

// Right now .... smaller.

Then if anyone runs into this, they can find it documented somewhere easily. Also, people looking for something to work on can find it easily.

53–61

In this case, we should probably only find repeats of length 2 and 3, correct?

Can we check that as well? We should never get a repeat of length, say, 4, or 0, or, whatever.

E.g, something like this?

unsigned Len;
for (SuffixTree::RepeatedSubstring &RS : SubStrings) {
  // The only strings we expect are {2, 3} and {1, 2, 3}.
  // Check that the length of the string is something we expect before checking its contents.
  Len = RS.Length;
  bool IsExpectedLen = (Len == 3 || Len == 2);
  ASSERT_TRUE(IsExpectedLen);

  // Check the contents of the string.
  if (Len == 3) {
   ...
  } else {
   ...
  }
}
54

Test that the start indices are 0 and 4?

66

What happens if it's not 3 or 2? Should probably fail the test.

73

Maybe a FIXME or TODO would be good here?

// FIXME: Add support for detecting {1, 1} and {1, 1, 1}
84

Test that the start index is 0 or 1?

85–87

Might be able to use llvm::all_of here?

Then you'd be able to write this like:

ASSERT_TRUE(all_of(std::make_range(...), [ ](...){ return Elt == 1; }));
92

Most people don't know what a tandem repeat is.

Maybe something like this?

// The suffix tree cannot currently find repeated substrings in strings of the form
// {1, 2, 3, 1, 2, 3}, because the two {1, 2, 3}s are adjacent ("tandem repeats")
//
// FIXME: Teach the SuffixTree to recognize these cases.
117–119

Remove braces?

Or, even better than using a for loop, try using llvm::all_of?

EXPECT_TRUE(all_of(std::make_range(...), [](...){ return Elt == 1; }))
This revision is now accepted and ready to land.Jun 8 2020, 10:16 AM
This revision was automatically updated to reflect the committed changes.