This is an archive of the discontinued LLVM Phabricator instance.

Remove the user-count threshold when analyzing read attributes
ClosedPublic

Authored by wengxt on Jul 17 2015, 1:37 PM.

Download Raw Diff

Details

Reviewers

nicholas
nlewycky
jingyue

Commits

rGabb05aa3c697: Remove the user-count threshold when analyzing read attributes
rL243141: Remove the user-count threshold when analyzing read attributes

Summary

This threshold limited FunctionAttrs ability to prove arguments to be read-only.
In NVPTX, a specialized instruction ld.global.nc can be used to load memory
with non-coherent texture cache. We notice that in SHOC [1] benchmark, some
function arguments are not marked with readonly because FunctionAttrs reaches
a hardcoded threshold when analysis uses.

Removing this threshold won't cause significant regression in compilation time, because the worst-case time complexity of the algorithm is still O(# of instructions) for each parameter.

Patched by Xuetian Weng.

[1] https://github.com/vetter/shoc

Diff Detail

Event Timeline

wengxt updated this revision to Diff 30032.Jul 17 2015, 1:37 PM

wengxt retitled this revision from to Replace hardcoded threshold with an option when analyze read attributes.

wengxt updated this object.

wengxt added reviewers: nlewycky, jingyue.

wengxt added a subscriber: llvm-commits.

LGTM

This revision is now accepted and ready to land.Jul 17 2015, 2:09 PM

update test plan info

Xuetian Weng wrote:

wengxt created this revision.
wengxt added reviewers: nlewycky, jingyue.
wengxt added a subscriber: llvm-commits.

In NVPTX, a specialized instruction ld.global.nc can be used to load memory
with non-coherent texture cache. We notice that in SHOC [1] benchmark, some
function arguments are not marked with readonly because FunctionAttrs reaches
a hardcoded threshold when analysis uses.

This patch changes the hardcoded threshold into an option to enable us to
change the threshold at runtime.

Question one, what does this have to do with ld.global.nc? It looks like
you've got a load instruction (wrapped in an intrinsic) and it in turn
has many uses. The fact it's ld.global.nc doesn't seem to be relevant?

In theory, if you replaced the ld.global.nc with a LoadInst, would you
have the same problem? Or is there some other optimization that occurs
on those uses of a LoadInst that doesn't occur on the ld.global.nc?

Question two, can you think of a way to design determinePointerReadAttrs
so that it doesn't need a use count but doesn't have a bad worst case
performance problem? For that matter, why isn't this code already linear
in the number of instructions?

Nick

[1] https://github.com/vetter/shoc

http://reviews.llvm.org/D11311

Files:
lib/Transforms/IPO/FunctionAttrs.cpp
test/Transforms/FunctionAttrs/readattrs-threshold.ll
Index: test/Transforms/FunctionAttrs/readattrs-threshold.ll

/dev/null

+++ test/Transforms/FunctionAttrs/readattrs-threshold.ll
@@ -0,0 +1,13 @@
+; RUN: opt< %s -functionattrs -S | FileCheck %s --check-prefix=CHECK
+; RUN: opt< %s -functionattrs -readattrs-analysis-uses-threshold=1 -S | FileCheck %s --check-prefix=CHECK-NO-READATTR
+
+; %p has two uses
+; CHECK: define i32 @test(i32* nocapture readonly %p)
+; CHECK-NO-READATTR: define i32 @test(i32* nocapture %p)
+define i32 @test(i32* %p) {
+ %1 = load i32, i32* %p
+ %2 = getelementptr i32, i32* %p, i32 1
+ %3 = load i32, i32* %2
+ %4 = add i32 %1, %3
+ ret i32 %4
+}

Index: lib/Transforms/IPO/FunctionAttrs.cpp

lib/Transforms/IPO/FunctionAttrs.cpp

+++ lib/Transforms/IPO/FunctionAttrs.cpp
@@ -44,6 +44,12 @@
STATISTIC(NumNoAlias, "Number of function returns marked noalias");
STATISTIC(NumAnnotated, "Number of attributes added to library functions");
+static cl::opt<int> ReadAttrsAnalysisUsesThreshold(
+ "readattrs-analysis-uses-threshold", cl::Hidden, cl::init(20),
+ cl::ZeroOrMore,
+ cl::desc("Control the maximum number of uses to analyze for "
+ "readonly/readnone attribute (default = 20)"));
+
namespace {
  struct FunctionAttrs : public CallGraphSCCPass {
    static char ID; // Pass identification, replacement for typeid
@@ -425,7 +431,7 @@
// We don't need to track IsWritten. If A is written to, return immediately.

for (Use&U : A->uses()) {
if (Count++>= 20)

+ if (Count++>= ReadAttrsAnalysisUsesThreshold)
  return Attribute::None;

Visited.insert(&U);
llvm-commits mailing list
llvm-commits@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

Question one, what does this have to do with ld.global.nc? It looks like
you've got a load instruction (wrapped in an intrinsic) and it in turn
has many uses. The fact it's ld.global.nc doesn't seem to be relevant?

In theory, if you replaced the ld.global.nc with a LoadInst, would you
have the same problem? Or is there some other optimization that occurs
on those uses of a LoadInst that doesn't occur on the ld.global.nc?

In our case, we intend to transform LoadInst into ld.global.nc during Instruction
selection, not the other direction. It's only safe to use ld.global.nc for LoadInst
if the pointer being accessed is readonly (and noalias, which is irreverent to this
patch).

Question two, can you think of a way to design determinePointerReadAttrs
so that it doesn't need a use count but doesn't have a bad worst case
performance problem? For that matter, why isn't this code already linear
in the number of instructions?

Yeah, I also think this is linear in the number of instructions, and I don't think
there can be multiple call on this function that make it quadratic or something.
The complexity should be O(num_of_arguments * num_of_instructions)
Maybe simply remove this threshold is good enough. Is there any other reason
to introduce this threshold?

Nick
[1] https://github.com/vetter/shoc

http://reviews.llvm.org/D11311

Files:
lib/Transforms/IPO/FunctionAttrs.cpp
test/Transforms/FunctionAttrs/readattrs-threshold.ll
Index: test/Transforms/FunctionAttrs/readattrs-threshold.ll

/dev/null

+++ test/Transforms/FunctionAttrs/readattrs-threshold.ll
@@ -0,0 +1,13 @@
+; RUN: opt< %s -functionattrs -S | FileCheck %s --check-prefix=CHECK
+; RUN: opt< %s -functionattrs -readattrs-analysis-uses-threshold=1 -S | FileCheck %s --check-prefix=CHECK-NO-READATTR
+
+; %p has two uses
+; CHECK: define i32 @test(i32* nocapture readonly %p)
+; CHECK-NO-READATTR: define i32 @test(i32* nocapture %p)
+define i32 @test(i32* %p) {
+ %1 = load i32, i32* %p
+ %2 = getelementptr i32, i32* %p, i32 1
+ %3 = load i32, i32* %2
+ %4 = add i32 %1, %3
+ ret i32 %4
+}

Index: lib/Transforms/IPO/FunctionAttrs.cpp

lib/Transforms/IPO/FunctionAttrs.cpp

+++ lib/Transforms/IPO/FunctionAttrs.cpp
@@ -44,6 +44,12 @@
STATISTIC(NumNoAlias, "Number of function returns marked noalias");
STATISTIC(NumAnnotated, "Number of attributes added to library functions");
+static cl::opt<int> ReadAttrsAnalysisUsesThreshold(
+ "readattrs-analysis-uses-threshold", cl::Hidden, cl::init(20),
+ cl::ZeroOrMore,
+ cl::desc("Control the maximum number of uses to analyze for "
+ "readonly/readnone attribute (default = 20)"));
+
namespace {
  struct FunctionAttrs : public CallGraphSCCPass {
    static char ID; // Pass identification, replacement for typeid
@@ -425,7 +431,7 @@
// We don't need to track IsWritten. If A is written to, return immediately.

for (Use&U : A->uses()) {
if (Count++>= 20)

+ if (Count++>= ReadAttrsAnalysisUsesThreshold)
  return Attribute::None;

Visited.insert(&U);
llvm-commits mailing list
llvm-commits@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

Xuetian Weng wrote:

wengxt added a comment.

Question one, what does this have to do with ld.global.nc? It looks like
you've got a load instruction (wrapped in an intrinsic) and it in turn
has many uses. The fact it's ld.global.nc doesn't seem to be relevant?
In theory, if you replaced the ld.global.nc with a LoadInst, would you
have the same problem? Or is there some other optimization that occurs
on those uses of a LoadInst that doesn't occur on the ld.global.nc?
In our case, we intend to transform LoadInst into ld.global.nc during Instruction
selection, not the other direction. It's only safe to use ld.global.nc for LoadInst
if the pointer being accessed is readonly (and noalias, which is irreverent to this
patch).

I realize that, but my question stands. Is there an optimization that
should be happening to these ld.global.nc before we get here?

Question two, can you think of a way to design determinePointerReadAttrs
so that it doesn't need a use count but doesn't have a bad worst case
performance problem? For that matter, why isn't this code already linear
in the number of instructions?
Yeah, I also think this is linear in the number of instructions, and I don't think
there can be multiple call on this function that make it quadratic or something.
The complexity should be O(num_of_arguments * num_of_instructions)
Maybe simply remove this threshold is good enough. Is there any other reason
to introduce this threshold?

Nope, I can't think of any reason. I think this threshold is safe to remove!

Nick

[1] https://github.com/vetter/shoc

http://reviews.llvm.org/D11311

Files:
lib/Transforms/IPO/FunctionAttrs.cpp
test/Transforms/FunctionAttrs/readattrs-threshold.ll
Index: test/Transforms/FunctionAttrs/readattrs-threshold.ll
===================================================================
/dev/null +++ test/Transforms/FunctionAttrs/readattrs-threshold.ll @@ -0,0 +1,13 @@ +; RUN: opt< %s -functionattrs -S | FileCheck %s --check-prefix=CHECK +; RUN: opt< %s -functionattrs -readattrs-analysis-uses-threshold=1 -S | FileCheck %s --check-prefix=CHECK-NO-READATTR + +; %p has two uses +; CHECK: define i32 @test(i32* nocapture readonly %p) +; CHECK-NO-READATTR: define i32 @test(i32* nocapture %p) +define i32 @test(i32* %p) { + %1 = load i32, i32* %p + %2 = getelementptr i32, i32* %p, i32 1 + %3 = load i32, i32* %2 + %4 = add i32 %1, %3 + ret i32 %4 +} Index: lib/Transforms/IPO/FunctionAttrs.cpp ===================================================================

lib/Transforms/IPO/FunctionAttrs.cpp +++ lib/Transforms/IPO/FunctionAttrs.cpp @@ -44,6 +44,12 @@ STATISTIC(NumNoAlias, "Number of function returns marked noalias"); STATISTIC(NumAnnotated, "Number of attributes added to library functions");
+static cl::opt<int>   ReadAttrsAnalysisUsesThreshold( +    "readattrs-analysis-uses-threshold", cl::Hidden, cl::init(20), +    cl::ZeroOrMore, +    cl::desc("Control the maximum number of uses to analyze for " +             "readonly/readnone attribute (default = 20)")); + namespace { struct FunctionAttrs : public CallGraphSCCPass { static char ID; // Pass identification, replacement for typeid @@ -425,7 +431,7 @@ // We don't need to track IsWritten. If A is written to, return immediately.
for (Use&U : A->uses()) {
if (Count++>= 20) + if (Count++>= ReadAttrsAnalysisUsesThreshold) return Attribute::None;
Visited.insert(&U);
llvm-commits mailing list
llvm-commits@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
http://reviews.llvm.org/D11311

In D11311#208782, @nicholas wrote:

Xuetian Weng wrote:

wengxt added a comment.

Question one, what does this have to do with ld.global.nc? It looks like
you've got a load instruction (wrapped in an intrinsic) and it in turn
has many uses. The fact it's ld.global.nc doesn't seem to be relevant?

In theory, if you replaced the ld.global.nc with a LoadInst, would you
have the same problem? Or is there some other optimization that occurs
on those uses of a LoadInst that doesn't occur on the ld.global.nc?

In our case, we intend to transform LoadInst into ld.global.nc during Instruction
selection, not the other direction. It's only safe to use ld.global.nc for LoadInst
if the pointer being accessed is readonly (and noalias, which is irreverent to this
patch).

I realize that, but my question stands. Is there an optimization that
should be happening to these ld.global.nc before we get here?

My commit message might be confusing. I don't think this change has anything to do with ld.global.nc. The LoadInst to ld.global.nc optimization is just one of the examples which can benefit from this change. And we happened to notice that when we are working on our cuda compiler.

remove the threshold

wengxt retitled this revision from Replace hardcoded threshold with an option when analyze read attributes to Replace hardcoded threshold when analyze read attributes.Jul 23 2015, 3:47 PM

wengxt retitled this revision from Replace hardcoded threshold when analyze read attributes to Replace the hardcoded threshold when analyze read attributes.

LGTM

jingyue retitled this revision from Replace the hardcoded threshold when analyze read attributes to Remove the user-count threshold when analyzing read attributes.Jul 24 2015, 12:05 PM

jingyue updated this object.

jingyue edited edge metadata.

jingyue closed this revision.Jul 24 2015, 12:06 PM

This patch caused an unused warning/error, which was fixed in r243145. Thanks dnovillo@!

Revision Contents

Path

Size

lib/

Transforms/

IPO/

FunctionAttrs.cpp

3 lines

Diff 30532

lib/Transforms/IPO/FunctionAttrs.cpp

Show First 20 Lines • Show All 419 Lines • ▼ Show 20 Lines	determinePointerReadAttrs(Argument *A,
// inalloca arguments are always clobbered by the call.		// inalloca arguments are always clobbered by the call.
if (A->hasInAllocaAttr())		if (A->hasInAllocaAttr())
return Attribute::None;		return Attribute::None;

bool IsRead = false;		bool IsRead = false;
// We don't need to track IsWritten. If A is written to, return immediately.		// We don't need to track IsWritten. If A is written to, return immediately.

for (Use &U : A->uses()) {		for (Use &U : A->uses()) {
if (Count++ >= 20)
return Attribute::None;

Visited.insert(&U);		Visited.insert(&U);
Worklist.push_back(&U);		Worklist.push_back(&U);
}		}

while (!Worklist.empty()) {		while (!Worklist.empty()) {
Use *U = Worklist.pop_back_val();		Use *U = Worklist.pop_back_val();
Instruction *I = cast<Instruction>(U->getUser());		Instruction *I = cast<Instruction>(U->getUser());
Value *V = U->get();		Value *V = U->get();
▲ Show 20 Lines • Show All 1,277 Lines • Show Last 20 Lines