Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Avoid migrating existing patches. Phabricator shutdown timeline

Annotated-source optimization reports (a.k.a. "listing" files)
AbandonedPublic

Authored by hfinkel on Apr 28 2016, 12:29 PM.

Details

Summary

This patch implements support for annotated-source optimization reports (a.k.a. "listing" files). Aside from optimizer improvements, this is a top feature request from my performance-engineering team. Most HPC-relevant compilers have some kind of capability along these lines. The DiagnosticInfo infrastructure at the IR level was designed specifically to support the development of this kind of feature, by allowing diagnostic messages to be subclass carrying arbitrary additional payload, although in terms of optimizer feedback, we currently only use this with -Rpass and friends. -Rpass and related options are very useful, but they can generate a lot of output, and that output lacks significant context, making it hard to see if the compiler is really doing what the user expects.

For this optimizer report, I focused on making the output as succinct as possible while providing information on inlining and loop transformations. The goal here is that the source code should still be easily readable in the report. My primary inspiration here is the reports generated by Cray's tools (http://docs.cray.com/books/S-2496-4101/html-S-2496-4101/z1112823641oswald.html). These reports are highly regarded within the HPC community. Intel's compiler, for example, also has an optimization-report capability (https://software.intel.com/sites/default/files/managed/55/b1/new-compiler-optimization-reports.pdf).

$ cat /tmp/v.c
void bar();
void foo() { bar(); }

void Test(int *res, int *c, int *d, int *p, int n) {
  int i;

#pragma clang loop vectorize(assume_safety)
  for (i = 0; i < 1600; i++) {
    res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
  }

  for (i = 0; i < 16; i++) {
    res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
  }

  foo();

  foo(); bar(); foo();
}

The patch -flisting and -flisting=filename. For the first form, where the file name is not explicitly specified, the file name is computed automatically just as we do for split-debug output files.

$ clang -O3 -o /tmp/v.o -c /tmp/v.c -flisting
$ cat /tmp/v.lst

< /tmp/v.c
 1     | void bar();
 2     | void foo() { bar(); }
 3     | 
 4     | void Test(int *res, int *c, int *d, int *p, int n) {
 5     |   int i;
 6     | 
 7     | #pragma clang loop vectorize(assume_safety)
 8   V |   for (i = 0; i < 1600; i++) {
 9     |     res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
10     |   }
11     | 
12     |   for (i = 0; i < 16; i++) {
13  U  |     res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
14     |   }
15     | 
16 I   |   foo();
17     | 
18     |   foo(); bar(); foo();
   I   |   ^
   I   |                 ^
19     | }
20     |

Each source line gets a prefix giving the line number, and a few columns for important optimizations: inlining, loop unrolling and loop vectorization. An 'I' is printed next to a line where a function was inlined, a 'U' next to an unrolled loop, and 'V' next to a vectorized loop. These are printing on the relevant code line when that seems unambiguous, or on subsequent lines when multiple potential options exist (messages, both positive and negative, from the same optimization with different column numbers are taken to indicate potential ambiguity). When on subsequent lines, a '^' is output in the relevant column. The fact that the 'U' is on the wrong line is also a problem with -Rpass=loop-unroll and may be something we can fix in the backend.

Annotated source for all relevant input files are put into the listing file (each starting with '<' and then the file name).

To see what this looks like for C++ code, here's a small excerpt from CodeGenAction.cpp:

340     |   // If the SMDiagnostic has an inline asm source location, translate it.
341 I   |   FullSourceLoc Loc;
342     |   if (D.getLoc() != SMLoc())
    I   |       ^
    I   |                  ^
    I   |                     ^
343     |     Loc = ConvertBackendLocation(D, Context->getSourceManager());
    I   |           ^
    I   |                                     ^
344     | 
345     |   unsigned DiagID;
346 I   |   switch (D.getKind()) {

There's obvious bikeshedding to do here, and I'm quite open to suggestions. My engineering team often calls these things "listing files", and other tools often name this files with lst as an extension, thus the naming in the patch. Intel's option is -opt-report-file=filename.

After some backend enhancements (to turn the relevant remark types into proper subclasses), I'd like to extend this to also print the vectorization factor, interleaving factor and unrolling factor when relevant. After these enhancements, I'd l imagine the loop annotations might look like V4,2U4 for a loop vectorized with VF == 4 and interleaving by 2, and then partially unrolled by a factor of 4.

Please review.

Diff Detail

Event Timeline

hfinkel updated this revision to Diff 55446.Apr 28 2016, 12:29 PM
hfinkel retitled this revision from to Annotated-source optimization reports (a.k.a. "listing" files).
hfinkel updated this object.
hfinkel added a subscriber: cfe-commits.
rjmccall edited edge metadata.Apr 28 2016, 12:52 PM

I see what you're going for with "listing file", but I like ICC's option name much better, or at least something along those lines.

rsmith edited edge metadata.Apr 28 2016, 1:15 PM

You give this example:

343     |     Loc = ConvertBackendLocation(D, Context->getSourceManager());
    I   |           ^
    I   |                                     ^

How does this look for a case like p->Foo()->Bar() (where one or both of the calls are inlined)? Can we get the source location to point at the function name instead of the start of the expression to reduce the scope for ambiguity?

lib/CodeGen/CodeGenAction.cpp
637–733

I'd like this to be factored out and moved somewhere more appropriate (such as Frontend). It seems appropriate for CodeGen to generate the data structure here, but it should not be deciding how to format the report nor doing file IO to put it somewhere.

I would hope that we can combine this report information with the static analyzer's existing support for generating syntax-highlighted, annotated source code as HTML as a future extension.

You give this example:

343     |     Loc = ConvertBackendLocation(D, Context->getSourceManager());
    I   |           ^
    I   |                                     ^

How does this look for a case like p->Foo()->Bar() (where one or both of the calls are inlined)? Can we get the source location to point at the function name instead of the start of the expression to reduce the scope for ambiguity?

That does not currently work very well (I assume this needs a backend fix, but I'll check).

$ cat /tmp/i.cpp
void ext();

struct Bar {
  void bar() { ext(); }
};

struct Foo {
  Bar *b;

  Bar *foo() { return b; }
};

void test(Foo *f) {
  f->foo()->bar();
}

And we get:

14 I   |   f->foo()->bar();

because both inlining remarks come from the backend with the same column number:

$ clang -O3 -c -o /tmp/i.o /tmp/i.cpp -flisting -Rpass=inline
/tmp/i.cpp:14:3: remark: _ZN3Foo3fooEv inlined into _Z4testP3Foo [-Rpass=inline]
  f->foo()->bar();
  ^
/tmp/i.cpp:14:3: remark: _ZN3Bar3barEv inlined into _Z4testP3Foo [-Rpass=inline]
lib/CodeGen/CodeGenAction.cpp
637–733

I'd like this to be factored out and moved somewhere more appropriate (such as Frontend). It seems appropriate for CodeGen to generate the data structure here, but it should not be deciding how to format the report nor doing file IO to put it somewhere.

Makes sense.

I would hope that we can combine this report information with the static analyzer's existing support for generating syntax-highlighted, annotated source code as HTML as a future extension.

I like this idea.

I see what you're going for with "listing file", but I like ICC's option name much better, or at least something along those lines.

Sounds good to me. Do you have a preference on -fopt-report vs. -foptimization-report vs. something else? Do you have an opinion on the default file-name extension for the report? Maybe I should name it .opt-report (or something like that)?

I see what you're going for with "listing file", but I like ICC's option name much better, or at least something along those lines.

Sounds good to me. Do you have a preference on -fopt-report vs. -foptimization-report vs. something else? Do you have an opinion on the default file-name extension for the report? Maybe I should name it .opt-report (or something like that)?

I don't think we have a consistent abbreviation for that anywhere else in the options (other than -O, I guess), so my inclination would be to spell it out as -foptimization-report.

The extension is just appended to the original filename, so that it ends up something like foo.cpp.opt-report? I don't have an objection to ".opt-report" or even ".lst".

rcox2 edited edge metadata.Apr 28 2016, 3:19 PM

Actually, the Intel compiler distinguishes between an optimization report (-qopt-report) and an annotated listing (-qopt-report-annotate). The optimization report lists the info for optimizations in a hierarchical fashion. To use you example,

icc -c -O3 -qopt-report=1 -qopt-report-file=stderr v.c

yields:

Report from: Interprocedural optimizations [ipo]

INLINING OPTION VALUES:

-inline-factor: 100
-inline-min-size: 20
-inline-max-size: 230
-inline-max-total-size: 2000
-inline-max-per-routine: 10000
-inline-max-per-compile: 500000

Begin optimization report for: foo()

Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (foo()) [1] v.c(2,12)

Report from: Code generation optimizations [cg]

v.c(2,12):remark #34051: REGISTER ALLOCATION : [foo] v.c:2

Hardware registers
    Reserved     :    1[ esp]
    Available    :   23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
    Callee-save  :    4[ ebx ebp esi edi]
    Assigned     :    0[ reg_null]

Routine temporaries
    Total         :       4
        Global    :       0
        Local     :       4
    Regenerable   :       0
    Spilled       :       0

Routine stack
    Variables     :       0 bytes*
        Reads     :       0 [0.00e+00 ~ 0.0%]
        Writes    :       0 [0.00e+00 ~ 0.0%]
    Spills        :       0 bytes*
        Reads     :       0 [0.00e+00 ~ 0.0%]
        Writes    :       0 [0.00e+00 ~ 0.0%]

Notes

    *Non-overlapping variables and spills may share stack space,
     so the total stack size might be less than this.

Begin optimization report for: Test(int *, int *, int *, int *, int)

Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (Test(int *, int *, int *, int *, int)) [2] v.c(4,52)

-> INLINE: (16,3) foo()
-> INLINE: (18,3) foo()
-> INLINE: (18,17) foo()


  Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

LOOP BEGIN at v.c(8,8)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at v.c(8,8)

remark #15301: SIMD LOOP WAS VECTORIZED

LOOP END

LOOP BEGIN at v.c(8,8)
<Alternate Alignment Vectorized Loop>
LOOP END

LOOP BEGIN at v.c(8,8)
<Remainder loop for vectorization>

remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override

LOOP END

LOOP BEGIN at v.c(12,3)

remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
remark #15346: vector dependence: assumed FLOW dependence between res[i] (13:5) and d[i] (13:5)
remark #25436: completely unrolled by 16

LOOP END

Report from: Code generation optimizations [cg]

v.c(4,52):remark #34051: REGISTER ALLOCATION : [Test] v.c:4

Hardware registers
    Reserved     :    1[ esp]
    Available    :   23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
    Callee-save  :    4[ ebx ebp esi edi]
    Assigned     :   15[ eax edx ecx ebx ebp esi edi zmm0-zmm7]

Routine temporaries
    Total         :     123
        Global    :      47
        Local     :      76
    Regenerable   :       5
    Spilled       :       6

Routine stack
    Variables     :       0 bytes*
        Reads     :       0 [0.00e+00 ~ 0.0%]
        Writes    :       0 [0.00e+00 ~ 0.0%]
    Spills        :       8 bytes*
        Reads     :       5 [1.41e+01 ~ 1.4%]
        Writes    :       3 [3.00e+00 ~ 0.3%]

Notes

    *Non-overlapping variables and spills may share stack space,
     so the total stack size might be less than this.

while the annotated listing looks like:


------- Annotated listing with optimization reports for "/export/iusers/rcox2/rgHF/v.c" -------

INLINING OPTION VALUES:
-inline-factor: 100
-inline-min-size: 20
-inline-max-size: 230
-inline-max-total-size: 2000
-inline-max-per-routine: 10000
-inline-max-per-compile: 500000

1 void bar();
2 void foo() { bar(); }
INLINE REPORT: (foo()) [1] /export/iusers/rcox2/rgHF/v.c(2,12)

/export/iusers/rcox2/rgHF/v.c(2,12):remark #34051: REGISTER ALLOCATION : [foo] /export/iusers/rcox2/rgHF/v.c:2

Hardware registers
Reserved : 1[ esp]
Available : 23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
Callee-save : 4[ ebx ebp esi edi]
Assigned : 0[ reg_null]

Routine temporaries
Total : 4
Global : 0
Local : 4
Regenerable : 0
Spilled : 0

Routine stack
Variables : 0 bytes*
Reads : 0 [0.00e+00 ~ 0.0%]
Writes : 0 [0.00e+00 ~ 0.0%]
Spills : 0 bytes*
Reads : 0 [0.00e+00 ~ 0.0%]
Writes : 0 [0.00e+00 ~ 0.0%]

Notes

*Non-overlapping variables and spills may share stack space,
so the total stack size might be less than this.


3
4 void Test(int *res, int *c, int *d, int *p, int n) {
INLINE REPORT: (Test(int *, int *, int *, int *, int)) [2] /export/iusers/rcox2/rgHF/v.c(4,52)
-> INLINE: (16,3) foo()
-> INLINE: (18,3) foo()
-> INLINE: (18,17) foo()

/export/iusers/rcox2/rgHF/v.c(4,52):remark #34051: REGISTER ALLOCATION : [Test] /export/iusers/rcox2/rgHF/v.c:4

Hardware registers
Reserved : 1[ esp]
Available : 23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
Callee-save : 4[ ebx ebp esi edi]
Assigned : 15[ eax edx ecx ebx ebp esi edi zmm0-zmm7]

Routine temporaries
Total : 123
Global : 47
Local : 76
Regenerable : 5
Spilled : 6

Routine stack
Variables : 0 bytes*
Reads : 0 [0.00e+00 ~ 0.0%]
Writes : 0 [0.00e+00 ~ 0.0%]
Spills : 8 bytes*
Reads : 5 [1.41e+01 ~ 1.4%]
Writes : 3 [3.00e+00 ~ 0.3%]

Notes

*Non-overlapping variables and spills may share stack space,
so the total stack size might be less than this.


5 int i;
6
7 #pragma simd
8 for (i = 0; i < 1600; i++) {

LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
<Alternate Alignment Vectorized Loop>
LOOP END

LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
<Remainder loop for vectorization>
remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
LOOP END
9 res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
10 }
11
12 for (i = 0; i < 16; i++) {

LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(12,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
remark #15346: vector dependence: assumed FLOW dependence between res[i] (13:5) and d[i] (13:5)
remark #25436: completely unrolled by 16
//LOOP END
13 res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
14 }
15
16 foo();
17
18 foo(); bar(); foo();
19 }

essentially, various parts of the optimization report are inserted into a listing at the appropriate line numbers.

(Note that this is just the default level. More detail can be obtained with -qopt-report=X where X>1 (up to 5 is supported)).

I believe what Hal is proposing in this patch is a very useful light-weight annotation of the source with key information. But I also believe that there is value for a stand-alone opt report with the kind of detailed information I presented in D19397 and the two follow up patches. In general, while this info can be interspersed in the source listing, I believe that for most purposes it is a bit too "busy" in text form. (The Intel compiler also supports annotated html and functionality that feeds into Visual Studio that has received great reviews.)

Actually, the Intel compiler distinguishes between an optimization report (-qopt-report) and an annotated listing (-qopt-report-annotate).

Interesting; thanks for pointing this out (and for the example).

The optimization report lists the info for optimizations in a hierarchical fashion. To use you example,

icc -c -O3 -qopt-report=1 -qopt-report-file=stderr v.c

yields:

Report from: Interprocedural optimizations [ipo]

INLINING OPTION VALUES:

-inline-factor: 100
-inline-min-size: 20
-inline-max-size: 230
-inline-max-total-size: 2000
-inline-max-per-routine: 10000
-inline-max-per-compile: 500000

Begin optimization report for: foo()

Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (foo()) [1] v.c(2,12)

Report from: Code generation optimizations [cg]

v.c(2,12):remark #34051: REGISTER ALLOCATION : [foo] v.c:2

Hardware registers
    Reserved     :    1[ esp]
    Available    :   23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
    Callee-save  :    4[ ebx ebp esi edi]
    Assigned     :    0[ reg_null]

Routine temporaries
    Total         :       4
        Global    :       0
        Local     :       4
    Regenerable   :       0
    Spilled       :       0

Routine stack
    Variables     :       0 bytes*
        Reads     :       0 [0.00e+00 ~ 0.0%]
        Writes    :       0 [0.00e+00 ~ 0.0%]
    Spills        :       0 bytes*
        Reads     :       0 [0.00e+00 ~ 0.0%]
        Writes    :       0 [0.00e+00 ~ 0.0%]

Notes

    *Non-overlapping variables and spills may share stack space,
     so the total stack size might be less than this.

Begin optimization report for: Test(int *, int *, int *, int *, int)

Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (Test(int *, int *, int *, int *, int)) [2] v.c(4,52)

-> INLINE: (16,3) foo()
-> INLINE: (18,3) foo()
-> INLINE: (18,17) foo()


  Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

LOOP BEGIN at v.c(8,8)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at v.c(8,8)

remark #15301: SIMD LOOP WAS VECTORIZED

LOOP END

LOOP BEGIN at v.c(8,8)
<Alternate Alignment Vectorized Loop>
LOOP END

LOOP BEGIN at v.c(8,8)
<Remainder loop for vectorization>

remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override

LOOP END

LOOP BEGIN at v.c(12,3)

remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
remark #15346: vector dependence: assumed FLOW dependence between res[i] (13:5) and d[i] (13:5)
remark #25436: completely unrolled by 16

LOOP END

Report from: Code generation optimizations [cg]

v.c(4,52):remark #34051: REGISTER ALLOCATION : [Test] v.c:4

Hardware registers
    Reserved     :    1[ esp]
    Available    :   23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
    Callee-save  :    4[ ebx ebp esi edi]
    Assigned     :   15[ eax edx ecx ebx ebp esi edi zmm0-zmm7]

Routine temporaries
    Total         :     123
        Global    :      47
        Local     :      76
    Regenerable   :       5
    Spilled       :       6

Routine stack
    Variables     :       0 bytes*
        Reads     :       0 [0.00e+00 ~ 0.0%]
        Writes    :       0 [0.00e+00 ~ 0.0%]
    Spills        :       8 bytes*
        Reads     :       5 [1.41e+01 ~ 1.4%]
        Writes    :       3 [3.00e+00 ~ 0.3%]

Notes

    *Non-overlapping variables and spills may share stack space,
     so the total stack size might be less than this.

while the annotated listing looks like:


------- Annotated listing with optimization reports for "/export/iusers/rcox2/rgHF/v.c" -------

INLINING OPTION VALUES:
-inline-factor: 100
-inline-min-size: 20
-inline-max-size: 230
-inline-max-total-size: 2000
-inline-max-per-routine: 10000
-inline-max-per-compile: 500000

1 void bar();
2 void foo() { bar(); }
INLINE REPORT: (foo()) [1] /export/iusers/rcox2/rgHF/v.c(2,12)

/export/iusers/rcox2/rgHF/v.c(2,12):remark #34051: REGISTER ALLOCATION : [foo] /export/iusers/rcox2/rgHF/v.c:2

Hardware registers
Reserved : 1[ esp]
Available : 23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
Callee-save : 4[ ebx ebp esi edi]
Assigned : 0[ reg_null]

Routine temporaries
Total : 4
Global : 0
Local : 4
Regenerable : 0
Spilled : 0

Routine stack
Variables : 0 bytes*
Reads : 0 [0.00e+00 ~ 0.0%]
Writes : 0 [0.00e+00 ~ 0.0%]
Spills : 0 bytes*
Reads : 0 [0.00e+00 ~ 0.0%]
Writes : 0 [0.00e+00 ~ 0.0%]

Notes

*Non-overlapping variables and spills may share stack space,
so the total stack size might be less than this.


3
4 void Test(int *res, int *c, int *d, int *p, int n) {
INLINE REPORT: (Test(int *, int *, int *, int *, int)) [2] /export/iusers/rcox2/rgHF/v.c(4,52)
-> INLINE: (16,3) foo()
-> INLINE: (18,3) foo()
-> INLINE: (18,17) foo()

/export/iusers/rcox2/rgHF/v.c(4,52):remark #34051: REGISTER ALLOCATION : [Test] /export/iusers/rcox2/rgHF/v.c:4

Hardware registers
Reserved : 1[ esp]
Available : 23[ eax edx ecx ebx ebp esi edi mm0-mm7 zmm0-zmm7]
Callee-save : 4[ ebx ebp esi edi]
Assigned : 15[ eax edx ecx ebx ebp esi edi zmm0-zmm7]

Routine temporaries
Total : 123
Global : 47
Local : 76
Regenerable : 5
Spilled : 6

Routine stack
Variables : 0 bytes*
Reads : 0 [0.00e+00 ~ 0.0%]
Writes : 0 [0.00e+00 ~ 0.0%]
Spills : 8 bytes*
Reads : 5 [1.41e+01 ~ 1.4%]
Writes : 3 [3.00e+00 ~ 0.3%]

Notes

*Non-overlapping variables and spills may share stack space,
so the total stack size might be less than this.


5 int i;
6
7 #pragma simd
8 for (i = 0; i < 1600; i++) {

LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
<Alternate Alignment Vectorized Loop>
LOOP END

LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(8,8)
<Remainder loop for vectorization>
remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
LOOP END
9 res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
10 }
11
12 for (i = 0; i < 16; i++) {

LOOP BEGIN at /export/iusers/rcox2/rgHF/v.c(12,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
remark #15346: vector dependence: assumed FLOW dependence between res[i] (13:5) and d[i] (13:5)
remark #25436: completely unrolled by 16
//LOOP END
13 res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
14 }
15
16 foo();
17
18 foo(); bar(); foo();
19 }

essentially, various parts of the optimization report are inserted into a listing at the appropriate line numbers.

(Note that this is just the default level. More detail can be obtained with -qopt-report=X where X>1 (up to 5 is supported)).

I believe what Hal is proposing in this patch is a very useful light-weight annotation of the source with key information. But I also believe that there is value for a stand-alone opt report with the kind of detailed information I presented in D19397 and the two follow up patches.

To be clear, I agree. I'd like to have both.

In general, while this info can be interspersed in the source listing, I believe that for most purposes it is a bit too "busy" in text form. (The Intel compiler also supports annotated html and functionality that feeds into Visual Studio that has received great reviews.)

I think this piggybacks on Richard's suggestion regarding later integration with the static analyzer's output capabilities. We should definitely explore how this might be done.

anemet added inline comments.Apr 28 2016, 5:20 PM
lib/CodeGen/CodeGenAction.cpp
706–709

Should the abbreviation be somehow part of the optimization remark API and passed in just like the pass name?

It would be nice if someone added optimization remark for a new opt, it would show up here automatically. I could see how that could make the output too busy but at least have the option?

You give this example:

343     |     Loc = ConvertBackendLocation(D, Context->getSourceManager());
    I   |           ^
    I   |                                     ^

How does this look for a case like p->Foo()->Bar() (where one or both of the calls are inlined)? Can we get the source location to point at the function name instead of the start of the expression to reduce the scope for ambiguity?

That does not currently work very well (I assume this needs a backend fix, but I'll check).

Actually, it's a Clang problem. https://llvm.org/bugs/show_bug.cgi?id=27567

hfinkel added inline comments.Apr 28 2016, 10:47 PM
lib/CodeGen/CodeGenAction.cpp
706–709

So long as we're careful in the backend to respect the limited visual real estate and namespace in this kind of report, we could have the optimizations themselves provide a letter. I'm undecided.

hfinkel updated this revision to Diff 55907.May 2 2016, 3:59 PM
hfinkel edited edge metadata.

Renamed the option from -flisting to -foptimization-report as suggested. Moved I/O-related and formatting-related code into Frontend.

Actually, the Intel compiler distinguishes between an optimization report (-qopt-report) and an annotated listing (-qopt-report-annotate). The optimization report lists the info for optimizations in a hierarchical fashion. To use you example,

icc -c -O3 -qopt-report=1 -qopt-report-file=stderr v.c

yields:

Robert, John, (et al.), do you think I should change this to have an -foptimization-report-file=<filename> and -foptimization-report, instead of -foptimization-report=<filename>? In the future, when we have multiple kinds of reports (a detailed inlining report, for example), maybe we want to use -foptimization-report=inlining,somethingelse,andmore?

rcox2 added a comment.May 2 2016, 5:14 PM

Of course, it would be my preference to mirror the functionality of what is available in the "new" hierarchical form of optimization report Intel compiler. So, I would like to distinguish between what Hal is proposing (which we call an "annotated listing") and what I am proposing, which we call an "optimization report".

If Hal wants to call what he is proposing the "optimization report", then we need to come up with another name for what I am proposing.

To summarize what the Intel compiler has

-qopt-report[=N]     where the default is 2 and the range is 1-5, with 1 having the least detail and 5 having the most detail 
-qopt-report-file=F   where F is a file name or stdout or stderr 
-qopt-report-phase=P where P is a sequence of phases (like ipo,cg, etc.) and only those phases are printed 
-qopt-report-filter=X  where X allows you to filter opt reports only for certain routines or parts of routines

Use of ANY of these implies the generation of an opt report, so you don't need to say:

-qopt-report -qopt-report-file=stderr

since

-qopt-report-file=stderr

is sufficient.

On a slightly different topic ....

One key question I have about Hal's proposal is whether there is any annotation associated with code that is inlined, beyond noting the call site that is inlined.

For example, if we have:

 int foo() { 
     ... 
     loop 
     ....
} 
int main() { 
    ...
    foo(); 
    ...
 }

and foo gets inlined, we have two loops of interest, the loop in foo() and the loop inlined into main(). Each of these could be vectorized, unrolled, etc. and it isn't always the case that both loops would have the same properties. So, does Hal's report indicate info only about the loop in foo(), or are the properties of the two loops ANDed or ORed together and reported next to the loop in foo, or something else?

In general, you want info about both loops, and you can that with a classic optimization report. But it's not clear how to effectively represent this on the lightweight annotated listing. And it is often the case that the inlined loop is actually the more executed one, and therefore more important.

Of course, it would be my preference to mirror the functionality of what is available in the "new" hierarchical form of optimization report Intel compiler. So, I would like to distinguish between what Hal is proposing (which we call an "annotated listing") and what I am proposing, which we call an "optimization report".

If Hal wants to call what he is proposing the "optimization report", then we need to come up with another name for what I am proposing.

To be clear, I'm fine with calling this something else. How about "optimization summary" or "annotated optimization summary"? I could name the option -fannotated-optimization-summary, for example.

To summarize what the Intel compiler has

-qopt-report[=N]     where the default is 2 and the range is 1-5, with 1 having the least detail and 5 having the most detail 
-qopt-report-file=F   where F is a file name or stdout or stderr 
-qopt-report-phase=P where P is a sequence of phases (like ipo,cg, etc.) and only those phases are printed 
-qopt-report-filter=X  where X allows you to filter opt reports only for certain routines or parts of routines

Use of ANY of these implies the generation of an opt report, so you don't need to say:

-qopt-report -qopt-report-file=stderr

since

-qopt-report-file=stderr

is sufficient.

On a slightly different topic ....

One key question I have about Hal's proposal is whether there is any annotation associated with code that is inlined, beyond noting the call site that is inlined.

For example, if we have:

 int foo() { 
     ... 
     loop 
     ....
} 
int main() { 
    ...
    foo(); 
    ...
 }

and foo gets inlined, we have two loops of interest, the loop in foo() and the loop inlined into main(). Each of these could be vectorized, unrolled, etc. and it isn't always the case that both loops would have the same properties. So, does Hal's report indicate info only about the loop in foo(), or are the properties of the two loops ANDed or ORed together and reported next to the loop in foo, or something else?

In general, you want info about both loops, and you can that with a classic optimization report. But it's not clear how to effectively represent this on the lightweight annotated listing. And it is often the case that the inlined loop is actually the more executed one, and therefore more important.

Currently, the information is ORed together. This has exactly the problem that you indicate, although the problem can be even worse than that: Functions are often transitively inlined multiple times into the same function, and users often want to know if the loops in those functions (or the inlining decisions themselves) differed each time. In the future, I'd like to detect this situation and present this information to the user. The common case (same behavior everywhere) should carry a succinct annotation, but otherwise we might want to do something like this:

1234      |    for (...)
       V  |    ^
          |    * When inlined into foo (file.c:789), doit (d.c:45) -> bar (bar.c:234), not when inlined into work (work.c:345), doit (d.c:45) -> bar (bar.c:254)

Or we might just want to indicate that the annotation applies only to some places where the code was inlined and the user can generate a more-detailed optimization report for more information. I'm certainly open to suggestions, but I'd like to handle this in follow-up because it will likely require backend enhancements as well.

A related issue comes up with templated code; we might want to indicate the types when the optimizations end up being type dependent.

This discussion of the command line interface makes me think that we should be taking Richard's suggestion one step further. Why is Clang's involvement here more than just handing down specific requests for optimization data to LLVM and packaging that information back into some reasonable format? The actual presentation of that data seems like it belongs in a separate library / tool, which can have a rich set of visualization options. This is also a more rigorously testable design, since the tool has a well-defined input format that's not just an incidental by-product of the compiler.

If we did that, then Clang just needs (1) an output filename, (2) an optional list of passes to collect data from, and (3) maybe some stringly-typed configuration data for each.

This discussion of the command line interface makes me think that we should be taking Richard's suggestion one step further. Why is Clang's involvement here more than just handing down specific requests for optimization data to LLVM and packaging that information back into some reasonable format?

The static analyzer supports outputting its data in plist format (with which I'm not familiar in detail, but it looks like a fairly-simple xml format). Is that close to what you had in mind? Maybe YAML would be better (since LLVM actually has a parser for that)?

The actual presentation of that data seems like it belongs in a separate library / tool, which can have a rich set of visualization options. This is also a more rigorously testable design, since the tool has a well-defined input format that's not just an incidental by-product of the compiler.

I think this makes sense.

If we did that, then Clang just needs (1) an output filename, (2) an optional list of passes to collect data from, and (3) maybe some stringly-typed configuration data for each.

Sure. Although I'd prefer to leave the filtering to the tool unless the I/O requirements become unmanageable. Users don't know, and shouldn't know, what passes do what.

This discussion of the command line interface makes me think that we should be taking Richard's suggestion one step further. Why is Clang's involvement here more than just handing down specific requests for optimization data to LLVM and packaging that information back into some reasonable format?

The static analyzer supports outputting its data in plist format (with which I'm not familiar in detail, but it looks like a fairly-simple xml format). Is that close to what you had in mind? Maybe YAML would be better (since LLVM actually has a parser for that)?

YAML makes a lot of sense to me.

If we did that, then Clang just needs (1) an output filename, (2) an optional list of passes to collect data from, and (3) maybe some stringly-typed configuration data for each.

Sure. Although I'd prefer to leave the filtering to the tool unless the I/O requirements become unmanageable. Users don't know, and shouldn't know, what passes do what.

Sure, seems reasonable. So we can start with just some way to turn the feature on and specify a filename.

fhahn added a subscriber: fhahn.Sep 13 2016, 4:44 AM