Page MenuHomePhabricator

[XRay][tools] Pack XRayRecord - reduce memory footprint by a third. (RFC)
Changes PlannedPublic

Authored by lebedev.ri on Feb 25 2019, 7:20 AM.

Details

Reviewers
dberris
kpw
Summary

This is a RFC because of the uint8_t CPU change.
That chance needs discussing.

In "basic log mode", we indeed only ever read 8 bits into that field.
But in FDR mode, the CPU field in log is 16 bits.
But if you look in the compiler-rt part, as far as i can tell, the CPU id is always
(in both modes, basic and FDR) received from uint64_t __xray::readTSC(uint8_t &CPU).
So naturally, CPU id is always only 8 bit, and in FDR mode, extra 8 bits is just padding.

Please don't take my word for it, do recheck!

Thus, i do not believe we need to have uint16_t for CPU. With the other current code
we can't ever get more than uint8_t value there, thus we save 1 byte.

The rest of the patch is trivial.
By specifying the base type of RecordTypes we save 3 bytes.

llvm::SmallVector<>/llvm::SmallString only cost 16 bytes each, as opposed to 24/32 bytes.

Thus, in total, old sizeof(XRayRecord) was 88 bytes, and new one is 56 bytes.
There is no padding between the fields of XRayRecord, and XRayRecord itself isn't being
padded when stored into a vector. Thus the footprint of XRayRecord is now optimal.

This is important because XRayRecord is what has the biggest memory footprint,
and most contributes to the peak heap memory usage at least of llvm-xray convert.

Some numbers:

xray-log.llvm-exegesis.FswRtO was acquired from llvm-exegesis
(compiled with -fxray-instruction-threshold=128)
analysis mode over -benchmarks-file with 10099 points (one full
latency measurement set), with normal runtime of 0.387s.

Time old:

$ perf stat -r9 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-old.yml xray-log.llvm-exegesis.FswRtO 

 Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-old.yml xray-log.llvm-exegesis.FswRtO' (9 runs):

           7607.69 msec task-clock                #    0.999 CPUs utilized            ( +-  0.48% )
               522      context-switches          #   68.635 M/sec                    ( +- 39.85% )
                 1      cpu-migrations            #    0.073 M/sec                    ( +- 60.83% )
             77905      page-faults               # 10241.090 M/sec                   ( +-  3.13% )
       30471867671      cycles                    # 4005708.241 GHz                   ( +-  0.48% )  (83.32%)
        2424264020      stalled-cycles-frontend   #    7.96% frontend cycles idle     ( +-  1.84% )  (83.30%)
       11097550400      stalled-cycles-backend    #   36.42% backend cycles idle      ( +-  0.35% )  (33.38%)
       36899274774      instructions              #    1.21  insn per cycle         
                                                  #    0.30  stalled cycles per insn  ( +-  0.07% )  (50.04%)
        6538597488      branches                  # 859537529.125 M/sec               ( +-  0.07% )  (66.70%)
          79769896      branch-misses             #    1.22% of all branches          ( +-  0.67% )  (83.35%)

            7.6143 +- 0.0371 seconds time elapsed  ( +-  0.49% )

Time new:

$ perf stat -r9 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-new.yml xray-log.llvm-exegesis.FswRtO 

 Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-new.yml xray-log.llvm-exegesis.FswRtO' (9 runs):

           7207.49 msec task-clock                #    1.000 CPUs utilized            ( +-  0.46% )
               174      context-switches          #   24.159 M/sec                    ( +- 30.10% )
                 0      cpu-migrations            #    0.062 M/sec                    ( +- 39.53% )
             52126      page-faults               # 7232.740 M/sec                    ( +-  0.69% )
       28876446408      cycles                    # 4006783.905 GHz                   ( +-  0.46% )  (83.31%)
        2352902586      stalled-cycles-frontend   #    8.15% frontend cycles idle     ( +-  2.08% )  (83.33%)
        8986901047      stalled-cycles-backend    #   31.12% backend cycles idle      ( +-  1.00% )  (33.36%)
       38630170181      instructions              #    1.34  insn per cycle         
                                                  #    0.23  stalled cycles per insn  ( +-  0.04% )  (50.02%)
        7016819734      branches                  # 973626739.925 M/sec               ( +-  0.04% )  (66.68%)
          86887572      branch-misses             #    1.24% of all branches          ( +-  0.39% )  (83.33%)

            7.2099 +- 0.0330 seconds time elapsed  ( +-  0.46% )

(Nice, accidentally improved by -5%)

Memory old:

$ heaptrack_print heaptrack.llvm-xray.3976.gz | tail -n 7
total runtime: 18.16s.
bytes allocated in total (ignoring deallocations): 5.25GB (289.03MB/s)
calls to allocation functions: 21840309 (1202792/s)
temporary memory allocations: 228301 (12573/s)
peak heap memory consumption: 354.62MB
peak RSS (including heaptrack overhead): 4.30GB
total memory leaked: 87.42KB

Memory new:

$ heaptrack_print heaptrack.llvm-xray.5234.gz | tail -n 7
total runtime: 17.93s.
bytes allocated in total (ignoring deallocations): 5.05GB (281.73MB/s)
calls to allocation functions: 21840309 (1217747/s)
temporary memory allocations: 228301 (12729/s)
peak heap memory consumption: 267.77MB
peak RSS (including heaptrack overhead): 2.16GB
total memory leaked: 83.50KB

Memory diff:

$ heaptrack_print -d heaptrack.llvm-xray.3976.gz heaptrack.llvm-xray.5234.gz | tail -n 7
total runtime: -0.22s.
bytes allocated in total (ignoring deallocations): -195.36MB (876.07MB/s)
calls to allocation functions: 0 (0/s)
temporary memory allocations: 0 (0/s)
peak heap memory consumption: -86.86MB
peak RSS (including heaptrack overhead): 0B
total memory leaked: -3.92KB

So we indeed improved (reduced) peak memory usage, by ~-25%.
Not by a third since now something else is the top contributor to the peak.

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Feb 25 2019, 7:20 AM

This is a RFC because of the uint8_t CPU change.
That chance needs discussing.

So, this is an accident of history, which should be changed, but to the other direction. I've learned some time ago that it turns out there are some platforms that can have enough CPU IDs which can't be represented by a uint8_t. For future-proofing, we really should change this to be larger (uint16_t) and change basic mode to store 16-bit CPU IDs. The other parts seem fine to me, except for the potential churn on the user side (this is an ABI change).

I think that's fine for the C++ APIs, but that it will need some release notes (in case someone has been using the Trace API).

lebedev.ri planned changes to this revision.Feb 25 2019, 1:24 PM

This is a RFC because of the uint8_t CPU change.
That chance needs discussing.

So, this is an accident of history, which should be changed, but to the other direction. I've learned some time ago that it turns out there are some platforms that can have enough CPU IDs which can't be represented by a uint8_t. For future-proofing, we really should change this to be larger (uint16_t) and change basic mode to store 16-bit CPU IDs.

Boo :)
Unfortunately that won't just cost that one extra byte, it will have ripple effect on the padding in this struct.
I'm not sure as to exact numbers.

The other parts seem fine to me, except for the potential churn on the user side (this is an ABI change).

I think that's fine for the C++ APIs, but that it will need some release notes (in case someone has been using the Trace API).

This is a RFC because of the uint8_t CPU change.
That chance needs discussing.

So, this is an accident of history, which should be changed, but to the other direction. I've learned some time ago that it turns out there are some platforms that can have enough CPU IDs which can't be represented by a uint8_t. For future-proofing, we really should change this to be larger (uint16_t) and change basic mode to store 16-bit CPU IDs.

Boo :)

Boo indeed. :)

Unfortunately that won't just cost that one extra byte, it will have ripple effect on the padding in this struct.
I'm not sure as to exact numbers.

I like the idea of reducing top-line memory requirements, but it shouldn't be at the cost of functionality. The current state is a bug that we're only using 8 bits for the CPU ID.

Now, an alternative here is to migrate the Basic Mode implementation to use a more compact log record (i.e. using the FDR mode format), and use a different converter approach, one that doesn't require reconstituting the whole Trace consisting of XRayRecord instances. The FDR log loading libraries/framework allow us to do this now (see llvm-xray fdr-dump). This is a more intensive project but one that isn't terribly hard to accomplish. If you'd like to take that on, I'd be happy to review patches going in that direction instead (really the only difference between the current basic mode implementation and the FDR mode implementation is that, basic mode threads returning a buffer to the queue will write out the contents before returning the buffer to the central buffer queue). In that process we can migrate FDR mode to use 16-bit CPU IDs.

This is a RFC because of the uint8_t CPU change.
That chance needs discussing.

So, this is an accident of history, which should be changed, but to the other direction. I've learned some time ago that it turns out there are some platforms that can have enough CPU IDs which can't be represented by a uint8_t. For future-proofing, we really should change this to be larger (uint16_t) and change basic mode to store 16-bit CPU IDs.

Boo :)

Boo indeed. :)

Unfortunately that won't just cost that one extra byte, it will have ripple effect on the padding in this struct.
I'm not sure as to exact numbers.

I like the idea of reducing top-line memory requirements, but it shouldn't be at the cost of functionality. The current state is a bug that we're only using 8 bits for the CPU ID.

No, i totally understand.
That is why i said it's RFC and is only valid if it is actually
only ever 8 bits. (which it is, but only due to the bug elsewhere)
I think this is still worth it even with 16-bit CPU, i'll take a look.

Now, an alternative here is to migrate the Basic Mode implementation to use a more compact log record (i.e. using the FDR mode format), and use a different converter approach, one that doesn't require reconstituting the whole Trace consisting of XRayRecord instances. The FDR log loading libraries/framework allow us to do this now (see llvm-xray fdr-dump). This is a more intensive project but one that isn't terribly hard to accomplish. If you'd like to take that on, I'd be happy to review patches going in that direction instead (really the only difference between the current basic mode implementation and the FDR mode implementation is that, basic mode threads returning a buffer to the queue will write out the contents before returning the buffer to the central buffer queue). In that process we can migrate FDR mode to use 16-bit CPU IDs.

So it's basically three co-dependent steps:

  1. Teach xray convert to also work on FDR input
  2. Afterwards, switch the compiler-rt X-Ray basic log to output FDR log format.
  3. Finally, fix the truncation of CPU id in the compiler-rt xray code

I'm presently not constrained by the XRayRecord memory footprint any more,
so i'm not sure how much effort i want to spent here..

This is a RFC because of the uint8_t CPU change.
That chance needs discussing.

So, this is an accident of history, which should be changed, but to the other direction. I've learned some time ago that it turns out there are some platforms that can have enough CPU IDs which can't be represented by a uint8_t. For future-proofing, we really should change this to be larger (uint16_t) and change basic mode to store 16-bit CPU IDs.

Boo :)

Boo indeed. :)

Unfortunately that won't just cost that one extra byte, it will have ripple effect on the padding in this struct.
I'm not sure as to exact numbers.

I like the idea of reducing top-line memory requirements, but it shouldn't be at the cost of functionality. The current state is a bug that we're only using 8 bits for the CPU ID.

No, i totally understand.
That is why i said it's RFC and is only valid if it is actually
only ever 8 bits. (which it is, but only due to the bug elsewhere)
I think this is still worth it even with 16-bit CPU, i'll take a look.

Now, an alternative here is to migrate the Basic Mode implementation to use a more compact log record (i.e. using the FDR mode format), and use a different converter approach, one that doesn't require reconstituting the whole Trace consisting of XRayRecord instances. The FDR log loading libraries/framework allow us to do this now (see llvm-xray fdr-dump). This is a more intensive project but one that isn't terribly hard to accomplish. If you'd like to take that on, I'd be happy to review patches going in that direction instead (really the only difference between the current basic mode implementation and the FDR mode implementation is that, basic mode threads returning a buffer to the queue will write out the contents before returning the buffer to the central buffer queue). In that process we can migrate FDR mode to use 16-bit CPU IDs.

So it's basically three co-dependent steps:

  1. Teach xray convert to also work on FDR input
  2. Afterwards, switch the compiler-rt X-Ray basic log to output FDR log format.
  3. Finally, fix the truncation of CPU id in the compiler-rt xray code

Step 1 is not strictly necessary, we already somewhat already do this although indirectly (through the Trace type) in llvm-xray convert. We can then later switch to stream-processing the FDR mode logs when converting, which will be very similar to what the fdr-dump subcommand already does.

Step 2 and 3 can happen in a single change.

I'm presently not constrained by the XRayRecord memory footprint any more,
so i'm not sure how much effort i want to spent here..

This I understand too. :)