This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/benchmarks/automemcpy/
-
benchmarks/
-
automemcpy/
8/9
README.md
-
include/automemcpy/
-
automemcpy/
-
FunctionDescriptor.h

Differential D111554

[libc] automemcpy README and main include file
ClosedPublic

Authored by gchatelet on Oct 11 2021, 8:40 AM.

Download Raw Diff

Details

Reviewers

courbet

Summary

"automemcpy: A framework for automatic generation of fundamental memory operations"
https://research.google/pubs/pub50338/

This patch implements the concepts presented in the paper, the overall approach is the following:

Makes use of constraint programming to model the implementation of a memory function (memcpy, memset, memcmp, bzero, bcmp).
Generate the code for all valid implementations
Compile the implementations and benchmark them on a set of machines. The benchmark makes use of representative distributions for the function's arguments.
Analyze the result and pick "the best" performing function according to the specific environement.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

gchatelet created this revision.Oct 11 2021, 8:40 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2021, 8:40 AM

Herald added subscribers: libc-commits, ecnelises, tschuett. · View Herald Transcript

gchatelet requested review of this revision.Oct 11 2021, 8:40 AM

Harbormaster completed remote builds in B128121: Diff 378686.Oct 11 2021, 9:18 AM

courbet added inline comments.Oct 12 2021, 2:50 AM

libc/benchmarks/automemcpy/FunctionDescriptor.h
1 ↗	(On Diff #378686)	This is supposed to include the filename: https://llvm.org/docs/CodingStandards.html#file-headers
23 ↗	(On Diff #378686)	Do you also want to add a static_assert for PODness ?
26 ↗	(On Diff #378686)	Can you comment on what the comparison is used for ?
33 ↗	(On Diff #378686)	doc ?
35 ↗	(On Diff #378686)	doc ?
52 ↗	(On Diff #378686)	Should the naming be `ContiguousStrategy`, `OverlapStrategy`, ... ? I'm seeing this as describing a strategy to be applied to a given size range.
58–59 ↗	(On Diff #378686)	"an overlapping strategy" ?
60 ↗	(On Diff #378686)	"The span"
80 ↗	(On Diff #378686)	How ? (I mean which strategy)
117–118 ↗	(On Diff #378686)	Should that be an error ? The Z3 model should not be generating these given the constraints, right ?
libc/benchmarks/automemcpy/README.md
3	Maybe make it clear that this is not built be default: This is not enabled by default, as it is mostly useful when working on tuning the library implementation. To build it, use `LIBC_BUILD_AUTOMEMCPY=ON`.
4	Did you mean to give a real link here ?
42	can you explain more about this ? (in particular, that the way to express this depends on the target)
73	information
77	is
84	picks
87	`A`
89	`A`, `M`

Address comments

libc/benchmarks/automemcpy/FunctionDescriptor.h
1 ↗	(On Diff #378686)	The whole libc project would need to be fixed :-/ Can we fix it separately?
23 ↗	(On Diff #378686)	Actually I'd only need them to be `std::trivial` but I just figured out that the `Optional` fields in `FunctionDescriptor` are stepping in the way... The generated code is still pretty efficient https://godbolt.org/z/zdoh5vWMj
52 ↗	(On Diff #378686)	In theory yes, but in practice the type is serialized in the autogenerated C++ file and adding a trailing "Strategy" on every type would really hinder readability. e.g. of one of the several thousand lines of serialized `NamedFunctionDescriptor` {"memcpy_0xE01C197FDF1FC6D3",{FunctionType::MEMCPY,Contiguous{{0,2}},Overlap{{2,64}},Loop{{64,128},16},AlignedLoop{Loop{{128,256},32},16,AlignArg::_1},Accelerator{{256,kMaxSize}},ElementTypeClass::NATIVE}}, Note: because the fields are `Optional` we have to write the type name so the compiler can distinguish between `llvm::None` construction and `T` construction, it also helps readability. Contrary to most TableGen files in LLVM, the autogenerated file will eventually be read to compare implementations.
80 ↗	(On Diff #378686)	I've reworded it, let me know what you think.
117–118 ↗	(On Diff #378686)	I wrote this to allow having an individual size in the middle of an overlap: e.g. ... if(size == 24) return Op<24>(); if(size < 16) return Op<Overlap<8>>(); if(size < 32) return Op<Overlap<16>>(); In this example the range 8-31 is handled by an overlapping strategy expect the size 24 which could be particularly hot and worth optimizing for. It is not in the automemcpy paper but though it would be a nice addition. I'll remove it for now since it's not implemented and probably confusing.
libc/benchmarks/automemcpy/README.md
4	It really is the parent directory. It has to be relative path so it works in github.

Harbormaster completed remote builds in B128558: Diff 379304.Oct 13 2021, 1:50 AM

courbet accepted this revision.Oct 13 2021, 2:29 AM

courbet added inline comments.

libc/benchmarks/automemcpy/FunctionDescriptor.h
127 ↗	(On Diff #379304)	detail*

This revision is now accepted and ready to land.Oct 13 2021, 2:29 AM

Fix typo

Harbormaster completed remote builds in B128598: Diff 379358.Oct 13 2021, 6:22 AM

Fix typo in README
move FunctionDescriptor in include folder

Harbormaster completed remote builds in B128841: Diff 379681.Oct 14 2021, 5:54 AM

gchatelet mentioned this in D111801: [libc] automemcpy.Oct 14 2021, 6:19 AM

Submitted within D111801.

Revision Contents

Path

Size

libc/

benchmarks/

automemcpy/

README.md

111 lines

include/

automemcpy/

FunctionDescriptor.h

159 lines

Diff 379681

libc/benchmarks/automemcpy/README.md

This file was added.

				This folder contains an implementation of [automemcpy: A framework for automatic generation of fundamental memory operations](https://research.google/pubs/pub50338/).

				It uses the [Z3 theorem prover](https://github.com/Z3Prover/z3) to enumerate a subset of valid memory function implementations. These implementations are then materialized as C++ code and can be [benchmarked](../) against various [size distributions](../distributions). This process helps the design of efficient implementations for a particular environnement (size distribution, processor or custom compilation options).
				courbetUnsubmitted Done Reply Inline Actions Maybe make it clear that this is not built be default: This is not enabled by default, as it is mostly useful when working on tuning the library implementation. To build it, use `LIBC_BUILD_AUTOMEMCPY=ON`. courbet: Maybe make it clear that this is not built be default: ``` This is not enabled by default, as…

				courbetUnsubmitted Done Reply Inline Actions Did you mean to give a real link here ? courbet: Did you mean to give a real link here ?
				gchateletAuthorUnsubmitted Done Reply Inline Actions It really is the parent directory. It has to be relative path so it works in github. gchatelet: It really is the parent directory. It has to be relative path so it works in github.
				This is not enabled by default, as it is mostly useful when working on tuning the library implementation. To build it, use `LIBC_BUILD_AUTOMEMCPY=ON` (see below).

				## Prerequisites

				You may need to install `Z3` from source if it's not available on your system.
				Here we show instructions to install it into `<Z3_INSTALL_DIR>`.
				You may need to `sudo` to `make install`.

				```shell
				mkdir -p ~/git
				cd ~/git
				git clone https://github.com/Z3Prover/z3.git
				python scripts/mk_make.py --prefix=<Z3_INSTALL_DIR>
				cd build
				make -j
				make install
				```

				## Configuration

				```shell
				mkdir -p <BUILD_DIR>
				cd <LLVM_PROJECT_DIR>/llvm
				cmake -DCMAKE_C_COMPILER=/usr/bin/clang \
				-DCMAKE_CXX_COMPILER=/usr/bin/clang++ \
				-DLLVM_ENABLE_PROJECTS="libc" \
				-DLLVM_ENABLE_Z3_SOLVER=ON \
				-DLLVM_Z3_INSTALL_DIR=<Z3_INSTALL_DIR> \
				-DLIBC_BUILD_AUTOMEMCPY=ON \
				-DCMAKE_BUILD_TYPE=Release \
				-B<BUILD_DIR>
				```

				## Targets and compilation

				There are three main CMake targets
				1. `automemcpy_implementations`
				- runs `Z3` and materializes valid memory functions as C++ code, a message will display its ondisk location.
				courbetUnsubmitted Not Done Reply Inline Actions can you explain more about this ? (in particular, that the way to express this depends on the target) courbet: can you explain more about this ? (in particular, that the way to express this depends on the…
				- the source code is then compiled using the native host optimizations (i.e. `-march=native` or `-mcpu=native` depending on the architecture).
				2. `automemcpy`
				- the binary that benchmarks the autogenerated implementations.
				3. `automemcpy_result_analyzer`
				- the binary that analyses the benchmark results.

				You may only compile the binaries as they both pull the autogenerated code as a dependency.

				```shell
				make -C <BUILD_DIR> -j automemcpy automemcpy_result_analyzer
				```

				## Running the benchmarks

				Make sure to save the results of the benchmark as a json file.

				```shell
				<BUILD_DIR>/bin/automemcpy --benchmark_out_format=json --benchmark_out=<RESULTS_DIR>/results.json
				```

				### Additional useful options


				- `--benchmark_min_time=.2`

				By default, each function is benchmarked for at least one second, here we lower it to 200ms.

				- `--benchmark_filter="BM_Memset\|BM_Bzero"`

				By default, all functions are benchmarked, here we restrict them to `memset` and `bzero`.

				courbetUnsubmitted Done Reply Inline Actions information courbet: information
				Other options might be useful, use `--help` for more information.

				## Analyzing the benchmarks

				courbetUnsubmitted Done Reply Inline Actions is courbet: is
				Analysis is performed by running `automemcpy_result_analyzer` on one or more json result files.

				```shell
				<BUILD_DIR>/bin/automemcpy_result_analyzer <RESULTS_DIR>/results.json
				```

				What it does:
				courbetUnsubmitted Done Reply Inline Actions picks courbet: picks
				1. Gathers all throughput values for each function / distribution pair and picks the median one.\
				This allows picking a representative value over many runs of the benchmark. Please make sure all the runs happen under similar circumstances.

				courbetUnsubmitted Done Reply Inline Actions `A` courbet: `A`
				2. For each distribution, look at the span of throughputs for functions of the same type (e.g. For distribution `A`, memcpy throughput spans from 2GiB/s to 5GiB/s).

				courbetUnsubmitted Done Reply Inline Actions `A`, `M` courbet: `A`, `M`
				3. For each distribution, give a normalized score to each function (e.g. For distribution `A`, function `M` scores 0.65).\
				This score is then turned into a grade `EXCELLENT`, `VERY_GOOD`, `GOOD`, `PASSABLE`, `INADEQUATE`, `MEDIOCRE`, `BAD` - so that each distribution categorizes how function perform according to them.

				4. A [Majority Judgement](https://en.wikipedia.org/wiki/Majority_judgment) process is then used to categorize each function. This enables finer analysis of how distributions agree on which function is better. In the following example, `Function_1` and `Function_2` are rated `EXCELLENT` but looking at the grade's distribution might help decide which is best.

				\| \| EXCELLENT \| VERY_GOOD \| GOOD \| PASSABLE \| INADEQUATE \| MEDIOCRE \| BAD \|
				\|------------\|:---------:\|:---------:\|:----:\|:--------:\|:----------:\|:--------:\|:---:\|
				\| Function_1 \| 7 \| 1 \| 2 \| \| \| \| \|
				\| Function_2 \| 6 \| 4 \| \| \| \| \| \|

				The tool outputs the histogram of grades for each function. In case of tie, other dimensions might help decide (e.g. code size, performance on other microarchitectures).

				```
				EXCELLENT \|█▁▂ \| Function_0
				EXCELLENT \|█▅ \| Function_1
				VERY_GOOD \|▂█▁ ▁ \| Function_2
				GOOD \| ▁█▄ \| Function_3
				PASSABLE \| ▂▆▄█ \| Function_4
				INADEQUATE \| ▃▃█▁ \| Function_5
				MEDIOCRE \| █▆▁\| Function_6
				BAD \| ▁▁█\| Function_7
				```

libc/benchmarks/automemcpy/include/automemcpy/FunctionDescriptor.h

This file was added.

				//===-- Pod structs to describe a memory function----------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_BENCHMARKS_AUTOMEMCPY_COMMON_H
				#define LLVM_LIBC_BENCHMARKS_AUTOMEMCPY_COMMON_H

				#include <climits>
				#include <cstddef>
				#include <llvm/ADT/ArrayRef.h>
				#include <llvm/ADT/Hashing.h>
				#include <llvm/ADT/Optional.h>
				#include <llvm/ADT/StringRef.h>
				#include <tuple>

				namespace llvm {
				namespace automemcpy {

				// Boilerplate code to be able to sort and hash types.
				#define COMPARABLE_AND_HASHABLE(T, ...) \
				inline auto asTuple() const { return std::tie(__VA_ARGS__); } \
				bool operator==(const T &O) const { return asTuple() == O.asTuple(); } \
				bool operator<(const T &O) const { return asTuple() < O.asTuple(); } \
				struct Hasher { \
				std::size_t operator()(const T &K) const { \
				return llvm::hash_value(K.asTuple()); \
				} \
				};

				// Represents the maximum value for the size parameter of a memory function.
				// This is an `int` so we can use it as an expression in Z3.
				// It also allows for a more readable and compact representation when storing
				// the SizeSpan in the autogenerated C++ file.
				static constexpr int kMaxSize = INT_MAX;

				// This mimics the `Arg` type in libc/src/string/memory_utils/elements.h without
				// having to depend on it.
				enum class AlignArg { _1, _2, ARRAY_SIZE };

				// Describes a range of sizes.
				// We use the begin/end representation instead of first/last to allow for empty
				// range (i.e. Begin == End)
				struct SizeSpan {
				size_t Begin = 0;
				size_t End = 0;

				COMPARABLE_AND_HASHABLE(SizeSpan, Begin, End)
				};

				// Describes a contiguous region.
				// In such a region all sizes are handled individually.
				// e.g. with Span = {0, 2};
				// if(size == 0) return Handle<0>();
				// if(size == 1) return Handle<1>();
				struct Contiguous {
				SizeSpan Span;

				COMPARABLE_AND_HASHABLE(Contiguous, Span)
				};

				// This struct represents a range of sizes over which to use an overlapping
				// strategy. An overlapping strategy of size N handles all sizes from N to 2xN.
				// The span may represent several contiguous overlaps.
				// e.g. with Span = {16, 128};
				// if(size >= 16 and size < 32) return Handle<Overlap<16>>();
				// if(size >= 32 and size < 64) return Handle<Overlap<32>>();
				// if(size >= 64 and size < 128) return Handle<Overlap<64>>();
				struct Overlap {
				SizeSpan Span;

				COMPARABLE_AND_HASHABLE(Overlap, Span)
				};

				// Describes a region using a loop handling BlockSize bytes at a time. The
				// remaining bytes of the loop are handled with an overlapping operation.
				struct Loop {
				SizeSpan Span;
				size_t BlockSize = 0;

				COMPARABLE_AND_HASHABLE(Loop, Span, BlockSize)
				};

				// Same as `Loop` but starts by aligning a buffer on `Alignment` bytes.
				// A first operation handling 'Alignment` bytes is performed followed by a
				// sequence of Loop.BlockSize bytes operation. The Loop starts processing from
				// the next aligned byte in the chosen buffer. The remaining bytes of the loop
				// are handled with an overlapping operation.
				struct AlignedLoop {
				Loop Loop;
				size_t Alignment = 0; // Size of the alignment.
				AlignArg AlignTo = AlignArg::_1; // Which buffer to align.

				COMPARABLE_AND_HASHABLE(AlignedLoop, Loop, Alignment, AlignTo)
				};

				// Some processors offer special instruction to handle the memory function
				// completely, we refer to such instructions as accelerators.
				struct Accelerator {
				SizeSpan Span;

				COMPARABLE_AND_HASHABLE(Accelerator, Span)
				};

				// The memory functions are assembled out of primitives that can be implemented
				// with regular scalar operations (SCALAR), with the help of vector or bitcount
				// instructions (NATIVE) or by deferring it to the compiler (BUILTIN).
				enum class ElementTypeClass {
				SCALAR,
				NATIVE,
				BUILTIN,
				};

				// A simple enum to categorize which function is being implemented.
				enum class FunctionType {
				MEMCPY,
				MEMCMP,
				BCMP,
				MEMSET,
				BZERO,
				};

				// This struct describes the skeleton of the implementation, it does not go into
				// every detail but is enough to uniquely identify the implementation.
				struct FunctionDescriptor {
				FunctionType Type;
				Optional<Contiguous> Contiguous;
				Optional<Overlap> Overlap;
				Optional<Loop> Loop;
				Optional<AlignedLoop> AlignedLoop;
				Optional<Accelerator> Accelerator;
				ElementTypeClass ElementClass;

				COMPARABLE_AND_HASHABLE(FunctionDescriptor, Type, Contiguous, Overlap, Loop,
				AlignedLoop, Accelerator, ElementClass)

				inline size_t id() const { return llvm::hash_value(asTuple()); }
				};

				// Same as above but with the function name.
				struct NamedFunctionDescriptor {
				StringRef Name;
				FunctionDescriptor Desc;
				};

				template <typename T> llvm::hash_code hash_value(const ArrayRef<T> &V) {
				return llvm::hash_combine_range(V.begin(), V.end());
				}
				template <typename T> llvm::hash_code hash_value(const T &O) {
				return llvm::hash_value(O.asTuple());
				}

				} // namespace automemcpy
				} // namespace llvm

				#endif /* LLVM_LIBC_BENCHMARKS_AUTOMEMCPY_COMMON_H */