This is an archive of the discontinued LLVM Phabricator instance.

[PR24643] Speeding up FoldingSetNodeID::AddPointer
AbandonedPublic

Authored by bcraig on Sep 1 2015, 9:59 AM.

Details

Summary

The old code resulted in placement new calls on Linux64, and memmove
calls on Windows. The new code gets to use memcpy instead. This
change makes an analysis of a large .c file on Linux64 go from
6m57.730s to 6m44.317s.

This effectively reverts r135364.
The commit log for that change is...

Simplify & microoptimize code. No intended functionality change.

It doesn't include any rationale or metrics regarding the speedup.

Diff Detail

Event Timeline

bcraig updated this revision to Diff 33706.Sep 1 2015, 9:59 AM
bcraig retitled this revision from to [PR24643] Speeding up FoldingSetNodeID::AddPointer.
bcraig updated this object.
bcraig added reviewers: chandlerc, krememek.
bcraig added a subscriber: llvm-commits.
bkramer edited edge metadata.Sep 16 2015, 8:24 AM

This is very strange. What compiler are you using to compile LLVM?

In trunk, SmallVector<unsigned>::append will always use memcpy, never memmove. It also has a constant size so the copy should be lowered into a single movq on x86_64. I verified this with clang -O3 and gcc 5.1 -O3, both generate optimal code for append and messy code for AddInteger.

What's possible is that all your pointers fit into 32 bits so the overall footprint of the FoldingSetNodeID is reduced (AddInteger truncates to 32 bits if the integer fits). Can you confirm that?

I am building with Visual Studio 2013, targeting x86 (not x64).
I am also building with clang 3.4, targeting Linux x64.

The memmove call was only on the Win32 builds, and it was definitely a result of the append call. It was the top 'inclusive time' function for the toy program that I was profiling. MSVC's std::uninitialized_copy eventually turns into memmove.

I will dig in a bit more and see if my typical pointers are small.

I can confirm that on my machine, the addresses that were typically passed to AddPointer were "small". They tended to fit in 32-bits. Stack addresses tended to be large.

I will also say that the generated assembly on Linux for my patch is significantly larger / more complicated than the original code. Despite the increased complexity, it does seem to run faster. There's also the substantial benefits that I get on Windows by avoiding the memmove.

Ping. Hoping to get another round of review on this since I've added my investigation notes.

bcraig added a comment.Oct 1 2015, 6:22 AM

Is there any further change or investigation necessary for this patch?

bcraig abandoned this revision.Jul 5 2016, 10:43 AM