Improve shared_ptr dtor performance
If the last destruction is uncontended, skip the atomic store on
__shared_weak_owners_. This shifts some costs from normal
shared_ptr usage to weak_ptr uses.
For x86_64, this results in an 8% improvement in shared_ptr ctor+dtor
performance.
Old benchmarks/shared_ptr_create_destroy.cpp: 26.8638 seconds
New benchmarks/shared_ptr_create_destroy.cpp: 24.6019 seconds
Weak_ptr increment / decrement is now slower.
Old benchmarks/weak_ptr_inc_dec_ref.cpp: 11.2892 seconds
New benchmarks/weak_ptr_inc_dec_ref.cpp: 14.5522 seconds
The increment / decrement code path did not degrade on X86_64.
Old benchmarks/shared_ptr_inc_dec_ref.cpp: 13.0896 seconds
New benchmarks/shared_ptr_inc_dec_ref.cpp: 13.0463 seconds