Message ID | f4b5d106-8176-b7bd-709b-d435188783b0@acm.org |
---|---|
State | New |
Headers | show |
Series | [PR,tree-optimization/83022] malloc/memset->calloc too aggressive | expand |
On 11/17/2017 09:07 AM, Nathan Sidwell wrote: > We currently optimize a malloc/memset pair into a calloc call (when the > values match, of course). This turns out to be a pessimization for > mysql 5.6, where the allocator looks like: > > void *ptr = malloc (size); > if (ptr && other_condition) > memset (ptr, 0, size); > > other_condition is false sufficiently often for this to be a noticeable > performance degradation. > > mysql 5.7 moved the above code around to explicitly call calloc or > malloc. Maybe to avoid this optimization? > > This patch restricts the optimization to require > a) malloc and memset in the same bb (of course, this is UB if malloc > returns NULL), or, > > b) memset in a single-pred block whose predecessor contains the malloc > call. AND the condition being a comparison of the malloc result against > NULL. > > For #b I only check for NE/EQ comparison, and don't bother determining > whether the test is the correct sense. If the user's written it the > wrong way round, we're in UB territory anyway. > > The existing tests for this optimization do not regress. The new test > shows the optimization being repressed, as expected. > > ok? > > nathan > > -- > Nathan Sidwell > > 83022.diff > > > 2017-11-17 Nathan Sidwell <nathan@acm.org> > > PR tree-optimization/83022 > * tree-ssa-strlen.c (handle_builtin_memset): Don't fold malloc & > memset if the condition is complicated. > > PR tree-optimization/83022 > * gcc.dg/tree-ssa/pr83022.c: New. ISTM the better way to drive this is to query the branch probabilities. It'd probably be simpler too. Is there some reason that's not a good solution? Jeff
On 11/17/2017 01:37 PM, Jeff Law wrote: > ISTM the better way to drive this is to query the branch probabilities. > It'd probably be simpler too. Is there some reason that's not a good > solution? (a) I'd have to learn how to do that (b) in the case where the condition is just a null check, ma.cc.046t.profile_estimate considers the memset reachable 53.47% of the time (see defect 83023) when the condition is 'ptr && some_bool' we think it reachable 33% of the time. It's not clear to me what a sensible threshold might be. I suppose more realistic probabilities are 99.99% in the first case and 50% in the second case? (c) the performance skew is presumably proportional to the size parameter. small size is probably swamped by the allocation effort itself. A large size, the memset cost might start to dominate. Profiling shows that it is the kernel burning this in flushing the tlb during a syscall. My guess is that the useage pattern repeatedly allocates and frees a large chunk of uninitialized memory. That ends up not being syscally at all. With the change to use calloc, each of those allocations turns out to be a large TLB churn getting read-as-zero anonymous pages. And possibly similar churn returning freed pages to the OS. nathan
On 11/17/2017 11:57 AM, Nathan Sidwell wrote: > On 11/17/2017 01:37 PM, Jeff Law wrote: > >> ISTM the better way to drive this is to query the branch probabilities. >> It'd probably be simpler too. Is there some reason that's not a good >> solution? > > (a) I'd have to learn how to do that Yea, but that's easy :-) I think you just need to look at the edge > > (b) in the case where the condition is just a null check, > ma.cc.046t.profile_estimate considers the memset reachable 53.47% of > the time (see defect 83023) Which seems like an obvious goof to me. > > when the condition is 'ptr && some_bool' we think it reachable 33% of > the time. > > It's not clear to me what a sensible threshold might be. I suppose more > realistic probabilities are 99.99% in the first case and 50% in the > second case? Yea, in isolation that's what I'd expect. Jan or Martin L might be able to provide guidance on how to fix this. Odds are we just need a predictor that says allocators almost always return non-null. > > (c) the performance skew is presumably proportional to the size > parameter. small size is probably swamped by the allocation effort > itself. A large size, the memset cost might start to dominate. > Profiling shows that it is the kernel burning this in flushing the tlb > during a syscall. Agreed. I'd come to the same conclusion as well -- it'd have to be a combination of the branch probably and the size. Given we don't have any good data, I'd be willing to go with using the benchmark to guide selection. > My guess is that the useage pattern repeatedly allocates and frees a > large chunk of uninitialized memory. That ends up not being syscally at > all. With the change to use calloc, each of those allocations turns out > to be a large TLB churn getting read-as-zero anonymous pages. And > possibly similar churn returning freed pages to the OS. Right. As the request gets large glibc will switch you to just getting one or more mmap'd pages which are likely backed anonymously. So it's relatively cheap. But once the memset touches all of them, well, you lose. I guess ultimately I'm not rejecting the patch, I just think that there's a better solution that (in theory) isn't a ton of work. We have to fix the predictor, query the probabilities and plug the result into the heuristic. If you could poke a little here it'd be appreciated. If it gets ugly, then I'm comfortable returning to your patch. jeff
2017-11-17 Nathan Sidwell <nathan@acm.org> PR tree-optimization/83022 * tree-ssa-strlen.c (handle_builtin_memset): Don't fold malloc & memset if the condition is complicated. PR tree-optimization/83022 * gcc.dg/tree-ssa/pr83022.c: New. Index: tree-ssa-strlen.c =================================================================== --- tree-ssa-strlen.c (revision 254841) +++ tree-ssa-strlen.c (working copy) @@ -2394,7 +2394,13 @@ handle_builtin_malloc (enum built_in_fun /* Handle a call to memset. After a call to calloc, memset(,0,) is unnecessary. - memset(malloc(n),0,n) is calloc(n,1). */ + calloc is also: + p = malloc (n); if (p) memset(p,0,n); + We don't want to optimize: + p = malloc (n); if (p && other_cond) memset (p,0,n); + because other_cond may be false sufficiently often for the memset + to be a noticeable performance hit. +*/ static bool handle_builtin_memset (gimple_stmt_iterator *gsi) @@ -2422,7 +2428,42 @@ handle_builtin_memset (gimple_stmt_itera else if (code1 == BUILT_IN_MALLOC && operand_equal_p (gimple_call_arg (stmt1, 0), size, 0)) { + /* Optimize if malloc is in same block, or it is in immediate + predecessor block AND condition is simple test of + non-nullness. */ gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1); + if (gsi1.bb != gsi->bb) + { + if (!single_pred_p (gsi->bb)) + /* memset not in a single predecessor bb. */ + return true; + + edge e = EDGE_PRED (gsi->bb, 0); + if (e->src != gsi1.bb) + /* Single predecessor bb does not contain malloc. */ + return true; + + gcond *cond = as_a <gcond *> + (gimple_seq_last (*bb_seq_addr (gsi1.bb))); + if (!integer_zerop (gimple_cond_rhs (cond))) + /* Not comparing against zero. */ + return true; + if (!operand_equal_p (gimple_cond_lhs (cond), ptr, 0)) + /* Not comparing malloc'd pointer. */ + return true; + + tree_code code = gimple_cond_code (cond); + if (code != NE_EXPR && code != EQ_EXPR) + /* Comparison is not == or !=. Something very strange, + but implementation-defined is happening. */ + return true; + + /* We don't have to check the edge truthiness. If it's the + wrong way round, we're calling memset with a NULL + pointer, which is undefined. */ + } + + /* Turn the malloc into calloc. */ update_gimple_call (&gsi1, builtin_decl_implicit (BUILT_IN_CALLOC), 2, size, build_one_cst (size_type_node)); si1->nonzero_chars = build_int_cst (size_type_node, 0); @@ -2431,6 +2472,8 @@ handle_builtin_memset (gimple_stmt_itera } else return true; + + /* Delete the memset. */ tree lhs = gimple_call_lhs (stmt2); unlink_stmt_vdef (stmt2); if (lhs) Index: testsuite/gcc.dg/tree-ssa/pr83022.c =================================================================== --- testsuite/gcc.dg/tree-ssa/pr83022.c (revision 0) +++ testsuite/gcc.dg/tree-ssa/pr83022.c (working copy) @@ -0,0 +1,22 @@ +/* PR tree-optimization/83022 malloc followed by conditional memset + optimized to calloc, which pessimizes code when condition is + sufficiently false. */ + +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +typedef __SIZE_TYPE__ size_t; +extern void *malloc (size_t) __attribute__((__nothrow__,__malloc__)); +extern void *memset (void *, int, size_t) __attribute__ ((__nothrow__,__nonnull__ (1))); + + +void *m (size_t s, int c) +{ + void *r = malloc (s); + if (r && c) + memset (r, 0, s); + return r; +} + +/* { dg-final { scan-tree-dump "memset" "optimized" } } + { dg-final { scan-tree-dump "malloc" "optimized" } } + { dg-final { scan-tree-dump-not "calloc" "optimized" } } */