Message ID | alpine.DEB.2.10.1311100140300.4153@laptop-mg.saclay.inria.fr |
---|---|
State | New |
Headers | show |
On Sun, Nov 10, 2013 at 04:27:00PM +0100, Marc Glisse wrote: > Hello, > > I am posting this patch to get some feedback on the approach. The > goal is to replace malloc+free with a stack allocation (a decl > actually) when the size is a small constant. > Why constraint yourself to small sizes. Stack allocation benefits is speed and less memory comsumption due lack of fragmentation. A possible way is to have thread local bounds to stack size and call function with custom logic when it is outside of bounds. Below is a simple implementation which creates a separate stack for that (for simplicity and because it does not need to find bounds on thread stack.) With bit of more work it could do allocations in similar way as in splitstack. > For testing, I highjacked the "leaf" attribute, but it isn't right, > I'll remove it from the list (not sure what I'll do for the > testcases then). What I'd want instead is a "returns" attribute that > means the function will return (either by "return" or an exception), > as opposed to having an infinite loop, calling exit or longjmp, etc > (sched-deps.c has a related call_may_noreturn_p). The situation I am > trying to avoid is: > p=malloc(12); > f(p) > free(p) > > where f contains somewhere: > free(p); exit(42); > (or longjmp or something else that takes the regular call to free > out of the execution flow). > One of plans to extend malloc is add custom free handler, interface would be something like dalloc(amount, destructor) which would invoke destructor on free. Main motivation is memory pool that can be returned by free. With that extension it would be possible to mark pointer so its free would be a nop. > > > The size above which the malloc->stack transformation is not applied > should depend on a parameter, I don't know if it should get its own > or depend on an existing one. In any case, I'd like to avoid ending > up with a ridiculously low threshold (my programs use GMP, which > internally uses alloca up to 65536 bytes (even in recursive > functions that have a dozen allocations), so I don't want gcc to > tell me that 50 bytes are too much). > > A program with a double-free may, with this patch, end up crashing > on the first free instead of the second, but that's an invalid > program anyway. > > #include <pthread.h> __thread void *__stack_from; __thread void *__stack_cur; __thread void *__stack_to; #define STACK_ALLOC(size) ({ \ void *__stack_new = __stack_cur + size; \ if (__stack_new < __stack_cur || __stack_to > __stack_new) \ __stack_alloc (size); \ else \ { \ void *__s = __stack_cur; \ __stack_cur = __stack_new; \ __s; \ } \ }) #define STACK_FREE(__stack_new) ({ \ if (__stack_new < __stack_from || __stack_to > __stack_new) \ __stack_free (size); \ else \ __stack_cur = __stack_new; \ }) static pthread_key_t key; void __stack_destroy (void *x) { free (stack_from); } void * __stack_alloc (size_t size) { if (!__stack_from) { __stack_from = malloc (1 << 18); __stack_to = __stack_from + (1 << 18); __stack_cur = __stack_from; _ pthread_key_create (&key, destroy); pthread_setspecific (key, &key); } return malloc (size); } void __stack_free (void *p) { free (p); }
On Mon, Nov 11, 2013 at 11:08:14AM +0100, Ondřej Bílka wrote: > On Sun, Nov 10, 2013 at 04:27:00PM +0100, Marc Glisse wrote: > > I am posting this patch to get some feedback on the approach. The > > goal is to replace malloc+free with a stack allocation (a decl > > actually) when the size is a small constant. > > > Why constraint yourself to small sizes. Stack allocation benefits is > speed and less memory comsumption due lack of fragmentation. Because you can hardly predict what the program will have as stack requirements in functions you call? In leaf functions sure, you only care not to create too large allocations that would go over the stack size, but if you call other functions, you usually can't know (at least without sufficient IPA analysis, but that is really hard because it is too early) how much stack will it really need (both fixed requirements for non-VLA vars on the stack, spill space, function call arguments and other overhead and VLAs and also these malloc turned into stack allocation). So, if you say have a malloc with corresponding free shortly afterwards but some call in between and decide that it is fine to change it into stack allocation when it is half the size of the remaining stack space, but then two frames down there will be some non-VLA var that needs 3/4 of the old remaining stack space, you've turned a correct program into a broken one. Jakub
On Mon, Nov 11, 2013 at 11:19:05AM +0100, Jakub Jelinek wrote: > On Mon, Nov 11, 2013 at 11:08:14AM +0100, Ondřej Bílka wrote: > > On Sun, Nov 10, 2013 at 04:27:00PM +0100, Marc Glisse wrote: > > > I am posting this patch to get some feedback on the approach. The > > > goal is to replace malloc+free with a stack allocation (a decl > > > actually) when the size is a small constant. > > > > > Why constraint yourself to small sizes. Stack allocation benefits is > > speed and less memory comsumption due lack of fragmentation. > > Because you can hardly predict what the program will have as stack > requirements in functions you call? In leaf functions sure, you only care > not to create too large allocations that would go over the stack size, > but if you call other functions, you usually can't know (at least without > sufficient IPA analysis, but that is really hard because it is too early) Which is completely irrelevant. Checking it in run time is easy as in my prototype which uses a alternative stack so added stack space does not coun. > how much stack will it really need (both fixed requirements for non-VLA > vars on the stack, spill space, function call arguments and other overhead > and VLAs and also these malloc turned into stack allocation). This is argument to turn a alloca and VLA to ones that check if there is enough stack and get space by other means. This can be with same logic as described above. > So, if you say have a malloc with corresponding free shortly afterwards > but some call in between and decide that it is fine to change it into stack > allocation when it is half the size of the remaining stack space, but then Of total stack space, also if we use main stack user should enlarge stacks accordingly. > two frames down there will be some non-VLA var that needs 3/4 of the old > remaining stack space, you've turned a correct program into a broken one. > > Jakub
On Sun, Nov 10, 2013 at 4:27 PM, Marc Glisse <marc.glisse@inria.fr> wrote: > Hello, > > I am posting this patch to get some feedback on the approach. The goal is to > replace malloc+free with a stack allocation (a decl actually) when the size > is a small constant. > > For testing, I highjacked the "leaf" attribute, but it isn't right, I'll > remove it from the list (not sure what I'll do for the testcases then). What > I'd want instead is a "returns" attribute that means the function will > return (either by "return" or an exception), as opposed to having an > infinite loop, calling exit or longjmp, etc (sched-deps.c has a related > call_may_noreturn_p). The situation I am trying to avoid is: > p=malloc(12); > f(p) > free(p) > > where f contains somewhere: > free(p); exit(42); > (or longjmp or something else that takes the regular call to free out of the > execution flow). Instead of a new attribute IPA should be used to compute this call-graph property. An infinite loop wouldn't be an issue, but the real issue is that it shouldn't free the object which would be only valid if the free after the call isn't reached. IPA pure-const is used to compute similar properties (may loop, may throw). You also have to make sure to properly align the replacement. > > It passes most of the testsuite, but breaks a couple __builtin_object_size > tests: > > struct A > { > char a[10]; > int b; > char c[10]; > }; > int main(){ > struct A *p = malloc (2 * sizeof (struct A)); > assert (__builtin_object_size (&p->a, 1) == sizeof (p->a)); > free (p); > } > __builtin_object_size now returns 56 instead of 10. I am not sure what to do > about that. > > > The size above which the malloc->stack transformation is not applied should > depend on a parameter, I don't know if it should get its own or depend on an > existing one. In any case, I'd like to avoid ending up with a ridiculously > low threshold (my programs use GMP, which internally uses alloca up to 65536 > bytes (even in recursive functions that have a dozen allocations), so I > don't want gcc to tell me that 50 bytes are too much). > > A program with a double-free may, with this patch, end up crashing on the > first free instead of the second, but that's an invalid program anyway. > > > I don't know if tree-ssa-forwprop is a good place for it, but it was > convenient for a prototype. I like the idea of running it several times: > replacing malloc with a decl early can help other optimizations, and other > optimizations can make this one possible later. We have a similar transform in CCP (fold_builtin_alloca_with_align) which is there because CCP is a good place where size arguments may become constant (the case you handle - you don't seem to handle variable malloc -> alloca transform which would be possible if for example VRP figures out a acceptable upper bound for the size). Richard. > The walk could be a bit expensive, but we only do it if we detected a malloc > of a small constant and at least one matching free. I guess I should mark > the malloc somehow to avoid performing the walk twice if there are several > (but not enough) matching free. > > > stack_vec is nice, it would be convenient if bitmaps also had a version with > a destructor so we don't need to explicitly deallocate them. > > -- > Marc Glisse > Index: gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C > =================================================================== > --- gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C (revision 0) > +++ gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C (working copy) > @@ -0,0 +1,24 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-optimized" } */ > + > +struct A { > + void*p; > + A(void*q) : p(q) {} > + ~A(){ __builtin_free(p); } > +}; > +void f(void*)__attribute__((__leaf__)); > +void h(void*)__attribute__((__leaf__,__nothrow__)); > +void g(){ > + void*p=__builtin_malloc(12); > + A a(p); > + f(p); > +} > + > +void i(){ > + void*p=__builtin_malloc(12); > + h(p); > + __builtin_free(p); > +} > + > +/* { dg-final { scan-tree-dump-not "malloc" "optimized" } } */ > +/* { dg-final { cleanup-tree-dump "optimized" } } */ > > Property changes on: gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C > ___________________________________________________________________ > Added: svn:keywords > ## -0,0 +1 ## > +Author Date Id Revision URL > \ No newline at end of property > Added: svn:eol-style > ## -0,0 +1 ## > +native > \ No newline at end of property > Index: gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C > =================================================================== > --- gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C (revision 0) > +++ gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C (working copy) > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-optimized" } */ > + > +void f(void*)__attribute__((__leaf__)); > +void g(){ > + void*p=__builtin_malloc(12); > + f(p); > + __builtin_free(p); > +} > + > +/* { dg-final { scan-tree-dump-times "malloc" 1 "optimized" } } */ > +/* { dg-final { cleanup-tree-dump "optimized" } } */ > > Property changes on: gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C > ___________________________________________________________________ > Added: svn:eol-style > ## -0,0 +1 ## > +native > \ No newline at end of property > Added: svn:keywords > ## -0,0 +1 ## > +Author Date Id Revision URL > \ No newline at end of property > Index: gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c > =================================================================== > --- gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c (revision 0) > +++ gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c (working copy) > @@ -0,0 +1,22 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-optimized" } */ > + > +void f(void*)__attribute__((__leaf__)); > +void g(int m,int n){ > + int i; > + void*p=__builtin_malloc(12); > + switch(n){ > + case 1: > + for (i=0; i<m; ++i) > + f(p); > + break; > + case 2: > + __builtin_free(p); > + __builtin_exit(42); > + default:; > + } > + __builtin_free(p); > +} > + > +/* { dg-final { scan-tree-dump-not "malloc" "optimized" } } */ > +/* { dg-final { cleanup-tree-dump "optimized" } } */ > > Property changes on: gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c > ___________________________________________________________________ > Added: svn:keywords > ## -0,0 +1 ## > +Author Date Id Revision URL > \ No newline at end of property > Added: svn:eol-style > ## -0,0 +1 ## > +native > \ No newline at end of property > Index: gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c > =================================================================== > --- gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c (revision 0) > +++ gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c (working copy) > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-optimized" } */ > + > +void f(void*); > +void g(){ > + void*p=__builtin_malloc(12); > + f(p); > + __builtin_free(p); > +} > + > +/* { dg-final { scan-tree-dump-times "malloc" 1 "optimized" } } */ > +/* { dg-final { cleanup-tree-dump "optimized" } } */ > > Property changes on: gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c > ___________________________________________________________________ > Added: svn:keywords > ## -0,0 +1 ## > +Author Date Id Revision URL > \ No newline at end of property > Added: svn:eol-style > ## -0,0 +1 ## > +native > \ No newline at end of property > Index: gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c > =================================================================== > --- gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c (revision 204620) > +++ gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c (working copy) > @@ -1,31 +1,31 @@ > /* { dg-do compile } */ > /* { dg-options "-O2 -fdump-tree-optimized" } */ > > -void test2(void) > +void test2(int n) > { > - int *p = __builtin_malloc (sizeof (int) * 4); > + int *p = __builtin_malloc (sizeof (int) * n); > if (p == (void *)0) > __builtin_abort (); > __builtin_free (p); > } > > -void test5(int b) > +void test5(int n) > { > - int *p = __builtin_malloc (sizeof (int) * 4); > + int *p = __builtin_malloc (sizeof (int) * n); > if (p) > __builtin_free (p); > } > > -void test6(void) > +void test6(int n) > { > - int *p = __builtin_malloc (sizeof (int) * 4); > + int *p = __builtin_malloc (sizeof (int) * n); > if (p == (void *)0) > __builtin_abort (); > if (p) > __builtin_free (p); > } > > /* We should be able to remove all malloc/free pairs with CDDCE. > Assume p was non-NULL for test2. > For test5, it doesn't matter if p is NULL or non-NULL. */ > > Index: gcc/tree-ssa-forwprop.c > =================================================================== > --- gcc/tree-ssa-forwprop.c (revision 204620) > +++ gcc/tree-ssa-forwprop.c (working copy) > @@ -33,20 +33,21 @@ along with GCC; see the file COPYING3. > #include "tree-ssanames.h" > #include "tree-dfa.h" > #include "tree-pass.h" > #include "langhooks.h" > #include "flags.h" > #include "expr.h" > #include "cfgloop.h" > #include "optabs.h" > #include "tree-ssa-propagate.h" > #include "tree-ssa-dom.h" > +#include "params.h" > > /* This pass propagates the RHS of assignment statements into use > sites of the LHS of the assignment. It's basically a specialized > form of tree combination. It is hoped all of this can disappear > when we have a generalized tree combiner. > > One class of common cases we handle is forward propagating a single use > variable into a COND_EXPR. > > bb0: > @@ -1478,20 +1479,113 @@ constant_pointer_difference (tree p1, tr > } > > for (i = 0; i < cnt[0]; i++) > for (j = 0; j < cnt[1]; j++) > if (exps[0][i] == exps[1][j]) > return size_binop (MINUS_EXPR, offs[0][i], offs[1][j]); > > return NULL_TREE; > } > > +/* We want to be sure that free (VAR) can't be called in another function, > and > + the easiest way to ensure that is to prove that it is called in this > + function. The hardest part is avoiding some call to a function that may > not > + return because of exit, an infinite loop, setjmp, etc, where it might > call > + free on VAR. */ > +static bool > +malloc_safe_on_stack (gimple stmt, vec<gimple, va_heap> *list_of_frees) > +{ > + tree var = gimple_call_lhs (stmt); > + basic_block bb = gimple_bb (stmt); > + stack_vec<basic_block, 20> bb_to_visit; > + bitmap bb_visited = BITMAP_ALLOC (NULL); > + bitmap_set_bit (bb_visited, bb->index); > + gimple_stmt_iterator gsi = gsi_for_stmt (stmt); > + > +next_stmt: > + gsi_next_nondebug (&gsi); > + > +handle_stmt: > + if (gsi_end_p (gsi)) > + { > + edge e; > + edge_iterator ei; > + FOR_EACH_EDGE (e, ei, bb->succs) > + { > + bb_to_visit.safe_push (e->dest); > + } > + goto next_bb; > + } > + stmt = gsi_stmt (gsi); > + if (stmt_can_throw_external (stmt)) > + /* We could give up only if VAR has escaped (same for return), but that > + means there is a memory leak, so don't bother. */ > + goto unsafe; > + > + switch (gimple_code (stmt)) > + // TODO: GIMPLE_ASM, EH related gimples? > + { > + /* We leave the function without calling free. */ > + case GIMPLE_RETURN: > + goto unsafe; > + > + /* Statements that are irrelevant. */ > + case GIMPLE_ASSIGN: > + case GIMPLE_LABEL: > + case GIMPLE_NOP: > + /* Last stmt of the bb, handled by looking at the outgoing edges. > */ > + case GIMPLE_GOTO: > + case GIMPLE_COND: > + // TODO: special case: if (VAR == 0) ... > + case GIMPLE_SWITCH: > + goto next_stmt; > + > + case GIMPLE_CALL: > + { > + // TODO: __builtin_(abort|trap|exit|unreachable) > + // should be safe as well. > + if (gimple_call_builtin_p (stmt, BUILT_IN_FREE) > + && gimple_call_arg (stmt, 0) == var) > + { > + /* Nothing could throw an exception earlier in the block, > + so we don't need to check the EH edges. */ > + list_of_frees->safe_push (stmt); > + goto next_bb; > + } > + int flags = gimple_call_flags (stmt); > + // FIXME: leaf is actually not safe, we need some new ECF_* flags. > + if (flags & (ECF_CONST|ECF_PURE|ECF_LEAF|ECF_NOVOPS)) > + goto next_stmt; > + goto unsafe; > + } > + default: > + goto unsafe; > + } > + > +next_bb: > + if (bb_to_visit.is_empty ()) > + goto safe; > + bb = bb_to_visit.last (); > + bb_to_visit.pop (); > + if (!bitmap_set_bit (bb_visited, bb->index)) > + goto next_bb; > + gsi = gsi_start_bb (bb); > + goto handle_stmt; > + > +unsafe: > + BITMAP_FREE (bb_visited); > + return false; > +safe: > + BITMAP_FREE (bb_visited); > + return true; > +} > + > /* *GSI_P is a GIMPLE_CALL to a builtin function. > Optimize > memcpy (p, "abcd", 4); > memset (p + 4, ' ', 3); > into > memcpy (p, "abcd ", 7); > call if the latter can be stored by pieces during expansion. */ > > static bool > simplify_builtin_call (gimple_stmt_iterator *gsi_p, tree callee2) > @@ -1681,20 +1775,64 @@ simplify_builtin_call (gimple_stmt_itera > gimple_call_set_arg (stmt2, 2, > build_int_cst (TREE_TYPE (len2), > src_len)); > unlink_stmt_vdef (stmt1); > gsi_remove (&gsi, true); > release_defs (stmt1); > update_stmt (stmt2); > return false; > } > } > break; > + > + case BUILT_IN_FREE: > + { > + tree size; > + tree ptr = gimple_call_arg (stmt2, 0); > + if (TREE_CODE (ptr) != SSA_NAME) > + return false; > + stmt1 = SSA_NAME_DEF_STMT (ptr); > + /* TODO: handle calloc. */ > + if (gimple_call_builtin_p (stmt1, BUILT_IN_MALLOC) > + && host_integerp ((size = gimple_call_arg (stmt1, 0)), 1) > + /* param_value(...) / 10? Some new parameter? */ > + && TREE_INT_CST_LOW (size) > + <= (unsigned HOST_WIDE_INT)PARAM_VALUE > (PARAM_LARGE_STACK_FRAME)) > + { > + gcc_checking_assert (gimple_call_lhs (stmt1) == ptr); > + stack_vec<gimple, 20> list_of_frees; > + if (!malloc_safe_on_stack (stmt1, &list_of_frees)) > + return false; > + /* Declare array. */ > + tree elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, > 1); > + tree array_type = build_array_type_nelts (elem_type, > + TREE_INT_CST_LOW > (size)); > + tree var = create_tmp_var (array_type, NULL); > + tree p = fold_convert (TREE_TYPE (ptr), build_fold_addr_expr > (var)); > + /* Replace free with a clobber. */ > + int i; > + gimple free_stmt; > + FOR_EACH_VEC_ELT (list_of_frees, i, free_stmt) > + { > + tree clobber = build_constructor (array_type, NULL); > + TREE_THIS_VOLATILE (clobber) = 1; > + gimple clobber_stmt = gimple_build_assign (var, clobber); > + gimple_stmt_iterator gsi_free = gsi_for_stmt (free_stmt); > + gsi_replace (&gsi_free, clobber_stmt, false); > + } > + /* Replace malloc with the address of the variable. */ > + gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1); > + update_call_from_tree (&gsi1, p); > + update_stmt (gsi_stmt (gsi1)); > + return true; > + } > + > + } > default: > break; > } > return false; > } > > /* Checks if expression has type of one-bit precision, or is a known > truth-valued expression. */ > static bool > truth_valued_ssa_name (tree name) >
On Mon, 11 Nov 2013, Ondřej Bílka wrote: > On Sun, Nov 10, 2013 at 04:27:00PM +0100, Marc Glisse wrote: >> Hello, >> >> I am posting this patch to get some feedback on the approach. The >> goal is to replace malloc+free with a stack allocation (a decl >> actually) when the size is a small constant. > > Why constraint yourself to small sizes. I am trying to get something to actually work and be accepted in gcc. That may mean being conservative. > Below is a simple implementation which creates a separate stack for > that (for simplicity and because it does not need to find bounds on > thread stack.) One difficulty with using a stack is that lifetimes cannot partially overlap, whereas with malloc+free they can. Using the main stack has the advantage that I don't have to think of deallocation, it happens automatically. And using a decl instead of alloca means that even if malloc+free was in a loop, not deallocating won't make me grow the stack use linearly in the number of iterations of the loop. > With that extension it would be possible to mark pointer so its free > would be a nop. That would help indeed, but it does require a libc that I haven't seen yet, and it may cause trouble if people try to LD_PRELOAD a different malloc implementation.
On Tue, Nov 12, 2013 at 01:16:14AM +0100, Marc Glisse wrote: > On Mon, 11 Nov 2013, Ondřej Bílka wrote: > > >On Sun, Nov 10, 2013 at 04:27:00PM +0100, Marc Glisse wrote: > >>Hello, > >> > >>I am posting this patch to get some feedback on the approach. The > >>goal is to replace malloc+free with a stack allocation (a decl > >>actually) when the size is a small constant. > > > >Why constraint yourself to small sizes. > > I am trying to get something to actually work and be accepted in > gcc. That may mean being conservative. > That also may mean that you will cover only cases where it is not needed. A malloc will have a small per-thread cache for small requests that does not need any locking. A performance difference will be quite small and there may be a define which causes inlining constant size mallocs. Sizes from 256 bytes are interesting case. > >Below is a simple implementation which creates a separate stack for > >that (for simplicity and because it does not need to find bounds on > >thread stack.) > > One difficulty with using a stack is that lifetimes cannot partially > overlap, whereas with malloc+free they can. Which you need to solve anyway if you want to do conversion. You need to pick a properly overlapping malloc+free area that is best to given criteria (like that it has maximal expected memory usage.) > Using the main stack has > the advantage that I don't have to think of deallocation, it happens > automatically. And what is logic of limiting sizes? Note that a leaf function have higher priority for stack that nonleaf ones as when you do stack allocation early it may kill lot of leaf allocations because of stack concerns. > And using a decl instead of alloca means that even if > malloc+free was in a loop, not deallocating won't make me grow the > stack use linearly in the number of iterations of the loop. > Actually this looks like a orthogonal optimalization of memory reuse, you can transform x = malloc(42); free(x); y = malloc(42); to x = malloc(42); y = x; It migth be feasible to teach PRE to transform loops with repeated allocation to a initial allocation and reuse. > >With that extension it would be possible to mark pointer so its free > >would be a nop. > > That would help indeed, but it does require a libc that I haven't > seen yet, and it may cause trouble if people try to LD_PRELOAD a > different malloc implementation. > This depends if we could get information that malloc did a stack allocation (or costant size allocation which could be transformed to different call.) As a custom free is concerned there are more usage cases. If its worth complications there is a possibility of prepending LD_PRELOAD with custom free logic that works regardless of allocator used.
On Tue, 12 Nov 2013, Ondřej Bílka wrote: >> I am trying to get something to actually work and be accepted in >> gcc. That may mean being conservative. > > That also may mean that you will cover only cases where it is not needed. > > A malloc will have a small per-thread cache for small requests that does > not need any locking. A performance difference will be quite small and > there may be a define which causes inlining constant size mallocs. > > Sizes from 256 bytes are interesting case. I have to disagree here. When the allocated size is large enough, the cost of malloc+free often becomes small compared to whatever work you are doing in that array. It is when the size is very small that speeding up malloc+free is essential. And you are underestimating the cost of those small allocations. I started on this because of an application that spends more than half of its time in malloc+free and where (almost) no allocation is larger than 100 bytes. Changing the code to not use malloc/free but other allocation strategies is very complicated because it would break abstraction layers. I used various workarounds that proved rather effective, but I would have loved for that to be unnecessary. >> One difficulty with using a stack is that lifetimes cannot partially >> overlap, whereas with malloc+free they can. > > Which you need to solve anyway if you want to do conversion. You need to > pick a properly overlapping malloc+free area that is best to given > criteria (like that it has maximal expected memory usage.) Here I am extending all lifetimes to the whole function (and implicitly reusing storage in loops), simplistic I know. >> Using the main stack has the advantage that I don't have to think of >> deallocation, it happens automatically. > > And what is logic of limiting sizes? Main stack is rather small by default. And it is nice if other variables on the stack are not spread over too many memory pages. > Note that a leaf function have higher priority for stack that nonleaf > ones as when you do stack allocation early it may kill lot of leaf > allocations because of stack concerns. Sorry, I have trouble understanding your sentence. Do you just mean that if there is a threshold it should be higher in leaf functions? >> And using a decl instead of alloca means that even if >> malloc+free was in a loop, not deallocating won't make me grow the >> stack use linearly in the number of iterations of the loop. >> > Actually this looks like a orthogonal optimalization of memory reuse, Replacing malloc+free with alloca (without save/restore of the stack) would be a pessimization. > you can transform > > x = malloc(42); > free(x); > y = malloc(42); > > to > > x = malloc(42); > y = x; > > It migth be feasible to teach PRE to transform loops with repeated > allocation to a initial allocation and reuse. There are already several PRs in bugzilla. For instance PR 38318 is about transforming: for(...){ malloc ... free } into: malloc for(...){ ... } free I don't remember if your example of reusing a previous allocation is listed anywhere, you may want to add it if not. Those would all be very useful, please try to implement one of them. You may also be interested in optimizing the C++ dynarray class once a basic implementation is committed. I understand the topic of optimizing allocations is very large, and powerful techniques could be developed for it. However, those bugs have been open for years and nobody has bothered going further that removing free(malloc(n)). So I think we should go with the small case that has known usecases. And when you come up with a more general framework that makes this small case irrelevant, I will gladly welcome it and we can easily remove the small case.
On Tue, Nov 12, 2013 at 01:41:24PM +0100, Marc Glisse wrote: > On Tue, 12 Nov 2013, Ondřej Bílka wrote: > > >>I am trying to get something to actually work and be accepted in > >>gcc. That may mean being conservative. > > > >That also may mean that you will cover only cases where it is not needed. > > > >A malloc will have a small per-thread cache for small requests that does > >not need any locking. A performance difference will be quite small and > >there may be a define which causes inlining constant size mallocs. > > > >Sizes from 256 bytes are interesting case. > > I have to disagree here. When the allocated size is large enough, > the cost of malloc+free often becomes small compared to whatever > work you are doing in that array. It is when the size is very small > that speeding up malloc+free is essential. And you are > underestimating the cost of those small allocations. > No, just aware that these are important and there will be optimizations that convert these. For example: #define malloc (s) ({ \ static pool p; \ if (__builtin_constant_p (s) { \ alloc_from_pool(&p); \ else \ malloc (s); \ }) How will you find small constant allocations with this in place? Also you malloc has tradeoff between time and memory usage and where larger allocations needlessly cause bigger fragmentation. > I started on this because of an application that spends more than > half of its time in malloc+free and where (almost) no allocation is > larger than 100 bytes. Changing the code to not use malloc/free but > other allocation strategies is very complicated because it would > break abstraction layers. I used various workarounds that proved > rather effective, but I would have loved for that to be unnecessary. > See my memory pool that uses custom free functionality where you need only change malloc, free is handled automaticaly. > > > >And what is logic of limiting sizes? > > Main stack is rather small by default. And it is nice if other > variables on the stack are not spread over too many memory pages. No, question was what is logic that you will use to decide if allocation is stack or heap? Also if we take not spreading stack argument to logical conclusion a best course of action in adding a separate stack and placing arrays/alloca over 1024 bytes there. This is theory, I tried few benchmarks how using large alloca on different stack affects performance but I did not measure a difference. > > >Note that a leaf function have higher priority for stack that > >nonleaf ones as when you do stack allocation early it may kill lot > >of leaf allocations because of stack concerns. > > Sorry, I have trouble understanding your sentence. Do you just mean > that if there is a threshold it should be higher in leaf functions? > Yes, functions closer to leaf function should have higher tresholds. Generally in leaf functions you can safely use 65536 bytes as most libc function assume this stack so caller would get same problems by calling that function. > >>And using a decl instead of alloca means that even if > >>malloc+free was in a loop, not deallocating won't make me grow the > >>stack use linearly in the number of iterations of the loop. > >> > >Actually this looks like a orthogonal optimalization of memory reuse, > > Replacing malloc+free with alloca (without save/restore of the > stack) would be a pessimization. > I did not mentioned alloca. > I understand the topic of optimizing allocations is very large, and > powerful techniques could be developed for it. However, those bugs > have been open for years and nobody has bothered going further that > removing free(malloc(n)). So I think we should go with the small > case that has known usecases. And when you come up with a more > general framework that makes this small case irrelevant, I will > gladly welcome it and we can easily remove the small case. > As this is hard in gcc then doing these without gcc should be easier. In several cases (constant size) we do not need gcc involvement as we colud get information ourself. In other cases a best behaviour could be determined only by running a custom benchmarking program which will tell which variant should go to which malloc. Then there are parts where coordination is necessary, one is determining if stack allocation is possible. A posible way would be first turn a eligible malloc calls to malloc_stack(size, color) as hint to allocator. I added a color parameter to handle partial overlap, if you do a coloring with edge when allocations partialy overlap then you can assign to each color class a stack and proceed as normal.
On Tue, 12 Nov 2013, Ondřej Bílka wrote: > On Tue, Nov 12, 2013 at 01:41:24PM +0100, Marc Glisse wrote: >> On Tue, 12 Nov 2013, Ondřej Bílka wrote: >> >>>> I am trying to get something to actually work and be accepted in >>>> gcc. That may mean being conservative. >>> >>> That also may mean that you will cover only cases where it is not needed. >>> >>> A malloc will have a small per-thread cache for small requests that does >>> not need any locking. A performance difference will be quite small and >>> there may be a define which causes inlining constant size mallocs. >>> >>> Sizes from 256 bytes are interesting case. >> >> I have to disagree here. When the allocated size is large enough, >> the cost of malloc+free often becomes small compared to whatever >> work you are doing in that array. It is when the size is very small >> that speeding up malloc+free is essential. And you are >> underestimating the cost of those small allocations. >> > No, just aware that these are important and there will be optimizations > that convert these. For example: > > #define malloc (s) ({ \ > static pool p; \ > if (__builtin_constant_p (s) { \ > alloc_from_pool(&p); \ > else \ > malloc (s); \ > }) Seems to be missing some bits. > How will you find small constant allocations with this in place? I won't. If your code is already optimized, the compiler has nothing left to do, that's fine. (not that I am convinced your optimization works that well) >> I started on this because of an application that spends more than >> half of its time in malloc+free and where (almost) no allocation is >> larger than 100 bytes. Changing the code to not use malloc/free but >> other allocation strategies is very complicated because it would >> break abstraction layers. I used various workarounds that proved >> rather effective, but I would have loved for that to be unnecessary. > > See my memory pool that uses custom free functionality where you need > only change malloc, free is handled automaticaly. Do you mean the incomplete macro above, or your STACK_ALLOC macro from the other post? (don't know how that one works either, "size" appears out of nowhere in STACK_FREE) As I already said, I know how to write efficient code, but that's hard on the abstraction layers (before inlining, you have to go at least 20 layers up in the CFG to find a common ancestor for malloc and free), and I'd be happy if the compiler could help a bit in easy cases. >>> And what is logic of limiting sizes? >> >> Main stack is rather small by default. And it is nice if other >> variables on the stack are not spread over too many memory pages. > > No, question was what is logic that you will use to decide if allocation > is stack or heap? > > Also if we take not spreading stack argument to logical conclusion a > best course of action in adding a separate stack and placing arrays/alloca > over 1024 bytes there. Then you also need to handle deallocations in there instead of relying on the automatic ones in the main stack. But yes, that's possible. > As this is hard in gcc then doing these without gcc should be easier. Yeah, we shouldn't bother writing an optimizing compiler, what were we thinking ;-) > In several cases (constant size) we do not need gcc involvement as we > colud get information ourself. In several cases, we don't need to bother manually optimizing our code, gcc could notice that things are simple and optimize them itself. > Then there are parts where coordination is necessary, one is determining > if stack allocation is possible. A posible way would be first turn a > eligible malloc calls to > > malloc_stack(size, color) > > as hint to allocator. I added a color parameter to handle partial > overlap, if you do a coloring with edge when allocations partialy > overlap then you can assign to each color class a stack and proceed as > normal. That would be great, yes. I'll be looking forward to your patches. (note that the limits of alias analysis mean that gcc often has no idea which free goes with which malloc).
On Tue, Nov 12, 2013 at 05:01:31PM +0100, Marc Glisse wrote: > On Tue, 12 Nov 2013, Ondřej Bílka wrote: > > >On Tue, Nov 12, 2013 at 01:41:24PM +0100, Marc Glisse wrote: > >>On Tue, 12 Nov 2013, Ondřej Bílka wrote: > >> > >>>>I am trying to get something to actually work and be accepted in > >>>>gcc. That may mean being conservative. > >>> > >>>That also may mean that you will cover only cases where it is not needed. > >>> > >>>A malloc will have a small per-thread cache for small requests that does > >>>not need any locking. A performance difference will be quite small and > >>>there may be a define which causes inlining constant size mallocs. > >>> > >>>Sizes from 256 bytes are interesting case. > >> > >>I have to disagree here. When the allocated size is large enough, > >>the cost of malloc+free often becomes small compared to whatever > >>work you are doing in that array. It is when the size is very small > >>that speeding up malloc+free is essential. And you are > >>underestimating the cost of those small allocations. > >> > >No, just aware that these are important and there will be optimizations > >that convert these. For example: > > > >#define malloc (s) ({ \ > > static pool p; \ > > if (__builtin_constant_p (s) { \ > > alloc_from_pool(&p); \ > > else \ > > malloc (s); \ > >}) > > Seems to be missing some bits. > A example, its purpose is to show a idea not to be complete. > >How will you find small constant allocations with this in place? > > I won't. If your code is already optimized, the compiler has nothing > left to do, that's fine. (not that I am convinced your optimization > works that well) > What if it decreases running time of all constant allocations by 6%. Converting to stack allocation would eliminate overhead but eliminated sites contributed to 5% of runtime. > >>I started on this because of an application that spends more than > >>half of its time in malloc+free and where (almost) no allocation is > >>larger than 100 bytes. Changing the code to not use malloc/free but > >>other allocation strategies is very complicated because it would > >>break abstraction layers. I used various workarounds that proved > >>rather effective, but I would have loved for that to be unnecessary. > > > >See my memory pool that uses custom free functionality where you need > >only change malloc, free is handled automaticaly. > > Do you mean the incomplete macro above, or your STACK_ALLOC macro > from the other post? (don't know how that one works either, "size" > appears out of nowhere in STACK_FREE) > Also a example where actual logic could be supplied later, should be __stack_new instead size. I am not talking about stack conversion but about memory pool, a proof-of-concept is here. https://www.sourceware.org/ml/libc-alpha/2013-11/msg00258.html > As I already said, I know how to write efficient code, but that's > hard on the abstraction layers (before inlining, you have to go at > least 20 layers up in the CFG to find a common ancestor for malloc > and free), and I'd be happy if the compiler could help a bit in easy > cases. > This is more about using for allocation libraries that are flexible enough. > >Then there are parts where coordination is necessary, one is determining > >if stack allocation is possible. A posible way would be first turn a > >eligible malloc calls to > > > >malloc_stack(size, color) > > > >as hint to allocator. I added a color parameter to handle partial > >overlap, if you do a coloring with edge when allocations partialy > >overlap then you can assign to each color class a stack and proceed as > >normal. > > That would be great, yes. I'll be looking forward to your patches. > > (note that the limits of alias analysis mean that gcc often has no > idea which free goes with which malloc). > Wait, you have a free with same SSA_NAME as malloc and alias analysis cannot tell which malloc corespond to that free? > -- > Marc Glisse
On Tue, 12 Nov 2013, Ondřej Bílka wrote: >> Seems to be missing some bits. > > A example, its purpose is to show a idea not to be complete. I agree, but when too many bits are missing or wrong I fail to get the idea :-( >>> How will you find small constant allocations with this in place? >> >> I won't. If your code is already optimized, the compiler has nothing >> left to do, that's fine. (not that I am convinced your optimization >> works that well) >> > What if it decreases running time of all constant allocations by 6%. > Converting to stack allocation would eliminate overhead but eliminated > sites contributed to 5% of runtime. Then you may keep using your code, the optimization will be useless on it but it won't break it. >>>> I started on this because of an application that spends more than >>>> half of its time in malloc+free and where (almost) no allocation is >>>> larger than 100 bytes. Changing the code to not use malloc/free but >>>> other allocation strategies is very complicated because it would >>>> break abstraction layers. I used various workarounds that proved >>>> rather effective, but I would have loved for that to be unnecessary. >>> >>> See my memory pool that uses custom free functionality where you need >>> only change malloc, free is handled automaticaly. >> >> Do you mean the incomplete macro above, or your STACK_ALLOC macro >> from the other post? (don't know how that one works either, "size" >> appears out of nowhere in STACK_FREE) >> > Also a example where actual logic could be supplied later, should be > __stack_new instead size. Ok, then that relies on the user calling STACK_ALLOC and STACK_FREE in a suitably nested order. I clearly don't know enough from a C++ constructor to use it. > This is more about using for allocation libraries that are flexible > enough. We have different targets. I am concerned with very small allocations. The overhead of 2 function calls is already noticable. I don't want to pay for flexibility. I believe the 2 approaches are really complementary and shouldn't hinder each other. >> (note that the limits of alias analysis mean that gcc often has no >> idea which free goes with which malloc). > > Wait, you have a free with same SSA_NAME as malloc and alias analysis > cannot tell which malloc corespond to that free? Here it does, but I wanted to warn you that it isn't such a common case. In addition to this patch, I'll also need some alias analysis changes for the code I started with to be optimized.
On 11/12/13 05:41, Marc Glisse wrote: > On Tue, 12 Nov 2013, Ondřej Bílka wrote: > >>> I am trying to get something to actually work and be accepted in >>> gcc. That may mean being conservative. >> >> That also may mean that you will cover only cases where it is not needed. >> >> A malloc will have a small per-thread cache for small requests that does >> not need any locking. A performance difference will be quite small and >> there may be a define which causes inlining constant size mallocs. >> >> Sizes from 256 bytes are interesting case. > > I have to disagree here. When the allocated size is large enough, the > cost of malloc+free often becomes small compared to whatever work you > are doing in that array. It is when the size is very small that speeding > up malloc+free is essential. And you are underestimating the cost of > those small allocations. I have to agree with Marc here. Start with something small and we can always look to extend it to handle more cases over time. The whole point is to avoid going into the allocator which can be damn expensive. Allocating on the stack is orders of magnitude cheaper. If you can collapse down to a _DECL, then, well, that's the holy grail. [soapbox] And doing this kind of thing intelligently is something I strongly would encourage. I can't express my disdain for folks directly using alloca. Programmers have consistently shown they are unable to do so in a safe manner. The amount of time I have personally had to dedicate to dealing with the fallout from such practices is absurd. Banning alloca in favor of malloc is something every project should seriously consider. Having the ability for the compiler to intelligently and safely take a malloc/free pair and turn that into alloca or DECL reference removes one more reason for programmers to directly use alloca. [/soapbox] Jeff
Index: gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C =================================================================== --- gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C (revision 0) +++ gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C (working copy) @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +struct A { + void*p; + A(void*q) : p(q) {} + ~A(){ __builtin_free(p); } +}; +void f(void*)__attribute__((__leaf__)); +void h(void*)__attribute__((__leaf__,__nothrow__)); +void g(){ + void*p=__builtin_malloc(12); + A a(p); + f(p); +} + +void i(){ + void*p=__builtin_malloc(12); + h(p); + __builtin_free(p); +} + +/* { dg-final { scan-tree-dump-not "malloc" "optimized" } } */ +/* { dg-final { cleanup-tree-dump "optimized" } } */ Property changes on: gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C ___________________________________________________________________ Added: svn:keywords ## -0,0 +1 ## +Author Date Id Revision URL \ No newline at end of property Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C =================================================================== --- gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C (revision 0) +++ gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C (working copy) @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +void f(void*)__attribute__((__leaf__)); +void g(){ + void*p=__builtin_malloc(12); + f(p); + __builtin_free(p); +} + +/* { dg-final { scan-tree-dump-times "malloc" 1 "optimized" } } */ +/* { dg-final { cleanup-tree-dump "optimized" } } */ Property changes on: gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +Author Date Id Revision URL \ No newline at end of property Index: gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c =================================================================== --- gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c (working copy) @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +void f(void*)__attribute__((__leaf__)); +void g(int m,int n){ + int i; + void*p=__builtin_malloc(12); + switch(n){ + case 1: + for (i=0; i<m; ++i) + f(p); + break; + case 2: + __builtin_free(p); + __builtin_exit(42); + default:; + } + __builtin_free(p); +} + +/* { dg-final { scan-tree-dump-not "malloc" "optimized" } } */ +/* { dg-final { cleanup-tree-dump "optimized" } } */ Property changes on: gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c ___________________________________________________________________ Added: svn:keywords ## -0,0 +1 ## +Author Date Id Revision URL \ No newline at end of property Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c =================================================================== --- gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c (working copy) @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +void f(void*); +void g(){ + void*p=__builtin_malloc(12); + f(p); + __builtin_free(p); +} + +/* { dg-final { scan-tree-dump-times "malloc" 1 "optimized" } } */ +/* { dg-final { cleanup-tree-dump "optimized" } } */ Property changes on: gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c ___________________________________________________________________ Added: svn:keywords ## -0,0 +1 ## +Author Date Id Revision URL \ No newline at end of property Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c =================================================================== --- gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c (revision 204620) +++ gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c (working copy) @@ -1,31 +1,31 @@ /* { dg-do compile } */ /* { dg-options "-O2 -fdump-tree-optimized" } */ -void test2(void) +void test2(int n) { - int *p = __builtin_malloc (sizeof (int) * 4); + int *p = __builtin_malloc (sizeof (int) * n); if (p == (void *)0) __builtin_abort (); __builtin_free (p); } -void test5(int b) +void test5(int n) { - int *p = __builtin_malloc (sizeof (int) * 4); + int *p = __builtin_malloc (sizeof (int) * n); if (p) __builtin_free (p); } -void test6(void) +void test6(int n) { - int *p = __builtin_malloc (sizeof (int) * 4); + int *p = __builtin_malloc (sizeof (int) * n); if (p == (void *)0) __builtin_abort (); if (p) __builtin_free (p); } /* We should be able to remove all malloc/free pairs with CDDCE. Assume p was non-NULL for test2. For test5, it doesn't matter if p is NULL or non-NULL. */ Index: gcc/tree-ssa-forwprop.c =================================================================== --- gcc/tree-ssa-forwprop.c (revision 204620) +++ gcc/tree-ssa-forwprop.c (working copy) @@ -33,20 +33,21 @@ along with GCC; see the file COPYING3. #include "tree-ssanames.h" #include "tree-dfa.h" #include "tree-pass.h" #include "langhooks.h" #include "flags.h" #include "expr.h" #include "cfgloop.h" #include "optabs.h" #include "tree-ssa-propagate.h" #include "tree-ssa-dom.h" +#include "params.h" /* This pass propagates the RHS of assignment statements into use sites of the LHS of the assignment. It's basically a specialized form of tree combination. It is hoped all of this can disappear when we have a generalized tree combiner. One class of common cases we handle is forward propagating a single use variable into a COND_EXPR. bb0: @@ -1478,20 +1479,113 @@ constant_pointer_difference (tree p1, tr } for (i = 0; i < cnt[0]; i++) for (j = 0; j < cnt[1]; j++) if (exps[0][i] == exps[1][j]) return size_binop (MINUS_EXPR, offs[0][i], offs[1][j]); return NULL_TREE; } +/* We want to be sure that free (VAR) can't be called in another function, and + the easiest way to ensure that is to prove that it is called in this + function. The hardest part is avoiding some call to a function that may not + return because of exit, an infinite loop, setjmp, etc, where it might call + free on VAR. */ +static bool +malloc_safe_on_stack (gimple stmt, vec<gimple, va_heap> *list_of_frees) +{ + tree var = gimple_call_lhs (stmt); + basic_block bb = gimple_bb (stmt); + stack_vec<basic_block, 20> bb_to_visit; + bitmap bb_visited = BITMAP_ALLOC (NULL); + bitmap_set_bit (bb_visited, bb->index); + gimple_stmt_iterator gsi = gsi_for_stmt (stmt); + +next_stmt: + gsi_next_nondebug (&gsi); + +handle_stmt: + if (gsi_end_p (gsi)) + { + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, bb->succs) + { + bb_to_visit.safe_push (e->dest); + } + goto next_bb; + } + stmt = gsi_stmt (gsi); + if (stmt_can_throw_external (stmt)) + /* We could give up only if VAR has escaped (same for return), but that + means there is a memory leak, so don't bother. */ + goto unsafe; + + switch (gimple_code (stmt)) + // TODO: GIMPLE_ASM, EH related gimples? + { + /* We leave the function without calling free. */ + case GIMPLE_RETURN: + goto unsafe; + + /* Statements that are irrelevant. */ + case GIMPLE_ASSIGN: + case GIMPLE_LABEL: + case GIMPLE_NOP: + /* Last stmt of the bb, handled by looking at the outgoing edges. */ + case GIMPLE_GOTO: + case GIMPLE_COND: + // TODO: special case: if (VAR == 0) ... + case GIMPLE_SWITCH: + goto next_stmt; + + case GIMPLE_CALL: + { + // TODO: __builtin_(abort|trap|exit|unreachable) + // should be safe as well. + if (gimple_call_builtin_p (stmt, BUILT_IN_FREE) + && gimple_call_arg (stmt, 0) == var) + { + /* Nothing could throw an exception earlier in the block, + so we don't need to check the EH edges. */ + list_of_frees->safe_push (stmt); + goto next_bb; + } + int flags = gimple_call_flags (stmt); + // FIXME: leaf is actually not safe, we need some new ECF_* flags. + if (flags & (ECF_CONST|ECF_PURE|ECF_LEAF|ECF_NOVOPS)) + goto next_stmt; + goto unsafe; + } + default: + goto unsafe; + } + +next_bb: + if (bb_to_visit.is_empty ()) + goto safe; + bb = bb_to_visit.last (); + bb_to_visit.pop (); + if (!bitmap_set_bit (bb_visited, bb->index)) + goto next_bb; + gsi = gsi_start_bb (bb); + goto handle_stmt; + +unsafe: + BITMAP_FREE (bb_visited); + return false; +safe: + BITMAP_FREE (bb_visited); + return true; +} + /* *GSI_P is a GIMPLE_CALL to a builtin function. Optimize memcpy (p, "abcd", 4); memset (p + 4, ' ', 3); into memcpy (p, "abcd ", 7); call if the latter can be stored by pieces during expansion. */ static bool simplify_builtin_call (gimple_stmt_iterator *gsi_p, tree callee2) @@ -1681,20 +1775,64 @@ simplify_builtin_call (gimple_stmt_itera gimple_call_set_arg (stmt2, 2, build_int_cst (TREE_TYPE (len2), src_len)); unlink_stmt_vdef (stmt1); gsi_remove (&gsi, true); release_defs (stmt1); update_stmt (stmt2); return false; } } break; + + case BUILT_IN_FREE: + { + tree size; + tree ptr = gimple_call_arg (stmt2, 0); + if (TREE_CODE (ptr) != SSA_NAME) + return false; + stmt1 = SSA_NAME_DEF_STMT (ptr); + /* TODO: handle calloc. */ + if (gimple_call_builtin_p (stmt1, BUILT_IN_MALLOC) + && host_integerp ((size = gimple_call_arg (stmt1, 0)), 1) + /* param_value(...) / 10? Some new parameter? */ + && TREE_INT_CST_LOW (size) + <= (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME)) + { + gcc_checking_assert (gimple_call_lhs (stmt1) == ptr); + stack_vec<gimple, 20> list_of_frees; + if (!malloc_safe_on_stack (stmt1, &list_of_frees)) + return false; + /* Declare array. */ + tree elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1); + tree array_type = build_array_type_nelts (elem_type, + TREE_INT_CST_LOW (size)); + tree var = create_tmp_var (array_type, NULL); + tree p = fold_convert (TREE_TYPE (ptr), build_fold_addr_expr (var)); + /* Replace free with a clobber. */ + int i; + gimple free_stmt; + FOR_EACH_VEC_ELT (list_of_frees, i, free_stmt) + { + tree clobber = build_constructor (array_type, NULL); + TREE_THIS_VOLATILE (clobber) = 1; + gimple clobber_stmt = gimple_build_assign (var, clobber); + gimple_stmt_iterator gsi_free = gsi_for_stmt (free_stmt); + gsi_replace (&gsi_free, clobber_stmt, false); + } + /* Replace malloc with the address of the variable. */ + gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1); + update_call_from_tree (&gsi1, p); + update_stmt (gsi_stmt (gsi1)); + return true; + } + + } default: break; } return false; } /* Checks if expression has type of one-bit precision, or is a known truth-valued expression. */ static bool truth_valued_ssa_name (tree name)