Message ID | 4A139FC4.7030309@cosmosbay.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, May 20, 2009 at 08:14:28AM +0200, Eric Dumazet wrote: > Eric Dumazet a écrit : > > David Miller a écrit : > >> From: Neil Horman <nhorman@tuxdriver.com> > >> Date: Tue, 19 May 2009 15:24:50 -0400 > >> > >>>> Moving whole group in front would defeat the purpose of move, actually, > >>>> since rank in chain is used to decay the timeout in garbage collector. > >>>> (search for tmo >>= 1; ) > >>>> > >>> Argh, so the list is implicitly ordered by expiration time. That > >>> really defeats the entire purpose of doing grouping in the ilst at > >>> all. If thats the case, then I agree, its probably better to to > >>> take the additional visitation hit in in check_expire above than to > >>> try and preserve ordering. > >> Yes, this seems best. > >> > >> I was worried that somehow the ordering also influences lookups, > >> because the TOS bits don't go into the hash so I worried that it would > >> be important that explicit TOS values appear before wildcard ones. > >> But it doesn't appear that this is an issue, we don't have wildcard > >> TOSs in the rtable entries, they are always explicit. > >> > >> So I would like to see an explicit final patch from Eric so we can get > >> this fixed now. > >> > > > > I would like to split patches because we have two bugs indeed, and > > I prefer to get attention for both problems, I dont remember Neil acknowledged > > the length computation problem. > > > > First and small patch, candidate for net-2.6 and stable (for 2.6.29) : > > > > Then here is the patch on top on previous one. > > Thanks to all > > [PATCH] net: fix rtable leak in net/ipv4/route.c > > Alexander V. Lukyanov found a regression in 2.6.29 and made a complete > analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339 > Quoted here because its a perfect one : > > begin_of_quotation > 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the > patch has at least one critical flaw, and another problem. > > rt_intern_hash calculates rthi pointer, which is later used for new entry > insertion. The same loop calculates cand pointer which is used to clean the > list. If the pointers are the same, rtable leak occurs, as first the cand is > removed then the new entry is appended to it. > > This leak leads to unregister_netdevice problem (usage count > 0). > > Another problem of the patch is that it tries to insert the entries in certain > order, to facilitate counting of entries distinct by all but QoS parameters. > Unfortunately, referencing an existing rtable entry moves it to list beginning, > to speed up further lookups, so the carefully built order is destroyed. > > For the first problem the simplest patch it to set rthi=0 when rthi==cand, but > it will also destroy the ordering. > end_of_quotation > > > Problematic commit is 1080d709fb9d8cd4392f93476ee46a9d6ea05a5b > (net: implement emergency route cache rebulds when gc_elasticity is exceeded) > > Trying to keep dst_entries ordered is too complex and breaks the fact that > order should depend on the frequency of use for garbage collection. > > A possible fix is to make rt_intern_hash() simpler, and only makes > rt_check_expire() a litle bit smarter, being able to cope with an arbitrary > entries order. The added loop is running on cache hot data, while cpu > is prefetching next object, so should be unnoticied. > > Reported-and-analyzed-by: Alexander V. Lukyanov <lav@yar.ru> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > net/ipv4/route.c | 55 +++++++++++++-------------------------------- > 1 files changed, 17 insertions(+), 38 deletions(-) > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index 869cf1c..28205e5 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -784,7 +784,7 @@ static void rt_check_expire(void) > { > static unsigned int rover; > unsigned int i = rover, goal; > - struct rtable *rth, **rthp; > + struct rtable *rth, *aux, **rthp; > unsigned long samples = 0; > unsigned long sum = 0, sum2 = 0; > u64 mult; > @@ -812,6 +812,7 @@ static void rt_check_expire(void) > length = 0; > spin_lock_bh(rt_hash_lock_addr(i)); > while ((rth = *rthp) != NULL) { > + prefetch(rth->u.dst.rt_next); > if (rt_is_expired(rth)) { > *rthp = rth->u.dst.rt_next; > rt_free(rth); > @@ -820,33 +821,30 @@ static void rt_check_expire(void) > if (rth->u.dst.expires) { > /* Entry is expired even if it is in use */ > if (time_before_eq(jiffies, rth->u.dst.expires)) { > +nofree: > tmo >>= 1; > rthp = &rth->u.dst.rt_next; > /* > - * Only bump our length if the hash > - * inputs on entries n and n+1 are not > - * the same, we only count entries on > + * We only count entries on > * a chain with equal hash inputs once > * so that entries for different QOS > * levels, and other non-hash input > * attributes don't unfairly skew > * the length computation > */ > - if ((*rthp == NULL) || > - !compare_hash_inputs(&(*rthp)->fl, > - &rth->fl)) > - length += ONE; > + for (aux = rt_hash_table[i].chain;;) { > + if (aux == rth) { > + length += ONE; > + break; > + } > + if (compare_hash_inputs(&aux->fl, &rth->fl)) > + break; > + aux = aux->u.dst.rt_next; > + } Very "interesting" for() usage, but isn't it more readable like this?: aux = rt_hash_table[i].chain; while (aux != rth) { if (compare_hash_inputs(&aux->fl, &rth->fl)) break; aux = aux->u.dst.rt_next; } if (aux == rth) length += ONE; Jarek P. > continue; > } > - } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) { > - tmo >>= 1; > - rthp = &rth->u.dst.rt_next; > - if ((*rthp == NULL) || > - !compare_hash_inputs(&(*rthp)->fl, > - &rth->fl)) > - length += ONE; > - continue; > - } > + } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) > + goto nofree; > > /* Cleanup aged off entries. */ > *rthp = rth->u.dst.rt_next; > @@ -1069,7 +1067,6 @@ out: return 0; > static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp) > { > struct rtable *rth, **rthp; > - struct rtable *rthi; > unsigned long now; > struct rtable *cand, **candp; > u32 min_score; > @@ -1089,7 +1086,6 @@ restart: > } > > rthp = &rt_hash_table[hash].chain; > - rthi = NULL; > > spin_lock_bh(rt_hash_lock_addr(hash)); > while ((rth = *rthp) != NULL) { > @@ -1135,17 +1131,6 @@ restart: > chain_length++; > > rthp = &rth->u.dst.rt_next; > - > - /* > - * check to see if the next entry in the chain > - * contains the same hash input values as rt. If it does > - * This is where we will insert into the list, instead of > - * at the head. This groups entries that differ by aspects not > - * relvant to the hash function together, which we use to adjust > - * our chain length > - */ > - if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl)) > - rthi = rth; > } > > if (cand) { > @@ -1206,10 +1191,7 @@ restart: > } > } > > - if (rthi) > - rt->u.dst.rt_next = rthi->u.dst.rt_next; > - else > - rt->u.dst.rt_next = rt_hash_table[hash].chain; > + rt->u.dst.rt_next = rt_hash_table[hash].chain; > > #if RT_CACHE_DEBUG >= 2 > if (rt->u.dst.rt_next) { > @@ -1225,10 +1207,7 @@ restart: > * previous writes to rt are comitted to memory > * before making rt visible to other CPUS. > */ > - if (rthi) > - rcu_assign_pointer(rthi->u.dst.rt_next, rt); > - else > - rcu_assign_pointer(rt_hash_table[hash].chain, rt); > + rcu_assign_pointer(rt_hash_table[hash].chain, rt); > > spin_unlock_bh(rt_hash_lock_addr(hash)); > *rp = rt; > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 20, 2009 at 08:14:28AM +0200, Eric Dumazet wrote: > Eric Dumazet a écrit : > > David Miller a écrit : > >> From: Neil Horman <nhorman@tuxdriver.com> > >> Date: Tue, 19 May 2009 15:24:50 -0400 > >> > >>>> Moving whole group in front would defeat the purpose of move, actually, > >>>> since rank in chain is used to decay the timeout in garbage collector. > >>>> (search for tmo >>= 1; ) > >>>> > >>> Argh, so the list is implicitly ordered by expiration time. That > >>> really defeats the entire purpose of doing grouping in the ilst at > >>> all. If thats the case, then I agree, its probably better to to > >>> take the additional visitation hit in in check_expire above than to > >>> try and preserve ordering. > >> Yes, this seems best. > >> > >> I was worried that somehow the ordering also influences lookups, > >> because the TOS bits don't go into the hash so I worried that it would > >> be important that explicit TOS values appear before wildcard ones. > >> But it doesn't appear that this is an issue, we don't have wildcard > >> TOSs in the rtable entries, they are always explicit. > >> > >> So I would like to see an explicit final patch from Eric so we can get > >> this fixed now. > >> > > > > I would like to split patches because we have two bugs indeed, and > > I prefer to get attention for both problems, I dont remember Neil acknowledged > > the length computation problem. > > > > First and small patch, candidate for net-2.6 and stable (for 2.6.29) : > > > > Then here is the patch on top on previous one. > > Thanks to all > > [PATCH] net: fix rtable leak in net/ipv4/route.c > > Alexander V. Lukyanov found a regression in 2.6.29 and made a complete > analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339 > Quoted here because its a perfect one : > > begin_of_quotation > 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the > patch has at least one critical flaw, and another problem. > > rt_intern_hash calculates rthi pointer, which is later used for new entry > insertion. The same loop calculates cand pointer which is used to clean the > list. If the pointers are the same, rtable leak occurs, as first the cand is > removed then the new entry is appended to it. > > This leak leads to unregister_netdevice problem (usage count > 0). > > Another problem of the patch is that it tries to insert the entries in certain > order, to facilitate counting of entries distinct by all but QoS parameters. > Unfortunately, referencing an existing rtable entry moves it to list beginning, > to speed up further lookups, so the carefully built order is destroyed. > > For the first problem the simplest patch it to set rthi=0 when rthi==cand, but > it will also destroy the ordering. > end_of_quotation > > > Problematic commit is 1080d709fb9d8cd4392f93476ee46a9d6ea05a5b > (net: implement emergency route cache rebulds when gc_elasticity is exceeded) > > Trying to keep dst_entries ordered is too complex and breaks the fact that > order should depend on the frequency of use for garbage collection. > > A possible fix is to make rt_intern_hash() simpler, and only makes > rt_check_expire() a litle bit smarter, being able to cope with an arbitrary > entries order. The added loop is running on cache hot data, while cpu > is prefetching next object, so should be unnoticied. > > Reported-and-analyzed-by: Alexander V. Lukyanov <lav@yar.ru> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski a écrit : > On Wed, May 20, 2009 at 08:14:28AM +0200, Eric Dumazet wrote: >> + for (aux = rt_hash_table[i].chain;;) { >> + if (aux == rth) { >> + length += ONE; >> + break; >> + } >> + if (compare_hash_inputs(&aux->fl, &rth->fl)) >> + break; >> + aux = aux->u.dst.rt_next; >> + } > > Very "interesting" for() usage, but isn't it more readable like this?: > > aux = rt_hash_table[i].chain; > while (aux != rth) { > if (compare_hash_inputs(&aux->fl, &rth->fl)) > break; > aux = aux->u.dst.rt_next; > } well, this test is done two times, this is the difference... > > if (aux == rth) > length += ONE; > > Jarek P. I first wrote : for (aux = rt_hash_table[i].chain ; ; aux = aux->u.dst.rt_next) { if (aux == rth) { length += ONE; break; } if (compare_hash_inputs(&aux->fl, &rth->fl)) break; } but had to split the too long line, so ended in the form in the patch :) Thank you -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 20, 2009 at 01:13:41PM +0200, Eric Dumazet wrote: > Jarek Poplawski a écrit : > > On Wed, May 20, 2009 at 08:14:28AM +0200, Eric Dumazet wrote: > >> + for (aux = rt_hash_table[i].chain;;) { > >> + if (aux == rth) { > >> + length += ONE; > >> + break; > >> + } > >> + if (compare_hash_inputs(&aux->fl, &rth->fl)) > >> + break; > >> + aux = aux->u.dst.rt_next; > >> + } > > > > Very "interesting" for() usage, but isn't it more readable like this?: > > > > aux = rt_hash_table[i].chain; > > while (aux != rth) { > > if (compare_hash_inputs(&aux->fl, &rth->fl)) > > break; > > aux = aux->u.dst.rt_next; > > } > > well, this test is done two times, this is the difference... I know, but I guess this is used quite often. And probably it's not very hard optimization for a compiler (while - else). As a matter of fact even this would confuse me less here: aux = rt_hash_table[i].chain; for (;;) { But of course, it's a matter of taste, so no big deal. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Neil Horman <nhorman@tuxdriver.com> Date: Wed, 20 May 2009 06:48:38 -0400 > On Wed, May 20, 2009 at 08:14:28AM +0200, Eric Dumazet wrote: >> [PATCH] net: fix rtable leak in net/ipv4/route.c ... >> Reported-and-analyzed-by: Alexander V. Lukyanov <lav@yar.ru> >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > Acked-by: Neil Horman <nhorman@tuxdriver.com> > Also applied and queued up for -stable, thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 869cf1c..28205e5 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -784,7 +784,7 @@ static void rt_check_expire(void) { static unsigned int rover; unsigned int i = rover, goal; - struct rtable *rth, **rthp; + struct rtable *rth, *aux, **rthp; unsigned long samples = 0; unsigned long sum = 0, sum2 = 0; u64 mult; @@ -812,6 +812,7 @@ static void rt_check_expire(void) length = 0; spin_lock_bh(rt_hash_lock_addr(i)); while ((rth = *rthp) != NULL) { + prefetch(rth->u.dst.rt_next); if (rt_is_expired(rth)) { *rthp = rth->u.dst.rt_next; rt_free(rth); @@ -820,33 +821,30 @@ static void rt_check_expire(void) if (rth->u.dst.expires) { /* Entry is expired even if it is in use */ if (time_before_eq(jiffies, rth->u.dst.expires)) { +nofree: tmo >>= 1; rthp = &rth->u.dst.rt_next; /* - * Only bump our length if the hash - * inputs on entries n and n+1 are not - * the same, we only count entries on + * We only count entries on * a chain with equal hash inputs once * so that entries for different QOS * levels, and other non-hash input * attributes don't unfairly skew * the length computation */ - if ((*rthp == NULL) || - !compare_hash_inputs(&(*rthp)->fl, - &rth->fl)) - length += ONE; + for (aux = rt_hash_table[i].chain;;) { + if (aux == rth) { + length += ONE; + break; + } + if (compare_hash_inputs(&aux->fl, &rth->fl)) + break; + aux = aux->u.dst.rt_next; + } continue; } - } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) { - tmo >>= 1; - rthp = &rth->u.dst.rt_next; - if ((*rthp == NULL) || - !compare_hash_inputs(&(*rthp)->fl, - &rth->fl)) - length += ONE; - continue; - } + } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) + goto nofree; /* Cleanup aged off entries. */ *rthp = rth->u.dst.rt_next; @@ -1069,7 +1067,6 @@ out: return 0; static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp) { struct rtable *rth, **rthp; - struct rtable *rthi; unsigned long now; struct rtable *cand, **candp; u32 min_score; @@ -1089,7 +1086,6 @@ restart: } rthp = &rt_hash_table[hash].chain; - rthi = NULL; spin_lock_bh(rt_hash_lock_addr(hash)); while ((rth = *rthp) != NULL) { @@ -1135,17 +1131,6 @@ restart: chain_length++; rthp = &rth->u.dst.rt_next; - - /* - * check to see if the next entry in the chain - * contains the same hash input values as rt. If it does - * This is where we will insert into the list, instead of - * at the head. This groups entries that differ by aspects not - * relvant to the hash function together, which we use to adjust - * our chain length - */ - if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl)) - rthi = rth; } if (cand) { @@ -1206,10 +1191,7 @@ restart: } } - if (rthi) - rt->u.dst.rt_next = rthi->u.dst.rt_next; - else - rt->u.dst.rt_next = rt_hash_table[hash].chain; + rt->u.dst.rt_next = rt_hash_table[hash].chain; #if RT_CACHE_DEBUG >= 2 if (rt->u.dst.rt_next) { @@ -1225,10 +1207,7 @@ restart: * previous writes to rt are comitted to memory * before making rt visible to other CPUS. */ - if (rthi) - rcu_assign_pointer(rthi->u.dst.rt_next, rt); - else - rcu_assign_pointer(rt_hash_table[hash].chain, rt); + rcu_assign_pointer(rt_hash_table[hash].chain, rt); spin_unlock_bh(rt_hash_lock_addr(hash)); *rp = rt;