Message ID | 1432085113-25063-1-git-send-email-ying.xue@windriver.com |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, 2015-05-20 at 09:25 +0800, Ying Xue wrote: > Calling __ipv4_neigh_lookup_noref() inside rcu_read_lock_bh() can > guarantee that its searched neighbour entry is not freed before RCU > grace period, but it cannot ensure that its obtained neighbour will > be freed shortly. Exactly saying, it cannot prevent neigh_destroy() > from being executed on another context at the same time. For example, > if ip_finish_output2() continues to deliver a SKB with a neighbour > entry whose refcount is zero, neigh_add_timer() may be called in > neigh_resolve_output() subsequently. As a result, neigh_add_timer() > takes refcount on the neighbour that already had a refcount of zero. > When the neighbour refcount is put before the timer's handler is > exited, neigh_destroy() will be called again, meaning crash happens > at the moment. > > To prevent the issue from occurring, we must check whether the refcount > of a neighbour searched by __ipv4_neigh_lookup_noref() is decremented > to zero or not. If it's zero, we should create a new one. > > However, as reading neigh's refcount is unsafe through atomic_read() > like it doesn't imply any memory barrier and the cost is too expensive > if we enforce a proper implicit or explicit memory barrier on it, > another checking of identifying whether neigh's dead flag is set or not > is involved into __neigh_event_send() to further prevent neigh_add_timer() > from holding a neigh's refcount that already hit zero, thereby avoiding > what the issue cannot absolutely happen. > > Reported-by: Joern Engel <joern@logfs.org> > Cc: Eric Dumazet <edumazet@google.com> > Signed-off-by: Ying Xue <ying.xue@windriver.com> > --- > v2: > - As Eric pointed that identifying whether neigh's refcnt is zero > through atomic_read() is unsafe, another condition checking of > verifying neigh's dead flag is set is involved into > __neigh_event_send() to further prevent neigh_add_timer() from > holding a neigh's refcnt that already hit zero. > - Now the patch is created based on "net" tree considering it's > a very fatal issue. > > net/core/neighbour.c | 3 +++ > net/ipv4/ip_output.c | 2 +- > 2 files changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index 3de6542..c7a675c 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -958,6 +958,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) > if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE)) > goto out_unlock_bh; > > + if (neigh->dead) > + goto out_unlock_bh; > + > if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) { > if (NEIGH_VAR(neigh->parms, MCAST_PROBES) + > NEIGH_VAR(neigh->parms, APP_PROBES)) { > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index c65b93a..5889774 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -200,7 +200,7 @@ static inline int ip_finish_output2(struct sock *sk, struct sk_buff *skb) > rcu_read_lock_bh(); > nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr); > neigh = __ipv4_neigh_lookup_noref(dev, nexthop); > - if (unlikely(!neigh)) > + if (unlikely(!neigh || !atomic_read(&neigh->refcnt))) > neigh = __neigh_create(&arp_tbl, &nexthop, dev, false); > if (!IS_ERR(neigh)) { > int res = dst_neigh_output(dst, neigh, skb); Sorry, this atomic_read() makes no sense to me. When rcu is used, this is perfectly fine to use an object which refcount is 0. If you believe the opposite, then point me to relevant Documentation/RCU/ file. refcount should only protect the object itself, not the use of it by a rcu reader. This has nothing to do with rcu but standard refcounting of objects. The only forbidden thing would be to try to take a reference on it with atomic_inc() instead of the more careful atomic_inc_not_zero(), if the caller needs to exit the rcu_read_lock() section safely (as explained in Documentation/RCU/rcuref.txt around line 52) So far, dst_neigh_output() will not try to take a refcnt. By the time you check atomic_read() the result is meaningless if another cpu decrements the refcount to 0. So what is the point, trying to reduce a race window but not properly remove it ? I repeat again : using atomic_read() in a rcu lookup is absolutely useless and is a clear sign you missed a fundamental issue. So now, we are back to the patch I initially sent, but you did not address Julian Anastasov feedback. Really I am not very happy... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/20/2015 01:27 PM, Eric Dumazet wrote: > Sorry, this atomic_read() makes no sense to me. > > When rcu is used, this is perfectly fine to use an object which refcount > is 0. If you believe the opposite, then point me to relevant > Documentation/RCU/ file. > However, the reality for us is that rcu_read_lock() can guarantee that a neigh object is not freed within the area covered by rcu read lock, but it cannot prevent neigh_destroy() from being executed in another context at the same time. > refcount should only protect the object itself, not the use of it by a > rcu reader. This has nothing to do with rcu but standard refcounting of > objects. > > The only forbidden thing would be to try to take a reference on it with > atomic_inc() instead of the more careful atomic_inc_not_zero(), if the > caller needs to exit the rcu_read_lock() section safely (as explained in > Documentation/RCU/rcuref.txt around line 52) > > So far, dst_neigh_output() will not try to take a refcnt. > Please let us take a look at the following log generated by the Joern Engel written debug patch: [ 4974.816012] [<ffffffff8163baae>] dump_stack+0x19/0x1b [ 4974.816017] [<ffffffff8103f141>] warn_slowpath_common+0x61/0x80 [ 4974.816019] [<ffffffff8103f21a>] warn_slowpath_null+0x1a/0x20 [ 4974.816021] [<ffffffff8163de95>] neigh_hold.part.26+0x1e/0x27 [ 4974.816027] [<ffffffff8155f08c>] neigh_add_timer+0x3c/0x60 [ 4974.816029] [<ffffffff8155f1ab>] __neigh_event_send+0xfb/0x220 [ 4974.816031] [<ffffffff8155f40b>] neigh_resolve_output+0x13b/0x220 [ 4974.816038] [<ffffffff8158c950>] ip_finish_output+0x1b0/0x3b0 [ 4974.816040] [<ffffffff8158ddd8>] ip_output+0x58/0x90 [ 4974.816042] [<ffffffff8158d5a5>] ip_local_out+0x25/0x30 [ 4974.816045] [<ffffffff8158d8f7>] ip_queue_xmit+0x137/0x380 [ 4974.816051] [<ffffffff815a47a5>] tcp_transmit_skb+0x445/0x8a0 [ 4974.816054] [<ffffffff815a4d40>] tcp_write_xmit+0x140/0xb00 [ 4974.816058] [<ffffffff815a59ae>] __tcp_push_pending_frames+0x2e/0xc0 [ 4974.816062] [<ffffffff81597198>] tcp_sendmsg+0x118/0xd90 [ 4974.816070] [<ffffffff81278b55>] ? debug_object_deactivate+0x115/0x170 [ 4974.816076] [<ffffffff815bf434>] inet_sendmsg+0x64/0xb0 [ 4974.816080] [<ffffffff8153da56>] sock_sendmsg+0x76/0x90 [ 4974.816086] [<ffffffff81046e89>] ? local_bh_enable_ip+0x89/0xf0 Above stack trace log tells us the following call chain happens: ip_finish_output() ->ip_finish_output2() ->dst_neigh_output() ->neigh_resolve_output() ->__neigh_event_send() ->neigh_add_timer() ->neigh_hold() Moreover, the below debug code is added in Joern Engel written debug patch : +static inline void neigh_hold(struct neighbour *neigh) +{ + WARN_ON_ONCE(atomic_inc_return(&neigh->refcnt) == 1); +} So, we can identify above stack trace was printed by WARN_ON_ONCE(), which also proves that dst_neigh_output() not only tries to take a refcnt of a neigh, but also it the refcnt of neigh entry that dst_neigh_output() will be touched was already decremented to zero. Please consider the below scenario which is very similar to above case met by Joern Engel: Time 1: CPU 0: neigh_release() ->neigh_destroy() //it's called as neigh's refcnt is 0 now ->kfree_rcu(neigh, rcu);//freeing neigh object is delayed after a RCU grace period Time 2: CPU 1: ip_finish_output2() rcu_read_lock_bh() dst_neigh_output() __neigh_event_send() neigh_add_timer() neigh_hold() //refcnt of neigh is increased from 0 to 1; rcu_read_unlock_bh() Time 3: CPU 0: On RCU_SOFTIRQ context: rcu_process_callbacks() neigh object is really freed! Time 4: CPU 0: Release the neigh whose refcnt was increased from 0 to zero. neigh_release() is called by someone. As the neigh is already freed, panic happens!!! This is why I initially added the condition statement in ip_finish_output2() to check whether the refcnt of neigh returned by __ipv4_neigh_lookup_noref() is zero or not with atomic_read(). But as you pointed out, atomic_read() was not safe for us, but we cannot use other atomic routines with implicit memory barriers as they all increase or decease refcnt while checking refcnt is 0, meanwhile, we cannot use explicit memory barrier as it seems too expensive for the hot path. So I try to add another condition to prevent neigh_add_timer() called by __neigh_event_send() from being take a refcnt which is already zero. > > By the time you check atomic_read() the result is meaningless if another > cpu decrements the refcount to 0. > If you think checking if refcnt is zero is meaningless in ip_finish_output2(), why does __ipv4_neigh_lookup() use atomic_inc_not_zero() instead of directly using neigh_hold() in __ipv4_neigh_lookup()? Regards, Ying > So what is the point, trying to reduce a race window but not properly > remove it ? > > I repeat again : using atomic_read() in a rcu lookup is absolutely > useless and is a clear sign you missed a fundamental issue. > > So now, we are back to the patch I initially sent, but you did not > address Julian Anastasov feedback. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Wed, 20 May 2015, Ying Xue wrote: > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -958,6 +958,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) > if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE)) > goto out_unlock_bh; > > + if (neigh->dead) > + goto out_unlock_bh; > + Returning 0 in all cases is wrong. May be you can goto to another new label where nud_state check can allow valid address to be used. See my idea: http://marc.info/?l=linux-netdev&m=142816363503402&w=2 I.e. return 0 for NUD_STALE, drop skb and return 1 otherwise. > if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) { > if (NEIGH_VAR(neigh->parms, MCAST_PROBES) + > NEIGH_VAR(neigh->parms, APP_PROBES)) { > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index c65b93a..5889774 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -200,7 +200,7 @@ static inline int ip_finish_output2(struct sock *sk, struct sk_buff *skb) > rcu_read_lock_bh(); > nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr); > neigh = __ipv4_neigh_lookup_noref(dev, nexthop); > - if (unlikely(!neigh)) > + if (unlikely(!neigh || !atomic_read(&neigh->refcnt))) You should forget about refcnt. kfree_rcu in neigh_destroy will not free neigh while RCU readers are present. Still, neigh_destroy runs in parallel with readers and I hope they can use the stored address safely. I mean, neigh_event_send allowing neigh_resolve_output to use address in NUD_STALE state on dead=1 by returning 0 and reaching the dev_queue_xmit call. Still, another inefficiency remains: how can __neigh_event_send indicate to ip_finish_output2 that dead=1 entry is [to be] removed and new entry needs to be created. It seems the ->output method returns code but skb is freed in all cases. The result is that we drop single skb in such condition. But such optimization should not be part of this bugfix. Regards -- Julian Anastasov <ja@ssi.bg> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Wed, 20 May 2015, Ying Xue wrote: > On 05/20/2015 01:27 PM, Eric Dumazet wrote: > > Sorry, this atomic_read() makes no sense to me. > > > > When rcu is used, this is perfectly fine to use an object which refcount > > is 0. If you believe the opposite, then point me to relevant > > Documentation/RCU/ file. > > > > However, the reality for us is that rcu_read_lock() can guarantee that a neigh > object is not freed within the area covered by rcu read lock, but it cannot > prevent neigh_destroy() from being executed in another context at the same time. The situation is that one writer decided that entry is to be removed. Reader comes and tries to become second writer. It should check refcnt==0 or dead==1 as in your last patch, always under write_lock. Second and next writers should not try to change state, timers, etc. Such writers are possible only if they were readers because only they can find entry that is unlinked by another writer. And we want to keep the readers free of any memory barriers as they can cost hundreds of clocks. We are lucky that the neigh states allow RCU readers to run without any atomic_inc_not_zero calls because memory barriers are not cheap. Regards -- Julian Anastasov <ja@ssi.bg> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/20/2015 04:07 PM, Julian Anastasov wrote: > > Hello, > > On Wed, 20 May 2015, Ying Xue wrote: > >> On 05/20/2015 01:27 PM, Eric Dumazet wrote: >>> Sorry, this atomic_read() makes no sense to me. >>> >>> When rcu is used, this is perfectly fine to use an object which refcount >>> is 0. If you believe the opposite, then point me to relevant >>> Documentation/RCU/ file. >>> >> >> However, the reality for us is that rcu_read_lock() can guarantee that a neigh >> object is not freed within the area covered by rcu read lock, but it cannot >> prevent neigh_destroy() from being executed in another context at the same time. > > The situation is that one writer decided that > entry is to be removed. Reader comes and tries to become > second writer. It should check refcnt==0 or dead==1 as > in your last patch, always under write_lock. Yes, this way is absolutely safe for us! But, for example, if we check refcnt==0 in write_lock protection, the checking is a bit late. Instead, if the checking is advanced to ip_finish_output2(), we can _early_ find the fact that we cannot use the neigh entry looked up by __ipv4_neigh_lookup_noref(). Of course, doing the check with atomic_read() is unsafe _really_, but once atomic_read() returned value tells us neigh refcnt is zero, the result is absolutely true. This is because refcnt is always decremented from a certain value which is bigger than 0 to 0. Therefore, if atomic_read() tells us a neigh's refcnt is 0, we should definitely create a new one; on the contrary, if it tells us a neigh's refcnt is not zero, it doesn't mean the refcnt is really equal to 0 because atomic_read() might read a stale refcnt value. In this situation, the condition of !atomic_read(&neigh->refcnt)) is really useless for us. This is why I try to involve another condition check of dead==1 to prevent it from happening in version 2. Meanwhile, as the checking of dead==1 is conducted under write lock, this is absolutely safe for us. Second and next > writers should not try to change state, timers, etc. > Such writers are possible only if they were readers because > only they can find entry that is unlinked by another writer. > > And we want to keep the readers free of any memory > barriers as they can cost hundreds of clocks. We are lucky > that the neigh states allow RCU readers to run without any > atomic_inc_not_zero calls because memory barriers are not > cheap. > Yes, I agreed with you. Regards, Ying > Regards > > -- > Julian Anastasov <ja@ssi.bg> > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/20/2015 03:35 PM, Julian Anastasov wrote: > > Hello, > > On Wed, 20 May 2015, Ying Xue wrote: > >> --- a/net/core/neighbour.c >> +++ b/net/core/neighbour.c >> @@ -958,6 +958,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) >> if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE)) >> goto out_unlock_bh; >> >> + if (neigh->dead) >> + goto out_unlock_bh; >> + > > Returning 0 in all cases is wrong. Good catch! Yes, you are right. We should return 1 especially when neigh->dead == 1. May be you can goto > to another new label where nud_state check can allow valid > address to be used. See my idea: > > http://marc.info/?l=linux-netdev&m=142816363503402&w=2 > > I.e. return 0 for NUD_STALE, drop skb and return 1 > otherwise. > >> if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) { >> if (NEIGH_VAR(neigh->parms, MCAST_PROBES) + >> NEIGH_VAR(neigh->parms, APP_PROBES)) { >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c >> index c65b93a..5889774 100644 >> --- a/net/ipv4/ip_output.c >> +++ b/net/ipv4/ip_output.c >> @@ -200,7 +200,7 @@ static inline int ip_finish_output2(struct sock *sk, struct sk_buff *skb) >> rcu_read_lock_bh(); >> nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr); >> neigh = __ipv4_neigh_lookup_noref(dev, nexthop); >> - if (unlikely(!neigh)) >> + if (unlikely(!neigh || !atomic_read(&neigh->refcnt))) > > You should forget about refcnt. kfree_rcu in neigh_destroy > will not free neigh while RCU readers are present. Still, > neigh_destroy runs in parallel with readers and I hope they can > use the stored address safely. I know touching neigh entry in ip_finish_output2() without identifying if atomic_read(&neigh->refcnt)==0 under RCU lock is absolutely safe for us. But in some extent we can say the issue is not much a closely relationship to RCU lock. The key problem for us is how efficiently we can prevent neigh refcnt from being increased from 0 to 1 on the path of dst_neigh_output(). Regards, Ying I mean, neigh_event_send allowing > neigh_resolve_output to use address in NUD_STALE state on dead=1 > by returning 0 and reaching the dev_queue_xmit call. > > Still, another inefficiency remains: how can > __neigh_event_send indicate to ip_finish_output2 that > dead=1 entry is [to be] removed and new entry needs to > be created. It seems the ->output method returns code > but skb is freed in all cases. The result is that we > drop single skb in such condition. But such optimization > should not be part of this bugfix. > > Regards > > -- > Julian Anastasov <ja@ssi.bg> > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2015-05-20 at 15:01 +0800, Ying Xue wrote: > Time 4: > CPU 0: > Release the neigh whose refcnt was increased from 0 to zero. > neigh_release() is called by someone. As the neigh is already freed, panic > happens!!! There is a bug for sure, although I never saw it ever happening. I am only saying your 'fix' is not appropriate. You focus on refcnt which is the wrong thing to test in this context. Even if we decided to remove RCU locking and go back to rwlock, refcnt would not be the thing to test. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Wed, 20 May 2015, Ying Xue wrote: > Yes, this way is absolutely safe for us! But, for example, if we check refcnt==0 > in write_lock protection, the checking is a bit late. Instead, if the checking > is advanced to ip_finish_output2(), we can _early_ find the fact that we cannot > use the neigh entry looked up by __ipv4_neigh_lookup_noref(). Of course, doing > the check with atomic_read() is unsafe _really_, but once atomic_read() returned > value tells us neigh refcnt is zero, the result is absolutely true. This is The problem is that this atomic_read is just an extra code that works in rare cases. Modern CPUs have 128-byte cache lines and 'dead', 'refcnt' and 'next' can be in same cache line. If you take into account the write-back write policy used by CPU caches and the cache locking effects, in most of the cases when such reader finds entry in list, the refcnt will be > 0 and dead will be 0. The reason: write-back policy makes the whole cache line public to other CPUs at once. So, without explicit barriers other CPUs can see data after delay and the order of stores in same cache line is not observed. As result, pure reader can not see 0 in refcnt, even if you put more atomic_read calls at this place. atomic_read contains only hints to compiler. So, we must ensure that if a writer decides to remove entry other writers should not change things. More observations for the writers: - neigh_lookup is initially a reader bur it can increase refcnt at any time, even while another CPU is under write_lock. This makes all refcnt checks in neigh_forced_gc, neigh_flush_dev and neigh_periodic_work racy because neigh_cleanup_and_release is called after write_unlock. If a writer decides to remove the entry, other write operations may try to modify the entry because they rely on their own reference. What can happen is that refcnt is 1 when it is decided to remove the entry but the refcnt can be increased by concurrent writers such as neigh_lookup callers. As result, neigh_lookup followed by neigh_update may work with dead=1 entry. May be the fix is to add check for dead flag in neigh_update to avoid modifications for removed entry, after the both write_lock_bh calls. This again happens later, i.e. __neigh_lookup calls neigh_lookup (which does not notice the dead flag early), we miss to call neigh_create and then neigh_update will fail on dead==1. - neigh_timer_handler looks safe, may be it does not need check for dead flag because if neigh_flush_dev calls neigh_del_timer and del_timer in neigh_del_timer fails with 0 (due to other CPU stuck at write_lock in neigh_timer_handler) the 'if (!(state & NUD_IN_TIMER))' check in neigh_timer_handler should succeed because neigh_flush_dev will change nud_state on success of the 'if (atomic_read(&n->refcnt) != 1)' check. - __neigh_event_send: needs check for dead flag, as already discussed Regards -- Julian Anastasov <ja@ssi.bg> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 3de6542..c7a675c 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -958,6 +958,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE)) goto out_unlock_bh; + if (neigh->dead) + goto out_unlock_bh; + if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) { if (NEIGH_VAR(neigh->parms, MCAST_PROBES) + NEIGH_VAR(neigh->parms, APP_PROBES)) { diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index c65b93a..5889774 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -200,7 +200,7 @@ static inline int ip_finish_output2(struct sock *sk, struct sk_buff *skb) rcu_read_lock_bh(); nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr); neigh = __ipv4_neigh_lookup_noref(dev, nexthop); - if (unlikely(!neigh)) + if (unlikely(!neigh || !atomic_read(&neigh->refcnt))) neigh = __neigh_create(&arp_tbl, &nexthop, dev, false); if (!IS_ERR(neigh)) { int res = dst_neigh_output(dst, neigh, skb);
Calling __ipv4_neigh_lookup_noref() inside rcu_read_lock_bh() can guarantee that its searched neighbour entry is not freed before RCU grace period, but it cannot ensure that its obtained neighbour will be freed shortly. Exactly saying, it cannot prevent neigh_destroy() from being executed on another context at the same time. For example, if ip_finish_output2() continues to deliver a SKB with a neighbour entry whose refcount is zero, neigh_add_timer() may be called in neigh_resolve_output() subsequently. As a result, neigh_add_timer() takes refcount on the neighbour that already had a refcount of zero. When the neighbour refcount is put before the timer's handler is exited, neigh_destroy() will be called again, meaning crash happens at the moment. To prevent the issue from occurring, we must check whether the refcount of a neighbour searched by __ipv4_neigh_lookup_noref() is decremented to zero or not. If it's zero, we should create a new one. However, as reading neigh's refcount is unsafe through atomic_read() like it doesn't imply any memory barrier and the cost is too expensive if we enforce a proper implicit or explicit memory barrier on it, another checking of identifying whether neigh's dead flag is set or not is involved into __neigh_event_send() to further prevent neigh_add_timer() from holding a neigh's refcount that already hit zero, thereby avoiding what the issue cannot absolutely happen. Reported-by: Joern Engel <joern@logfs.org> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> --- v2: - As Eric pointed that identifying whether neigh's refcnt is zero through atomic_read() is unsafe, another condition checking of verifying neigh's dead flag is set is involved into __neigh_event_send() to further prevent neigh_add_timer() from holding a neigh's refcnt that already hit zero. - Now the patch is created based on "net" tree considering it's a very fatal issue. net/core/neighbour.c | 3 +++ net/ipv4/ip_output.c | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-)