diff mbox

[3/4] xfrm: remove policy lock when accessing policy->walk.dead

Message ID 1270030626-16687-5-git-send-email-timo.teras@iki.fi
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Timo Teras March 31, 2010, 10:17 a.m. UTC
All of the code considers ->dead as a hint that the cached policy
needs to get refreshed. The read side can just drop the read lock
without any side effects.

The write side needs to make sure that it's written only exactly
once. Only possible race is at xfrm_policy_kill(). This is fixed
by checking result of __xfrm_policy_unlink() when needed. It will
always succeed if the policy object is looked up from the hash
list (so some checks are removed), but it needs to be checked if
we are trying to unlink policy via a reference (appropriate
checks added).

Since policy->walk.dead is written exactly once, it no longer
needs to be protected with a write lock.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 net/xfrm/xfrm_policy.c |   31 +++++++++----------------------
 net/xfrm/xfrm_user.c   |    6 +-----
 2 files changed, 10 insertions(+), 27 deletions(-)

Comments

Herbert Xu March 31, 2010, 11:03 a.m. UTC | #1
On Wed, Mar 31, 2010 at 01:17:05PM +0300, Timo Teras wrote:
> All of the code considers ->dead as a hint that the cached policy
> needs to get refreshed. The read side can just drop the read lock
> without any side effects.
> 
> The write side needs to make sure that it's written only exactly
> once. Only possible race is at xfrm_policy_kill(). This is fixed
> by checking result of __xfrm_policy_unlink() when needed. It will
> always succeed if the policy object is looked up from the hash
> list (so some checks are removed), but it needs to be checked if
> we are trying to unlink policy via a reference (appropriate
> checks added).
> 
> Since policy->walk.dead is written exactly once, it no longer
> needs to be protected with a write lock.
> 
> Signed-off-by: Timo Teras <timo.teras@iki.fi>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

> @@ -794,10 +783,9 @@ int xfrm_policy_flush(struct net *net, u8 type, struct xfrm_audit *audit_info)
>  				     &net->xfrm.policy_inexact[dir], bydst) {
>  			if (pol->type != type)
>  				continue;
> -			dp = __xfrm_policy_unlink(pol, dir);
> +			__xfrm_policy_unlink(pol, dir);
>  			write_unlock_bh(&xfrm_policy_lock);
> -			if (dp)
> -				cnt++;
> +			cnt++;
>  
>  			xfrm_audit_policy_delete(pol, 1, audit_info->loginuid,
>  						 audit_info->sessionid,

I was intrigued to find that my local source never had this dp
stuff.

It seems that this was added recently:

commit 2f1eb65f366b81aa3c22c31e6e8db26168777ec5
Author: Jamal Hadi Salim <hadi@cyberus.ca>
Date:   Fri Feb 19 02:00:42 2010 +0000

    xfrm: Flushing empty SPD generates false events

    To see the effect make sure you have an empty SPD.
    On window1 "ip xfrm mon" and on window2 issue "ip xfrm policy flush"
    You get prompt back in window2 and you see the flush event on window1.
    With this fix, you still get prompt on window1 but no event on window2.

    Thanks to Alexey Dobriyan for finding a bug in earlier version
    when using pfkey to do the flushing.

    Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
    Signed-off-by: David S. Miller <davem@davemloft.net>

This seems to be bogus to me.  Just because the DB was empty
before the flush doesn't mean that the flush didn't happen.

We should revert this patch in its entirety.

Thanks,
jamal March 31, 2010, 1:06 p.m. UTC | #2
On Wed, 2010-03-31 at 19:03 +0800, Herbert Xu wrote:

> This seems to be bogus to me.  Just because the DB was empty
> before the flush doesn't mean that the flush didn't happen.

Herbert, If a tree falls in a forest and no one is around to hear it,
does it make a sound? ;->
A flush event is meant to be a signal to user space that what
was once a non-empty table is now empty. 
This is a consistent definition of the semantics everywhere tables
are flushed (not just in Linux)..

What makes the SPD and SAD speacial?

cheers,
jamal



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 31, 2010, 1:11 p.m. UTC | #3
On Wed, Mar 31, 2010 at 09:06:13AM -0400, jamal wrote:
> On Wed, 2010-03-31 at 19:03 +0800, Herbert Xu wrote:
> 
> > This seems to be bogus to me.  Just because the DB was empty
> > before the flush doesn't mean that the flush didn't happen.
> 
> Herbert, If a tree falls in a forest and no one is around to hear it,
> does it make a sound? ;->
> A flush event is meant to be a signal to user space that what
> was once a non-empty table is now empty. 

I disagree.  A flush event is a signal that someone has sent a
flush command.  In any case we've had this semantics for years
and I haven't heard a good reason why this should be changed.

> This is a consistent definition of the semantics everywhere tables
> are flushed (not just in Linux)..

Please give specific examples in the kernel.

Cheers,
Herbert Xu March 31, 2010, 1:26 p.m. UTC | #4
On Wed, Mar 31, 2010 at 09:11:31PM +0800, Herbert Xu wrote:
> 
> > This is a consistent definition of the semantics everywhere tables
> > are flushed (not just in Linux)..
> 
> Please give specific examples in the kernel.

In fact the previous behaviour is also consistent with RFC2367:

3.1.9 SADB_FLUSH

   The SADB_FLUSH message causes the kernel to delete all entries in its
   key table for a certain sadb_msg_satype.  Only the base header is
   required for a flush message.  If sadb_msg_satype is filled in with a
   specific value, only associations of that type are deleted.  If it is
   filled in with SADB_SATYPE_UNSPEC, ALL associations are deleted.

     The messaging behavior for SADB_FLUSH is:

           Send an SADB_FLUSH message from a user process to the kernel.

           <base>

           The kernel will return an SADB_FLUSH message to all listening
           sockets.

           <base>

           The reply message happens only after the actual flushing
           of security associations has been attempted.

There is no special treatment for an empty DB.

Please send a revert.

Thanks,
jamal March 31, 2010, 1:28 p.m. UTC | #5
On Wed, 2010-03-31 at 21:11 +0800, Herbert Xu wrote:

> I disagree.  A flush event is a signal that someone has sent a
> flush command.  

Sorry - I respectfully disagree.

> In any case we've had this semantics for years
> and I haven't heard a good reason why this should be changed.

It generates unnecessary noise and it is a deviation like i mentioned.

> > This is a consistent definition of the semantics everywhere tables
> > are flushed (not just in Linux)..
> 
> Please give specific examples in the kernel.

Something i can do safely right now without messing my connection; 
Issue iproute commands in one window, observe events in another

-sudo ip route add 192.168.11.100 dev eth0 table 15
generates an event
-sudo ip route flush table 15
generates an event
-sudo ip route flush table 15
No event

But pick anything else in the other netlink knowledgeable subsystem
and youd see similar behavior.

If there was an app depending on this behavior - thats a separate reason
(but thats not the arguement you are making).

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal March 31, 2010, 1:32 p.m. UTC | #6
On Wed, 2010-03-31 at 21:26 +0800, Herbert Xu wrote:


> In fact the previous behaviour is also consistent with RFC2367:

I did not touch pfkey. That behavior remains there. 

cheers,
jamal


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal March 31, 2010, 1:39 p.m. UTC | #7
On Wed, 2010-03-31 at 09:32 -0400, jamal wrote:

> I did not touch pfkey. That behavior remains there. 

Sorry - I lied. I did touch pfkey. Here was my reasoning.

RFC 2367 says flushing behavior should be:
1) user space -> kernel: flush
2) kernel: flush
3) kernel -> user space: flush event to ALL listeners

This is not realistic today in the presence of selinux policies
which may reject the flush etc. So we make the sequence become:
1) user space -> kernel: flush
2) kernel: flush
3) kernel -> user space: flush response to originater from #1
4) if there were no errors then:
kernel -> user space: flush event to ALL listeners

This was in the logs.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal March 31, 2010, 1:41 p.m. UTC | #8
On Wed, 2010-03-31 at 09:39 -0400, jamal wrote:
> 
> On Wed, 2010-03-31 at 09:32 -0400, jamal wrote:
> 
> > I did not touch pfkey. That behavior remains there. 
> 
> Sorry - I lied. I did touch pfkey. Here was my reasoning.

And what I meant by not touching is that "the behavior there
remains as it was before"

cheers,
jamal

> RFC 2367 says flushing behavior should be:
> 1) user space -> kernel: flush
> 2) kernel: flush
> 3) kernel -> user space: flush event to ALL listeners
> 
> This is not realistic today in the presence of selinux policies
> which may reject the flush etc. So we make the sequence become:
> 1) user space -> kernel: flush
> 2) kernel: flush
> 3) kernel -> user space: flush response to originater from #1
> 4) if there were no errors then:
> kernel -> user space: flush event to ALL listeners
> 
> This was in the logs.
> 
> cheers,
> jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 31, 2010, 1:53 p.m. UTC | #9
On Wed, Mar 31, 2010 at 09:28:12AM -0400, jamal wrote:
> 
> -sudo ip route add 192.168.11.100 dev eth0 table 15
> generates an event
> -sudo ip route flush table 15
> generates an event
> -sudo ip route flush table 15
> No event

That's completely different.  We don't have a route flush event,
instead we're sending route delete events.  That's why when the
table is empty you get no events.

If we had a route flush event then it would behave exactly the
same.

BTW you've also made xfrm_state_flush inconsistent with respect
to xfrm_policy_flush.

Dave, please revert this patch.

Thanks,
Herbert Xu March 31, 2010, 1:55 p.m. UTC | #10
On Wed, Mar 31, 2010 at 09:39:55AM -0400, jamal wrote:
> 
> This is not realistic today in the presence of selinux policies
> which may reject the flush etc. So we make the sequence become:
> 1) user space -> kernel: flush
> 2) kernel: flush
> 3) kernel -> user space: flush response to originater from #1
> 4) if there were no errors then:
> kernel -> user space: flush event to ALL listeners

Eliding the notification if SELinux says so is fine, but eliding
it because the table is empty is wrong.

The flush did not fail just because the table was empty to begin
with.

Cheers,
Herbert Xu March 31, 2010, 1:56 p.m. UTC | #11
On Wed, Mar 31, 2010 at 09:41:23AM -0400, jamal wrote:
> 
> On Wed, 2010-03-31 at 09:39 -0400, jamal wrote:
> > 
> > On Wed, 2010-03-31 at 09:32 -0400, jamal wrote:
> > 
> > > I did not touch pfkey. That behavior remains there. 
> > 
> > Sorry - I lied. I did touch pfkey. Here was my reasoning.
> 
> And what I meant by not touching is that "the behavior there
> remains as it was before"

No you've changed it.  PF_KEY will no longer notify if the policy
table is empty.

This is inconsistent with the behaviour of SADB flushes, and the
spirit of RFC2367.

Cheers,
jamal March 31, 2010, 2:12 p.m. UTC | #12
On Wed, 2010-03-31 at 21:55 +0800, Herbert Xu wrote:

> Eliding the notification if SELinux says so is fine, but eliding
> it because the table is empty is wrong.
> 
> The flush did not fail just because the table was empty to begin
> with.

Like i said i didnt touch the behavior except for the selinux case
(which sounds very reasonable). I believe there maybe historical legacy
reasons for that semantic in pfkey. 
Can you point to something in the kernel (or anywhere else) that behaves
like this on table flushing? Actually if there was an app that depended
on netlink flush being exposed on empty table - then i think theres
reason for a revert.
Other than that i will say again: i respectfully disagree.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal March 31, 2010, 2:15 p.m. UTC | #13
On Wed, 2010-03-31 at 21:56 +0800, Herbert Xu wrote:

> 
> No you've changed it.  PF_KEY will no longer notify if the policy
> table is empty.
> 
> This is inconsistent with the behaviour of SADB flushes, and the
> spirit of RFC2367.

I did not mean to change it for pfkey. I do believe there are apps that
need it.
I will run some tests and if it breaks - I will send a patch.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 31, 2010, 2:15 p.m. UTC | #14
On Wed, Mar 31, 2010 at 10:12:48AM -0400, jamal wrote:
> On Wed, 2010-03-31 at 21:55 +0800, Herbert Xu wrote:
> 
> > Eliding the notification if SELinux says so is fine, but eliding
> > it because the table is empty is wrong.
> > 
> > The flush did not fail just because the table was empty to begin
> > with.
> 
> Like i said i didnt touch the behavior except for the selinux case
> (which sounds very reasonable). I believe there maybe historical legacy
> reasons for that semantic in pfkey. 
> Can you point to something in the kernel (or anywhere else) that behaves
> like this on table flushing? Actually if there was an app that depended
> on netlink flush being exposed on empty table - then i think theres
> reason for a revert.
> Other than that i will say again: i respectfully disagree.

OK I give up.

Dave can keep or revert this as he likes.

Cheers,
jamal March 31, 2010, 2:24 p.m. UTC | #15
On Wed, 2010-03-31 at 22:15 +0800, Herbert Xu wrote:

> 
> OK I give up.
> 
> Dave can keep or revert this as he likes.

Ok - I will await to see what Dave thinks before i bother doing whatever
setup. What we have here is a philosophical difference.
Herbert note: i realize this is unfair for me to say given you were
likely very busy: I respect your opinion and knowledge and did CC you on
all the many mails when i was RFCing this and when i sent the patches. 

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 31, 2010, 2:29 p.m. UTC | #16
On Wed, Mar 31, 2010 at 10:24:38AM -0400, jamal wrote:
> On Wed, 2010-03-31 at 22:15 +0800, Herbert Xu wrote:
> 
> > 
> > OK I give up.
> > 
> > Dave can keep or revert this as he likes.
> 
> Ok - I will await to see what Dave thinks before i bother doing whatever
> setup. What we have here is a philosophical difference.
> Herbert note: i realize this is unfair for me to say given you were
> likely very busy: I respect your opinion and knowledge and did CC you on
> all the many mails when i was RFCing this and when i sent the patches. 

Yes you did cc me.  Unfortunately I was rather busy with some other
stuff at the time.  Had I read your patch back then I would've said
something :)

Cheers,
jamal March 31, 2010, 2:38 p.m. UTC | #17
On Wed, 2010-03-31 at 22:29 +0800, Herbert Xu wrote:

> Yes you did cc me.  Unfortunately I was rather busy with some other
> stuff at the time.  Had I read your patch back then I would've said
> something :)

makes sense ;-> And it wouldnt have been like the first (or the last)
time you and i have had this heated discussions;->
Lets just wait to hear what Dave says - I dont want make his life more
difficult than it is; so no hard feeling if he decides to revert.


cheers,
jamal

PS:- since you woke me up and i know you are awake, 
i am going to get rid of a jetlag by sending
probably another controversial patch. Hmmm.. where to start.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy March 31, 2010, 4:41 p.m. UTC | #18
Herbert Xu wrote:
> On Wed, Mar 31, 2010 at 09:28:12AM -0400, jamal wrote:
>> -sudo ip route add 192.168.11.100 dev eth0 table 15
>> generates an event
>> -sudo ip route flush table 15
>> generates an event
>> -sudo ip route flush table 15
>> No event
> 
> That's completely different.  We don't have a route flush event,
> instead we're sending route delete events.  That's why when the
> table is empty you get no events.
> 
> If we had a route flush event then it would behave exactly the
> same.

I was just about to say the exact same thing when I noticed your
email. I agree with Herbert, the flush notification indicates that
the table is now empty, independant of its previous state.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal March 31, 2010, 5:47 p.m. UTC | #19
On Wed, 2010-03-31 at 18:41 +0200, Patrick McHardy wrote:

> I agree with Herbert, the flush notification indicates that
> the table is now empty, independant of its previous state.

What purpose does it serve?

As an example, if i delete an entry, the fact i deleted an entry is 
of interest to some manager in user space for the purpose of syncing.
If i brought a link down, same thing. If i brought a link down
that was already down - why would that be of interest to generate
as an event? etc.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 31, 2010, 8:52 p.m. UTC | #20
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 31 Mar 2010 21:53:31 +0800

> Dave, please revert this patch.

Herbert I actually don't agree with you on this one, sorry :-)

If no semantic change occurred to the policy or state databases,
there is zero reason to report the flush event, it's noise.

Like Jamal, I too am pretty frustrated with how noisy and unusable the
netlink messages are to watch when trying to debug something and this
behavior change should break no applications whatsoever.

When you have evidence of a real existing case failing because of
Jamal's changes (not some case you crafted specifically to show that
an empty flush doesn't get reported), I'll consider a revert or ask
Jamal to fix that case up.

Until then, Jamal's change is staying in the tree.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 31, 2010, 8:54 p.m. UTC | #21
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 31 Mar 2010 21:56:29 +0800

> This is inconsistent with the behaviour of SADB flushes, and the
> spirit of RFC2367.

Want to have an even larger list of areas where we're not compliant
with that RFC2367 that everyone has to extend and makes changes to in
order to be useful? :-)

Herbert, you have to show that something breaks because of this,
because the new behavior helps in diagnosis and efficiency and the old
behavior has been a serious pet peeve of mine just as it was for
Jamal.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 31, 2010, 8:57 p.m. UTC | #22
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 31 Mar 2010 22:15:25 +0800

> Dave can keep or revert this as he likes.

Herbert, if it doesn't break anything, I feel the gains are very
worthwhile.

When we have a breakage report, I will undo this or fix it
immediately.  Until then, we have to be cognizant of what positives we
get out of this.

Have you actually tried to monitor the IPSEC netlink socket events
when a daemon is running?  It's way too painful before Jamal's
changes, and we already emit way too many semantically empty events in
netlink.

Minimization of the noise is a good thing, if it can legally be done,
which I believe is the case here.  And this applies to pfkey events
too, because processing those is even more expensive on the
application side.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu April 1, 2010, 12:22 a.m. UTC | #23
On Wed, Mar 31, 2010 at 01:57:35PM -0700, David Miller wrote:
> 
> Herbert, if it doesn't break anything, I feel the gains are very
> worthwhile.

OK, in that case please change xfrm_state_flush too.  Right now
it also notifies on empty.

> Have you actually tried to monitor the IPSEC netlink socket events
> when a daemon is running?  It's way too painful before Jamal's
> changes, and we already emit way too many semantically empty events in
> netlink.

Honestly I can't see how eliminating an empty flush event helps.
Flushes are very rare.

Cheers,
jamal April 1, 2010, 2:19 a.m. UTC | #24
On Thu, 2010-04-01 at 08:22 +0800, Herbert Xu wrote:

> OK, in that case please change xfrm_state_flush too.  Right now
> it also notifies on empty.

It shouldnt. I will double check it tomorrow..

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal April 1, 2010, 10:53 a.m. UTC | #25
Ive tried it with net-next which is derived of 2.6.34-rc1
and it seems to work as expected
i.e. only the flush on a non-empty SAD is event-ed.
Probably older kernel you are running?

cheers,
jamal

On Wed, 2010-03-31 at 22:19 -0400, jamal wrote:
> On Thu, 2010-04-01 at 08:22 +0800, Herbert Xu wrote:
> 
> > OK, in that case please change xfrm_state_flush too.  Right now
> > it also notifies on empty.
> 
> It shouldnt. I will double check it tomorrow..
> 
> cheers,
> jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy April 1, 2010, 11:24 a.m. UTC | #26
jamal wrote:
> On Wed, 2010-03-31 at 18:41 +0200, Patrick McHardy wrote:
> 
>> I agree with Herbert, the flush notification indicates that
>> the table is now empty, independant of its previous state.
> 
> What purpose does it serve?
> 
> As an example, if i delete an entry, the fact i deleted an entry is 
> of interest to some manager in user space for the purpose of syncing.
> If i brought a link down, same thing. If i brought a link down
> that was already down - why would that be of interest to generate
> as an event? etc.

That's true. Since both pfkey and xfrm process messages synchronously,
there shouldn't be any need for this. In fact I couldn't even find
a single keying daemon that cares about this message.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 843e066..82789cf 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -156,7 +156,7 @@  static void xfrm_policy_timer(unsigned long data)
 
 	read_lock(&xp->lock);
 
-	if (xp->walk.dead)
+	if (unlikely(xp->walk.dead))
 		goto out;
 
 	dir = xfrm_policy_id2dir(xp->index);
@@ -297,17 +297,7 @@  static DECLARE_WORK(xfrm_policy_gc_work, xfrm_policy_gc_task);
 
 static void xfrm_policy_kill(struct xfrm_policy *policy)
 {
-	int dead;
-
-	write_lock_bh(&policy->lock);
-	dead = policy->walk.dead;
 	policy->walk.dead = 1;
-	write_unlock_bh(&policy->lock);
-
-	if (unlikely(dead)) {
-		WARN_ON(1);
-		return;
-	}
 
 	spin_lock_bh(&xfrm_policy_gc_lock);
 	hlist_add_head(&policy->bydst, &xfrm_policy_gc_list);
@@ -776,7 +766,6 @@  xfrm_policy_flush_secctx_check(struct net *net, u8 type, struct xfrm_audit *audi
 int xfrm_policy_flush(struct net *net, u8 type, struct xfrm_audit *audit_info)
 {
 	int dir, err = 0, cnt = 0;
-	struct xfrm_policy *dp;
 
 	write_lock_bh(&xfrm_policy_lock);
 
@@ -794,10 +783,9 @@  int xfrm_policy_flush(struct net *net, u8 type, struct xfrm_audit *audit_info)
 				     &net->xfrm.policy_inexact[dir], bydst) {
 			if (pol->type != type)
 				continue;
-			dp = __xfrm_policy_unlink(pol, dir);
+			__xfrm_policy_unlink(pol, dir);
 			write_unlock_bh(&xfrm_policy_lock);
-			if (dp)
-				cnt++;
+			cnt++;
 
 			xfrm_audit_policy_delete(pol, 1, audit_info->loginuid,
 						 audit_info->sessionid,
@@ -816,10 +804,9 @@  int xfrm_policy_flush(struct net *net, u8 type, struct xfrm_audit *audit_info)
 					     bydst) {
 				if (pol->type != type)
 					continue;
-				dp = __xfrm_policy_unlink(pol, dir);
+				__xfrm_policy_unlink(pol, dir);
 				write_unlock_bh(&xfrm_policy_lock);
-				if (dp)
-					cnt++;
+				cnt++;
 
 				xfrm_audit_policy_delete(pol, 1,
 							 audit_info->loginuid,
@@ -1132,6 +1119,9 @@  int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol)
 		__xfrm_policy_link(pol, XFRM_POLICY_MAX+dir);
 	}
 	if (old_pol)
+		/* Unlinking succeeds always. This is the only function
+		 * allowed to delete or replace socket policy.
+		 */
 		__xfrm_policy_unlink(old_pol, XFRM_POLICY_MAX+dir);
 	write_unlock_bh(&xfrm_policy_lock);
 
@@ -1737,11 +1727,8 @@  restart:
 			goto error;
 		}
 
-		for (pi = 0; pi < npols; pi++) {
-			read_lock_bh(&pols[pi]->lock);
+		for (pi = 0; pi < npols; pi++)
 			pol_dead |= pols[pi]->walk.dead;
-			read_unlock_bh(&pols[pi]->lock);
-		}
 
 		write_lock_bh(&policy->lock);
 		if (unlikely(pol_dead || stale_bundle(dst))) {
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index da5ba86..a267fbd 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1770,13 +1770,9 @@  static int xfrm_add_pol_expire(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (xp == NULL)
 		return -ENOENT;
 
-	read_lock(&xp->lock);
-	if (xp->walk.dead) {
-		read_unlock(&xp->lock);
+	if (unlikely(xp->walk.dead))
 		goto out;
-	}
 
-	read_unlock(&xp->lock);
 	err = 0;
 	if (up->hard) {
 		uid_t loginuid = NETLINK_CB(skb).loginuid;