diff mbox

netfilter: use per-CPU recursive lock {XV}

Message ID 49F4A6E3.7080102@cosmosbay.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet April 26, 2009, 6:24 p.m. UTC
From: Stephen Hemminger <shemminger@vyatta.com>

> Epilogue due to master Jarek. Lockdep carest not about the locking
> doth bestowed. Therefore no keys are needed.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

So far, so good, should be ready for inclusion now, nobody complained :)

I include the final patch, merge of your last two patches.

David, could you please review it once again and apply it if it's OK ?

Thanks to all for your help and patience

[PATCH] netfilter: use per-CPU recursive lock {XV}

In days of old in 2.6.29, netfilter did locketh using a 
lock of the reader kind when doing its table business, and do
a writer when with pen in hand like a overworked accountant
did replace the tables. This sucketh and caused the single
lock to fly back and forth like a poor errant boy.

But then netfilter was blessed with RCU and the performance
was divine, but alas there were those that suffered for
trying to replace their many rules one at a time.

So now RCU must be vanquished from the scene, and better
chastity belts be placed upon this valuable asset most dear.
The locks that were but one are now replaced by one per suitor.

The repair was made after much discussion involving
Eric the wise, and Linus the foul. With flowers springing
up amid the thorns some peace has finally prevailed and
all is soothed. This patch and purple prose was penned by
in honor of "Talk like Shakespeare" day.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
What hath changed over the last two setting suns:

  * more words, mostly correct...

  * no need to locketh for writeh on current cpu tis 
    always so

  * the locking of all cpu's on replace is always done as
    part of the get_counters cycle, so the sychronize swip
    in replace tables is gone with only a comment remaing

  * Epilogue due to master Jarek. Lockdep carest not about
    the locking doth bestowed. Therefore no keys are needed.

 include/linux/netfilter/x_tables.h |   55 ++++++++++-
 net/ipv4/netfilter/arp_tables.c    |  125 +++++++-------------------
 net/ipv4/netfilter/ip_tables.c     |  126 +++++++--------------------
 net/ipv6/netfilter/ip6_tables.c    |  123 +++++++-------------------
 net/netfilter/x_tables.c           |   50 +++++-----
 5 files changed, 183 insertions(+), 296 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Mathieu Desnoyers April 26, 2009, 6:56 p.m. UTC | #1
* Eric Dumazet (dada1@cosmosbay.com) wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> 
> > Epilogue due to master Jarek. Lockdep carest not about the locking
> > doth bestowed. Therefore no keys are needed.
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> So far, so good, should be ready for inclusion now, nobody complained :)
> 
> I include the final patch, merge of your last two patches.
> 
> David, could you please review it once again and apply it if it's OK ?
> 
> Thanks to all for your help and patience
> 
> [PATCH] netfilter: use per-CPU recursive lock {XV}

Hi Eric,

Please... could you rename this patch according to Linus'comments ?

Suitable name would probably be :

[PATCH] netfilter: use bh disabling with per-cpu read-write lock

> 
> In days of old in 2.6.29, netfilter did locketh using a 
> lock of the reader kind when doing its table business, and do
> a writer when with pen in hand like a overworked accountant
> did replace the tables. This sucketh and caused the single
> lock to fly back and forth like a poor errant boy.
> 
> But then netfilter was blessed with RCU and the performance
> was divine, but alas there were those that suffered for
> trying to replace their many rules one at a time.
> 
> So now RCU must be vanquished from the scene, and better
> chastity belts be placed upon this valuable asset most dear.
> The locks that were but one are now replaced by one per suitor.
> 
> The repair was made after much discussion involving
> Eric the wise, and Linus the foul. With flowers springing
> up amid the thorns some peace has finally prevailed and
> all is soothed. This patch and purple prose was penned by
> in honor of "Talk like Shakespeare" day.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
> What hath changed over the last two setting suns:
> 
>   * more words, mostly correct...
> 
>   * no need to locketh for writeh on current cpu tis 
>     always so
> 
>   * the locking of all cpu's on replace is always done as
>     part of the get_counters cycle, so the sychronize swip
>     in replace tables is gone with only a comment remaing
> 
>   * Epilogue due to master Jarek. Lockdep carest not about
>     the locking doth bestowed. Therefore no keys are needed.
> 
>  include/linux/netfilter/x_tables.h |   55 ++++++++++-
>  net/ipv4/netfilter/arp_tables.c    |  125 +++++++-------------------
>  net/ipv4/netfilter/ip_tables.c     |  126 +++++++--------------------
>  net/ipv6/netfilter/ip6_tables.c    |  123 +++++++-------------------
>  net/netfilter/x_tables.c           |   50 +++++-----
>  5 files changed, 183 insertions(+), 296 deletions(-)
> 
> diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
> index 7b1a652..511debb 100644
> --- a/include/linux/netfilter/x_tables.h
> +++ b/include/linux/netfilter/x_tables.h
> @@ -354,9 +354,6 @@ struct xt_table
>  	/* What hooks you will enter on */
>  	unsigned int valid_hooks;
>  
> -	/* Lock for the curtain */
> -	struct mutex lock;
> -
>  	/* Man behind the curtain... */
>  	struct xt_table_info *private;
>  
> @@ -434,8 +431,56 @@ extern void xt_proto_fini(struct net *net, u_int8_t af);
>  
>  extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
>  extern void xt_free_table_info(struct xt_table_info *info);
> -extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
> -				    struct xt_table_info *new);
> +
> +/*
> + * Per-CPU read/write lock associated with per-cpu table entries.
> + * This is not a general solution but makes reader locking fast since
> + * there is no shared variable to cause cache ping-pong; but adds an
> + * additional write-side penalty since update must lock all
> + * possible CPU's.
> + *
> + * Read lock is used by ip/arp/ip6 tables rule processing which runs per-cpu.
> + * It needs to ensure that the rules are not being changed while packet
> + * is being processed. In some cases, the read lock will be acquired
> + * twice on the same CPU; this is okay because read locks handle nesting.
> + *
> + * Write lock is used in two cases:
> + *    1. reading counter values
> + *       all readers need to be stopped and the per-CPU values are summed.
> + *
> + *    2. replacing tables
> + *       any readers that are using the old tables have to complete
> + *       before freeing the old table. This is handled by reading
> + *	  as a side effect of reading counters

Stating that the write lock must _always_ be taken with bh disabled
might not hurt here.

> + */
> +DECLARE_PER_CPU(rwlock_t, xt_info_locks);
> +
> +static inline void xt_info_rdlock_bh(void)
> +{
> +	/*
> +	 * Note: can not use read_lock_bh(&__get_cpu_var(xt_info_locks))
> +	 * because need to ensure that preemption is disable before
> +	 * acquiring per-cpu-variable, so do it as a two step process
> +	 */
> +	local_bh_disable();
> +	read_lock(&__get_cpu_var(xt_info_locks));
> +}
> +
> +static inline void xt_info_rdunlock_bh(void)
> +{
> +	read_unlock_bh(&__get_cpu_var(xt_info_locks));
> +}
> +
> +static inline void xt_info_wrlock(unsigned int cpu)
> +{
> +	write_lock(&per_cpu(xt_info_locks, cpu));
> +}
> +
> +static inline void xt_info_wrunlock(unsigned int cpu)
> +{
> +
> +	write_unlock(&per_cpu(xt_info_locks, cpu));
> +}
>  
>  /*
>   * This helper is performance critical and must be inlined
> diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
> index 5ba533d..831fe18 100644
> --- a/net/ipv4/netfilter/arp_tables.c
> +++ b/net/ipv4/netfilter/arp_tables.c
> @@ -253,9 +253,9 @@ unsigned int arpt_do_table(struct sk_buff *skb,
>  	indev = in ? in->name : nulldevname;
>  	outdev = out ? out->name : nulldevname;
>  
> -	rcu_read_lock_bh();
> -	private = rcu_dereference(table->private);
> -	table_base = rcu_dereference(private->entries[smp_processor_id()]);
> +	xt_info_rdlock_bh();
> +	private = table->private;
> +	table_base = private->entries[smp_processor_id()];
>  
>  	e = get_entry(table_base, private->hook_entry[hook]);
>  	back = get_entry(table_base, private->underflow[hook]);
> @@ -273,6 +273,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
>  
>  			hdr_len = sizeof(*arp) + (2 * sizeof(struct in_addr)) +
>  				(2 * skb->dev->addr_len);
> +

This is not a whitespace cleanup patch.

>  			ADD_COUNTER(e->counters, hdr_len, 1);
>  
>  			t = arpt_get_target(e);
> @@ -328,8 +329,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
>  			e = (void *)e + e->next_offset;
>  		}
>  	} while (!hotdrop);
> -

Whitespace cleanup ?

> -	rcu_read_unlock_bh();
> +	xt_info_rdunlock_bh();
>  
>  	if (hotdrop)
>  		return NF_DROP;
> @@ -711,9 +711,12 @@ static void get_counters(const struct xt_table_info *t,
>  	/* Instead of clearing (by a previous call to memset())
>  	 * the counters and using adds, we set the counters
>  	 * with data used by 'current' CPU
> -	 * We dont care about preemption here.
> +	 *
> +	 * Bottom half has to be disabled to prevent deadlock
> +	 * if new softirq were to run and call ipt_do_table
>  	 */
> -	curcpu = raw_smp_processor_id();
> +	local_bh_disable();
> +	curcpu = smp_processor_id();
>  
>  	i = 0;
>  	ARPT_ENTRY_ITERATE(t->entries[curcpu],
> @@ -726,73 +729,22 @@ static void get_counters(const struct xt_table_info *t,
>  		if (cpu == curcpu)
>  			continue;
>  		i = 0;
> +		xt_info_wrlock(cpu);
>  		ARPT_ENTRY_ITERATE(t->entries[cpu],
>  				   t->size,
>  				   add_entry_to_counter,
>  				   counters,
>  				   &i);
> +		xt_info_wrunlock(cpu);
>  	}
> -}
> -
> -
> -/* We're lazy, and add to the first CPU; overflow works its fey magic
> - * and everything is OK. */
> -static int
> -add_counter_to_entry(struct arpt_entry *e,
> -		     const struct xt_counters addme[],
> -		     unsigned int *i)
> -{
> -	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> -
> -	(*i)++;
> -	return 0;
> -}
> -
> -/* Take values from counters and add them back onto the current cpu */
> -static void put_counters(struct xt_table_info *t,
> -			 const struct xt_counters counters[])
> -{
> -	unsigned int i, cpu;
> -
> -	local_bh_disable();
> -	cpu = smp_processor_id();
> -	i = 0;
> -	ARPT_ENTRY_ITERATE(t->entries[cpu],
> -			  t->size,
> -			  add_counter_to_entry,
> -			  counters,
> -			  &i);
>  	local_bh_enable();
>  }

Did you really need to move add_counter_to_entry and put_counters in
this patch ? This also seems more like a cleanup to me, if it is even
one. It does make the patch harder to follow though.

>  
> -static inline int
> -zero_entry_counter(struct arpt_entry *e, void *arg)
> -{
> -	e->counters.bcnt = 0;
> -	e->counters.pcnt = 0;
> -	return 0;
> -}
> -
> -static void
> -clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
> -{
> -	unsigned int cpu;
> -	const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
> -
> -	memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
> -	for_each_possible_cpu(cpu) {
> -		memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
> -		ARPT_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
> -				  zero_entry_counter, NULL);
> -	}
> -}
> -
>  static struct xt_counters *alloc_counters(struct xt_table *table)
>  {
>  	unsigned int countersize;
>  	struct xt_counters *counters;
>  	struct xt_table_info *private = table->private;
> -	struct xt_table_info *info;
>  
>  	/* We need atomic snapshot of counters: rest doesn't change
>  	 * (other than comefrom, which userspace doesn't care
> @@ -802,30 +754,11 @@ static struct xt_counters *alloc_counters(struct xt_table *table)
>  	counters = vmalloc_node(countersize, numa_node_id());
>  
>  	if (counters == NULL)
> -		goto nomem;
> -
> -	info = xt_alloc_table_info(private->size);
> -	if (!info)
> -		goto free_counters;
> -
> -	clone_counters(info, private);
> -
> -	mutex_lock(&table->lock);
> -	xt_table_entry_swap_rcu(private, info);
> -	synchronize_net();	/* Wait until smoke has cleared */
> +		return ERR_PTR(-ENOMEM);
>  
> -	get_counters(info, counters);
> -	put_counters(private, counters);
> -	mutex_unlock(&table->lock);
> -
> -	xt_free_table_info(info);
> +	get_counters(private, counters);
>  
>  	return counters;
> -
> - free_counters:
> -	vfree(counters);
> - nomem:
> -	return ERR_PTR(-ENOMEM);
>  }
>  
>  static int copy_entries_to_user(unsigned int total_size,
> @@ -1094,8 +1027,9 @@ static int __do_replace(struct net *net, const char *name,
>  	    (newinfo->number <= oldinfo->initial_entries))
>  		module_put(t->me);
>  
> -	/* Get the old counters. */
> +	/* Get the old counters, and synchronize with replace */
>  	get_counters(oldinfo, counters);
> +
>  	/* Decrease module usage counts and free resource */
>  	loc_cpu_old_entry = oldinfo->entries[raw_smp_processor_id()];
>  	ARPT_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,
> @@ -1165,10 +1099,23 @@ static int do_replace(struct net *net, void __user *user, unsigned int len)
>  	return ret;
>  }
>  
> +/* We're lazy, and add to the first CPU; overflow works its fey magic
> + * and everything is OK. */
> +static int
> +add_counter_to_entry(struct arpt_entry *e,
> +		     const struct xt_counters addme[],
> +		     unsigned int *i)
> +{
> +	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> +
> +	(*i)++;
> +	return 0;
> +}
> +
>  static int do_add_counters(struct net *net, void __user *user, unsigned int len,
>  			   int compat)
>  {
> -	unsigned int i;
> +	unsigned int i, curcpu;
>  	struct xt_counters_info tmp;
>  	struct xt_counters *paddc;
>  	unsigned int num_counters;
> @@ -1224,26 +1171,26 @@ static int do_add_counters(struct net *net, void __user *user, unsigned int len,
>  		goto free;
>  	}
>  
> -	mutex_lock(&t->lock);
> +	local_bh_disable();
>  	private = t->private;
>  	if (private->number != num_counters) {
>  		ret = -EINVAL;
>  		goto unlock_up_free;
>  	}
>  
> -	preempt_disable();
>  	i = 0;
>  	/* Choose the copy that is on our node */
> -	loc_cpu_entry = private->entries[smp_processor_id()];
> +	curcpu = smp_processor_id();
> +	loc_cpu_entry = private->entries[curcpu];
> +	xt_info_wrlock(curcpu);
>  	ARPT_ENTRY_ITERATE(loc_cpu_entry,
>  			   private->size,
>  			   add_counter_to_entry,
>  			   paddc,
>  			   &i);
> -	preempt_enable();
> +	xt_info_wrunlock(curcpu);
>   unlock_up_free:
> -	mutex_unlock(&t->lock);
> -
> +	local_bh_enable();
>  	xt_table_unlock(t);
>  	module_put(t->me);
>   free:
> diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
> index 810c0b6..2ec8d72 100644
> --- a/net/ipv4/netfilter/ip_tables.c
> +++ b/net/ipv4/netfilter/ip_tables.c
> @@ -338,10 +338,9 @@ ipt_do_table(struct sk_buff *skb,
>  	tgpar.hooknum = hook;
>  
>  	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
> -
> -	rcu_read_lock_bh();
> -	private = rcu_dereference(table->private);
> -	table_base = rcu_dereference(private->entries[smp_processor_id()]);
> +	xt_info_rdlock_bh();
> +	private = table->private;
> +	table_base = private->entries[smp_processor_id()];
>  
>  	e = get_entry(table_base, private->hook_entry[hook]);
>  
> @@ -436,8 +435,7 @@ ipt_do_table(struct sk_buff *skb,
>  			e = (void *)e + e->next_offset;
>  		}
>  	} while (!hotdrop);
> -
> -	rcu_read_unlock_bh();
> +	xt_info_rdunlock_bh();
>  
>  #ifdef DEBUG_ALLOW_ALL
>  	return NF_ACCEPT;
> @@ -896,10 +894,13 @@ get_counters(const struct xt_table_info *t,
>  
>  	/* Instead of clearing (by a previous call to memset())
>  	 * the counters and using adds, we set the counters
> -	 * with data used by 'current' CPU
> -	 * We dont care about preemption here.
> +	 * with data used by 'current' CPU.
> +	 *
> +	 * Bottom half has to be disabled to prevent deadlock
> +	 * if new softirq were to run and call ipt_do_table
>  	 */
> -	curcpu = raw_smp_processor_id();
> +	local_bh_disable();
> +	curcpu = smp_processor_id();
>  
>  	i = 0;
>  	IPT_ENTRY_ITERATE(t->entries[curcpu],
> @@ -912,74 +913,22 @@ get_counters(const struct xt_table_info *t,
>  		if (cpu == curcpu)
>  			continue;
>  		i = 0;
> +		xt_info_wrlock(cpu);
>  		IPT_ENTRY_ITERATE(t->entries[cpu],
>  				  t->size,
>  				  add_entry_to_counter,
>  				  counters,
>  				  &i);
> +		xt_info_wrunlock(cpu);
>  	}
> -
> -}
> -
> -/* We're lazy, and add to the first CPU; overflow works its fey magic
> - * and everything is OK. */
> -static int
> -add_counter_to_entry(struct ipt_entry *e,
> -		     const struct xt_counters addme[],
> -		     unsigned int *i)
> -{
> -	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> -
> -	(*i)++;
> -	return 0;
> -}
> -
> -/* Take values from counters and add them back onto the current cpu */
> -static void put_counters(struct xt_table_info *t,
> -			 const struct xt_counters counters[])
> -{
> -	unsigned int i, cpu;
> -
> -	local_bh_disable();
> -	cpu = smp_processor_id();
> -	i = 0;
> -	IPT_ENTRY_ITERATE(t->entries[cpu],
> -			  t->size,
> -			  add_counter_to_entry,
> -			  counters,
> -			  &i);
>  	local_bh_enable();
>  }
>  
> -
> -static inline int
> -zero_entry_counter(struct ipt_entry *e, void *arg)
> -{
> -	e->counters.bcnt = 0;
> -	e->counters.pcnt = 0;
> -	return 0;
> -}
> -
> -static void
> -clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
> -{
> -	unsigned int cpu;
> -	const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
> -
> -	memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
> -	for_each_possible_cpu(cpu) {
> -		memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
> -		IPT_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
> -				  zero_entry_counter, NULL);
> -	}
> -}
> -
>  static struct xt_counters * alloc_counters(struct xt_table *table)
>  {
>  	unsigned int countersize;
>  	struct xt_counters *counters;
>  	struct xt_table_info *private = table->private;
> -	struct xt_table_info *info;
>  
>  	/* We need atomic snapshot of counters: rest doesn't change
>  	   (other than comefrom, which userspace doesn't care
> @@ -988,30 +937,11 @@ static struct xt_counters * alloc_counters(struct xt_table *table)
>  	counters = vmalloc_node(countersize, numa_node_id());
>  
>  	if (counters == NULL)
> -		goto nomem;
> +		return ERR_PTR(-ENOMEM);
>  
> -	info = xt_alloc_table_info(private->size);
> -	if (!info)
> -		goto free_counters;
> -
> -	clone_counters(info, private);
> -
> -	mutex_lock(&table->lock);
> -	xt_table_entry_swap_rcu(private, info);
> -	synchronize_net();	/* Wait until smoke has cleared */
> -
> -	get_counters(info, counters);
> -	put_counters(private, counters);
> -	mutex_unlock(&table->lock);
> -
> -	xt_free_table_info(info);
> +	get_counters(private, counters);
>  
>  	return counters;
> -
> - free_counters:
> -	vfree(counters);
> - nomem:
> -	return ERR_PTR(-ENOMEM);
>  }
>  
>  static int
> @@ -1306,8 +1236,9 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
>  	    (newinfo->number <= oldinfo->initial_entries))
>  		module_put(t->me);
>  
> -	/* Get the old counters. */
> +	/* Get the old counters, and synchronize with replace */
>  	get_counters(oldinfo, counters);
> +
>  	/* Decrease module usage counts and free resource */
>  	loc_cpu_old_entry = oldinfo->entries[raw_smp_processor_id()];
>  	IPT_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,
> @@ -1377,11 +1308,23 @@ do_replace(struct net *net, void __user *user, unsigned int len)
>  	return ret;
>  }
>  
> +/* We're lazy, and add to the first CPU; overflow works its fey magic
> + * and everything is OK. */
> +static int
> +add_counter_to_entry(struct ipt_entry *e,
> +		     const struct xt_counters addme[],
> +		     unsigned int *i)
> +{
> +	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> +
> +	(*i)++;
> +	return 0;
> +}
>  
>  static int
>  do_add_counters(struct net *net, void __user *user, unsigned int len, int compat)
>  {
> -	unsigned int i;
> +	unsigned int i, curcpu;
>  	struct xt_counters_info tmp;
>  	struct xt_counters *paddc;
>  	unsigned int num_counters;
> @@ -1437,25 +1380,26 @@ do_add_counters(struct net *net, void __user *user, unsigned int len, int compat
>  		goto free;
>  	}
>  
> -	mutex_lock(&t->lock);
> +	local_bh_disable();
>  	private = t->private;
>  	if (private->number != num_counters) {
>  		ret = -EINVAL;
>  		goto unlock_up_free;
>  	}
>  
> -	preempt_disable();
>  	i = 0;
>  	/* Choose the copy that is on our node */
> -	loc_cpu_entry = private->entries[raw_smp_processor_id()];
> +	curcpu = smp_processor_id();
> +	loc_cpu_entry = private->entries[curcpu];
> +	xt_info_wrlock(curcpu);
>  	IPT_ENTRY_ITERATE(loc_cpu_entry,
>  			  private->size,
>  			  add_counter_to_entry,
>  			  paddc,
>  			  &i);
> -	preempt_enable();
> +	xt_info_wrunlock(curcpu);
>   unlock_up_free:
> -	mutex_unlock(&t->lock);
> +	local_bh_enable();
>  	xt_table_unlock(t);
>  	module_put(t->me);
>   free:
> diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
> index 800ae85..219e165 100644
> --- a/net/ipv6/netfilter/ip6_tables.c
> +++ b/net/ipv6/netfilter/ip6_tables.c
> @@ -365,9 +365,9 @@ ip6t_do_table(struct sk_buff *skb,
>  
>  	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
>  
> -	rcu_read_lock_bh();
> -	private = rcu_dereference(table->private);
> -	table_base = rcu_dereference(private->entries[smp_processor_id()]);
> +	xt_info_rdlock_bh();
> +	private = table->private;
> +	table_base = private->entries[smp_processor_id()];
>  
>  	e = get_entry(table_base, private->hook_entry[hook]);
>  
> @@ -466,7 +466,7 @@ ip6t_do_table(struct sk_buff *skb,
>  #ifdef CONFIG_NETFILTER_DEBUG
>  	((struct ip6t_entry *)table_base)->comefrom = NETFILTER_LINK_POISON;
>  #endif
> -	rcu_read_unlock_bh();
> +	xt_info_rdunlock_bh();
>  
>  #ifdef DEBUG_ALLOW_ALL
>  	return NF_ACCEPT;
> @@ -926,9 +926,12 @@ get_counters(const struct xt_table_info *t,
>  	/* Instead of clearing (by a previous call to memset())
>  	 * the counters and using adds, we set the counters
>  	 * with data used by 'current' CPU
> -	 * We dont care about preemption here.
> +	 *
> +	 * Bottom half has to be disabled to prevent deadlock
> +	 * if new softirq were to run and call ipt_do_table
>  	 */
> -	curcpu = raw_smp_processor_id();
> +	local_bh_disable();
> +	curcpu = smp_processor_id();
>  
>  	i = 0;
>  	IP6T_ENTRY_ITERATE(t->entries[curcpu],
> @@ -941,72 +944,22 @@ get_counters(const struct xt_table_info *t,
>  		if (cpu == curcpu)
>  			continue;
>  		i = 0;
> +		xt_info_wrlock(cpu);
>  		IP6T_ENTRY_ITERATE(t->entries[cpu],
>  				  t->size,
>  				  add_entry_to_counter,
>  				  counters,
>  				  &i);
> +		xt_info_wrunlock(cpu);
>  	}
> -}
> -
> -/* We're lazy, and add to the first CPU; overflow works its fey magic
> - * and everything is OK. */
> -static int
> -add_counter_to_entry(struct ip6t_entry *e,
> -		     const struct xt_counters addme[],
> -		     unsigned int *i)
> -{
> -	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> -
> -	(*i)++;
> -	return 0;
> -}
> -
> -/* Take values from counters and add them back onto the current cpu */
> -static void put_counters(struct xt_table_info *t,
> -			 const struct xt_counters counters[])
> -{
> -	unsigned int i, cpu;
> -
> -	local_bh_disable();
> -	cpu = smp_processor_id();
> -	i = 0;
> -	IP6T_ENTRY_ITERATE(t->entries[cpu],
> -			   t->size,
> -			   add_counter_to_entry,
> -			   counters,
> -			   &i);
>  	local_bh_enable();
>  }
>  
> -static inline int
> -zero_entry_counter(struct ip6t_entry *e, void *arg)
> -{
> -	e->counters.bcnt = 0;
> -	e->counters.pcnt = 0;
> -	return 0;
> -}
> -
> -static void
> -clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
> -{
> -	unsigned int cpu;
> -	const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
> -
> -	memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
> -	for_each_possible_cpu(cpu) {
> -		memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
> -		IP6T_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
> -				   zero_entry_counter, NULL);
> -	}
> -}
> -
>  static struct xt_counters *alloc_counters(struct xt_table *table)
>  {
>  	unsigned int countersize;
>  	struct xt_counters *counters;
>  	struct xt_table_info *private = table->private;
> -	struct xt_table_info *info;
>  
>  	/* We need atomic snapshot of counters: rest doesn't change
>  	   (other than comefrom, which userspace doesn't care
> @@ -1015,30 +968,11 @@ static struct xt_counters *alloc_counters(struct xt_table *table)
>  	counters = vmalloc_node(countersize, numa_node_id());
>  
>  	if (counters == NULL)
> -		goto nomem;
> +		return ERR_PTR(-ENOMEM);
>  
> -	info = xt_alloc_table_info(private->size);
> -	if (!info)
> -		goto free_counters;
> -
> -	clone_counters(info, private);
> -
> -	mutex_lock(&table->lock);
> -	xt_table_entry_swap_rcu(private, info);
> -	synchronize_net();	/* Wait until smoke has cleared */
> -
> -	get_counters(info, counters);
> -	put_counters(private, counters);
> -	mutex_unlock(&table->lock);
> -
> -	xt_free_table_info(info);
> +	get_counters(private, counters);
>  
>  	return counters;
> -
> - free_counters:
> -	vfree(counters);
> - nomem:
> -	return ERR_PTR(-ENOMEM);
>  }
>  
>  static int
> @@ -1334,8 +1268,9 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
>  	    (newinfo->number <= oldinfo->initial_entries))
>  		module_put(t->me);
>  
> -	/* Get the old counters. */
> +	/* Get the old counters, and synchronize with replace */
>  	get_counters(oldinfo, counters);
> +
>  	/* Decrease module usage counts and free resource */
>  	loc_cpu_old_entry = oldinfo->entries[raw_smp_processor_id()];
>  	IP6T_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,
> @@ -1405,11 +1340,24 @@ do_replace(struct net *net, void __user *user, unsigned int len)
>  	return ret;
>  }
>  
> +/* We're lazy, and add to the first CPU; overflow works its fey magic
> + * and everything is OK. */
> +static int
> +add_counter_to_entry(struct ip6t_entry *e,
> +		     const struct xt_counters addme[],
> +		     unsigned int *i)
> +{
> +	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> +
> +	(*i)++;
> +	return 0;
> +}
> +
>  static int
>  do_add_counters(struct net *net, void __user *user, unsigned int len,
>  		int compat)
>  {
> -	unsigned int i;
> +	unsigned int i, curcpu;
>  	struct xt_counters_info tmp;
>  	struct xt_counters *paddc;
>  	unsigned int num_counters;
> @@ -1465,25 +1413,28 @@ do_add_counters(struct net *net, void __user *user, unsigned int len,
>  		goto free;
>  	}
>  
> -	mutex_lock(&t->lock);
> +

Incorrect whiteline added.

> +	local_bh_disable();
>  	private = t->private;
>  	if (private->number != num_counters) {
>  		ret = -EINVAL;
>  		goto unlock_up_free;
>  	}
>  
> -	preempt_disable();
>  	i = 0;
>  	/* Choose the copy that is on our node */
> -	loc_cpu_entry = private->entries[raw_smp_processor_id()];
> +	curcpu = smp_processor_id();
> +	xt_info_wrlock(curcpu);
> +	loc_cpu_entry = private->entries[curcpu];
>  	IP6T_ENTRY_ITERATE(loc_cpu_entry,
>  			  private->size,
>  			  add_counter_to_entry,
>  			  paddc,
>  			  &i);
> -	preempt_enable();
> +	xt_info_wrunlock(curcpu);
> +

Inconsistent whiteline.

>   unlock_up_free:
> -	mutex_unlock(&t->lock);
> +	local_bh_enable();
>  	xt_table_unlock(t);
>  	module_put(t->me);
>   free:
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index 509a956..5807a4d 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -625,20 +625,6 @@ void xt_free_table_info(struct xt_table_info *info)
>  }
>  EXPORT_SYMBOL(xt_free_table_info);
>  
> -void xt_table_entry_swap_rcu(struct xt_table_info *oldinfo,
> -			     struct xt_table_info *newinfo)
> -{
> -	unsigned int cpu;
> -
> -	for_each_possible_cpu(cpu) {
> -		void *p = oldinfo->entries[cpu];
> -		rcu_assign_pointer(oldinfo->entries[cpu], newinfo->entries[cpu]);
> -		newinfo->entries[cpu] = p;
> -	}
> -
> -}
> -EXPORT_SYMBOL_GPL(xt_table_entry_swap_rcu);
> -
>  /* Find table by name, grabs mutex & ref.  Returns ERR_PTR() on error. */
>  struct xt_table *xt_find_table_lock(struct net *net, u_int8_t af,
>  				    const char *name)
> @@ -676,32 +662,43 @@ void xt_compat_unlock(u_int8_t af)
>  EXPORT_SYMBOL_GPL(xt_compat_unlock);
>  #endif
>  
> +DEFINE_PER_CPU(rwlock_t, xt_info_locks);
> +EXPORT_PER_CPU_SYMBOL_GPL(xt_info_locks);
> +
> +
>  struct xt_table_info *
>  xt_replace_table(struct xt_table *table,
>  	      unsigned int num_counters,
>  	      struct xt_table_info *newinfo,
>  	      int *error)
>  {
> -	struct xt_table_info *oldinfo, *private;
> +	struct xt_table_info *private;
>  
>  	/* Do the substitution. */
> -	mutex_lock(&table->lock);
> +	local_bh_disable();
>  	private = table->private;
> +
>  	/* Check inside lock: is the old number correct? */
>  	if (num_counters != private->number) {
>  		duprintf("num_counters != table->private->number (%u/%u)\n",
>  			 num_counters, private->number);
> -		mutex_unlock(&table->lock);
> +		local_bh_enable();
>  		*error = -EAGAIN;
>  		return NULL;
>  	}
> -	oldinfo = private;
> -	rcu_assign_pointer(table->private, newinfo);
> -	newinfo->initial_entries = oldinfo->initial_entries;
> -	mutex_unlock(&table->lock);
>  

Whiteline.....

Mathieu

> -	synchronize_net();
> -	return oldinfo;
> +	table->private = newinfo;
> +	newinfo->initial_entries = private->initial_entries;
> +
> +	/*
> +	 * Even though table entries have now been swapped, other CPU's
> +	 * may still be using the old entries. This is okay, because
> +	 * resynchronization happens because of the locking done
> +	 * during the get_counters() routine.
> +	 */
> +	local_bh_enable();
> +
> +	return private;
>  }
>  EXPORT_SYMBOL_GPL(xt_replace_table);
>  
> @@ -734,7 +731,6 @@ struct xt_table *xt_register_table(struct net *net, struct xt_table *table,
>  
>  	/* Simplifies replace_table code. */
>  	table->private = bootstrap;
> -	mutex_init(&table->lock);
>  
>  	if (!xt_replace_table(table, 0, newinfo, &ret))
>  		goto unlock;
> @@ -1147,7 +1143,11 @@ static struct pernet_operations xt_net_ops = {
>  
>  static int __init xt_init(void)
>  {
> -	int i, rv;
> +	unsigned int i;
> +	int rv;
> +
> +	for_each_possible_cpu(i)
> +		rwlock_init(&per_cpu(xt_info_locks, i));
>  
>  	xt = kmalloc(sizeof(struct xt_af) * NFPROTO_NUMPROTO, GFP_KERNEL);
>  	if (!xt)
Mathieu Desnoyers April 26, 2009, 7:31 p.m. UTC | #2
* Eric Dumazet (dada1@cosmosbay.com) wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> 
> > Epilogue due to master Jarek. Lockdep carest not about the locking
> > doth bestowed. Therefore no keys are needed.
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> So far, so good, should be ready for inclusion now, nobody complained :)
> 
> I include the final patch, merge of your last two patches.
> 
> David, could you please review it once again and apply it if it's OK ?
> 
[...]
> +/*
> + * Per-CPU read/write lock associated with per-cpu table entries.
> + * This is not a general solution but makes reader locking fast since
> + * there is no shared variable to cause cache ping-pong; but adds an
> + * additional write-side penalty since update must lock all
> + * possible CPU's.
> + *
> + * Read lock is used by ip/arp/ip6 tables rule processing which runs per-cpu.
> + * It needs to ensure that the rules are not being changed while packet
> + * is being processed. In some cases, the read lock will be acquired
> + * twice on the same CPU; this is okay because read locks handle nesting.
> + *
> + * Write lock is used in two cases:
> + *    1. reading counter values
> + *       all readers need to be stopped and the per-CPU values are summed.
> + *
> + *    2. replacing tables
> + *       any readers that are using the old tables have to complete
> + *       before freeing the old table. This is handled by reading
> + *	  as a side effect of reading counters
> + */
> +DECLARE_PER_CPU(rwlock_t, xt_info_locks);
> +
> +static inline void xt_info_rdlock_bh(void)
> +{
> +	/*
> +	 * Note: can not use read_lock_bh(&__get_cpu_var(xt_info_locks))
> +	 * because need to ensure that preemption is disable before
> +	 * acquiring per-cpu-variable, so do it as a two step process
> +	 */
> +	local_bh_disable();

Why do you need to disable bottom halves on the read-side ? You could
probably just disable preemption, given this lock is nestable on the
read-side anyway. Or I'm missing something obvious ?

> +	read_lock(&__get_cpu_var(xt_info_locks));
> +}
> +
> +static inline void xt_info_rdunlock_bh(void)
> +{
> +	read_unlock_bh(&__get_cpu_var(xt_info_locks));
> +}
> +
> +static inline void xt_info_wrlock(unsigned int cpu)
> +{
> +	write_lock(&per_cpu(xt_info_locks, cpu));
> +}
> +
> +static inline void xt_info_wrunlock(unsigned int cpu)
> +{
> +
> +	write_unlock(&per_cpu(xt_info_locks, cpu));
> +}
>  

[...]

Mathieu
Eric Dumazet April 26, 2009, 8:55 p.m. UTC | #3
Mathieu Desnoyers a écrit :
> * Eric Dumazet (dada1@cosmosbay.com) wrote:
>> From: Stephen Hemminger <shemminger@vyatta.com>
>>
>>> Epilogue due to master Jarek. Lockdep carest not about the locking
>>> doth bestowed. Therefore no keys are needed.
>>>
>>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>> So far, so good, should be ready for inclusion now, nobody complained :)
>>
>> I include the final patch, merge of your last two patches.
>>
>> David, could you please review it once again and apply it if it's OK ?
>>
> [...]
>> +/*
>> + * Per-CPU read/write lock associated with per-cpu table entries.
>> + * This is not a general solution but makes reader locking fast since
>> + * there is no shared variable to cause cache ping-pong; but adds an
>> + * additional write-side penalty since update must lock all
>> + * possible CPU's.
>> + *
>> + * Read lock is used by ip/arp/ip6 tables rule processing which runs per-cpu.
>> + * It needs to ensure that the rules are not being changed while packet
>> + * is being processed. In some cases, the read lock will be acquired
>> + * twice on the same CPU; this is okay because read locks handle nesting.
>> + *
>> + * Write lock is used in two cases:
>> + *    1. reading counter values
>> + *       all readers need to be stopped and the per-CPU values are summed.
>> + *
>> + *    2. replacing tables
>> + *       any readers that are using the old tables have to complete
>> + *       before freeing the old table. This is handled by reading
>> + *	  as a side effect of reading counters
>> + */
>> +DECLARE_PER_CPU(rwlock_t, xt_info_locks);
>> +
>> +static inline void xt_info_rdlock_bh(void)
>> +{
>> +	/*
>> +	 * Note: can not use read_lock_bh(&__get_cpu_var(xt_info_locks))
>> +	 * because need to ensure that preemption is disable before
>> +	 * acquiring per-cpu-variable, so do it as a two step process
>> +	 */
>> +	local_bh_disable();
> 
> Why do you need to disable bottom halves on the read-side ? You could
> probably just disable preemption, given this lock is nestable on the
> read-side anyway. Or I'm missing something obvious ?

It may not be obvious, but subject already raised on this list, so I'll
try to be as precise as possible (But may be wrong on some points, I'll
let Patrick correct me if necessary)

ipt_do_table() is not a readonly function returning a verdict.

1) It handles a stack (check how is used next->comefrom) that seems to
be stored on rules themselves. (This is how I understand this code)
This is safe as each cpu has its own copy of rules/counters, and BH protected.

2) It also updates two 64 bit counters (bytes/packets) on each matched rule.

3) Some netfilter matches/targets probably rely on the fact their handlers
are run with BH disabled by their caller (ipt_do_table()/arp/ip6...)

These must be BH protected (and preempt disabled too), or else :

1) A softirq could interrupt a process in the middle of ipt_do_table()
and corrupt its "stack".

2) A softirq could interrupt a process in ipt_do_table() in the middle
 of the ADD_COUNTER(). Some counters could be corrupted.

3) Some netfiler extensions would break.

Previous linux versions already used a read_lock_bh() here, on a single
and shared rwlock, there is nothing new on this BH locking AFAIK.

Thank you

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathieu Desnoyers April 26, 2009, 9:39 p.m. UTC | #4
* Eric Dumazet (dada1@cosmosbay.com) wrote:
> Mathieu Desnoyers a écrit :
> > * Eric Dumazet (dada1@cosmosbay.com) wrote:
> >> From: Stephen Hemminger <shemminger@vyatta.com>
> >>
> >>> Epilogue due to master Jarek. Lockdep carest not about the locking
> >>> doth bestowed. Therefore no keys are needed.
> >>>
> >>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> >> So far, so good, should be ready for inclusion now, nobody complained :)
> >>
> >> I include the final patch, merge of your last two patches.
> >>
> >> David, could you please review it once again and apply it if it's OK ?
> >>
> > [...]
> >> +/*
> >> + * Per-CPU read/write lock associated with per-cpu table entries.
> >> + * This is not a general solution but makes reader locking fast since
> >> + * there is no shared variable to cause cache ping-pong; but adds an
> >> + * additional write-side penalty since update must lock all
> >> + * possible CPU's.
> >> + *
> >> + * Read lock is used by ip/arp/ip6 tables rule processing which runs per-cpu.
> >> + * It needs to ensure that the rules are not being changed while packet
> >> + * is being processed. In some cases, the read lock will be acquired
> >> + * twice on the same CPU; this is okay because read locks handle nesting.
> >> + *
> >> + * Write lock is used in two cases:
> >> + *    1. reading counter values
> >> + *       all readers need to be stopped and the per-CPU values are summed.
> >> + *
> >> + *    2. replacing tables
> >> + *       any readers that are using the old tables have to complete
> >> + *       before freeing the old table. This is handled by reading
> >> + *	  as a side effect of reading counters
> >> + */
> >> +DECLARE_PER_CPU(rwlock_t, xt_info_locks);
> >> +
> >> +static inline void xt_info_rdlock_bh(void)
> >> +{
> >> +	/*
> >> +	 * Note: can not use read_lock_bh(&__get_cpu_var(xt_info_locks))
> >> +	 * because need to ensure that preemption is disable before
> >> +	 * acquiring per-cpu-variable, so do it as a two step process
> >> +	 */
> >> +	local_bh_disable();
> > 
> > Why do you need to disable bottom halves on the read-side ? You could
> > probably just disable preemption, given this lock is nestable on the
> > read-side anyway. Or I'm missing something obvious ?
> 
> It may not be obvious, but subject already raised on this list, so I'll
> try to be as precise as possible (But may be wrong on some points, I'll
> let Patrick correct me if necessary)
> 
> ipt_do_table() is not a readonly function returning a verdict.
> 
> 1) It handles a stack (check how is used next->comefrom) that seems to
> be stored on rules themselves. (This is how I understand this code)
> This is safe as each cpu has its own copy of rules/counters, and BH protected.
> 
> 2) It also updates two 64 bit counters (bytes/packets) on each matched rule.
> 
> 3) Some netfilter matches/targets probably rely on the fact their handlers
> are run with BH disabled by their caller (ipt_do_table()/arp/ip6...)
> 
> These must be BH protected (and preempt disabled too), or else :
> 
> 1) A softirq could interrupt a process in the middle of ipt_do_table()
> and corrupt its "stack".
> 
> 2) A softirq could interrupt a process in ipt_do_table() in the middle
>  of the ADD_COUNTER(). Some counters could be corrupted.
> 
> 3) Some netfiler extensions would break.
> 
> Previous linux versions already used a read_lock_bh() here, on a single
> and shared rwlock, there is nothing new on this BH locking AFAIK.
> 
> Thank you

Thanks for the explanation. It might help to document the role of bh
disabling for the reader in a supplementary code comment, otherwise one
might think it's been put there to match the bottom half disabling used
on the write-side, which has the supplementary role of making sure bh
will not deadlock (and this precise behavior is not needed usually on
the read-side).

One more point :

 *    1. reading counter values
 *       all readers need to be stopped and the per-CPU values are summed.

Maybe it's just me, but this sentence does not seem to clearly indicate
that we have :

for_each_cpu()
  write lock()
  read data
  write unlock()

One might interpret it as :

for_each_cpu()
  write lock()

read data

for_each_cpu()
  write unlock()

Or maybe it's just my understanding of English that's not perfect.
Anyhow, rewording this sentence might not hurt. Something along the
lines of :

"reading counter values
 all readers are iteratively stopped to have their per-CPU values
 summed"

This is an important difference, as this behaves more like a RCU-based
mechanism than a global per-cpu read/write lock where all the write
locks would be taken at once.

Mathieu
stephen hemminger April 26, 2009, 9:57 p.m. UTC | #5
On Sun, 26 Apr 2009 14:56:46 -0400
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> * Eric Dumazet (dada1@cosmosbay.com) wrote:
> > From: Stephen Hemminger <shemminger@vyatta.com>
> > 
> > > Epilogue due to master Jarek. Lockdep carest not about the locking
> > > doth bestowed. Therefore no keys are needed.
> > > 
> > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > 
> > So far, so good, should be ready for inclusion now, nobody complained :)
> > 
> > I include the final patch, merge of your last two patches.
> > 
> > David, could you please review it once again and apply it if it's OK ?
> 
> > Thanks to all for your help and patience
> > 
> > [PATCH] netfilter: use per-CPU recursive lock {XV}
> 
> Hi Eric, 
> 
> Suitable name would probably be :
> 

But Linus is trying to delude himself.

This usage is recursive even if he doesn't like the terminology.
The same CPU has to be able to reacquire the read lock without deadlocking.
If reader/writer locks were implemented in a pure writer gets priority
method, then this code would break!  So yes read locks can be used recursively
now in Linux, but if the were implemented differently then this code
would break.  For example, the -rt kernel turns all read/write locks into
mutexs, so the -rt kernel developers will have to address this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathieu Desnoyers April 26, 2009, 10:32 p.m. UTC | #6
* Stephen Hemminger (shemminger@vyatta.com) wrote:
> On Sun, 26 Apr 2009 14:56:46 -0400
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > * Eric Dumazet (dada1@cosmosbay.com) wrote:
> > > From: Stephen Hemminger <shemminger@vyatta.com>
> > > 
> > > > Epilogue due to master Jarek. Lockdep carest not about the locking
> > > > doth bestowed. Therefore no keys are needed.
> > > > 
> > > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > > 
> > > So far, so good, should be ready for inclusion now, nobody complained :)
> > > 
> > > I include the final patch, merge of your last two patches.
> > > 
> > > David, could you please review it once again and apply it if it's OK ?
> > 
> > > Thanks to all for your help and patience
> > > 
> > > [PATCH] netfilter: use per-CPU recursive lock {XV}
> > 
> > Hi Eric, 
> > 
> > Suitable name would probably be :
> > 

Hi Stephen,

[I see that you have cutted my name proposal from the original email,
which might make it difficult for others to follow. I will assume you
did it by mistake.]

(re-added)
[PATCH] netfilter: use bh disabling with per-cpu read-write lock

> 
> But Linus is trying to delude himself.
> 
> This usage is recursive even if he doesn't like the terminology.
> The same CPU has to be able to reacquire the read lock without deadlocking.
> If reader/writer locks were implemented in a pure writer gets priority
> method, then this code would break!  So yes read locks can be used recursively
> now in Linux, but if the were implemented differently then this code
> would break.  For example, the -rt kernel turns all read/write locks into
> mutexs, so the -rt kernel developers will have to address this.

Reading Documentation/spinlocks.txt, which states the lock usage
guidelines :

"Note that you can be clever with read-write locks and interrupts. For
example, if you know that the interrupt only ever gets a read-lock, then
you can use a non-irq version of read locks everywhere - because they
don't block on each other (and thus there is no dead-lock wrt interrupts.
But when you do the write-lock, you have to use the irq-safe version."

So it's assumed in the kernel-wide read lock usage that nested read
locks are OK. I'm adding Thomas and Steven in CC, but I'm quite sure
they must have dealt with nested read-lock transformation into mutexes
by detecting nesting in some way in -rt. But I'll let them confirm this.

So I don't see why you are dreaming about a different semantic than the
one of the primitives you are using. I guess I'll leave the semantics to
you. I just find it astonishing that you persist saying that everbody is
wrong on a topic like semantic, which is in the end a mean to
communicate ideas clearly within the overall community you disagree
with.

Good luck !

Mathieu
Peter Zijlstra April 27, 2009, 5:44 p.m. UTC | #7
On Sun, 2009-04-26 at 14:57 -0700, Stephen Hemminger wrote:
> On Sun, 26 Apr 2009 14:56:46 -0400
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > * Eric Dumazet (dada1@cosmosbay.com) wrote:
> > > From: Stephen Hemminger <shemminger@vyatta.com>
> > > 
> > > > Epilogue due to master Jarek. Lockdep carest not about the locking
> > > > doth bestowed. Therefore no keys are needed.
> > > > 
> > > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > > 
> > > So far, so good, should be ready for inclusion now, nobody complained :)
> > > 
> > > I include the final patch, merge of your last two patches.
> > > 
> > > David, could you please review it once again and apply it if it's OK ?
> > 
> > > Thanks to all for your help and patience
> > > 
> > > [PATCH] netfilter: use per-CPU recursive lock {XV}
> > 
> > Hi Eric, 
> > 
> > Suitable name would probably be :
> > 
> 
> But Linus is trying to delude himself.
> 
> This usage is recursive even if he doesn't like the terminology.
> The same CPU has to be able to reacquire the read lock without deadlocking.
> If reader/writer locks were implemented in a pure writer gets priority
> method, then this code would break!  So yes read locks can be used recursively
> now in Linux, but if the were implemented differently then this code
> would break.  For example, the -rt kernel turns all read/write locks into
> mutexs, so the -rt kernel developers will have to address this.

A recursive lock has the property:

lock()
{
  if (lock->owner == current) {
    lock->depth++;
    return;
  }

  /* regular lock stuff */
}

unlock()
{
  if (!--lock->depth)
    /* regular unlock */
}

non of the linux kernel locking primitives have this -- with the
possible exception of the cpu-hotplug lock.

What rwlock_t has, is reader bias to the point where you can utterly
starve writers, with the side effect that you can obtain multiple read
ownerships without causing a deadlock.

This is not what is called a recursive lock. A recursive lock would have
each owner only once, this rwlock_t thing is simply so unfair that it
can have unlimited owners, including multiple copies of the same one.

rwsem has fifo fairness, and therefore can deadlock in this scenario,
suppose thread A does a read, thread B tries a write and blocks, then
thread A recurses and tries to obtain another read ownership --
deadlock, as the FIFO fairness will demand the second read ownership
will wait on the pending writer, which will wait on the outstanding read
owner.

Now if rwsem were a fifo-fair recursive lock, the above would not
deadlock, since it would detect that the task already had (read)
ownership and simply increment the depth, instead of trying to acquire a
second ownership.

This is all common and well understood terminology, not something Linus
invented just to harras you with.

Generally speaking we do not condone recursive locking strategies -- and
afaik reiserfs (as per the BKL) and the network code (as per abusing
rwlock_t unfairness) are the only offenders.

Like Linus stated, recursive locking is generally poor taste and
indicates you basically gave up on trying to find a proper locking
scheme. We should very much work towards getting rid of these
abberations instead of adding new ones.

Linus is very much right on what he said, and you calling him delusional
only high-lights your ignorance on the issue.

[ PS. -rt implements rwlock_t as a proper recursive lock (either a mutex
  or a full multi-owner reader-writer lock with PI fairness) so if
  anybody abuses rwlock_t unfairness in a way that is not strict owner
  recursive we have a problem. ]
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger April 27, 2009, 6:30 p.m. UTC | #8
On Mon, 27 Apr 2009 19:44:57 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, 2009-04-26 at 14:57 -0700, Stephen Hemminger wrote:
> > On Sun, 26 Apr 2009 14:56:46 -0400
> > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > 
> > > * Eric Dumazet (dada1@cosmosbay.com) wrote:
> > > > From: Stephen Hemminger <shemminger@vyatta.com>
> > > > 
> > > > > Epilogue due to master Jarek. Lockdep carest not about the locking
> > > > > doth bestowed. Therefore no keys are needed.
> > > > > 
> > > > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > > > 
> > > > So far, so good, should be ready for inclusion now, nobody complained :)
> > > > 
> > > > I include the final patch, merge of your last two patches.
> > > > 
> > > > David, could you please review it once again and apply it if it's OK ?
> > > 
> > > > Thanks to all for your help and patience
> > > > 
> > > > [PATCH] netfilter: use per-CPU recursive lock {XV}
> > > 
> > > Hi Eric, 
> > > 
> > > Suitable name would probably be :
> > > 
> > 
> > But Linus is trying to delude himself.
> > 
> > This usage is recursive even if he doesn't like the terminology.
> > The same CPU has to be able to reacquire the read lock without deadlocking.
> > If reader/writer locks were implemented in a pure writer gets priority
> > method, then this code would break!  So yes read locks can be used recursively
> > now in Linux, but if the were implemented differently then this code
> > would break.  For example, the -rt kernel turns all read/write locks into
> > mutexs, so the -rt kernel developers will have to address this.
> 
> A recursive lock has the property:
> 
> lock()
> {
>   if (lock->owner == current) {
>     lock->depth++;
>     return;
>   }
> 
>   /* regular lock stuff */
> }
> 
> unlock()
> {
>   if (!--lock->depth)
>     /* regular unlock */
> }

Only on Linux, and only because you look at locking from
the point of view of the magic variable "current" process
point of view.

> non of the linux kernel locking primitives have this -- with the
> possible exception of the cpu-hotplug lock.
> 
> What rwlock_t has, is reader bias to the point where you can utterly
> starve writers, with the side effect that you can obtain multiple read
> ownerships without causing a deadlock.

But what happens when this side effect disappears?
 
> This is not what is called a recursive lock. A recursive lock would have
> each owner only once, this rwlock_t thing is simply so unfair that it
> can have unlimited owners, including multiple copies of the same one.
> 
> rwsem has fifo fairness, and therefore can deadlock in this scenario,
> suppose thread A does a read, thread B tries a write and blocks, then
> thread A recurses and tries to obtain another read ownership --
> deadlock, as the FIFO fairness will demand the second read ownership
> will wait on the pending writer, which will wait on the outstanding read
> owner.
> 
> Now if rwsem were a fifo-fair recursive lock, the above would not
> deadlock, since it would detect that the task already had (read)
> ownership and simply increment the depth, instead of trying to acquire a
> second ownership.
> 
> This is all common and well understood terminology, not something Linus
> invented just to harras you with.

In Documentation/ ?  online ?  Where is the definition? The only reference
I se is indirectly in DocBook/kernel-locking.tmpl.

> Generally speaking we do not condone recursive locking strategies -- and
> afaik reiserfs (as per the BKL) and the network code (as per abusing
> rwlock_t unfairness) are the only offenders.
> 
> Like Linus stated, recursive locking is generally poor taste and
> indicates you basically gave up on trying to find a proper locking
> scheme. We should very much work towards getting rid of these
> abberations instead of adding new ones.

The people complaining about naming never seem to be the ones providing
workable suggestions or patches.

> Linus is very much right on what he said, and you calling him delusional
> only high-lights your ignorance on the issue.
> 
> [ PS. -rt implements rwlock_t as a proper recursive lock (either a mutex
>   or a full multi-owner reader-writer lock with PI fairness) so if
>   anybody abuses rwlock_t unfairness in a way that is not strict owner
>   recursive we have a problem. ]

Name it "dog's breath locking" for all I care. I am not bothering
with arguments over names; there is real work to do elsewhere.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar April 27, 2009, 6:54 p.m. UTC | #9
* Stephen Hemminger <shemminger@vyatta.com> wrote:

> > non of the linux kernel locking primitives have this -- with the 
> > possible exception of the cpu-hotplug lock.
> > 
> > What rwlock_t has, is reader bias to the point where you can 
> > utterly starve writers, with the side effect that you can obtain 
> > multiple read ownerships without causing a deadlock.
> 
> But what happens when this side effect disappears?

Then well written code works, badly written code breaks.

> > [...]
> >
> > This is all common and well understood terminology, not 
> > something Linus invented just to harras you with.
> 
> In Documentation/ ?  online ?  Where is the definition? The only 
> reference I se is indirectly in DocBook/kernel-locking.tmpl.

Sure, see:

    http://tinyurl.com/c6fakc

> > Generally speaking we do not condone recursive locking 
> > strategies -- and afaik reiserfs (as per the BKL) and the 
> > network code (as per abusing rwlock_t unfairness) are the only 
> > offenders.
> > 
> > Like Linus stated, recursive locking is generally poor taste and 
> > indicates you basically gave up on trying to find a proper 
> > locking scheme. We should very much work towards getting rid of 
> > these abberations instead of adding new ones.
> 
> The people complaining about naming never seem to be the ones 
> providing workable suggestions or patches.

The thing is, while you now have named your locking primitive 
correctly, you are still abusing it by using it recursively.

So it's not 'just about naming'. You should not use read-locks as 
recursive locks. It's poor code.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger April 27, 2009, 7:06 p.m. UTC | #10
On Mon, 27 Apr 2009 20:54:23 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Stephen Hemminger <shemminger@vyatta.com> wrote:
> 
> > > non of the linux kernel locking primitives have this -- with the 
> > > possible exception of the cpu-hotplug lock.
> > > 
> > > What rwlock_t has, is reader bias to the point where you can 
> > > utterly starve writers, with the side effect that you can obtain 
> > > multiple read ownerships without causing a deadlock.
> > 
> > But what happens when this side effect disappears?
> 
> Then well written code works, badly written code breaks.
> 
> > > [...]
> > >
> > > This is all common and well understood terminology, not 
> > > something Linus invented just to harras you with.
> > 
> > In Documentation/ ?  online ?  Where is the definition? The only 
> > reference I se is indirectly in DocBook/kernel-locking.tmpl.
> 
> Sure, see:
> 
>     http://tinyurl.com/c6fakc

All those references support my argument that the lock is being
used recursively in this case. It is being acquired multiple
times by the same CPU. This is not new, it has always been
acquired multiple times, so my change does not break anything.

If other implementations of reader locks to not nest the same
way, then on those system iptables can deadlock. Nothing was
changed by this.

> > > Generally speaking we do not condone recursive locking 
> > > strategies -- and afaik reiserfs (as per the BKL) and the 
> > > network code (as per abusing rwlock_t unfairness) are the only 
> > > offenders.
> > > 
> > > Like Linus stated, recursive locking is generally poor taste and 
> > > indicates you basically gave up on trying to find a proper 
> > > locking scheme. We should very much work towards getting rid of 
> > > these abberations instead of adding new ones.
> > 
> > The people complaining about naming never seem to be the ones 
> > providing workable suggestions or patches.
> 
> The thing is, while you now have named your locking primitive 
> correctly, you are still abusing it by using it recursively.
> 
> So it's not 'just about naming'. You should not use read-locks as 
> recursive locks. It's poor code.

If you don't like the proposal please, think of a better alternative.
Not just pseudo code that is broken with handwaving arguments.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 27, 2009, 7:46 p.m. UTC | #11
On Mon, 27 Apr 2009, Stephen Hemminger wrote:
>
> All those references support my argument that the lock is being
> used recursively in this case.

What's so hard between understanding the difference between "used 
recursively" and "recursive lock"?

THEY MEAN TWO TOTALLY DIFFERENT THINGS!

The fact that you don't seem to understand that is one of the things I've 
been complaining about all along.

So here's a final set of clue-bat:

Clue bat #1:

 - "can be used in a recursive context" is not the same as "is recursive".

   Analogy time: Look at "strtok_r()", and the difference between that and 
   "strtok()". "strtok_r()" can be used in a threaded environment. Does 
   that mean that 'strtok_r()' is threaded? No. When you call 
   'strtok_r()', it's still signle-threaded - it's just that it _can_ be 
   called in a threaded environment and then still has sane semantics.

   Now, replace "threaded" with "recursive". Do you get it?

Clue bat #2:

 - a lock that can count does not mean that it is recursive. 

   Counting and recursion are TWO TOTALLY DIFFERENT ISSUES. The 
   "countingness" means that there can be multiple users inside of it, and 
   that, in turn, implies that maybe it can be used in a recursive 
   context. But again, counting and recursion are not about the same thing 
   at all.

   Put another way: a read-write lock is not a "recursive lock" despite 
   the fact that you can recursively take the lock for reading. It just 
   happens to count readers, and allow more than one (from _any_ context, 
   not just from the "same context").

Clue bat #3:

 - A recursive lock is very much _about_ recursion. It is explicitly about 
   the fact that the _same_ thread re-enters the lock, not about another 
   thread being in the locked region at the same time.

   See the difference? Big, big difference. A recursive lock will lock out 
   other thread contexts, even if it allows the current one to recurse. 
   Notice how the _only_ thing a recursive lock allows is that recursive 
   behavior, and nothing else.

   IOW, a "recursive lock" is explicitly designed for recursion. But that 
   doesn't mean that recursive algorithms cannot use functions that 
   aren't.

   Can you use "memcpy()" in a recursive algorithm? Yes. Does that mean 
   that "memcpy()" is suddenly a "recursive memory copy"? No.

   See the difference?

Clue bat #3:

 - if you do not understand the difference between these two things, don't 
   then try to claim that somebody _else_ who does understand it is 
   "deluding himself".

   Analogy time: Ethernet and a modem line can both get you on the
   internet. Now, let's say that Mr Peter Paste-Eater has heard of 
   ethernet, and knows you can get on the internet with an ethernet 
   connection, but he happens to use a modem line to do it.

   Now, Peter Paste-Eater talks to you, and tells you he is connecting to 
   the internet with ethernet, and proudly shows you his serial line and 
   modem, and tells you how he uses ethernet to get onto the internet. You 
   correct him, and tell him it's not ethernet.  He argues for several 
   days about how he gets on the internet, and that it must thus be 
   ethernet, and that you're obviously just "deluding yourself".

Now, can you see why people react badly to you talking about "recursive 
locks"? You're acting like Peter Paste-Eater calling his serial line 
ethernet. The fact that two _different_ things can be used for the same 
end result DOES NOT MAKE THEM THE SAME THING.

In other words:

 - "Recursive locks" are different from reader-writer locks.

 - The ability to count is different from recursion, although in a lock it 
   can obviously be _related_ to whether it can be used in a recursive 
   environment or not. If you don't have a counter, you probably cannot 
   recurse, but it's also not true that a counter implies that you always 
   can.

   A traditional counting lock is the old-fasioned 'semaphore' we have, 
   where the count allows for more than just simple mutual exclusion, and 
   is used when you might want to allow concurrency, but need to limit it
   to some number 'n' (although, almost always, n==1)

 - What the netfilter code wants is simply not a recursive lock. It wants 
   a form of locking that allows recursive _use_, but as mentioned, that 
   is totally and utterly irrelevant from what we call it.

   You _could_ use a recursive lock for it. BUT NONE OF THE PATCHES THAT 
   HAVE EVER BEEN POSTED HAVE BEEN RECURSIVE LOCKS! Nada. None. Zero. 
   Zilch.

So don't talk about recursive locks.

Get it? Finally? Or are you going to continue to be that Paste-Eater guy 
who refuses to understand that he is talking about something else than 
ethernet?

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 27, 2009, 7:48 p.m. UTC | #12
On Mon, 27 Apr 2009, Linus Torvalds wrote:
> 
> Clue bat #1:
> 
> Clue bat #2:
> 
> Clue bat #3:
> 
> Clue bat #3:

Note to self: learn to count beyond three one of these days.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov April 27, 2009, 8:36 p.m. UTC | #13
Hi.

On Mon, Apr 27, 2009 at 12:46:48PM -0700, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > All those references support my argument that the lock is being
> > used recursively in this case.
> 
> What's so hard between understanding the difference between "used 
> recursively" and "recursive lock"?
> 
> THEY MEAN TWO TOTALLY DIFFERENT THINGS!
> 
> The fact that you don't seem to understand that is one of the things I've 
> been complaining about all along.

Just ot be sure readers will not lose the discssion topic: do you object
against naming or realizaion? If its about the former, does 'dog's
breath lock' proposed by Stephen fit?
Linus Torvalds April 27, 2009, 8:58 p.m. UTC | #14
On Tue, 28 Apr 2009, Evgeniy Polyakov wrote:
> 
> Just ot be sure readers will not lose the discssion topic: do you object
> against naming or realizaion?

I said _long_ ago that I thought the patch was fine. 

What I object to is people mis-representing the lock, and apparently 
having a really hard time admitting that having a good grounding in 
networking doesn't necessarily mean that you know everything about 
something as basic as locking.

> If its about the former, does 'dog's breath lock' proposed by Stephen 
> fit?

Is that just an attempt to avoid admitting that they were wrong about lock 
naming? And then trying to trivialize it by calling things by a 
_different_ wrong name, but silly enough that they hope people won't call 
them on it?

Why not just use the correct name? I think it was Mathieu that just 
suggested:

	[PATCH] netfilter: use bh disabling with per-cpu read-write lock

or just call it "netfilter: use per-CPU read-write lock".

Why are people so against calling things by their correct names? Why do 
certain network people seem to force a stupid and incorrect description, 
when they have been told (a) that it's wrong and (b) why it's wrong 
several times?

What's so hard with just doing the TechnicallyRightThing(tm) and 
describing it as such?

And btw - don't get me wrong. There are _other_ ways to do that locking 
too. You don't have to use a rwlock. You can do it with explicit counting, 
the way Eric's original patch did. But it would be wrong to call that one 
"recursive lock" too.

Or you _could_ use a recursive lock, but nobody has actually posted such 
patches. It would work. No question about it. And if it actually _were_ a 
recursive lock, I wouldn't have objected about the description saying it 
is (although I would probably have objected to it being unnecessarily 
complex, when a simpler rwlock or the explicit count thing would be 
sufficient).

But since the current simple patch is using a rwlock, why not just say 
that? Why call it something incorrect ("recursive lock") or nonsensical 
("dog's breath lock").

As I tried to explain with an analogy, networking people would (quite 
correctly) object to me calling a serial line an "ethernet cable". Why is 
it so hard for netfilter people to then see why it's wrong to call a 
rwlock a "recursive lock".

I mean, guys, if you don't want to read up on decades of locking work, 
just google for it. Google for "recursive lock" (_with_ the quotes). At 
least for me, the very first hit gives a reasonable explanation for it, 
and it says:

  "POSIX allows mutexes to be recursive. That means the same thread can 
   lock the same mutex twice and won't deadlock."

and realize that the "same thread" part is very much a keyword, not juat 
a random implementation detail (the first answer to the question is 
better than the question, but even the question at the top really does 
get at the basics).

And please do realize that neither rwlocks nor the counting locks from 
Dumazet's original patch do that. Never did. They simply aren't recursive 
locks.

So just don't call them that. But is "dog's breath" any better? Yes, maybe 
it's less _misleading_, but it sure as hell isn't any more descriptive.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger April 27, 2009, 9:40 p.m. UTC | #15
On Mon, 27 Apr 2009 13:58:48 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Tue, 28 Apr 2009, Evgeniy Polyakov wrote:
> > 
> > Just ot be sure readers will not lose the discssion topic: do you object
> > against naming or realizaion?
> 
> I said _long_ ago that I thought the patch was fine. 
> 
> What I object to is people mis-representing the lock, and apparently 
> having a really hard time admitting that having a good grounding in 
> networking doesn't necessarily mean that you know everything about 
> something as basic as locking.
> 
> > If its about the former, does 'dog's breath lock' proposed by Stephen 
> > fit?
> 
> Is that just an attempt to avoid admitting that they were wrong about lock 
> naming? And then trying to trivialize it by calling things by a 
> _different_ wrong name, but silly enough that they hope people won't call 
> them on it?

The part that concerns me is that the reader lock used in a nested manner on
same cpu may still be broken on -rt. Other than that it is just language
lawyering; violent agreement that the lock gets used multiple times by
same CPU. I never had the occasion to address this before
(and avoided such usage), but this legacy code exists and needs to
be dealt with.


> Why not just use the correct name? I think it was Mathieu that just 
> suggested:
> 
> 	[PATCH] netfilter: use bh disabling with per-cpu read-write lock
> 
> or just call it "netfilter: use per-CPU read-write lock".

[PATCH] netfilter: Ceci n'est pas une serrure récurrente

I don't care. I don't care. Don't you get the point yet.



> 
> Why are people so against calling things by their correct names? Why do 
> certain network people seem to force a stupid and incorrect description, 
> when they have been told (a) that it's wrong and (b) why it's wrong 
> several times?

Because meaning comes from context, and my meaning comes from different
context. So we disagree on correct names. 

> What's so hard with just doing the TechnicallyRightThing(tm) and 
> describing it as such?
> 
> And btw - don't get me wrong. There are _other_ ways to do that locking 
> too. You don't have to use a rwlock. You can do it with explicit counting, 
> the way Eric's original patch did. But it would be wrong to call that one 
> "recursive lock" too.
> 
> Or you _could_ use a recursive lock, but nobody has actually posted such 
> patches. It would work. No question about it. And if it actually _were_ a 
> recursive lock, I wouldn't have objected about the description saying it 
> is (although I would probably have objected to it being unnecessarily 
> complex, when a simpler rwlock or the explicit count thing would be 
> sufficient).
> 
> But since the current simple patch is using a rwlock, why not just say 
> that? Why call it something incorrect ("recursive lock") or nonsensical 
> ("dog's breath lock").
> 
> As I tried to explain with an analogy, networking people would (quite 
> correctly) object to me calling a serial line an "ethernet cable". Why is 
> it so hard for netfilter people to then see why it's wrong to call a 
> rwlock a "recursive lock".
> 
> I mean, guys, if you don't want to read up on decades of locking work, 
> just google for it. Google for "recursive lock" (_with_ the quotes). At 
> least for me, the very first hit gives a reasonable explanation for it, 
> and it says:
> 
>   "POSIX allows mutexes to be recursive. That means the same thread can 
>    lock the same mutex twice and won't deadlock."
> 
> and realize that the "same thread" part is very much a keyword, not juat 
> a random implementation detail (the first answer to the question is 
> better than the question, but even the question at the top really does 
> get at the basics).
> 
> And please do realize that neither rwlocks nor the counting locks from 
> Dumazet's original patch do that. Never did. They simply aren't recursive 
> locks.
> 
> So just don't call them that. But is "dog's breath" any better? Yes, maybe 
> it's less _misleading_, but it sure as hell isn't any more descriptive.
> 
> 			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 27, 2009, 10:24 p.m. UTC | #16
On Mon, 27 Apr 2009, Stephen Hemminger wrote:
>
> The part that concerns me is that the reader lock used in a nested manner on
> same cpu may still be broken on -rt.

I think that's a valid concern, and I don't actually object to not using a 
rwlock, but using the "explicit counting + spinlock" that we had at one 
point. It _might_ even be faster, since a spinlock can be faster than a 
rwlock, and the (rare) case where you recurse you can then avoid the 
atomic entirely.

But EVEN IF YOU DO THAT, it's still wrong to call it a "recursive lock". 
Because it still wouldn't be one.

That's kind of my point, and always has been. It was why I objected to the 
original patch description by Dumazet. It wasn't a recursive lock back 
then _either_. For all the reasons I tried to explain to you, and you seem 
to not care about.

Btw, if you do use the "explicit count" case, then please please please 
make sure it's documented and bug-free. And dammit, correct documentation 
in this case very much talks about how it is _not_ a "recursive lock", but 
a spinlock that then has an explicit counter that avoids taking the lock 
entirely in one very specific path (that happens to be recursive).

The thing is, using a rwlock kind of implicitly documents all of that, 
because you can tell somebody who doesn't even read the code that it's a 
"per-cpu rwlock", and they then know what the semantics are (given that 
they know the kernel semantics for locking in the first place).

But once you start doing your own locking rules, you really have to 
explain why it works, and what it does. And you do have to be very 
careful, because it's so easy to get locking wrong.

> I don't care. I don't care. Don't you get the point yet.

If you don't care about the naming, then use the right one. 

And if you don't care about the naming, why do you then say I'm deluding 
myself, when I'm _correct_, and I _do_ happen to care about correct 
naming.

Locking really is important.  Code that gets locking wrong breaks in
really subtle and nasty ways.  And it sadly tends to "work" in testing,
since the breakage cases require just the right timing.  So locking
should be robust and as "obviously correct" as possible.

And naming really is important.  Misnaming things makes people make
assumptions that aren't true.  If you talk about recursive locks, it
should be reasonable that people who know how recursive locks work would
then make assumptions about them.  If the code then doesn't actually
match those rules, that's bad. 

It's like having a comment in front of a piece of code that says something 
totally different than what the code actually does. It's _bad_. That's why 
naming matters so much - because naming is commentary.  If you mis-name 
things on purpose, it's simply a bug.

Do _you_ get the point?

You _do_ care about bugs, don't you?

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 27, 2009, 11:01 p.m. UTC | #17
On Mon, 27 Apr 2009, Linus Torvalds wrote:
> 
> Btw, if you do use the "explicit count" case, then please please please 
> make sure it's documented and bug-free. And dammit, correct documentation 
> in this case very much talks about how it is _not_ a "recursive lock", but 
> a spinlock that then has an explicit counter that avoids taking the lock 
> entirely in one very specific path (that happens to be recursive).

So, just as an example, I would not object to the counter approach, as 
long as it really talks about the issues, and why it works (and as long 
that doesn't call the locks "recursive locks").

So if you are just nervous about rwlocks, then something like this might
work (needs testing, and other people to sanity-check it).

I left the commentary about "readers" and "writers", because in many
ways it's correct, and what the code actually does is very much to
emulate a reader-writer lock.  I put quotes around the uses in the
comments to high-light that it largely _acts_ as a reader-writer lock. 

Now, it would actually be a really _bad_ reader-writer lock in general, 
exactly because it's not very "atomic" (ie this would be a truly sucky and 
broken lock if we didn't have the strict per-cpu usage rules).

So it just so happens that the per-cpu'ness of the lock, together with the 
very limited way in which it is used, make that sucky implementation 
possible - and indeed possibly faster than the standard (and generic) 
kernel rwlock implementation.

So it's basically a special-case rwlock that happens to work due to the
specific rules. 

And exactly because it really wouldn't work in the absense of those
rules, those rules really do need to have big comments on them so that 
people don't then later forget about the limitations.

BTW: THIS IS TOTALLY UNTESTED.  I just cut-and-pasted the existing
rwlock version from one of the later patches, used the counting approach
from one of the earlier ones, and basically just added what I think
would be minimal required comments for this special case.  All of it was
written inside the mail reader, none of it has been compiled or tested
in any other way. 

Because it's exactly the "lock semantics awareness" that I think is so
important here (and that's why all the naming and commentary is so
critical). 

Btw, the "counter" could probably be just a single bit, since apparently
the nesting level is always supposed to be just one.  I made it
"unsigned char" just because sometimes spinlocks can be small, and it
seemed the simplest type.  But perhaps it would be better as an
"unsigned int", since some architectures potentially could do that
faster (eg old alphas).

			Linus

---
/*
 * Per-CPU spinlock associated with per-cpu table entries, and
 * with a counter for the "reading" side that allows a recursive
 * reader to avoid taking the lock and deadlocking.
 *
 * "reading" is used by ip/arp/ip6 tables rule processing which runs per-cpu.
 * It needs to ensure that the rules are not being changed while the packet
 * is being processed. In some cases, the read lock will be acquired
 * twice on the same CPU; this is okay because of the count.
 *
 * The write lock is used in two cases:
 *    1. reading counter values
 *       all rule processing need to be stopped and the per-CPU values are summed.
 *
 *    2. replacing tables
 *       any readers that are using the old tables have to complete
 *       before freeing the old table. This is handled by reading
 *       as a side effect of reading counters
 */
struct xt_info_lock {
	spin_lock_t lock;
	unsigned char readers;
};
DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks);

/*
 * Note: we need to ensure that preemption is disabled before acquiring
 * the per-cpu-variable, so we do it as a two step process rather than
 * using "spin_lock_bh()". 
 *
 * We _also_ need to disable bottom half processing before updating our
 * nesting count, to make sure that the only kind of re-entrancy is this
 * code being called by itself: since the count+lock is not an atomic
 * operation, we can allow no races. 
 *
 * _Only_ that special combination of being per-cpu and never getting
 * re-entered asynchronously means that the count is safe. 
 */
static inline void xt_info_rdlock_bh(void)
{
	struct xt_info_lock *lock;

	local_bh_disable();
	lock = &__get_cpu_var(xt_info_locks);
	if (!lock->readers++)
		spin_lock(&lock->lock);
}

static inline void xt_info_rdunlock_bh(void)
{
	struct xt_info_lock *lock;

	lock = &__get_cpu_var(xt_info_locks);
	if (!--lock->readers)
		spin_unlock(&lock->lock);
}

/*
 * The "writer" side needs to get exclusive access to the lock,
 * regardless of readers.  This must be called with bottom half
 * processing (and thus also preemption) disabled. 
 */
static inline void xt_info_wrlock(unsigned int cpu)
{
	spin_lock(&per_cpu(xt_info_locks, cpu).lock);
}

static inline void xt_info_wrunlock(unsigned int cpu)
{
	spin_unlock(&per_cpu(xt_info_locks, cpu).lock);
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 27, 2009, 11:03 p.m. UTC | #18
On Mon, 27 Apr 2009, Linus Torvalds wrote:
> 
> BTW: THIS IS TOTALLY UNTESTED.

Gaah. I should have read through it one more time before sending.

> static inline void xt_info_rdunlock_bh(void)
> {
> 	struct xt_info_lock *lock;
> 
> 	lock = &__get_cpu_var(xt_info_locks);
> 	if (!--lock->readers)
> 		spin_unlock(&lock->lock);
> }

This one was missing the "local_bh_enable()" at the end.

There may be other bugs, but that's the one I noticed immediately when 
reading what I sent out. Oops.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 27, 2009, 11:32 p.m. UTC | #19
On Mon, 27 Apr 2009, Linus Torvalds wrote:
> 
> I left the commentary about "readers" and "writers", because in many
> ways it's correct, and what the code actually does is very much to
> emulate a reader-writer lock.  I put quotes around the uses in the
> comments to high-light that it largely _acts_ as a reader-writer lock. 

Btw, I think it was Paul who pointed out that technically it's probably 
better to call them "local" and "global" lockers instead of "readers" and 
"writers".

That also probably clarifies the rules on when you use one over the other 
(ie reading off all the statistics is a "global" operation, as is 
obviously replacing the tables).

Of course, "readers" and "writers" is something most Linux lock people are 
more used to. Or "brlock" for the old-timers, but that involves a heavy 
dose of bad taste. The new use is much nicer, especially since it never 
takes the global lock on _all_ cpu's (which was really a killer in so 
many ways).

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 28, 2009, 6:58 a.m. UTC | #20
Linus Torvalds a écrit :
> 
> On Mon, 27 Apr 2009, Linus Torvalds wrote:
>> BTW: THIS IS TOTALLY UNTESTED.
> 
> Gaah. I should have read through it one more time before sending.
> 
>> static inline void xt_info_rdunlock_bh(void)
>> {
>> 	struct xt_info_lock *lock;
>>
>> 	lock = &__get_cpu_var(xt_info_locks);
>> 	if (!--lock->readers)
>> 		spin_unlock(&lock->lock);
>> }
> 
> This one was missing the "local_bh_enable()" at the end.
> 
> There may be other bugs, but that's the one I noticed immediately when 
> reading what I sent out. Oops.

I am not sure my day job will permit me to polish a patch mixing all
the bits and comments. But I am glad we eventually got back spinlocks
which are probably better than rwlocks for implementing this stuff.

Instead of submitting a full patch again, we could first submit a new
 include file containg all comments and inline functions ?

This include file could be local to netfilter, with a big stick on
it to forbids its use on other areas (No changes in Documentation/ )

Then, as soon as we can go back to pure RCU solution, we can safely
delete this controversial-locking-nesting-per-cpu-thing ?


Instead of local/global name that Paul suggested, that was about
'global' locking all locks at the same time. Not any more the good
name IMHO

Maybe something like local/remote or owner/foreigner ?

xt_info_owner_lock_bh(), xt_info_owner_unlock_bh()
xt_info_foreigner_lock(), xt_info_foreigner_unlock()

One comment about this comment you wrote :

/*
 * The "writer" side needs to get exclusive access to the lock,
 * regardless of readers.  This must be called with bottom half
 * processing (and thus also preemption) disabled. 
 */
static inline void xt_info_wrlock(unsigned int cpu)
{
	spin_lock(&per_cpu(xt_info_locks, cpu).lock);
}

static inline void xt_info_wrunlock(unsigned int cpu)
{
	spin_unlock(&per_cpu(xt_info_locks, cpu).lock);
}

Its true that BH should be disabled if caller runs
on the cpu it wants to lock. 
For other ones (true foreigners), there is
no requirement about BH (current cpu could be interrupted
by a softirq and packets could fly)

We could use following construct and not require disabling BH
more than a short period of time.
(But preemption disabled for the whole duration)

preempt_disable(); // could be cpu_migration_disable();

int curcpu = smp_processor_id();
/*
 * Gather stats for current cpu : must disable BH
 * before trying to lock.
 */
local_bh_disable();
xt_info_wrlock(curcpu);
// copy stats of this cpu on my private data (not shown here)
xt_info_wrunlock(curcpu);
local_bh_enable();

for_each_possible_cpu(cpu) {
	if (cpu == curcpu)
		continue;
	xt_info_wrlock(cpu);
	// fold stats of "cpu" on my private data (not shown here)
	xt_info_wrunlock((cpu);
}
preempt_enable(); // could be cpu_migration_enable();


So your initial comment could be changed to :

/*
 * The "writer" side needs to get exclusive access to the lock,
 * regardless of readers. If caller is about to lock its own lock,
 * he must have disabled BH before. For other cpus, no special
 * care but preemption disabled to guarantee no cpu migration.
 */

Back to work now :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra April 28, 2009, 7:41 a.m. UTC | #21
On Mon, 2009-04-27 at 16:32 -0700, Linus Torvalds wrote:
> 
> On Mon, 27 Apr 2009, Linus Torvalds wrote:
> > 
> > I left the commentary about "readers" and "writers", because in many
> > ways it's correct, and what the code actually does is very much to
> > emulate a reader-writer lock.  I put quotes around the uses in the
> > comments to high-light that it largely _acts_ as a reader-writer lock. 
> 
> Btw, I think it was Paul who pointed out that technically it's probably 
> better to call them "local" and "global" lockers instead of "readers" and 
> "writers".

exclusive vs non-exclusive is what the literature would call them in
most cases I think.

> That also probably clarifies the rules on when you use one over the other 
> (ie reading off all the statistics is a "global" operation, as is 
> obviously replacing the tables).
> 
> Of course, "readers" and "writers" is something most Linux lock people are 
> more used to. Or "brlock" for the old-timers, but that involves a heavy 
> dose of bad taste. The new use is much nicer, especially since it never 
> takes the global lock on _all_ cpu's (which was really a killer in so 
> many ways).
> 
> 			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Engelhardt April 28, 2009, 7:42 a.m. UTC | #22
On Monday 2009-04-27 21:46, Linus Torvalds wrote:

>Clue bat #3 [sic #4]:
>
> - if you do not understand the difference between these two things, don't 
>   then try to claim that somebody _else_ who does understand it is 
>   "deluding himself".
>
>   Analogy time: Ethernet and a modem line can both get you on the
>   internet. Now, let's say that Mr Peter Paste-Eater has heard of 
>   ethernet, and knows you can get on the internet with an ethernet 
>   connection, but he happens to use a modem line to do it.
>
>   Now, Peter Paste-Eater talks to you, and tells you he is connecting to 
>   the internet with ethernet, and proudly shows you his serial line and 
>   modem, and tells you how he uses ethernet to get onto the internet. You 
>   correct him, and tell him it's not ethernet.  He argues for several 
>   days about how he gets on the internet, and that it must thus be 
>   ethernet, and that you're obviously just "deluding yourself".
>
>Now, can you see why people react badly to you talking about "recursive 
>locks"? You're acting like Peter Paste-Eater calling his serial line 
>ethernet.

It could be worse. He could be running Ethernet over serial, e.g. L2TP.
Or his serial line is a TP cable with RJ45 plugs - consumers like
to call that Ethernet (cable) too.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 28, 2009, 11:53 a.m. UTC | #23
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Tue, 28 Apr 2009 08:58:05 +0200

> I am not sure my day job will permit me to polish a patch mixing all
> the bits and comments. But I am glad we eventually got back spinlocks
> which are probably better than rwlocks for implementing this stuff.
> 
> Instead of submitting a full patch again, we could first submit a new
>  include file containg all comments and inline functions ?
> 
> This include file could be local to netfilter, with a big stick on
> it to forbids its use on other areas (No changes in Documentation/ )
> 
> Then, as soon as we can go back to pure RCU solution, we can safely
> delete this controversial-locking-nesting-per-cpu-thing ?

I say we merge Linus's locking idea into the XV patch, fixup the
commit message wording, and move on with life.

For something that's going to get deleted as soon as the faster grace
period RCU stuff is available, it has consumed an inordinate amount of
our time :-)

I might take a stab at this before hittng bed tonight, no promises :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar April 28, 2009, 12:40 p.m. UTC | #24
* David Miller <davem@davemloft.net> wrote:

> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Tue, 28 Apr 2009 08:58:05 +0200
> 
> > I am not sure my day job will permit me to polish a patch mixing all
> > the bits and comments. But I am glad we eventually got back spinlocks
> > which are probably better than rwlocks for implementing this stuff.
> > 
> > Instead of submitting a full patch again, we could first submit a new
> >  include file containg all comments and inline functions ?
> > 
> > This include file could be local to netfilter, with a big stick on
> > it to forbids its use on other areas (No changes in Documentation/ )
> > 
> > Then, as soon as we can go back to pure RCU solution, we can 
> > safely delete this controversial-locking-nesting-per-cpu-thing ?
> 
> I say we merge Linus's locking idea into the XV patch, fixup the 
> commit message wording, and move on with life.
> 
> For something that's going to get deleted as soon as the faster 
> grace period RCU stuff is available, it has consumed an inordinate 
> amount of our time :-)

One more reason to factor out this code into general locking code.

The latest code looks a bit similar to the old big-reader-locks hack 
(which got dropped for good many eons ago and with which i deny any 
involvement with, such as having authored it. [oh, did i say that 
out loud? crap.]), implemented cleanly and properly.

IMHO this locking construct should be considered for 
linux/local_lock.h and kernel/local_lock.c. Even if the netfilter 
code drops its use soon afterwards ;-)

[ The _only_ thing i am worried about is the apparent fact that
  there's so much confusion about recursion versus read-access.
  Recursion might be hard to factor out of the netfilter code, and
  maybe it's not even possible realistically (we fought years with
  the BKL and are still fighting it) but if its harms are not even
  _realized_ that difficult task turns into an impossible task ;-) ]

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 28, 2009, 1:43 p.m. UTC | #25
From: Ingo Molnar <mingo@elte.hu>
Date: Tue, 28 Apr 2009 14:40:33 +0200

> IMHO this locking construct should be considered for 
> linux/local_lock.h and kernel/local_lock.c. Even if the netfilter 
> code drops its use soon afterwards ;-)

If you can show me have to pass a per-cpu variable (the variable,
not a dereference of it) as an argument to an inline function,
I'll implement this :-)

It has to be dereferenced after local_bh_disable() for the
read side acquisition.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathieu Desnoyers April 28, 2009, 1:52 p.m. UTC | #26
* David Miller (davem@davemloft.net) wrote:
> From: Ingo Molnar <mingo@elte.hu>
> Date: Tue, 28 Apr 2009 14:40:33 +0200
> 
> > IMHO this locking construct should be considered for 
> > linux/local_lock.h and kernel/local_lock.c. Even if the netfilter 
> > code drops its use soon afterwards ;-)
> 
> If you can show me have to pass a per-cpu variable (the variable,
> not a dereference of it) as an argument to an inline function,
> I'll implement this :-)
> 
> It has to be dereferenced after local_bh_disable() for the
> read side acquisition.

The local_bh_disable() could be outside of the locking construct. This
would make it easier to adapt it to various users (irq disable, bh
disable, preempt disable) depending on the contexts from which they much
be protected.

And if it still does not work for some reason, using a #define is
discouraged, but could work.

Mathieu
Paul E. McKenney April 28, 2009, 2:22 p.m. UTC | #27
On Tue, Apr 28, 2009 at 09:41:08AM +0200, Peter Zijlstra wrote:
> On Mon, 2009-04-27 at 16:32 -0700, Linus Torvalds wrote:
> > 
> > On Mon, 27 Apr 2009, Linus Torvalds wrote:
> > > 
> > > I left the commentary about "readers" and "writers", because in many
> > > ways it's correct, and what the code actually does is very much to
> > > emulate a reader-writer lock.  I put quotes around the uses in the
> > > comments to high-light that it largely _acts_ as a reader-writer lock. 
> > 
> > Btw, I think it was Paul who pointed out that technically it's probably 
> > better to call them "local" and "global" lockers instead of "readers" and 
> > "writers".
> 
> exclusive vs non-exclusive is what the literature would call them in
> most cases I think.

I would argue that the non-exclusive category includes both reader-writer
locking and local-global locking.  That said, we have an unusual variant
of local-global in this case, as the global processing acquires only one
of the locks at a time.

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 28, 2009, 2:37 p.m. UTC | #28
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date: Tue, 28 Apr 2009 09:52:19 -0400

> The local_bh_disable() could be outside of the locking construct. This
> would make it easier to adapt it to various users (irq disable, bh
> disable, preempt disable) depending on the contexts from which they much
> be protected.
> 
> And if it still does not work for some reason, using a #define is
> discouraged, but could work.

That's what I was hoping to avoid, things like macros and having
the callers of this thing expand the two parts of the operation.

What's the point in making this generic if it ends up being ugly
as hell?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathieu Desnoyers April 28, 2009, 2:49 p.m. UTC | #29
* David Miller (davem@davemloft.net) wrote:
> From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Date: Tue, 28 Apr 2009 09:52:19 -0400
> 
> > The local_bh_disable() could be outside of the locking construct. This
> > would make it easier to adapt it to various users (irq disable, bh
> > disable, preempt disable) depending on the contexts from which they much
> > be protected.
> > 
> > And if it still does not work for some reason, using a #define is
> > discouraged, but could work.
> 
> That's what I was hoping to avoid, things like macros and having
> the callers of this thing expand the two parts of the operation.
> 
> What's the point in making this generic if it ends up being ugly
> as hell?

.. and what's the point in making it generic if it can be replaced
by a proper RCU implementation ? :-) I am not convinced of the added
value we get in making it a generic header this soon. I would wait for
other users to express similar needs, otherwise this could soon become
an orphaned piece of locking code.

Mathieu
Linus Torvalds April 28, 2009, 3:09 p.m. UTC | #30
On Tue, 28 Apr 2009, Eric Dumazet wrote:
> 
> Instead of submitting a full patch again, we could first submit a new
>  include file containg all comments and inline functions ?

Well, I actually already suggested to David that he should just merge the 
last patch I saw floating around (with the "recursive" -> "readwrite" fix 
in the comment ;), so that we can at least get the basic issue fixed, and 
then we can tweak it a bit with smaller patches flying around.

And at least right now, the difference between the rwlock and the 
"count+spinlock" should be basically almost unnoticeable, and a very small 
implementation issue. They're entirely interchangeable, after all.

> This include file could be local to netfilter, with a big stick on
> it to forbids its use on other areas (No changes in Documentation/ )
>
> Then, as soon as we can go back to pure RCU solution, we can safely
> delete this controversial-locking-nesting-per-cpu-thing ?

I don't have any strogn preferences, but I'd almost prefer to not abstract 
things out even that much. It's already pretty well hidden inside 
<netfilter/x_tables.h>, I'd hate to add a new file just for this. 

As to just adding more commenting that it must not be used anywhere else, 
I certainly agree with that.

> Instead of local/global name that Paul suggested, that was about
> 'global' locking all locks at the same time. Not any more the good
> name IMHO
> 
> Maybe something like local/remote or owner/foreigner ?

local/remote works for me, and yes, since we only take the remote side one 
CPU at a time, I guess "global" is misleading. But "owner/foreigner" 
sounds pretty odd.

> One comment about this comment you wrote :
> 
> /*
>  * The "writer" side needs to get exclusive access to the lock,
>  * regardless of readers.  This must be called with bottom half
>  * processing (and thus also preemption) disabled. 
>  */
> 
> Its true that BH should be disabled if caller runs
> on the cpu it wants to lock. 
> For other ones (true foreigners), there is
> no requirement about BH (current cpu could be interrupted
> by a softirq and packets could fly)

Yes. Other CPU's just require preemption protection. 

> We could use following construct and not require disabling BH
> more than a short period of time.
> (But preemption disabled for the whole duration)
> 
> preempt_disable(); // could be cpu_migration_disable();
> 
> int curcpu = smp_processor_id();
> /*
>  * Gather stats for current cpu : must disable BH
>  * before trying to lock.
>  */
> local_bh_disable();
> xt_info_wrlock(curcpu);
> // copy stats of this cpu on my private data (not shown here)
> xt_info_wrunlock(curcpu);
> local_bh_enable();
> 
> for_each_possible_cpu(cpu) {
> 	if (cpu == curcpu)
> 		continue;
> 	xt_info_wrlock(cpu);
> 	// fold stats of "cpu" on my private data (not shown here)
> 	xt_info_wrunlock((cpu);
> }
> preempt_enable(); // could be cpu_migration_enable();

Agreed. 

> So your initial comment could be changed to :
> 
> /*
>  * The "writer" side needs to get exclusive access to the lock,
>  * regardless of readers. If caller is about to lock its own lock,
>  * he must have disabled BH before. For other cpus, no special
>  * care but preemption disabled to guarantee no cpu migration.
>  */

Ack.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney April 28, 2009, 3:42 p.m. UTC | #31
On Tue, Apr 28, 2009 at 06:43:40AM -0700, David Miller wrote:
> From: Ingo Molnar <mingo@elte.hu>
> Date: Tue, 28 Apr 2009 14:40:33 +0200
> 
> > IMHO this locking construct should be considered for 
> > linux/local_lock.h and kernel/local_lock.c. Even if the netfilter 
> > code drops its use soon afterwards ;-)
> 
> If you can show me have to pass a per-cpu variable (the variable,
> not a dereference of it) as an argument to an inline function,
> I'll implement this :-)
> 
> It has to be dereferenced after local_bh_disable() for the
> read side acquisition.

The way I did this in treercu.c was to create an array of references
to the per-CPU data in question.  Not necessarily recommended, but
one way of doing it.  That said, one could argue that we should wait
until we have at least three users before creating a generic primitive.

And I just know that I am going to regret this deeply, but I cannot
resist posting the following URL:

http://en.wikipedia.org/wiki/Wikipedia:Avoid_Parkinson's_Bicycle_Shed_Effect

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) April 28, 2009, 5:35 p.m. UTC | #32
On Tue, 28 Apr 2009, Paul E. McKenney wrote:

> > If you can show me have to pass a per-cpu variable (the variable,
> > not a dereference of it) as an argument to an inline function,
> > I'll implement this :-)
> >
> > It has to be dereferenced after local_bh_disable() for the
> > read side acquisition.
>
> The way I did this in treercu.c was to create an array of references
> to the per-CPU data in question.  Not necessarily recommended, but
> one way of doing it.  That said, one could argue that we should wait
> until we have at least three users before creating a generic primitive.

The new percpu allocator allows you to create a per cpu pointer and pass
it to functions.

per_cpu_ptr(pointer,cpu) is used to select an instance.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 7b1a652..511debb 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -354,9 +354,6 @@  struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -434,8 +431,56 @@  extern void xt_proto_fini(struct net *net, u_int8_t af);
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
-extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
-				    struct xt_table_info *new);
+
+/*
+ * Per-CPU read/write lock associated with per-cpu table entries.
+ * This is not a general solution but makes reader locking fast since
+ * there is no shared variable to cause cache ping-pong; but adds an
+ * additional write-side penalty since update must lock all
+ * possible CPU's.
+ *
+ * Read lock is used by ip/arp/ip6 tables rule processing which runs per-cpu.
+ * It needs to ensure that the rules are not being changed while packet
+ * is being processed. In some cases, the read lock will be acquired
+ * twice on the same CPU; this is okay because read locks handle nesting.
+ *
+ * Write lock is used in two cases:
+ *    1. reading counter values
+ *       all readers need to be stopped and the per-CPU values are summed.
+ *
+ *    2. replacing tables
+ *       any readers that are using the old tables have to complete
+ *       before freeing the old table. This is handled by reading
+ *	  as a side effect of reading counters
+ */
+DECLARE_PER_CPU(rwlock_t, xt_info_locks);
+
+static inline void xt_info_rdlock_bh(void)
+{
+	/*
+	 * Note: can not use read_lock_bh(&__get_cpu_var(xt_info_locks))
+	 * because need to ensure that preemption is disable before
+	 * acquiring per-cpu-variable, so do it as a two step process
+	 */
+	local_bh_disable();
+	read_lock(&__get_cpu_var(xt_info_locks));
+}
+
+static inline void xt_info_rdunlock_bh(void)
+{
+	read_unlock_bh(&__get_cpu_var(xt_info_locks));
+}
+
+static inline void xt_info_wrlock(unsigned int cpu)
+{
+	write_lock(&per_cpu(xt_info_locks, cpu));
+}
+
+static inline void xt_info_wrunlock(unsigned int cpu)
+{
+
+	write_unlock(&per_cpu(xt_info_locks, cpu));
+}
 
 /*
  * This helper is performance critical and must be inlined
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 5ba533d..831fe18 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -253,9 +253,9 @@  unsigned int arpt_do_table(struct sk_buff *skb,
 	indev = in ? in->name : nulldevname;
 	outdev = out ? out->name : nulldevname;
 
-	rcu_read_lock_bh();
-	private = rcu_dereference(table->private);
-	table_base = rcu_dereference(private->entries[smp_processor_id()]);
+	xt_info_rdlock_bh();
+	private = table->private;
+	table_base = private->entries[smp_processor_id()];
 
 	e = get_entry(table_base, private->hook_entry[hook]);
 	back = get_entry(table_base, private->underflow[hook]);
@@ -273,6 +273,7 @@  unsigned int arpt_do_table(struct sk_buff *skb,
 
 			hdr_len = sizeof(*arp) + (2 * sizeof(struct in_addr)) +
 				(2 * skb->dev->addr_len);
+
 			ADD_COUNTER(e->counters, hdr_len, 1);
 
 			t = arpt_get_target(e);
@@ -328,8 +329,7 @@  unsigned int arpt_do_table(struct sk_buff *skb,
 			e = (void *)e + e->next_offset;
 		}
 	} while (!hotdrop);
-
-	rcu_read_unlock_bh();
+	xt_info_rdunlock_bh();
 
 	if (hotdrop)
 		return NF_DROP;
@@ -711,9 +711,12 @@  static void get_counters(const struct xt_table_info *t,
 	/* Instead of clearing (by a previous call to memset())
 	 * the counters and using adds, we set the counters
 	 * with data used by 'current' CPU
-	 * We dont care about preemption here.
+	 *
+	 * Bottom half has to be disabled to prevent deadlock
+	 * if new softirq were to run and call ipt_do_table
 	 */
-	curcpu = raw_smp_processor_id();
+	local_bh_disable();
+	curcpu = smp_processor_id();
 
 	i = 0;
 	ARPT_ENTRY_ITERATE(t->entries[curcpu],
@@ -726,73 +729,22 @@  static void get_counters(const struct xt_table_info *t,
 		if (cpu == curcpu)
 			continue;
 		i = 0;
+		xt_info_wrlock(cpu);
 		ARPT_ENTRY_ITERATE(t->entries[cpu],
 				   t->size,
 				   add_entry_to_counter,
 				   counters,
 				   &i);
+		xt_info_wrunlock(cpu);
 	}
-}
-
-
-/* We're lazy, and add to the first CPU; overflow works its fey magic
- * and everything is OK. */
-static int
-add_counter_to_entry(struct arpt_entry *e,
-		     const struct xt_counters addme[],
-		     unsigned int *i)
-{
-	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
-
-	(*i)++;
-	return 0;
-}
-
-/* Take values from counters and add them back onto the current cpu */
-static void put_counters(struct xt_table_info *t,
-			 const struct xt_counters counters[])
-{
-	unsigned int i, cpu;
-
-	local_bh_disable();
-	cpu = smp_processor_id();
-	i = 0;
-	ARPT_ENTRY_ITERATE(t->entries[cpu],
-			  t->size,
-			  add_counter_to_entry,
-			  counters,
-			  &i);
 	local_bh_enable();
 }
 
-static inline int
-zero_entry_counter(struct arpt_entry *e, void *arg)
-{
-	e->counters.bcnt = 0;
-	e->counters.pcnt = 0;
-	return 0;
-}
-
-static void
-clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
-{
-	unsigned int cpu;
-	const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
-
-	memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
-	for_each_possible_cpu(cpu) {
-		memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
-		ARPT_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
-				  zero_entry_counter, NULL);
-	}
-}
-
 static struct xt_counters *alloc_counters(struct xt_table *table)
 {
 	unsigned int countersize;
 	struct xt_counters *counters;
 	struct xt_table_info *private = table->private;
-	struct xt_table_info *info;
 
 	/* We need atomic snapshot of counters: rest doesn't change
 	 * (other than comefrom, which userspace doesn't care
@@ -802,30 +754,11 @@  static struct xt_counters *alloc_counters(struct xt_table *table)
 	counters = vmalloc_node(countersize, numa_node_id());
 
 	if (counters == NULL)
-		goto nomem;
-
-	info = xt_alloc_table_info(private->size);
-	if (!info)
-		goto free_counters;
-
-	clone_counters(info, private);
-
-	mutex_lock(&table->lock);
-	xt_table_entry_swap_rcu(private, info);
-	synchronize_net();	/* Wait until smoke has cleared */
+		return ERR_PTR(-ENOMEM);
 
-	get_counters(info, counters);
-	put_counters(private, counters);
-	mutex_unlock(&table->lock);
-
-	xt_free_table_info(info);
+	get_counters(private, counters);
 
 	return counters;
-
- free_counters:
-	vfree(counters);
- nomem:
-	return ERR_PTR(-ENOMEM);
 }
 
 static int copy_entries_to_user(unsigned int total_size,
@@ -1094,8 +1027,9 @@  static int __do_replace(struct net *net, const char *name,
 	    (newinfo->number <= oldinfo->initial_entries))
 		module_put(t->me);
 
-	/* Get the old counters. */
+	/* Get the old counters, and synchronize with replace */
 	get_counters(oldinfo, counters);
+
 	/* Decrease module usage counts and free resource */
 	loc_cpu_old_entry = oldinfo->entries[raw_smp_processor_id()];
 	ARPT_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,
@@ -1165,10 +1099,23 @@  static int do_replace(struct net *net, void __user *user, unsigned int len)
 	return ret;
 }
 
+/* We're lazy, and add to the first CPU; overflow works its fey magic
+ * and everything is OK. */
+static int
+add_counter_to_entry(struct arpt_entry *e,
+		     const struct xt_counters addme[],
+		     unsigned int *i)
+{
+	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
+
+	(*i)++;
+	return 0;
+}
+
 static int do_add_counters(struct net *net, void __user *user, unsigned int len,
 			   int compat)
 {
-	unsigned int i;
+	unsigned int i, curcpu;
 	struct xt_counters_info tmp;
 	struct xt_counters *paddc;
 	unsigned int num_counters;
@@ -1224,26 +1171,26 @@  static int do_add_counters(struct net *net, void __user *user, unsigned int len,
 		goto free;
 	}
 
-	mutex_lock(&t->lock);
+	local_bh_disable();
 	private = t->private;
 	if (private->number != num_counters) {
 		ret = -EINVAL;
 		goto unlock_up_free;
 	}
 
-	preempt_disable();
 	i = 0;
 	/* Choose the copy that is on our node */
-	loc_cpu_entry = private->entries[smp_processor_id()];
+	curcpu = smp_processor_id();
+	loc_cpu_entry = private->entries[curcpu];
+	xt_info_wrlock(curcpu);
 	ARPT_ENTRY_ITERATE(loc_cpu_entry,
 			   private->size,
 			   add_counter_to_entry,
 			   paddc,
 			   &i);
-	preempt_enable();
+	xt_info_wrunlock(curcpu);
  unlock_up_free:
-	mutex_unlock(&t->lock);
-
+	local_bh_enable();
 	xt_table_unlock(t);
 	module_put(t->me);
  free:
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 810c0b6..2ec8d72 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -338,10 +338,9 @@  ipt_do_table(struct sk_buff *skb,
 	tgpar.hooknum = hook;
 
 	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
-
-	rcu_read_lock_bh();
-	private = rcu_dereference(table->private);
-	table_base = rcu_dereference(private->entries[smp_processor_id()]);
+	xt_info_rdlock_bh();
+	private = table->private;
+	table_base = private->entries[smp_processor_id()];
 
 	e = get_entry(table_base, private->hook_entry[hook]);
 
@@ -436,8 +435,7 @@  ipt_do_table(struct sk_buff *skb,
 			e = (void *)e + e->next_offset;
 		}
 	} while (!hotdrop);
-
-	rcu_read_unlock_bh();
+	xt_info_rdunlock_bh();
 
 #ifdef DEBUG_ALLOW_ALL
 	return NF_ACCEPT;
@@ -896,10 +894,13 @@  get_counters(const struct xt_table_info *t,
 
 	/* Instead of clearing (by a previous call to memset())
 	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU
-	 * We dont care about preemption here.
+	 * with data used by 'current' CPU.
+	 *
+	 * Bottom half has to be disabled to prevent deadlock
+	 * if new softirq were to run and call ipt_do_table
 	 */
-	curcpu = raw_smp_processor_id();
+	local_bh_disable();
+	curcpu = smp_processor_id();
 
 	i = 0;
 	IPT_ENTRY_ITERATE(t->entries[curcpu],
@@ -912,74 +913,22 @@  get_counters(const struct xt_table_info *t,
 		if (cpu == curcpu)
 			continue;
 		i = 0;
+		xt_info_wrlock(cpu);
 		IPT_ENTRY_ITERATE(t->entries[cpu],
 				  t->size,
 				  add_entry_to_counter,
 				  counters,
 				  &i);
+		xt_info_wrunlock(cpu);
 	}
-
-}
-
-/* We're lazy, and add to the first CPU; overflow works its fey magic
- * and everything is OK. */
-static int
-add_counter_to_entry(struct ipt_entry *e,
-		     const struct xt_counters addme[],
-		     unsigned int *i)
-{
-	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
-
-	(*i)++;
-	return 0;
-}
-
-/* Take values from counters and add them back onto the current cpu */
-static void put_counters(struct xt_table_info *t,
-			 const struct xt_counters counters[])
-{
-	unsigned int i, cpu;
-
-	local_bh_disable();
-	cpu = smp_processor_id();
-	i = 0;
-	IPT_ENTRY_ITERATE(t->entries[cpu],
-			  t->size,
-			  add_counter_to_entry,
-			  counters,
-			  &i);
 	local_bh_enable();
 }
 
-
-static inline int
-zero_entry_counter(struct ipt_entry *e, void *arg)
-{
-	e->counters.bcnt = 0;
-	e->counters.pcnt = 0;
-	return 0;
-}
-
-static void
-clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
-{
-	unsigned int cpu;
-	const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
-
-	memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
-	for_each_possible_cpu(cpu) {
-		memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
-		IPT_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
-				  zero_entry_counter, NULL);
-	}
-}
-
 static struct xt_counters * alloc_counters(struct xt_table *table)
 {
 	unsigned int countersize;
 	struct xt_counters *counters;
 	struct xt_table_info *private = table->private;
-	struct xt_table_info *info;
 
 	/* We need atomic snapshot of counters: rest doesn't change
 	   (other than comefrom, which userspace doesn't care
@@ -988,30 +937,11 @@  static struct xt_counters * alloc_counters(struct xt_table *table)
 	counters = vmalloc_node(countersize, numa_node_id());
 
 	if (counters == NULL)
-		goto nomem;
+		return ERR_PTR(-ENOMEM);
 
-	info = xt_alloc_table_info(private->size);
-	if (!info)
-		goto free_counters;
-
-	clone_counters(info, private);
-
-	mutex_lock(&table->lock);
-	xt_table_entry_swap_rcu(private, info);
-	synchronize_net();	/* Wait until smoke has cleared */
-
-	get_counters(info, counters);
-	put_counters(private, counters);
-	mutex_unlock(&table->lock);
-
-	xt_free_table_info(info);
+	get_counters(private, counters);
 
 	return counters;
-
- free_counters:
-	vfree(counters);
- nomem:
-	return ERR_PTR(-ENOMEM);
 }
 
 static int
@@ -1306,8 +1236,9 @@  __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
 	    (newinfo->number <= oldinfo->initial_entries))
 		module_put(t->me);
 
-	/* Get the old counters. */
+	/* Get the old counters, and synchronize with replace */
 	get_counters(oldinfo, counters);
+
 	/* Decrease module usage counts and free resource */
 	loc_cpu_old_entry = oldinfo->entries[raw_smp_processor_id()];
 	IPT_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,
@@ -1377,11 +1308,23 @@  do_replace(struct net *net, void __user *user, unsigned int len)
 	return ret;
 }
 
+/* We're lazy, and add to the first CPU; overflow works its fey magic
+ * and everything is OK. */
+static int
+add_counter_to_entry(struct ipt_entry *e,
+		     const struct xt_counters addme[],
+		     unsigned int *i)
+{
+	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
+
+	(*i)++;
+	return 0;
+}
 
 static int
 do_add_counters(struct net *net, void __user *user, unsigned int len, int compat)
 {
-	unsigned int i;
+	unsigned int i, curcpu;
 	struct xt_counters_info tmp;
 	struct xt_counters *paddc;
 	unsigned int num_counters;
@@ -1437,25 +1380,26 @@  do_add_counters(struct net *net, void __user *user, unsigned int len, int compat
 		goto free;
 	}
 
-	mutex_lock(&t->lock);
+	local_bh_disable();
 	private = t->private;
 	if (private->number != num_counters) {
 		ret = -EINVAL;
 		goto unlock_up_free;
 	}
 
-	preempt_disable();
 	i = 0;
 	/* Choose the copy that is on our node */
-	loc_cpu_entry = private->entries[raw_smp_processor_id()];
+	curcpu = smp_processor_id();
+	loc_cpu_entry = private->entries[curcpu];
+	xt_info_wrlock(curcpu);
 	IPT_ENTRY_ITERATE(loc_cpu_entry,
 			  private->size,
 			  add_counter_to_entry,
 			  paddc,
 			  &i);
-	preempt_enable();
+	xt_info_wrunlock(curcpu);
  unlock_up_free:
-	mutex_unlock(&t->lock);
+	local_bh_enable();
 	xt_table_unlock(t);
 	module_put(t->me);
  free:
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 800ae85..219e165 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -365,9 +365,9 @@  ip6t_do_table(struct sk_buff *skb,
 
 	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
 
-	rcu_read_lock_bh();
-	private = rcu_dereference(table->private);
-	table_base = rcu_dereference(private->entries[smp_processor_id()]);
+	xt_info_rdlock_bh();
+	private = table->private;
+	table_base = private->entries[smp_processor_id()];
 
 	e = get_entry(table_base, private->hook_entry[hook]);
 
@@ -466,7 +466,7 @@  ip6t_do_table(struct sk_buff *skb,
 #ifdef CONFIG_NETFILTER_DEBUG
 	((struct ip6t_entry *)table_base)->comefrom = NETFILTER_LINK_POISON;
 #endif
-	rcu_read_unlock_bh();
+	xt_info_rdunlock_bh();
 
 #ifdef DEBUG_ALLOW_ALL
 	return NF_ACCEPT;
@@ -926,9 +926,12 @@  get_counters(const struct xt_table_info *t,
 	/* Instead of clearing (by a previous call to memset())
 	 * the counters and using adds, we set the counters
 	 * with data used by 'current' CPU
-	 * We dont care about preemption here.
+	 *
+	 * Bottom half has to be disabled to prevent deadlock
+	 * if new softirq were to run and call ipt_do_table
 	 */
-	curcpu = raw_smp_processor_id();
+	local_bh_disable();
+	curcpu = smp_processor_id();
 
 	i = 0;
 	IP6T_ENTRY_ITERATE(t->entries[curcpu],
@@ -941,72 +944,22 @@  get_counters(const struct xt_table_info *t,
 		if (cpu == curcpu)
 			continue;
 		i = 0;
+		xt_info_wrlock(cpu);
 		IP6T_ENTRY_ITERATE(t->entries[cpu],
 				  t->size,
 				  add_entry_to_counter,
 				  counters,
 				  &i);
+		xt_info_wrunlock(cpu);
 	}
-}
-
-/* We're lazy, and add to the first CPU; overflow works its fey magic
- * and everything is OK. */
-static int
-add_counter_to_entry(struct ip6t_entry *e,
-		     const struct xt_counters addme[],
-		     unsigned int *i)
-{
-	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
-
-	(*i)++;
-	return 0;
-}
-
-/* Take values from counters and add them back onto the current cpu */
-static void put_counters(struct xt_table_info *t,
-			 const struct xt_counters counters[])
-{
-	unsigned int i, cpu;
-
-	local_bh_disable();
-	cpu = smp_processor_id();
-	i = 0;
-	IP6T_ENTRY_ITERATE(t->entries[cpu],
-			   t->size,
-			   add_counter_to_entry,
-			   counters,
-			   &i);
 	local_bh_enable();
 }
 
-static inline int
-zero_entry_counter(struct ip6t_entry *e, void *arg)
-{
-	e->counters.bcnt = 0;
-	e->counters.pcnt = 0;
-	return 0;
-}
-
-static void
-clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
-{
-	unsigned int cpu;
-	const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
-
-	memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
-	for_each_possible_cpu(cpu) {
-		memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
-		IP6T_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
-				   zero_entry_counter, NULL);
-	}
-}
-
 static struct xt_counters *alloc_counters(struct xt_table *table)
 {
 	unsigned int countersize;
 	struct xt_counters *counters;
 	struct xt_table_info *private = table->private;
-	struct xt_table_info *info;
 
 	/* We need atomic snapshot of counters: rest doesn't change
 	   (other than comefrom, which userspace doesn't care
@@ -1015,30 +968,11 @@  static struct xt_counters *alloc_counters(struct xt_table *table)
 	counters = vmalloc_node(countersize, numa_node_id());
 
 	if (counters == NULL)
-		goto nomem;
+		return ERR_PTR(-ENOMEM);
 
-	info = xt_alloc_table_info(private->size);
-	if (!info)
-		goto free_counters;
-
-	clone_counters(info, private);
-
-	mutex_lock(&table->lock);
-	xt_table_entry_swap_rcu(private, info);
-	synchronize_net();	/* Wait until smoke has cleared */
-
-	get_counters(info, counters);
-	put_counters(private, counters);
-	mutex_unlock(&table->lock);
-
-	xt_free_table_info(info);
+	get_counters(private, counters);
 
 	return counters;
-
- free_counters:
-	vfree(counters);
- nomem:
-	return ERR_PTR(-ENOMEM);
 }
 
 static int
@@ -1334,8 +1268,9 @@  __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
 	    (newinfo->number <= oldinfo->initial_entries))
 		module_put(t->me);
 
-	/* Get the old counters. */
+	/* Get the old counters, and synchronize with replace */
 	get_counters(oldinfo, counters);
+
 	/* Decrease module usage counts and free resource */
 	loc_cpu_old_entry = oldinfo->entries[raw_smp_processor_id()];
 	IP6T_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,
@@ -1405,11 +1340,24 @@  do_replace(struct net *net, void __user *user, unsigned int len)
 	return ret;
 }
 
+/* We're lazy, and add to the first CPU; overflow works its fey magic
+ * and everything is OK. */
+static int
+add_counter_to_entry(struct ip6t_entry *e,
+		     const struct xt_counters addme[],
+		     unsigned int *i)
+{
+	ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
+
+	(*i)++;
+	return 0;
+}
+
 static int
 do_add_counters(struct net *net, void __user *user, unsigned int len,
 		int compat)
 {
-	unsigned int i;
+	unsigned int i, curcpu;
 	struct xt_counters_info tmp;
 	struct xt_counters *paddc;
 	unsigned int num_counters;
@@ -1465,25 +1413,28 @@  do_add_counters(struct net *net, void __user *user, unsigned int len,
 		goto free;
 	}
 
-	mutex_lock(&t->lock);
+
+	local_bh_disable();
 	private = t->private;
 	if (private->number != num_counters) {
 		ret = -EINVAL;
 		goto unlock_up_free;
 	}
 
-	preempt_disable();
 	i = 0;
 	/* Choose the copy that is on our node */
-	loc_cpu_entry = private->entries[raw_smp_processor_id()];
+	curcpu = smp_processor_id();
+	xt_info_wrlock(curcpu);
+	loc_cpu_entry = private->entries[curcpu];
 	IP6T_ENTRY_ITERATE(loc_cpu_entry,
 			  private->size,
 			  add_counter_to_entry,
 			  paddc,
 			  &i);
-	preempt_enable();
+	xt_info_wrunlock(curcpu);
+
  unlock_up_free:
-	mutex_unlock(&t->lock);
+	local_bh_enable();
 	xt_table_unlock(t);
 	module_put(t->me);
  free:
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 509a956..5807a4d 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -625,20 +625,6 @@  void xt_free_table_info(struct xt_table_info *info)
 }
 EXPORT_SYMBOL(xt_free_table_info);
 
-void xt_table_entry_swap_rcu(struct xt_table_info *oldinfo,
-			     struct xt_table_info *newinfo)
-{
-	unsigned int cpu;
-
-	for_each_possible_cpu(cpu) {
-		void *p = oldinfo->entries[cpu];
-		rcu_assign_pointer(oldinfo->entries[cpu], newinfo->entries[cpu]);
-		newinfo->entries[cpu] = p;
-	}
-
-}
-EXPORT_SYMBOL_GPL(xt_table_entry_swap_rcu);
-
 /* Find table by name, grabs mutex & ref.  Returns ERR_PTR() on error. */
 struct xt_table *xt_find_table_lock(struct net *net, u_int8_t af,
 				    const char *name)
@@ -676,32 +662,43 @@  void xt_compat_unlock(u_int8_t af)
 EXPORT_SYMBOL_GPL(xt_compat_unlock);
 #endif
 
+DEFINE_PER_CPU(rwlock_t, xt_info_locks);
+EXPORT_PER_CPU_SYMBOL_GPL(xt_info_locks);
+
+
 struct xt_table_info *
 xt_replace_table(struct xt_table *table,
 	      unsigned int num_counters,
 	      struct xt_table_info *newinfo,
 	      int *error)
 {
-	struct xt_table_info *oldinfo, *private;
+	struct xt_table_info *private;
 
 	/* Do the substitution. */
-	mutex_lock(&table->lock);
+	local_bh_disable();
 	private = table->private;
+
 	/* Check inside lock: is the old number correct? */
 	if (num_counters != private->number) {
 		duprintf("num_counters != table->private->number (%u/%u)\n",
 			 num_counters, private->number);
-		mutex_unlock(&table->lock);
+		local_bh_enable();
 		*error = -EAGAIN;
 		return NULL;
 	}
-	oldinfo = private;
-	rcu_assign_pointer(table->private, newinfo);
-	newinfo->initial_entries = oldinfo->initial_entries;
-	mutex_unlock(&table->lock);
 
-	synchronize_net();
-	return oldinfo;
+	table->private = newinfo;
+	newinfo->initial_entries = private->initial_entries;
+
+	/*
+	 * Even though table entries have now been swapped, other CPU's
+	 * may still be using the old entries. This is okay, because
+	 * resynchronization happens because of the locking done
+	 * during the get_counters() routine.
+	 */
+	local_bh_enable();
+
+	return private;
 }
 EXPORT_SYMBOL_GPL(xt_replace_table);
 
@@ -734,7 +731,6 @@  struct xt_table *xt_register_table(struct net *net, struct xt_table *table,
 
 	/* Simplifies replace_table code. */
 	table->private = bootstrap;
-	mutex_init(&table->lock);
 
 	if (!xt_replace_table(table, 0, newinfo, &ret))
 		goto unlock;
@@ -1147,7 +1143,11 @@  static struct pernet_operations xt_net_ops = {
 
 static int __init xt_init(void)
 {
-	int i, rv;
+	unsigned int i;
+	int rv;
+
+	for_each_possible_cpu(i)
+		rwlock_init(&per_cpu(xt_info_locks, i));
 
 	xt = kmalloc(sizeof(struct xt_af) * NFPROTO_NUMPROTO, GFP_KERNEL);
 	if (!xt)