diff mbox

[v4,for-next,05/12] IB/cm: Share listening CM IDs

Message ID 1431841868-28063-6-git-send-email-haggaie@mellanox.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Haggai Eran May 17, 2015, 5:51 a.m. UTC
Enabling network namespaces for RDMA CM will allow processes on different
namespaces to listen on the same port. In order to leave namespace support
out of the CM layer, this requires that multiple RDMA CM IDs will be able
to share a single CM ID.

This patch adds infrastructure to retrieve an existing listening ib_cm_id,
based on its device and service ID, or create a new one if one does not
already exist. It also adds a reference count for such instances
(cm_id_private.sharecount), and prevents cm_destroy_id from destroying a CM
if it is still shared. See the relevant discussion [1].

[1] Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids
    http://www.spinics.net/lists/netdev/msg328860.html

Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
---
 drivers/infiniband/core/cm.c | 121 +++++++++++++++++++++++++++++++++++++++++--
 include/rdma/ib_cm.h         |   4 ++
 2 files changed, 120 insertions(+), 5 deletions(-)

Comments

Jason Gunthorpe May 19, 2015, 6:35 p.m. UTC | #1
On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote:
> @@ -212,6 +212,8 @@ struct cm_id_private {
>  	spinlock_t lock;	/* Do not acquire inside cm.lock */
>  	struct completion comp;
>  	atomic_t refcount;
> +	/* Number of clients sharing this ib_cm_id. Only valid for listeners. */
> +	atomic_t sharecount;

No need for this atomic, hold the lock

The use of the atomic looks racy:

> +	if (!atomic_dec_and_test(&cm_id_priv->sharecount)) {
> +		/* The id is still shared. */
> +		return;
> +	}

Might race with this:

> +		if (atomic_inc_return(&cm_id_priv->sharecount) == 1) {
> +			/* This ID is already being destroyed */
> +			atomic_dec(&cm_id_priv->sharecount);
> +			goto new_id;
> +		}
> +

Resulting in use-after-free of cm_id_priv->sharecount

Don't try and be clever with atomics, it is almost always wrong.

The share count should be 'listen_sharecount' because it *only* works
for listen.

The above test in cm_destroy_id should only be in the listen branch of
the if.

> + * Create a new listening ib_cm_id and listen on the given service ID.
> + *
> + * If there's an existing ID listening on that same device and service ID,
> + * return it.
> + *

.. Callers should call cm_destroy_id when done with the listen ..

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 19, 2015, 10:35 p.m. UTC | #2
On Tue, May 19, 2015 at 12:35:45PM -0600, Jason Gunthorpe wrote:
> On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote:
> > @@ -212,6 +212,8 @@ struct cm_id_private {
> >  	spinlock_t lock;	/* Do not acquire inside cm.lock */
> >  	struct completion comp;
> >  	atomic_t refcount;
> > +	/* Number of clients sharing this ib_cm_id. Only valid for listeners. */
> > +	atomic_t sharecount;
> 
> No need for this atomic, hold the lock
> 
> The use of the atomic looks racy:
> 
> > +	if (!atomic_dec_and_test(&cm_id_priv->sharecount)) {
> > +		/* The id is still shared. */
> > +		return;
> > +	}
> 
> Might race with this:
> 
> > +		if (atomic_inc_return(&cm_id_priv->sharecount) == 1) {
> > +			/* This ID is already being destroyed */
> > +			atomic_dec(&cm_id_priv->sharecount);
> > +			goto new_id;
> > +		}
> > +
> 
> Resulting in use-after-free of cm_id_priv->sharecount

Actually, there is something else odd here.. I mentioned the above
because there wasn't obvious ref'ing on the cm_id_priv. Looking closer
the cm.lock should prevent use-after-free, but there is still no ref.

The more I look at this, the more I think it is sketchy. Don't try and
merge sharecount and refcount together, after cm_find_listen is called
you have to increment the refcount before dropping cm.lock.

Decrement the refcount when destroying a shared listen.

I also don't see how the 'goto new_id' can work, if cm_find_listen
succeeds then __ib_cm_listen is guarenteed to fail.

Fix the locking to make that impossible, associate sharecount with the
cm.lock and, rework how cm_destroy_id grabs the cm_id_priv->lock spinlock:

	case IB_CM_LISTEN:
		spin_lock_irq(&cm.lock);
		if (cm_id_priv->sharecount != 0) {
		     cm_id_prv->sharecount--;
		     // paired with in in ib_cm_id_create_and_listen
		     atomic_dec(&cm_id_priv->refcount);
		     spin_unlock_irq(&cm.lock);
		     return;
		}
		rb_erase(&cm_id_priv->service_node, &cm.listen_service_table);
		spin_unlock_irq(&cm.lock);
	
		spin_lock_irq(&cm_id_priv->lock);
		cm_id->state = IB_CM_IDLE;
		spin_unlock_irq(&cm_id_priv->lock);
		break;

Now that condition is eliminated, the unneeded atomic is gone, and
refcount still acts like a proper kref should.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Haggai Eran May 21, 2015, 7:07 a.m. UTC | #3
On 19/05/2015 21:35, Jason Gunthorpe wrote:
...

> The share count should be 'listen_sharecount' because it *only* works
> for listen.
> 
> The above test in cm_destroy_id should only be in the listen branch of
> the if.

Okay.

> 
>> + * Create a new listening ib_cm_id and listen on the given service ID.
>> + *
>> + * If there's an existing ID listening on that same device and service ID,
>> + * return it.
>> + *
> 
> .. Callers should call cm_destroy_id when done with the listen ..

I'll add that (with ib_destroy_cm_id of course).

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Haggai Eran May 21, 2015, 8:08 a.m. UTC | #4
On 20/05/2015 01:35, Jason Gunthorpe wrote:
> On Tue, May 19, 2015 at 12:35:45PM -0600, Jason Gunthorpe wrote:
>> On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote:
>>> @@ -212,6 +212,8 @@ struct cm_id_private {
>>>  	spinlock_t lock;	/* Do not acquire inside cm.lock */
>>>  	struct completion comp;
>>>  	atomic_t refcount;
>>> +	/* Number of clients sharing this ib_cm_id. Only valid for listeners. */
>>> +	atomic_t sharecount;
>>
>> No need for this atomic, hold the lock
>>
>> The use of the atomic looks racy:
>>
>>> +	if (!atomic_dec_and_test(&cm_id_priv->sharecount)) {
>>> +		/* The id is still shared. */
>>> +		return;
>>> +	}
>>
>> Might race with this:
>>
>>> +		if (atomic_inc_return(&cm_id_priv->sharecount) == 1) {
>>> +			/* This ID is already being destroyed */
>>> +			atomic_dec(&cm_id_priv->sharecount);
>>> +			goto new_id;
>>> +		}
>>> +
>>
>> Resulting in use-after-free of cm_id_priv->sharecount
> 
> Actually, there is something else odd here.. I mentioned the above
> because there wasn't obvious ref'ing on the cm_id_priv. Looking closer
> the cm.lock should prevent use-after-free, but there is still no ref.
> 
> The more I look at this, the more I think it is sketchy. Don't try and
> merge sharecount and refcount together, 
I'm not sure what you mean here. The way I was thinking about it was
that sharecount = num of rdma_cm_ids sharing this listener, while
refcount = num of active internal uses of this ID (work items, timers,
etc.) Do you want refcount to also include the sharecount?

> after cm_find_listen is called
> you have to increment the refcount before dropping cm.lock.
> 
> Decrement the refcount when destroying a shared listen.
You mean to decrement event if listen_sharecount > 0, and the id isn't
destroyed, right? The code already calls cm_deref_id when
listen_sharecount = 0 of course.

> I also don't see how the 'goto new_id' can work, if cm_find_listen
> succeeds then __ib_cm_listen is guarenteed to fail.
> 
> Fix the locking to make that impossible, associate sharecount with the
> cm.lock and, rework how cm_destroy_id grabs the cm_id_priv->lock spinlock:
> 
> 	case IB_CM_LISTEN:
> 		spin_lock_irq(&cm.lock);
> 		if (cm_id_priv->sharecount != 0) {
> 		     cm_id_prv->sharecount--;
> 		     // paired with in in ib_cm_id_create_and_listen
> 		     atomic_dec(&cm_id_priv->refcount);
> 		     spin_unlock_irq(&cm.lock);
> 		     return;
> 		}
> 		rb_erase(&cm_id_priv->service_node, &cm.listen_service_table);
> 		spin_unlock_irq(&cm.lock);
> 	
> 		spin_lock_irq(&cm_id_priv->lock);
> 		cm_id->state = IB_CM_IDLE;
> 		spin_unlock_irq(&cm_id_priv->lock);
> 		break;
> 
> Now that condition is eliminated, the unneeded atomic is gone, and
> refcount still acts like a proper kref should.
Thanks, that looks like a better solution.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 21, 2015, 5:54 p.m. UTC | #5
On Thu, May 21, 2015 at 11:08:31AM +0300, Haggai Eran wrote:
> > The more I look at this, the more I think it is sketchy. Don't try and
> > merge sharecount and refcount together,

> I'm not sure what you mean here. The way I was thinking about it was
> that sharecount = num of rdma_cm_ids sharing this listener, while
> refcount = num of active internal uses of this ID (work items, timers,
> etc.) Do you want refcount to also include the sharecount?

If you hold on to the pointer, then you increment the refcount. The
refcount is the 'number of holders of the pointer'. Basic invariant.

When the pointer left the lock region for the lockup, the ref must be
incremented.

What you had was functionally correct because the sharecount was
implying a refcount, but it doesn't follow the standard kernel
refcounting pattern.

> > after cm_find_listen is called
> > you have to increment the refcount before dropping cm.lock.
> > 
> > Decrement the refcount when destroying a shared listen.
> You mean to decrement event if listen_sharecount > 0, and the id isn't
> destroyed, right? The code already calls cm_deref_id when
> listen_sharecount = 0 of course.

Yes, because cm_destroy_id is the ref'd pair to cm_listen, after it
returns the caller must give up their pointer.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 14423c20c55b..4f936932dd54 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -212,6 +212,8 @@  struct cm_id_private {
 	spinlock_t lock;	/* Do not acquire inside cm.lock */
 	struct completion comp;
 	atomic_t refcount;
+	/* Number of clients sharing this ib_cm_id. Only valid for listeners. */
+	atomic_t sharecount;
 
 	struct ib_mad_send_buf *msg;
 	struct cm_timewait_info *timewait_info;
@@ -720,6 +722,7 @@  struct ib_cm_id *ib_create_cm_id(struct ib_device *device,
 	INIT_LIST_HEAD(&cm_id_priv->work_list);
 	atomic_set(&cm_id_priv->work_count, -1);
 	atomic_set(&cm_id_priv->refcount, 1);
+	atomic_set(&cm_id_priv->sharecount, 1);
 	return &cm_id_priv->id;
 
 error:
@@ -841,6 +844,12 @@  static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
 	struct cm_work *work;
 
 	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
+
+	if (!atomic_dec_and_test(&cm_id_priv->sharecount)) {
+		/* The id is still shared. */
+		return;
+	}
+
 retest:
 	spin_lock_irq(&cm_id_priv->lock);
 	switch (cm_id->state) {
@@ -927,11 +936,32 @@  void ib_destroy_cm_id(struct ib_cm_id *cm_id)
 }
 EXPORT_SYMBOL(ib_destroy_cm_id);
 
-int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
-		 struct ib_cm_compare_data *compare_data)
+/**
+ * __ib_cm_listen - Initiates listening on the specified service ID for
+ *   connection and service ID resolution requests.
+ * @cm_id: Connection identifier associated with the listen request.
+ * @service_id: Service identifier matched against incoming connection
+ *   and service ID resolution requests.  The service ID should be specified
+ *   network-byte order.  If set to IB_CM_ASSIGN_SERVICE_ID, the CM will
+ *   assign a service ID to the caller.
+ * @service_mask: Mask applied to service ID used to listen across a
+ *   range of service IDs.  If set to 0, the service ID is matched
+ *   exactly.  This parameter is ignored if %service_id is set to
+ *   IB_CM_ASSIGN_SERVICE_ID.
+ * @compare_data: This parameter is optional.  It specifies data that must
+ *   appear in the private data of a connection request for the specified
+ *   listen request.
+ * @lock: If set, lock the cm.lock spin-lock when adding the id to the
+ *   listener tree. When false, the caller must already hold the spin-lock,
+ *   and compare_data must be NULL.
+ */
+static int __ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id,
+			  __be64 service_mask,
+			  struct ib_cm_compare_data *compare_data,
+			  bool lock)
 {
 	struct cm_id_private *cm_id_priv, *cur_cm_id_priv;
-	unsigned long flags;
+	unsigned long flags = 0;
 	int ret = 0;
 
 	service_mask = service_mask ? service_mask : ~cpu_to_be64(0);
@@ -957,7 +987,8 @@  int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
 
 	cm_id->state = IB_CM_LISTEN;
 
-	spin_lock_irqsave(&cm.lock, flags);
+	if (lock)
+		spin_lock_irqsave(&cm.lock, flags);
 	if (service_id == IB_CM_ASSIGN_SERVICE_ID) {
 		cm_id->service_id = cpu_to_be64(cm.listen_service_id++);
 		cm_id->service_mask = ~cpu_to_be64(0);
@@ -966,7 +997,8 @@  int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
 		cm_id->service_mask = service_mask;
 	}
 	cur_cm_id_priv = cm_insert_listen(cm_id_priv);
-	spin_unlock_irqrestore(&cm.lock, flags);
+	if (lock)
+		spin_unlock_irqrestore(&cm.lock, flags);
 
 	if (cur_cm_id_priv) {
 		cm_id->state = IB_CM_IDLE;
@@ -976,8 +1008,87 @@  int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
 	}
 	return ret;
 }
+
+int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
+		 struct ib_cm_compare_data *compare_data)
+{
+	return __ib_cm_listen(cm_id, service_id, service_mask, compare_data,
+			      true);
+}
 EXPORT_SYMBOL(ib_cm_listen);
 
+/**
+ * Create a new listening ib_cm_id and listen on the given service ID.
+ *
+ * If there's an existing ID listening on that same device and service ID,
+ * return it.
+ *
+ * @device: Device associated with the cm_id.  All related communication will
+ * be associated with the specified device.
+ * @cm_handler: Callback invoked to notify the user of CM events.
+ * @service_id: Service identifier matched against incoming connection
+ *   and service ID resolution requests.  The service ID should be specified
+ *   network-byte order.  If set to IB_CM_ASSIGN_SERVICE_ID, the CM will
+ *   assign a service ID to the caller.
+ * @service_mask: Mask applied to service ID used to listen across a
+ *   range of service IDs.  If set to 0, the service ID is matched
+ *   exactly.  This parameter is ignored if %service_id is set to
+ *   IB_CM_ASSIGN_SERVICE_ID.
+ */
+struct ib_cm_id *ib_cm_id_create_and_listen(
+		struct ib_device *device,
+		ib_cm_handler cm_handler,
+		__be64 service_id,
+		__be64 service_mask)
+{
+	struct cm_id_private *cm_id_priv;
+	struct ib_cm_id *cm_id;
+	unsigned long flags;
+	int err = 0;
+
+	/* Create an ID in advance, since the creation may sleep */
+	cm_id = ib_create_cm_id(device, cm_handler, NULL);
+	if (IS_ERR(cm_id))
+		return cm_id;
+
+	spin_lock_irqsave(&cm.lock, flags);
+
+	if (service_id == IB_CM_ASSIGN_SERVICE_ID)
+		goto new_id;
+
+	/* Find an existing ID */
+	cm_id_priv = cm_find_listen(device, service_id, NULL);
+	if (cm_id_priv) {
+		if (atomic_inc_return(&cm_id_priv->sharecount) == 1) {
+			/* This ID is already being destroyed */
+			atomic_dec(&cm_id_priv->sharecount);
+			goto new_id;
+		}
+
+		spin_unlock_irqrestore(&cm.lock, flags);
+		ib_destroy_cm_id(cm_id);
+		cm_id = &cm_id_priv->id;
+		if (cm_id->cm_handler != cm_handler || cm_id->context)
+			/* Sharing an ib_cm_id with different handlers is not
+			 * supported */
+			return ERR_PTR(-EINVAL);
+		return cm_id;
+	}
+
+new_id:
+	/* Use newly created ID */
+	err = __ib_cm_listen(cm_id, service_id, service_mask, NULL, false);
+
+	spin_unlock_irqrestore(&cm.lock, flags);
+
+	if (err) {
+		ib_destroy_cm_id(cm_id);
+		return ERR_PTR(err);
+	}
+	return cm_id;
+}
+EXPORT_SYMBOL(ib_cm_id_create_and_listen);
+
 static __be64 cm_form_tid(struct cm_id_private *cm_id_priv,
 			  enum cm_msg_sequence msg_seq)
 {
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index 39ed2d2fbd51..59f534b5f091 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -361,6 +361,10 @@  struct ib_cm_compare_data {
 int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
 		 struct ib_cm_compare_data *compare_data);
 
+struct ib_cm_id *ib_cm_id_create_and_listen(
+		struct ib_device *device, ib_cm_handler cm_handler,
+		__be64 service_id, __be64 service_mask);
+
 struct ib_cm_req_param {
 	struct ib_sa_path_rec	*primary_path;
 	struct ib_sa_path_rec	*alternate_path;