Message ID | BN6PR21MB0465043C08E519774EE73E99C0090@BN6PR21MB0465.namprd21.prod.outlook.com |
---|---|
State | Accepted |
Delegated to: | David Miller |
Headers | show |
Series | [v2] hv_sock: Add support for delayed close | expand |
> From: Sunil Muthuswamy <sunilmut@microsoft.com> > Sent: Tuesday, May 14, 2019 5:56 PM > ... > +static bool hvs_close_lock_held(struct vsock_sock *vsk) > { > ... > + schedule_delayed_work(&vsk->close_work, HVS_CLOSE_TIMEOUT); Reviewed-by: Dexuan Cui <decui@microsoft.com> The patch looks good to me. Thank you, Sunil! Next, we're going to remove the "channel->rescind" check in vmbus_hvsock_device_unregister() -- when doing that, IMO we need to fix a potential race revealed by the schedule_delayed_work() in this patch: When hvs_close_timeout() finishes, the "sk" struct has been freed, but vmbus_onoffer_rescind() -> channel->chn_rescind_callback(), i.e. hvs_close_connection(), may be still running and referencing the "chan" and "sk" structs (), which should no longer be referenced when hvs_close_timeout() finishes, i.e. "get_per_channel_state(chan)" is no longer safe. The problem is: currently there is no sync mechanism between vmbus_onoffer_rescind() and hvs_close_timeout(). The race is a real issue only after we remove the "channel->rescind" in vmbus_hvsock_device_unregister(). I guess we need to introduce a new single-threaded workqueue in the vmbus driver, and offload both vmbus_onoffer_rescind() and hvs_close_timeout() onto the new workqueue. Thanks, -- Dexuan
> From: linux-hyperv-owner@vger.kernel.org > <linux-hyperv-owner@vger.kernel.org> On Behalf Of Dexuan Cui > Sent: Wednesday, May 15, 2019 9:34 PM > ... Hi Sunil, To make it clear, your patch itself is good, and I was just talking about the next change we're going to make. Once we make the next change, IMO we need a further patch to schedule hvs_close_timeout() to the new single-threaded workqueue rather than the global "system_wq". > Next, we're going to remove the "channel->rescind" check in > vmbus_hvsock_device_unregister() -- when doing that, IMO we need to > fix a potential race revealed by the schedule_delayed_work() in this > patch: > > When hvs_close_timeout() finishes, the "sk" struct has been freed, but > vmbus_onoffer_rescind() -> channel->chn_rescind_callback(), i.e. > hvs_close_connection(), may be still running and referencing the "chan" > and "sk" structs (), which should no longer be referenced when > hvs_close_timeout() finishes, i.e. "get_per_channel_state(chan)" is no > longer safe. The problem is: currently there is no sync mechanism > between vmbus_onoffer_rescind() and hvs_close_timeout(). > > The race is a real issue only after we remove the "channel->rescind" > in vmbus_hvsock_device_unregister(). A correction: IMO the race is real even for the current code, i.e. without your patch: in vmbus_onoffer_rescind(), between we set channel->rescind and we call channel->chn_rescind_callback(), the channel may have been freed by vmbus_hvsock_device_unregister(). This race window is small and I guess that's why we never noticed it. > I guess we need to introduce a new single-threaded workqueue in the > vmbus driver, and offload both vmbus_onoffer_rescind() and > hvs_close_timeout() onto the new workqueue. Thanks, -- Dexuan
> -----Original Message----- > From: Dexuan Cui <decui@microsoft.com> > Sent: Thursday, May 16, 2019 10:17 AM > To: Sunil Muthuswamy <sunilmut@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; > Stephen Hemminger <sthemmin@microsoft.com>; Sasha Levin <sashal@kernel.org>; David S. Miller <davem@davemloft.net>; > Michael Kelley <mikelley@microsoft.com> > Cc: netdev@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: RE: [PATCH v2] hv_sock: Add support for delayed close > > > From: linux-hyperv-owner@vger.kernel.org > > <linux-hyperv-owner@vger.kernel.org> On Behalf Of Dexuan Cui > > Sent: Wednesday, May 15, 2019 9:34 PM > > ... > > Hi Sunil, > To make it clear, your patch itself is good, and I was just talking about > the next change we're going to make. Once we make the next change, > IMO we need a further patch to schedule hvs_close_timeout() to the new > single-threaded workqueue rather than the global "system_wq". > Thanks for your review. Can you add a 'signed-off' from your side to the patch. > > Next, we're going to remove the "channel->rescind" check in > > vmbus_hvsock_device_unregister() -- when doing that, IMO we need to > > fix a potential race revealed by the schedule_delayed_work() in this > > patch: > > > > When hvs_close_timeout() finishes, the "sk" struct has been freed, but > > vmbus_onoffer_rescind() -> channel->chn_rescind_callback(), i.e. > > hvs_close_connection(), may be still running and referencing the "chan" > > and "sk" structs (), which should no longer be referenced when > > hvs_close_timeout() finishes, i.e. "get_per_channel_state(chan)" is no > > longer safe. The problem is: currently there is no sync mechanism > > between vmbus_onoffer_rescind() and hvs_close_timeout(). > > > > The race is a real issue only after we remove the "channel->rescind" > > in vmbus_hvsock_device_unregister(). > > A correction: IMO the race is real even for the current code, i.e. without > your patch: in vmbus_onoffer_rescind(), between we set channel->rescind > and we call channel->chn_rescind_callback(), the channel may have been > freed by vmbus_hvsock_device_unregister(). > > This race window is small and I guess that's why we never noticed it. > > > I guess we need to introduce a new single-threaded workqueue in the > > vmbus driver, and offload both vmbus_onoffer_rescind() and > > hvs_close_timeout() onto the new workqueue. > Something is a miss if the guest has to wait for the host to close the channel prior to cleaning it up from it's side. That's waste of resources, doesn't matter if you do it in a system thread, dedicated pool or anyway else. I think the right way to deal with this is to unregister the rescind callback routine, wait for any running rescind callback routine to finish and then drop the last reference to the socket, which should lead to all the cleanup. I understand that some of the facility of unregistering the rescind callback might not exist today. > Thanks, > -- Dexuan
> From: Sunil Muthuswamy <sunilmut@microsoft.com> > Sent: Thursday, May 16, 2019 11:11 AM > > Hi Sunil, > > To make it clear, your patch itself is good, and I was just talking about > > the next change we're going to make. Once we make the next change, > > IMO we need a further patch to schedule hvs_close_timeout() to the new > > single-threaded workqueue rather than the global "system_wq". > > > Thanks for your review. Can you add a 'signed-off' from your side to the patch. I have provided my Reviewed-by. I guess this should be enough. Of course, David makes the final call. It would be great if the maintaners of the Hyper-V drivers listed in the "To:" could provide their Signed-off-by. > > > Next, we're going to remove the "channel->rescind" check in > > > vmbus_hvsock_device_unregister() -- when doing that, IMO we need to > > > fix a potential race revealed by the schedule_delayed_work() in this > > > patch: > > > > > > When hvs_close_timeout() finishes, the "sk" struct has been freed, but > > > vmbus_onoffer_rescind() -> channel->chn_rescind_callback(), i.e. > > > hvs_close_connection(), may be still running and referencing the "chan" > > > and "sk" structs (), which should no longer be referenced when > > > hvs_close_timeout() finishes, i.e. "get_per_channel_state(chan)" is no > > > longer safe. The problem is: currently there is no sync mechanism > > > between vmbus_onoffer_rescind() and hvs_close_timeout(). > > > > > > The race is a real issue only after we remove the "channel->rescind" > > > in vmbus_hvsock_device_unregister(). > > > > A correction: IMO the race is real even for the current code, i.e. without > > your patch: in vmbus_onoffer_rescind(), between we set channel->rescind > > and we call channel->chn_rescind_callback(), the channel may have been > > freed by vmbus_hvsock_device_unregister(). > > > > This race window is small and I guess that's why we never noticed it. > > > > > I guess we need to introduce a new single-threaded workqueue in the > > > vmbus driver, and offload both vmbus_onoffer_rescind() and > > > hvs_close_timeout() onto the new workqueue. > > > Something is a miss if the guest has to wait for the host to close the channel > prior to cleaning it up from it's side. That's waste of resources, doesn't matter I agree. > if you do it in a system thread, dedicated pool or anyway else. I think the right > way to deal with this is to unregister the rescind callback routine, wait for any > running rescind callback routine to finish and then drop the last reference to > the socket, which should lead to all the cleanup. I understand that some of the > facility of unregistering the rescind callback might not exist today. Considering the concurrency, I'm not sure if it's easy or possible to safely unregister the chn_rescind_callback. My hunch is: doing that may be more difficult than adding a new single-thread workqueue. Thanks, -- Dexuan
From: Sunil Muthuswamy <sunilmut@microsoft.com> Date: Wed, 15 May 2019 00:56:05 +0000 > Currently, hvsock does not implement any delayed or background close > logic. Whenever the hvsock socket is closed, a FIN is sent to the peer, and > the last reference to the socket is dropped, which leads to a call to > .destruct where the socket can hang indefinitely waiting for the peer to > close it's side. The can cause the user application to hang in the close() > call. > > This change implements proper STREAM(TCP) closing handshake mechanism by > sending the FIN to the peer and the waiting for the peer's FIN to arrive > for a given timeout. On timeout, it will try to terminate the connection > (i.e. a RST). This is in-line with other socket providers such as virtio. > > This change does not address the hang in the vmbus_hvsock_device_unregister > where it waits indefinitely for the host to rescind the channel. That > should be taken up as a separate fix. > > Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com> Applied, thanks.
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c index a827547..982a8dc 100644 --- a/net/vmw_vsock/hyperv_transport.c +++ b/net/vmw_vsock/hyperv_transport.c @@ -35,6 +35,9 @@ /* The MTU is 16KB per the host side's design */ #define HVS_MTU_SIZE (1024 * 16) +/* How long to wait for graceful shutdown of a connection */ +#define HVS_CLOSE_TIMEOUT (8 * HZ) + struct vmpipe_proto_header { u32 pkt_type; u32 data_size; @@ -305,19 +308,32 @@ static void hvs_channel_cb(void *ctx) sk->sk_write_space(sk); } -static void hvs_close_connection(struct vmbus_channel *chan) +static void hvs_do_close_lock_held(struct vsock_sock *vsk, + bool cancel_timeout) { - struct sock *sk = get_per_channel_state(chan); - struct vsock_sock *vsk = vsock_sk(sk); - - lock_sock(sk); + struct sock *sk = sk_vsock(vsk); - sk->sk_state = TCP_CLOSE; sock_set_flag(sk, SOCK_DONE); - vsk->peer_shutdown |= SEND_SHUTDOWN | RCV_SHUTDOWN; - + vsk->peer_shutdown = SHUTDOWN_MASK; + if (vsock_stream_has_data(vsk) <= 0) + sk->sk_state = TCP_CLOSING; sk->sk_state_change(sk); + if (vsk->close_work_scheduled && + (!cancel_timeout || cancel_delayed_work(&vsk->close_work))) { + vsk->close_work_scheduled = false; + vsock_remove_sock(vsk); + /* Release the reference taken while scheduling the timeout */ + sock_put(sk); + } +} + +static void hvs_close_connection(struct vmbus_channel *chan) +{ + struct sock *sk = get_per_channel_state(chan); + + lock_sock(sk); + hvs_do_close_lock_held(vsock_sk(sk), true); release_sock(sk); } @@ -452,50 +468,80 @@ static int hvs_connect(struct vsock_sock *vsk) return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id); } +static void hvs_shutdown_lock_held(struct hvsock *hvs, int mode) +{ + struct vmpipe_proto_header hdr; + + if (hvs->fin_sent || !hvs->chan) + return; + + /* It can't fail: see hvs_channel_writable_bytes(). */ + (void)hvs_send_data(hvs->chan, (struct hvs_send_buf *)&hdr, 0); + hvs->fin_sent = true; +} + static int hvs_shutdown(struct vsock_sock *vsk, int mode) { struct sock *sk = sk_vsock(vsk); - struct vmpipe_proto_header hdr; - struct hvs_send_buf *send_buf; - struct hvsock *hvs; if (!(mode & SEND_SHUTDOWN)) return 0; lock_sock(sk); + hvs_shutdown_lock_held(vsk->trans, mode); + release_sock(sk); + return 0; +} - hvs = vsk->trans; - if (hvs->fin_sent) - goto out; - - send_buf = (struct hvs_send_buf *)&hdr; +static void hvs_close_timeout(struct work_struct *work) +{ + struct vsock_sock *vsk = + container_of(work, struct vsock_sock, close_work.work); + struct sock *sk = sk_vsock(vsk); - /* It can't fail: see hvs_channel_writable_bytes(). */ - (void)hvs_send_data(hvs->chan, send_buf, 0); + sock_hold(sk); + lock_sock(sk); + if (!sock_flag(sk, SOCK_DONE)) + hvs_do_close_lock_held(vsk, false); - hvs->fin_sent = true; -out: + vsk->close_work_scheduled = false; release_sock(sk); - return 0; + sock_put(sk); } -static void hvs_release(struct vsock_sock *vsk) +/* Returns true, if it is safe to remove socket; false otherwise */ +static bool hvs_close_lock_held(struct vsock_sock *vsk) { struct sock *sk = sk_vsock(vsk); - struct hvsock *hvs = vsk->trans; - struct vmbus_channel *chan; - lock_sock(sk); + if (!(sk->sk_state == TCP_ESTABLISHED || + sk->sk_state == TCP_CLOSING)) + return true; - sk->sk_state = TCP_CLOSING; - vsock_remove_sock(vsk); + if ((sk->sk_shutdown & SHUTDOWN_MASK) != SHUTDOWN_MASK) + hvs_shutdown_lock_held(vsk->trans, SHUTDOWN_MASK); - release_sock(sk); + if (sock_flag(sk, SOCK_DONE)) + return true; - chan = hvs->chan; - if (chan) - hvs_shutdown(vsk, RCV_SHUTDOWN | SEND_SHUTDOWN); + /* This reference will be dropped by the delayed close routine */ + sock_hold(sk); + INIT_DELAYED_WORK(&vsk->close_work, hvs_close_timeout); + vsk->close_work_scheduled = true; + schedule_delayed_work(&vsk->close_work, HVS_CLOSE_TIMEOUT); + return false; +} +static void hvs_release(struct vsock_sock *vsk) +{ + struct sock *sk = sk_vsock(vsk); + bool remove_sock; + + lock_sock(sk); + remove_sock = hvs_close_lock_held(vsk); + release_sock(sk); + if (remove_sock) + vsock_remove_sock(vsk); } static void hvs_destruct(struct vsock_sock *vsk)
Currently, hvsock does not implement any delayed or background close logic. Whenever the hvsock socket is closed, a FIN is sent to the peer, and the last reference to the socket is dropped, which leads to a call to .destruct where the socket can hang indefinitely waiting for the peer to close it's side. The can cause the user application to hang in the close() call. This change implements proper STREAM(TCP) closing handshake mechanism by sending the FIN to the peer and the waiting for the peer's FIN to arrive for a given timeout. On timeout, it will try to terminate the connection (i.e. a RST). This is in-line with other socket providers such as virtio. This change does not address the hang in the vmbus_hvsock_device_unregister where it waits indefinitely for the host to rescind the channel. That should be taken up as a separate fix. Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com> --- Changes since v1: - Updated the title and description to better reflect the change. The title was previously called 'hv_sock: Fix data loss upon socket close' - Removed the sk_state_change call to keep the fix focused. - Removed 'inline' keyword from the .c file and letting compiler do it. net/vmw_vsock/hyperv_transport.c | 108 ++++++++++++++++++++++++++++----------- 1 file changed, 77 insertions(+), 31 deletions(-)