diff mbox series

[net-next] tcp: forbid direct reclaim if MSG_DONTWAIT is set in send path

Message ID 1539086718-4119-2-git-send-email-laoar.shao@gmail.com
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series [net-next] tcp: forbid direct reclaim if MSG_DONTWAIT is set in send path | expand

Commit Message

Yafang Shao Oct. 9, 2018, 12:05 p.m. UTC
By default, the sk->sk_allocation is GFP_KERNEL, that means if there's
no enough memory it will do both direct reclaim and background reclaim.
If the size of system memory is great, the direct reclaim may cause great
latency spike.

When we set MSG_DONTWAIT in send syscalls, we really don't want it to be
blocked, so we'd better clear __GFP_DIRECT_RECLAIM when allocate skb in the
send path. Then, it will return immediately if there's no enough memory to
be allocated, and then the appliation has a chance to do some other stuffs
instead of being blocked here.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 net/ipv4/tcp.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Eric Dumazet Oct. 9, 2018, 2:12 p.m. UTC | #1
On Tue, Oct 9, 2018 at 5:05 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> By default, the sk->sk_allocation is GFP_KERNEL, that means if there's
> no enough memory it will do both direct reclaim and background reclaim.
> If the size of system memory is great, the direct reclaim may cause great
> latency spike.
>
> When we set MSG_DONTWAIT in send syscalls, we really don't want it to be
> blocked, so we'd better clear __GFP_DIRECT_RECLAIM when allocate skb in the
> send path. Then, it will return immediately if there's no enough memory to
> be allocated, and then the appliation has a chance to do some other stuffs
> instead of being blocked here.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  net/ipv4/tcp.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 43ef83b..fe4f5ce 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1182,6 +1182,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>         bool process_backlog = false;
>         bool zc = false;
>         long timeo;
> +       gfp_t gfp;
>
>         flags = msg->msg_flags;
>
> @@ -1255,6 +1256,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>         /* Ok commence sending. */
>         copied = 0;
>
> +       gfp = flags & MSG_DONTWAIT ? sk->sk_allocation & ~__GFP_DIRECT_RECLAIM :
> +             sk->sk_allocation;
> +
>  restart:
>         mss_now = tcp_send_mss(sk, &size_goal, flags);
>
> @@ -1283,8 +1287,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                         }
>                         first_skb = tcp_rtx_and_write_queues_empty(sk);
>                         linear = select_size(first_skb, zc);
> -                       skb = sk_stream_alloc_skb(sk, linear, sk->sk_allocation,
> -                                                 first_skb);
> +                       skb = sk_stream_alloc_skb(sk, linear, gfp, first_skb);
>                         if (!skb)
>                                 goto wait_for_memory;


How have you tested this patch exactly ?

Most of TCP payloads are added in page fragments, and you have not
changed the page allocation fragments.

Also, I do not see how an application will get future notifications
that it can retry the failed system call ?
How are you really going to deal with this in high performance applications ?

I would rather prefer a socket setsockopt() to eventually be able to
flip __GFP_DIRECT_RECLAIM in sk->sk_allocation,
to not add all these tests in fast path, but honestly I do not see how
applications can really make use of this.
Yafang Shao Oct. 9, 2018, 2:52 p.m. UTC | #2
On Tue, Oct 9, 2018 at 10:12 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 9, 2018 at 5:05 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > By default, the sk->sk_allocation is GFP_KERNEL, that means if there's
> > no enough memory it will do both direct reclaim and background reclaim.
> > If the size of system memory is great, the direct reclaim may cause great
> > latency spike.
> >
> > When we set MSG_DONTWAIT in send syscalls, we really don't want it to be
> > blocked, so we'd better clear __GFP_DIRECT_RECLAIM when allocate skb in the
> > send path. Then, it will return immediately if there's no enough memory to
> > be allocated, and then the appliation has a chance to do some other stuffs
> > instead of being blocked here.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  net/ipv4/tcp.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 43ef83b..fe4f5ce 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1182,6 +1182,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >         bool process_backlog = false;
> >         bool zc = false;
> >         long timeo;
> > +       gfp_t gfp;
> >
> >         flags = msg->msg_flags;
> >
> > @@ -1255,6 +1256,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >         /* Ok commence sending. */
> >         copied = 0;
> >
> > +       gfp = flags & MSG_DONTWAIT ? sk->sk_allocation & ~__GFP_DIRECT_RECLAIM :
> > +             sk->sk_allocation;
> > +
> >  restart:
> >         mss_now = tcp_send_mss(sk, &size_goal, flags);
> >
> > @@ -1283,8 +1287,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >                         }
> >                         first_skb = tcp_rtx_and_write_queues_empty(sk);
> >                         linear = select_size(first_skb, zc);
> > -                       skb = sk_stream_alloc_skb(sk, linear, sk->sk_allocation,
> > -                                                 first_skb);
> > +                       skb = sk_stream_alloc_skb(sk, linear, gfp, first_skb);
> >                         if (!skb)
> >                                 goto wait_for_memory;
>
>
> How have you tested this patch exactly ?
>
There was a network latency (hunreds msecs or even one sec ) recently
on our production enviroment.
And finally I diagnosed that this latency was caused by direct reclaim
in tcp_sendmsg.
That issue could be resovled by keeping a reserved memory.
But I think deeply that why not forbid direct reclaim if we set MSG_DONWAIT.
So I did this change and tested it. The application got a errno
returned instead of being blocked in send path.
That's why I sumbit this patch.

> Most of TCP payloads are added in page fragments, and you have not
> changed the page allocation fragments.
>
> Also, I do not see how an application will get future notifications
> that it can retry the failed system call ?
> How are you really going to deal with this in high performance applications ?
>

I think that immdiately return with errno is better than being blocked.
Maybe this solution is not good enough.
At least it could tell the application that something is wrong and it
can't send now.

> I would rather prefer a socket setsockopt() to eventually be able to
> flip __GFP_DIRECT_RECLAIM in sk->sk_allocation,
> to not add all these tests in fast path, but honestly I do not see how
> applications can really make use of this.

Maybe an event is needed to tell the application it can send now.
I don't have better idea neither.

Thanks
Yafang
Eric Dumazet Oct. 9, 2018, 2:58 p.m. UTC | #3
> >
> There was a network latency (hunreds msecs or even one sec ) recently
> on our production enviroment.
> And finally I diagnosed that this latency was caused by direct reclaim
> in tcp_sendmsg.
> That issue could be resovled by keeping a reserved memory.
> But I think deeply that why not forbid direct reclaim if we set MSG_DONWAIT.
> So I did this change and tested it. The application got a errno
> returned instead of being blocked in send path.
> That's why I sumbit this patch.

Sure, and I asked you how you have tested it, because it seems clear
to me that  you missed
the real memory allocation point (We fill up to 64 KB of page
fragments memory into one (small) skb)

And how is the application going to use MSG_DONTWAIT in the real
world, I do wonder as well.

We do not add bloat in the kernel if no application is ever going to
use it, especially in the TCP fast path.

Give us a test, so that we can see how this can be used...

Thanks.
Eric Dumazet Oct. 9, 2018, 3:38 p.m. UTC | #4
On Tue, Oct 9, 2018 at 7:58 AM Eric Dumazet <edumazet@google.com> wrote:
>

> We do not add bloat in the kernel if no application is ever going to
> use it, especially in the TCP fast path.
>

BTW, are you willing to change all memory allocations in the kernel as well ?

Let say an application is using a system call providing a pathname
(open(), stat(), ...), how this system call
is going to ask the kernel for no direct reclaim ?

Even allocating a socket with socket() or accept() has no ability to
avoid direct reclaim.

So tcp_sendmsg() is only the tip of the iceberg.
Yafang Shao Oct. 10, 2018, 1:30 a.m. UTC | #5
On Tue, Oct 9, 2018 at 11:38 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 9, 2018 at 7:58 AM Eric Dumazet <edumazet@google.com> wrote:
> >
>
> > We do not add bloat in the kernel if no application is ever going to
> > use it, especially in the TCP fast path.
> >
>
> BTW, are you willing to change all memory allocations in the kernel as well ?
>
> Let say an application is using a system call providing a pathname
> (open(), stat(), ...), how this system call
> is going to ask the kernel for no direct reclaim ?
>
> Even allocating a socket with socket() or accept() has no ability to
> avoid direct reclaim.
>
> So tcp_sendmsg() is only the tip of the iceberg.

If we can really find a solution that is good enough to hanlde direct
reclaim in tcp_sendmsg,
we could also implement it in other syscalls.
Unexpected latency is hateful.

Thanks
Yafang
Eric Dumazet Oct. 10, 2018, 1:44 a.m. UTC | #6
On 10/09/2018 06:30 PM, Yafang Shao wrote:
> On Tue, Oct 9, 2018 at 11:38 PM Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Tue, Oct 9, 2018 at 7:58 AM Eric Dumazet <edumazet@google.com> wrote:
>>>
>>
>>> We do not add bloat in the kernel if no application is ever going to
>>> use it, especially in the TCP fast path.
>>>
>>
>> BTW, are you willing to change all memory allocations in the kernel as well ?
>>
>> Let say an application is using a system call providing a pathname
>> (open(), stat(), ...), how this system call
>> is going to ask the kernel for no direct reclaim ?
>>
>> Even allocating a socket with socket() or accept() has no ability to
>> avoid direct reclaim.
>>
>> So tcp_sendmsg() is only the tip of the iceberg.
> 
> If we can really find a solution that is good enough to hanlde direct
> reclaim in tcp_sendmsg,
> we could also implement it in other syscalls.
> Unexpected latency is hateful.

We have thousands of other places in the kernel, I want to find a generic solution,
not patch all the places one by one.

So come back when you have something more generic, and once applications have a way
to handle gracefully (without calling sendmsg() in infinite loop ...)
to these memory allocation issues.

How is EPOLLOUT going to be generated ?
diff mbox series

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 43ef83b..fe4f5ce 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1182,6 +1182,7 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	bool process_backlog = false;
 	bool zc = false;
 	long timeo;
+	gfp_t gfp;
 
 	flags = msg->msg_flags;
 
@@ -1255,6 +1256,9 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	/* Ok commence sending. */
 	copied = 0;
 
+	gfp = flags & MSG_DONTWAIT ? sk->sk_allocation & ~__GFP_DIRECT_RECLAIM :
+	      sk->sk_allocation;
+
 restart:
 	mss_now = tcp_send_mss(sk, &size_goal, flags);
 
@@ -1283,8 +1287,7 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			}
 			first_skb = tcp_rtx_and_write_queues_empty(sk);
 			linear = select_size(first_skb, zc);
-			skb = sk_stream_alloc_skb(sk, linear, sk->sk_allocation,
-						  first_skb);
+			skb = sk_stream_alloc_skb(sk, linear, gfp, first_skb);
 			if (!skb)
 				goto wait_for_memory;