mbox series

[SRU,Bionic,0/2] Cutting and Pasting files from NFS sec=sys to NFS sec=krb5p causes NFS to hang

Message ID 20200716015442.14270-1-matthew.ruffell@canonical.com
Headers show
Series Cutting and Pasting files from NFS sec=sys to NFS sec=krb5p causes NFS to hang | expand

Message

Matthew Ruffell July 16, 2020, 1:54 a.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1887607

[Impact]

If you have a desktop system, with two NFS mounts: 
- One that uses the baseline IP based security, aka sec=sys, 
- and the other that uses Kerberos sec=krb5p based security,

If you try and cut a file from the normal NFS mount, and paste it to a directory
on the kerberos krb5p mount (using Nautilus), the NFS subsystem will lock up,
Nautilus will hang, and the file won't be moved.

The problem only reproduces if you cut and paste. Copying and pasting does not
trigger any problems. Using mv in terminal doesn't reproduce either, you need to
use Nautilus.

The issue was introduced into 4.15.0-60-generic, by the commit:

commit 594d1644cd59447f4fceb592448d5cd09eb09b5e
Author: Chris Perl <cperl@janestreet.com>
Date: Mon Dec 17 10:56:38 2018 -0500
Subject: NFS: nfs_compare_mount_options always compare auth flavors.
Link: https://github.com/torvalds/linux/commit/594d1644cd59447f4fceb592448d5cd09eb09b5e 

It was backported to 4.15.0-60-generic from upstream -stable, and landed in
4.4.175, 4.14.99 and 4.19.21. The commit itself does not actually cause the
problem, it simply removes a check for NFS server security settings, which
simply reveals a broken codepath which the testcase triggers.

Xenial 4.4.0-185-generic is not affected, only Bionic 4.15.0-60-generic onward.

[Fix]

The fix landed in 5.1-rc1, in the following commit:

commit 02ef04e432babf8fc703104212314e54112ecd2d
Author: Chuck Lever <chuck.lever@oracle.com>
Date: Mon Feb 11 11:25:25 2019 -0500
Subject: NFS: Account for XDR pad of buf->pages
Link: https://github.com/torvalds/linux/commit/02ef04e432babf8fc703104212314e54112ecd2d 

The above commit more or less relies on the below commit as a dependency, and is
included in the SRU:

commit cf500bac8fd48b57f38ece890235923d4ed5ee91
Author: Chuck Lever <chuck.lever@oracle.com>
Date: Mon Feb 11 11:25:20 2019 -0500
Subject: SUNRPC: Introduce rpc_prepare_reply_pages()
Link: https://github.com/torvalds/linux/commit/cf500bac8fd48b57f38ece890235923d4ed5ee91

It appears that some NFS calls return a NFS payload which is not a multiple of
4 bytes, but any payload sent over the network needs to be padded to an exact
multiple of 4 bytes. It seems cutting and pasting from Nautilus triggers one
such payload which is missing a byte, and it causes the NFS subsystem to hang
during packet transmission. The fix ensures that all payloads use correct
padding.

[Testcase]

You will need four machines. The first, is a kerberos KDC. Set up Kerberos
correctly and create new service principals for the NFS server and for the
client.

The second machine will be a NFS server with the krb5p share. Add the nfs server
kerberos keys to the system's keytab, and set up a NFS server that exports a
directory with sec=krb5p.

The third machine is a regular NFS server. Export a directory with normal
sec=sys security.

The fourth is a desktop machine. Add the client kerberos keys to the system's
keytab. Mount both NFS shares, and generate some files full of random data.
I found 20MB from /dev/random works great.

Open each NFS share up in tabs in Nautilus. Copy the random data files to the
sec=sys NFS share. When they are done, one at a time cut and then paste the file
into the sec=krb5p NFS share. The bug will trigger either on the first, or
subsequent tries, but less than 10 tries are needed usually.

There is a test kernel available in the following PPA:
https://launchpad.net/~mruffell/+archive/ubuntu/sf285439-test

If you install the test kernel, files will cut and paste correctly, and NFS will
work as expected.

[Regression Potential]

If a regression were to occur, it would impact users of the NFS subsystem, since
the changes modify how padding is applied to all NFS packets, and a regression
would affect all versions of NFS.

If a regression were to occur, users would need to downgrade their kernel while
awaiting a fix.

Chuck Lever (2):
  SUNRPC: Introduce rpc_prepare_reply_pages()
  NFS: Account for XDR pad of buf->pages

 fs/nfs/nfs2xdr.c              | 33 ++++++---------------
 fs/nfs/nfs3xdr.c              | 37 +++++++-----------------
 fs/nfs/nfs4xdr.c              | 54 +++++++++++++++--------------------
 include/linux/sunrpc/clnt.h   |  3 ++
 include/trace/events/sunrpc.h | 37 ++++++++++++++++++++++++
 net/sunrpc/clnt.c             | 23 +++++++++++++++
 net/sunrpc/xdr.c              | 11 +++++++
 7 files changed, 117 insertions(+), 81 deletions(-)

Comments

Stefan Bader July 16, 2020, 7:51 a.m. UTC | #1
On 16.07.20 03:54, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1887607
> 
> [Impact]
> 
> If you have a desktop system, with two NFS mounts: 
> - One that uses the baseline IP based security, aka sec=sys, 
> - and the other that uses Kerberos sec=krb5p based security,
> 
> If you try and cut a file from the normal NFS mount, and paste it to a directory
> on the kerberos krb5p mount (using Nautilus), the NFS subsystem will lock up,
> Nautilus will hang, and the file won't be moved.
> 
> The problem only reproduces if you cut and paste. Copying and pasting does not
> trigger any problems. Using mv in terminal doesn't reproduce either, you need to
> use Nautilus.
> 
> The issue was introduced into 4.15.0-60-generic, by the commit:
> 
> commit 594d1644cd59447f4fceb592448d5cd09eb09b5e
> Author: Chris Perl <cperl@janestreet.com>
> Date: Mon Dec 17 10:56:38 2018 -0500
> Subject: NFS: nfs_compare_mount_options always compare auth flavors.
> Link: https://github.com/torvalds/linux/commit/594d1644cd59447f4fceb592448d5cd09eb09b5e 
> 
> It was backported to 4.15.0-60-generic from upstream -stable, and landed in
> 4.4.175, 4.14.99 and 4.19.21. The commit itself does not actually cause the
> problem, it simply removes a check for NFS server security settings, which
> simply reveals a broken codepath which the testcase triggers.
> 
> Xenial 4.4.0-185-generic is not affected, only Bionic 4.15.0-60-generic onward.
> 
> [Fix]
> 
> The fix landed in 5.1-rc1, in the following commit:
> 
> commit 02ef04e432babf8fc703104212314e54112ecd2d
> Author: Chuck Lever <chuck.lever@oracle.com>
> Date: Mon Feb 11 11:25:25 2019 -0500
> Subject: NFS: Account for XDR pad of buf->pages
> Link: https://github.com/torvalds/linux/commit/02ef04e432babf8fc703104212314e54112ecd2d 
> 
> The above commit more or less relies on the below commit as a dependency, and is
> included in the SRU:
> 
> commit cf500bac8fd48b57f38ece890235923d4ed5ee91
> Author: Chuck Lever <chuck.lever@oracle.com>
> Date: Mon Feb 11 11:25:20 2019 -0500
> Subject: SUNRPC: Introduce rpc_prepare_reply_pages()
> Link: https://github.com/torvalds/linux/commit/cf500bac8fd48b57f38ece890235923d4ed5ee91
> 
> It appears that some NFS calls return a NFS payload which is not a multiple of
> 4 bytes, but any payload sent over the network needs to be padded to an exact
> multiple of 4 bytes. It seems cutting and pasting from Nautilus triggers one
> such payload which is missing a byte, and it causes the NFS subsystem to hang
> during packet transmission. The fix ensures that all payloads use correct
> padding.
> 
> [Testcase]
> 
> You will need four machines. The first, is a kerberos KDC. Set up Kerberos
> correctly and create new service principals for the NFS server and for the
> client.
> 
> The second machine will be a NFS server with the krb5p share. Add the nfs server
> kerberos keys to the system's keytab, and set up a NFS server that exports a
> directory with sec=krb5p.
> 
> The third machine is a regular NFS server. Export a directory with normal
> sec=sys security.
> 
> The fourth is a desktop machine. Add the client kerberos keys to the system's
> keytab. Mount both NFS shares, and generate some files full of random data.
> I found 20MB from /dev/random works great.
> 
> Open each NFS share up in tabs in Nautilus. Copy the random data files to the
> sec=sys NFS share. When they are done, one at a time cut and then paste the file
> into the sec=krb5p NFS share. The bug will trigger either on the first, or
> subsequent tries, but less than 10 tries are needed usually.
> 
> There is a test kernel available in the following PPA:
> https://launchpad.net/~mruffell/+archive/ubuntu/sf285439-test
> 
> If you install the test kernel, files will cut and paste correctly, and NFS will
> work as expected.
> 
> [Regression Potential]
> 
> If a regression were to occur, it would impact users of the NFS subsystem, since
> the changes modify how padding is applied to all NFS packets, and a regression
> would affect all versions of NFS.
> 
> If a regression were to occur, users would need to downgrade their kernel while
> awaiting a fix.
> 
> Chuck Lever (2):
>   SUNRPC: Introduce rpc_prepare_reply_pages()
>   NFS: Account for XDR pad of buf->pages
> 
>  fs/nfs/nfs2xdr.c              | 33 ++++++---------------
>  fs/nfs/nfs3xdr.c              | 37 +++++++-----------------
>  fs/nfs/nfs4xdr.c              | 54 +++++++++++++++--------------------
>  include/linux/sunrpc/clnt.h   |  3 ++
>  include/trace/events/sunrpc.h | 37 ++++++++++++++++++++++++
>  net/sunrpc/clnt.c             | 23 +++++++++++++++
>  net/sunrpc/xdr.c              | 11 +++++++
>  7 files changed, 117 insertions(+), 81 deletions(-)
> 
Between the lines of the testcase and the regression risk assessment I believe
that you say, that you made some testing and saw no problems (v2/v3/v4 or just
one). Stating this more explicitly helps to decide about taking a set of
patches. Especially if backport work appears slightly more complex.

-Stefan

Acked-by: Stefan Bader <stefan.bader@canonical.com>
Andrea Righi July 16, 2020, 8:24 a.m. UTC | #2
On Thu, Jul 16, 2020 at 01:54:40PM +1200, Matthew Ruffell wrote:
...
> [Fix]
> 
> The fix landed in 5.1-rc1, in the following commit:
> 
> commit 02ef04e432babf8fc703104212314e54112ecd2d
> Author: Chuck Lever <chuck.lever@oracle.com>
> Date: Mon Feb 11 11:25:25 2019 -0500
> Subject: NFS: Account for XDR pad of buf->pages
> Link: https://github.com/torvalds/linux/commit/02ef04e432babf8fc703104212314e54112ecd2d 
> 
> The above commit more or less relies on the below commit as a dependency, and is
> included in the SRU:
> 
> commit cf500bac8fd48b57f38ece890235923d4ed5ee91
> Author: Chuck Lever <chuck.lever@oracle.com>
> Date: Mon Feb 11 11:25:20 2019 -0500
> Subject: SUNRPC: Introduce rpc_prepare_reply_pages()
> Link: https://github.com/torvalds/linux/commit/cf500bac8fd48b57f38ece890235923d4ed5ee91
> 
> It appears that some NFS calls return a NFS payload which is not a multiple of
> 4 bytes, but any payload sent over the network needs to be padded to an exact
> multiple of 4 bytes. It seems cutting and pasting from Nautilus triggers one
> such payload which is missing a byte, and it causes the NFS subsystem to hang
> during packet transmission. The fix ensures that all payloads use correct
> padding.

I think we may need to apply also:

 29e7ca715f2a ("NFS: Fix handling of reply page vector")

that seems to fix a bug introduced by 02ef04e432ba. If that's the case
could you re-send the patch set including also this fix.

Thanks,
-Andrea
Matthew Ruffell July 16, 2020, 11:51 p.m. UTC | #3
>
> I think we may need to apply also:
>
>  29e7ca715f2a ("NFS: Fix handling of reply page vector")
>
> that seems to fix a bug introduced by 02ef04e432ba. If that's the case
> could you re-send the patch set including also this fix.

Thanks Andrea, yes, I agree we need this patch. Sorry for missing it!

I will build a fresh test kernel, test it and resend the patch set.

Thanks,
Matthew
Matthew Ruffell July 17, 2020, 6:12 a.m. UTC | #4
On Fri, Jul 17, 2020 at 11:51 AM Matthew Ruffell
<matthew.ruffell@canonical.com> wrote:
>
> >
> > I think we may need to apply also:
> >
> >  29e7ca715f2a ("NFS: Fix handling of reply page vector")
> >
> > that seems to fix a bug introduced by 02ef04e432ba. If that's the case
> > could you re-send the patch set including also this fix.
>
> Thanks Andrea, yes, I agree we need this patch. Sorry for missing it!
>
> I will build a fresh test kernel, test it and resend the patch set.
>
Okay, I built a test kernel based on:

- Ubuntu-4.15.0-111-generic:
  + cf500bac8fd4 "SUNRPC: Introduce rpc_prepare_reply_pages()"
  + 02ef04e432ba "NFS: Account for XDR pad of buf->pages"
  + 29e7ca715f2a "NFS: Fix handling of reply page vector"

The most curious thing happens - the issue returns and is reproducible. It seems
the fixup commit either breaks things, or what I believed to be fixes don't
actually fix the problem, but only cover it up.

I am going to investigate this more deeply, and will resubmit for the next
SRU cycle once I have determined what is happening.

Engineering notes:

If I build a kernel based on:
- 29e7ca715f2a "NFS: Fix handling of reply page vector"

It works.

If I build a kernel based on:
- 02ef04e432ba "NFS: Account for XDR pad of buf->pages"

It works.

If I build a kernel based on:

- 02ef04e432ba "NFS: Account for XDR pad of buf->pages"
  + 29e7ca715f2a "NFS: Fix handling of reply page vector"

It fails, and the issue returns.

Therefore, something additionally got fixed between 02ef04e432ba..29e7ca715f2a
which might be the actual root cause.

Looks like I'm going to have to bisect again.