Message ID | 20200716015442.14270-1-matthew.ruffell@canonical.com |
---|---|
Headers | show |
Series | Cutting and Pasting files from NFS sec=sys to NFS sec=krb5p causes NFS to hang | expand |
On 16.07.20 03:54, Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1887607 > > [Impact] > > If you have a desktop system, with two NFS mounts: > - One that uses the baseline IP based security, aka sec=sys, > - and the other that uses Kerberos sec=krb5p based security, > > If you try and cut a file from the normal NFS mount, and paste it to a directory > on the kerberos krb5p mount (using Nautilus), the NFS subsystem will lock up, > Nautilus will hang, and the file won't be moved. > > The problem only reproduces if you cut and paste. Copying and pasting does not > trigger any problems. Using mv in terminal doesn't reproduce either, you need to > use Nautilus. > > The issue was introduced into 4.15.0-60-generic, by the commit: > > commit 594d1644cd59447f4fceb592448d5cd09eb09b5e > Author: Chris Perl <cperl@janestreet.com> > Date: Mon Dec 17 10:56:38 2018 -0500 > Subject: NFS: nfs_compare_mount_options always compare auth flavors. > Link: https://github.com/torvalds/linux/commit/594d1644cd59447f4fceb592448d5cd09eb09b5e > > It was backported to 4.15.0-60-generic from upstream -stable, and landed in > 4.4.175, 4.14.99 and 4.19.21. The commit itself does not actually cause the > problem, it simply removes a check for NFS server security settings, which > simply reveals a broken codepath which the testcase triggers. > > Xenial 4.4.0-185-generic is not affected, only Bionic 4.15.0-60-generic onward. > > [Fix] > > The fix landed in 5.1-rc1, in the following commit: > > commit 02ef04e432babf8fc703104212314e54112ecd2d > Author: Chuck Lever <chuck.lever@oracle.com> > Date: Mon Feb 11 11:25:25 2019 -0500 > Subject: NFS: Account for XDR pad of buf->pages > Link: https://github.com/torvalds/linux/commit/02ef04e432babf8fc703104212314e54112ecd2d > > The above commit more or less relies on the below commit as a dependency, and is > included in the SRU: > > commit cf500bac8fd48b57f38ece890235923d4ed5ee91 > Author: Chuck Lever <chuck.lever@oracle.com> > Date: Mon Feb 11 11:25:20 2019 -0500 > Subject: SUNRPC: Introduce rpc_prepare_reply_pages() > Link: https://github.com/torvalds/linux/commit/cf500bac8fd48b57f38ece890235923d4ed5ee91 > > It appears that some NFS calls return a NFS payload which is not a multiple of > 4 bytes, but any payload sent over the network needs to be padded to an exact > multiple of 4 bytes. It seems cutting and pasting from Nautilus triggers one > such payload which is missing a byte, and it causes the NFS subsystem to hang > during packet transmission. The fix ensures that all payloads use correct > padding. > > [Testcase] > > You will need four machines. The first, is a kerberos KDC. Set up Kerberos > correctly and create new service principals for the NFS server and for the > client. > > The second machine will be a NFS server with the krb5p share. Add the nfs server > kerberos keys to the system's keytab, and set up a NFS server that exports a > directory with sec=krb5p. > > The third machine is a regular NFS server. Export a directory with normal > sec=sys security. > > The fourth is a desktop machine. Add the client kerberos keys to the system's > keytab. Mount both NFS shares, and generate some files full of random data. > I found 20MB from /dev/random works great. > > Open each NFS share up in tabs in Nautilus. Copy the random data files to the > sec=sys NFS share. When they are done, one at a time cut and then paste the file > into the sec=krb5p NFS share. The bug will trigger either on the first, or > subsequent tries, but less than 10 tries are needed usually. > > There is a test kernel available in the following PPA: > https://launchpad.net/~mruffell/+archive/ubuntu/sf285439-test > > If you install the test kernel, files will cut and paste correctly, and NFS will > work as expected. > > [Regression Potential] > > If a regression were to occur, it would impact users of the NFS subsystem, since > the changes modify how padding is applied to all NFS packets, and a regression > would affect all versions of NFS. > > If a regression were to occur, users would need to downgrade their kernel while > awaiting a fix. > > Chuck Lever (2): > SUNRPC: Introduce rpc_prepare_reply_pages() > NFS: Account for XDR pad of buf->pages > > fs/nfs/nfs2xdr.c | 33 ++++++--------------- > fs/nfs/nfs3xdr.c | 37 +++++++----------------- > fs/nfs/nfs4xdr.c | 54 +++++++++++++++-------------------- > include/linux/sunrpc/clnt.h | 3 ++ > include/trace/events/sunrpc.h | 37 ++++++++++++++++++++++++ > net/sunrpc/clnt.c | 23 +++++++++++++++ > net/sunrpc/xdr.c | 11 +++++++ > 7 files changed, 117 insertions(+), 81 deletions(-) > Between the lines of the testcase and the regression risk assessment I believe that you say, that you made some testing and saw no problems (v2/v3/v4 or just one). Stating this more explicitly helps to decide about taking a set of patches. Especially if backport work appears slightly more complex. -Stefan Acked-by: Stefan Bader <stefan.bader@canonical.com>
On Thu, Jul 16, 2020 at 01:54:40PM +1200, Matthew Ruffell wrote: ... > [Fix] > > The fix landed in 5.1-rc1, in the following commit: > > commit 02ef04e432babf8fc703104212314e54112ecd2d > Author: Chuck Lever <chuck.lever@oracle.com> > Date: Mon Feb 11 11:25:25 2019 -0500 > Subject: NFS: Account for XDR pad of buf->pages > Link: https://github.com/torvalds/linux/commit/02ef04e432babf8fc703104212314e54112ecd2d > > The above commit more or less relies on the below commit as a dependency, and is > included in the SRU: > > commit cf500bac8fd48b57f38ece890235923d4ed5ee91 > Author: Chuck Lever <chuck.lever@oracle.com> > Date: Mon Feb 11 11:25:20 2019 -0500 > Subject: SUNRPC: Introduce rpc_prepare_reply_pages() > Link: https://github.com/torvalds/linux/commit/cf500bac8fd48b57f38ece890235923d4ed5ee91 > > It appears that some NFS calls return a NFS payload which is not a multiple of > 4 bytes, but any payload sent over the network needs to be padded to an exact > multiple of 4 bytes. It seems cutting and pasting from Nautilus triggers one > such payload which is missing a byte, and it causes the NFS subsystem to hang > during packet transmission. The fix ensures that all payloads use correct > padding. I think we may need to apply also: 29e7ca715f2a ("NFS: Fix handling of reply page vector") that seems to fix a bug introduced by 02ef04e432ba. If that's the case could you re-send the patch set including also this fix. Thanks, -Andrea
> > I think we may need to apply also: > > 29e7ca715f2a ("NFS: Fix handling of reply page vector") > > that seems to fix a bug introduced by 02ef04e432ba. If that's the case > could you re-send the patch set including also this fix. Thanks Andrea, yes, I agree we need this patch. Sorry for missing it! I will build a fresh test kernel, test it and resend the patch set. Thanks, Matthew
On Fri, Jul 17, 2020 at 11:51 AM Matthew Ruffell <matthew.ruffell@canonical.com> wrote: > > > > > I think we may need to apply also: > > > > 29e7ca715f2a ("NFS: Fix handling of reply page vector") > > > > that seems to fix a bug introduced by 02ef04e432ba. If that's the case > > could you re-send the patch set including also this fix. > > Thanks Andrea, yes, I agree we need this patch. Sorry for missing it! > > I will build a fresh test kernel, test it and resend the patch set. > Okay, I built a test kernel based on: - Ubuntu-4.15.0-111-generic: + cf500bac8fd4 "SUNRPC: Introduce rpc_prepare_reply_pages()" + 02ef04e432ba "NFS: Account for XDR pad of buf->pages" + 29e7ca715f2a "NFS: Fix handling of reply page vector" The most curious thing happens - the issue returns and is reproducible. It seems the fixup commit either breaks things, or what I believed to be fixes don't actually fix the problem, but only cover it up. I am going to investigate this more deeply, and will resubmit for the next SRU cycle once I have determined what is happening. Engineering notes: If I build a kernel based on: - 29e7ca715f2a "NFS: Fix handling of reply page vector" It works. If I build a kernel based on: - 02ef04e432ba "NFS: Account for XDR pad of buf->pages" It works. If I build a kernel based on: - 02ef04e432ba "NFS: Account for XDR pad of buf->pages" + 29e7ca715f2a "NFS: Fix handling of reply page vector" It fails, and the issue returns. Therefore, something additionally got fixed between 02ef04e432ba..29e7ca715f2a which might be the actual root cause. Looks like I'm going to have to bisect again.