Message ID | 20230816050803.15660-1-krisman@suse.de |
---|---|
Headers | show |
Series | [v6,1/9] ecryptfs: Reject casefold directory inodes | expand |
On Wed, Aug 16, 2023 at 01:07:54AM -0400, Gabriel Krisman Bertazi wrote: > Hi, > > This is v6 of the negative dentry on case-insensitive directories. > Thanks Eric for the review of the last iteration. This version > drops the patch to expose the helper to check casefolding directories, > since it is not necessary in ecryptfs and it might be going away. It > also addresses some documentation details, fix a build bot error and > simplifies the commit messages. See the changelog in each patch for > more details. > > Thanks, > > --- > > Gabriel Krisman Bertazi (9): > ecryptfs: Reject casefold directory inodes > 9p: Split ->weak_revalidate from ->revalidate > fs: Expose name under lookup to d_revalidate hooks > fs: Add DCACHE_CASEFOLDED_NAME flag > libfs: Validate negative dentries in case-insensitive directories > libfs: Chain encryption checks after case-insensitive revalidation > libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops > ext4: Enable negative dentries on case-insensitive lookup > f2fs: Enable negative dentries on case-insensitive lookup > Looks good, Reviewed-by: Eric Biggers <ebiggers@google.com> - Eric
On Thu, Aug 17, 2023 at 10:06:58AM -0700, Eric Biggers wrote: > On Wed, Aug 16, 2023 at 01:07:54AM -0400, Gabriel Krisman Bertazi wrote: > > Hi, > > > > This is v6 of the negative dentry on case-insensitive directories. > > Thanks Eric for the review of the last iteration. This version > > drops the patch to expose the helper to check casefolding directories, > > since it is not necessary in ecryptfs and it might be going away. It > > also addresses some documentation details, fix a build bot error and > > simplifies the commit messages. See the changelog in each patch for > > more details. > > > > Thanks, > > > > --- > > > > Gabriel Krisman Bertazi (9): > > ecryptfs: Reject casefold directory inodes > > 9p: Split ->weak_revalidate from ->revalidate > > fs: Expose name under lookup to d_revalidate hooks > > fs: Add DCACHE_CASEFOLDED_NAME flag > > libfs: Validate negative dentries in case-insensitive directories > > libfs: Chain encryption checks after case-insensitive revalidation > > libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops > > ext4: Enable negative dentries on case-insensitive lookup > > f2fs: Enable negative dentries on case-insensitive lookup > > > > Looks good, > > Reviewed-by: Eric Biggers <ebiggers@google.com> Thanks! We're a bit too late for v6.6 with this given that this hasn't even been in -next. So this will be up for v6.7.
Christian Brauner <brauner@kernel.org> writes: > On Thu, Aug 17, 2023 at 10:06:58AM -0700, Eric Biggers wrote: >> On Wed, Aug 16, 2023 at 01:07:54AM -0400, Gabriel Krisman Bertazi wrote: >> > Hi, >> > >> > This is v6 of the negative dentry on case-insensitive directories. >> > Thanks Eric for the review of the last iteration. This version >> > drops the patch to expose the helper to check casefolding directories, >> > since it is not necessary in ecryptfs and it might be going away. It >> > also addresses some documentation details, fix a build bot error and >> > simplifies the commit messages. See the changelog in each patch for >> > more details. >> > >> > Thanks, >> > >> > --- >> > >> > Gabriel Krisman Bertazi (9): >> > ecryptfs: Reject casefold directory inodes >> > 9p: Split ->weak_revalidate from ->revalidate >> > fs: Expose name under lookup to d_revalidate hooks >> > fs: Add DCACHE_CASEFOLDED_NAME flag >> > libfs: Validate negative dentries in case-insensitive directories >> > libfs: Chain encryption checks after case-insensitive revalidation >> > libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops >> > ext4: Enable negative dentries on case-insensitive lookup >> > f2fs: Enable negative dentries on case-insensitive lookup >> > >> >> Looks good, >> >> Reviewed-by: Eric Biggers <ebiggers@google.com> > > Thanks! We're a bit too late for v6.6 with this given that this hasn't > even been in -next. So this will be up for v6.7. Targeting 6.7 is fine by me. will you pick it up through the vfs tree? I prefer it goes through there since it mostly touches vfs.
> Targeting 6.7 is fine by me. will you pick it up through the vfs tree? I > prefer it goes through there since it mostly touches vfs. Yes, I think that's what will end up happening.
On Wed, 16 Aug 2023 01:07:54 -0400, Gabriel Krisman Bertazi wrote: > This is v6 of the negative dentry on case-insensitive directories. > Thanks Eric for the review of the last iteration. This version > drops the patch to expose the helper to check casefolding directories, > since it is not necessary in ecryptfs and it might be going away. It > also addresses some documentation details, fix a build bot error and > simplifies the commit messages. See the changelog in each patch for > more details. > > [...] Ok, let's put it into -next so it sees some testing. So it's too late for v6.7. Seems we forgot about this series. Sorry about that. --- Applied to the vfs.casefold branch of the vfs/vfs.git tree. Patches in the vfs.casefold branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs.casefold [1/9] ecryptfs: Reject casefold directory inodes https://git.kernel.org/vfs/vfs/c/8512e7c7e665 [2/9] 9p: Split ->weak_revalidate from ->revalidate https://git.kernel.org/vfs/vfs/c/17f4423cb24a [3/9] fs: Expose name under lookup to d_revalidate hooks https://git.kernel.org/vfs/vfs/c/24084e50e579 [4/9] fs: Add DCACHE_CASEFOLDED_NAME flag https://git.kernel.org/vfs/vfs/c/2daa2df800f8 [5/9] libfs: Validate negative dentries in case-insensitive directories https://git.kernel.org/vfs/vfs/c/8d879ccaf0f7 [6/9] libfs: Chain encryption checks after case-insensitive revalidation https://git.kernel.org/vfs/vfs/c/314e925d5a2c [7/9] libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops https://git.kernel.org/vfs/vfs/c/07f820b77c58 [8/9] ext4: Enable negative dentries on case-insensitive lookup https://git.kernel.org/vfs/vfs/c/2562ec77f11e [9/9] f2fs: Enable negative dentries on case-insensitive lookup https://git.kernel.org/vfs/vfs/c/39d2dd36a065
Christian Brauner <brauner@kernel.org> writes: > On Wed, 16 Aug 2023 01:07:54 -0400, Gabriel Krisman Bertazi wrote: >> This is v6 of the negative dentry on case-insensitive directories. >> Thanks Eric for the review of the last iteration. This version >> drops the patch to expose the helper to check casefolding directories, >> since it is not necessary in ecryptfs and it might be going away. It >> also addresses some documentation details, fix a build bot error and >> simplifies the commit messages. See the changelog in each patch for >> more details. >> >> [...] > > Ok, let's put it into -next so it sees some testing. > So it's too late for v6.7. Seems we forgot about this series. > Sorry about that. Christian, We are approaching -rc2 and, until last Friday, it didn't shown up in linux-next. So, to avoid turning a 6 month delay into 9 months, I pushed your signed tag to linux-next myself. That obviously uncovered a merge conflict: in v6.6, ceph added fscrypt, and the caller had to be updated. I fixed it and pushed again to linux-next to get more testing. Now, I don't want to send it to Linus myself. This is 100% VFS/FS code, I'm not the maintainer and it will definitely raise eyebrows. Can you please requeue and make sure it goes through this time? I'm happy to drop my branch from linux-next once yours shows up. https://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git/log/?h=negative-dentries This branch has the latest version with the ceph conflict folded in. I did it this way because I'd consider it was never picked up and there is no point in making the history complex by adding a fix on top of your signed tag, since it already fails to build ceph. I can send it as a v7; but I prefer you just pull from the branch above. Or you can ack and I'll send to Linus. This is the diff from you signed tag: diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 629d8fb31d8f..21278a9d9baa 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1869,7 +1869,7 @@ static int ceph_d_revalidate(struct dentry *dentry, const struct qstr *name, struct inode *dir, *inode; struct ceph_mds_client *mdsc; - valid = fscrypt_d_revalidate(dentry, flags); + valid = fscrypt_d_revalidate(dentry, name, flags); if (valid <= 0) return valid; diff --git a/fs/ecryptfs/dentry.c b/fs/ecryptfs/dentry.c index 56093648d838..ce86891a1711 100644 --- a/fs/ecryptfs/dentry.c +++ b/fs/ecryptfs/dentry.c @@ -18,6 +18,7 @@ /** * ecryptfs_d_revalidate - revalidate an ecryptfs dentry * @dentry: The ecryptfs dentry + * @name: The name under lookup * @flags: lookup flags * * Called when the VFS needs to revalidate a dentry. This diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c index 3dd93d36aaf2..5e4910e016a8 100644 --- a/fs/gfs2/dentry.c +++ b/fs/gfs2/dentry.c @@ -22,6 +22,7 @@ /** * gfs2_drevalidate - Check directory lookup consistency * @dentry: the mapping to check + * @name: The name under lookup * @flags: lookup flags * * Check to make sure the lookup necessary to arrive at this inode from its
On Sun, Nov 19, 2023 at 06:11:39PM -0500, Gabriel Krisman Bertazi wrote: > Christian Brauner <brauner@kernel.org> writes: > > > On Wed, 16 Aug 2023 01:07:54 -0400, Gabriel Krisman Bertazi wrote: > >> This is v6 of the negative dentry on case-insensitive directories. > >> Thanks Eric for the review of the last iteration. This version > >> drops the patch to expose the helper to check casefolding directories, > >> since it is not necessary in ecryptfs and it might be going away. It > >> also addresses some documentation details, fix a build bot error and > >> simplifies the commit messages. See the changelog in each patch for > >> more details. > >> > >> [...] > > > > Ok, let's put it into -next so it sees some testing. > > So it's too late for v6.7. Seems we forgot about this series. > > Sorry about that. > > Christian, > > We are approaching -rc2 and, until last Friday, it didn't shown up in > linux-next. So, to avoid turning a 6 month delay into 9 months, I pushed > your signed tag to linux-next myself. > > That obviously uncovered a merge conflict: in v6.6, ceph added fscrypt, > and the caller had to be updated. I fixed it and pushed again to > linux-next to get more testing. > > Now, I don't want to send it to Linus myself. This is 100% VFS/FS code, > I'm not the maintainer and it will definitely raise eyebrows. Can you > please requeue and make sure it goes through this time? I'm happy to My current understanding is that core dcache stuff is usually handled by Al. And he's got a dcache branches sitting in his tree. So this isn't me ignoring you in any way. My hands are tied and so I can't sort this out for you easily. > drop my branch from linux-next once yours shows up. > > https://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git/log/?h=negative-dentries > > This branch has the latest version with the ceph conflict folded in. I > did it this way because I'd consider it was never picked up and there is > no point in making the history complex by adding a fix on top of your > signed tag, since it already fails to build ceph. > > I can send it as a v7; but I prefer you just pull from the branch > above. Or you can ack and I'll send to Linus.
Christian Brauner <brauner@kernel.org> writes: > On Sun, Nov 19, 2023 at 06:11:39PM -0500, Gabriel Krisman Bertazi wrote: >> Christian Brauner <brauner@kernel.org> writes: >> >> > On Wed, 16 Aug 2023 01:07:54 -0400, Gabriel Krisman Bertazi wrote: >> >> This is v6 of the negative dentry on case-insensitive directories. >> >> Thanks Eric for the review of the last iteration. This version >> >> drops the patch to expose the helper to check casefolding directories, >> >> since it is not necessary in ecryptfs and it might be going away. It >> >> also addresses some documentation details, fix a build bot error and >> >> simplifies the commit messages. See the changelog in each patch for >> >> more details. >> >> >> >> [...] >> > >> > Ok, let's put it into -next so it sees some testing. >> > So it's too late for v6.7. Seems we forgot about this series. >> > Sorry about that. >> >> Christian, >> >> We are approaching -rc2 and, until last Friday, it didn't shown up in >> linux-next. So, to avoid turning a 6 month delay into 9 months, I pushed >> your signed tag to linux-next myself. >> >> That obviously uncovered a merge conflict: in v6.6, ceph added fscrypt, >> and the caller had to be updated. I fixed it and pushed again to >> linux-next to get more testing. >> >> Now, I don't want to send it to Linus myself. This is 100% VFS/FS code, >> I'm not the maintainer and it will definitely raise eyebrows. Can you >> please requeue and make sure it goes through this time? I'm happy to > > My current understanding is that core dcache stuff is usually handled by > Al. And he's got a dcache branches sitting in his tree. > > So this isn't me ignoring you in any way. My hands are tied and so I > can't sort this out for you easily. Please don't take it personally, but you surely see how frustrating this is. While I appreciate your very prompt answer, this is very different from: https://lore.kernel.org/linux-fsdevel/20230821-derart-serienweise-3506611e576d@brauner/ https://lore.kernel.org/linux-fsdevel/20230822-denkmal-operette-f16d8bd815fc@brauner/ https://lore.kernel.org/linux-fsdevel/20231025-selektiert-leibarzt-5d0070d85d93@brauner/ Perhaps it all has a vfs-specific meaning. But the following suggests it was accepted and queued long ago: > Thanks! We're a bit too late for v6.6 with this given that this hasn't > even been in -next. So this will be up for v6.7. [...] > Ok, let's put it into -next so it sees some testing. [...] > It's encouraged to provide Acked-bys and Reviewed-bys even though the > patch has now been applied. [...] > Patches in the vfs.casefold branch should appear in linux-next soon. [...] Obviously, there are big issues with the process here. But I fail to see how I could have communicated clearer or where I didn't follow the process in this entire thread. The branches you mentioned are 10 days old. This patchset was "accepted" two months ago. As one of the VFS maintainer, can you send an acked-by - or at least a r-b in cases like this, if you agree with the patches? Then it makes more sense for me to send to Linus directly. Viro, You are CC'ed since early 2022. Can you comment? Ted and Eric reviewed, Christian too. there's been only small changes since the first RFC. I'll ask Linus to pull from the unicode tree in the next merge window if I don't hear from you.
On Mon, 20 Nov 2023 at 07:06, Christian Brauner <brauner@kernel.org> wrote: > > My current understanding is that core dcache stuff is usually handled by > Al. And he's got a dcache branches sitting in his tree. > > So this isn't me ignoring you in any way. My hands are tied and so I > can't sort this out for you easily. Well, we all know - very much including Al - that Al isn't always the most responsive person, and tends to have his own ratholes that he dives deep into. The good news is that I do know the dcache code pretty well, and while I really would like Al to deal with any locking issues (because "pretty well" is not "as well as Al Viro"), for just about any other issue I'll happily take pulls from you. I dislike case folding with a passion - it's about the worst design decision a filesystem can ever do - but the other side of that is that if you have to have case folding, the last thing you want to do is to have each filesystem deal with that sh*t-for-brains decision itself. So moving more support for case folding into the VFS so that the horrid thing at least gets better support is something I'm perfectly fine with despite my dislike of it. Of course, "do it in shared generic code" doesn't tend to really fix the braindamage, but at least it's now shared braindamage and not spread out all over. I'm looking at things like generic_ci_d_compare(), and it hurts to see the mindless "let's do lookups and compares one utf8 character at a time". What a disgrace. Somebody either *really* didn't care, or was a Unicode person who didn't understand the point of UTF-8. Oh well. I guess people went "this is going to suck anyway, so let's make sure it *really* sucks". The patches look fine to me. Al - do you even care about them? Linus
On Mon, Nov 20, 2023 at 10:07:51AM -0800, Linus Torvalds wrote: > Of course, "do it in shared generic code" doesn't tend to really fix > the braindamage, but at least it's now shared braindamage and not > spread out all over. I'm looking at things like > generic_ci_d_compare(), and it hurts to see the mindless "let's do > lookups and compares one utf8 character at a time". What a disgrace. > Somebody either *really* didn't care, or was a Unicode person who > didn't understand the point of UTF-8. This isn't because of case-folding brain damage, but rather Unicode brain damage. We compare one character at a time because it's possible for some character like é to either be encoded as 0x0089 (aka "Latin Small Letter E with Acute") OR as 0x0065 0x0301 ("Latin Small Letter E" plus "Combining Acute Accent"). Typically, we pretend that UTF-8 means that we can just encode é, or 0x0089 as 0xC3 0xA9 and then call it a day and just use strcmp(3) on the sucker. But Unicode is a lot more insane than that. Technically, 0x65 0xCC 0x81 is the same character as 0xC3 0xA9. > Oh well. I guess people went "this is going to suck anyway, so let's > make sure it *really* sucks". It's more like, "this is going to suck, but if it's going to suck anyway, let's implement the full Unicode spec in all its gory^H^H^H^H glory, whether or not it's sane". - Ted
On Mon, Nov 20, 2023 at 10:07:51AM -0800, Linus Torvalds wrote: > Well, we all know - very much including Al - that Al isn't always the > most responsive person, and tends to have his own ratholes that he > dives deep into. FWIW, the right now I'm getting out of one of those - rename() deadlocks fun. Will post that, assuming it survives local beating, then post the dcache stuff and other branches (all rebased by now). Other ratholes - d_name/d_parent audit is about halfway through - we do have fun shite in quite a few ->d_revalidate() instances (UAF, etc.); RCU pathwalk methods need to be re-audited; the fixes caught in the last cycle are either in mainline by now, or rebased. But that stuff is relatively easy to suspend for a few days. > Of course, "do it in shared generic code" doesn't tend to really fix > the braindamage, but at least it's now shared braindamage and not > spread out all over. I'm looking at things like > generic_ci_d_compare(), and it hurts to see the mindless "let's do > lookups and compares one utf8 character at a time". What a disgrace. > Somebody either *really* didn't care, or was a Unicode person who > didn't understand the point of UTF-8. > > Oh well. I guess people went "this is going to suck anyway, so let's > make sure it *really* sucks". > > The patches look fine to me. Al - do you even care about them? I will review that series; my impression from the previous iterations had been fairly unpleasant, TBH, but I hadn't rechecked since April or so. Re dcache conflicts - see #untested-pile-dcache; most the dcache-affecting stuff should be there. One intersting exception is the general helper for safely picking parent/parent's inode/entry name for ->d_revalidate() use; we have the parent-related part boilerplated, but the name side tends to be missing. That's still being tweaked, looking for sane calling conventions.
On Mon, 20 Nov 2023 at 18:03, Theodore Ts'o <tytso@mit.edu> wrote: > > On Mon, Nov 20, 2023 at 10:07:51AM -0800, Linus Torvalds wrote: > > I'm looking at things like > > generic_ci_d_compare(), and it hurts to see the mindless "let's do > > lookups and compares one utf8 character at a time". What a disgrace. > > Somebody either *really* didn't care, or was a Unicode person who > > didn't understand the point of UTF-8. > > This isn't because of case-folding brain damage, but rather Unicode > brain damage. No, it really is just stupidity and horribleness. The thing is, when you check two strings for equality, the FIRST THING you should do is to just compare them for exactly that: equality. And no, the way you do that is not by checking each unicode character one by one. You do it by just doing a regular memcmp. In fact, you can do even better than that: while at it, check whether (a) all bytes are equal in everything but bit#5 (b) none of the bytes have the high bit set and you have now narrowed down things in a big way. You can do these things trivially one whole word at a time, and you'll handle 99% of all input without EVER doing any Unicode garbage AT ALL. Yes, yes, if you actually have complex characters, you end up having to deal with that mess. But no, that is *not* an excuse for saying "all characters are complex". So no. There is absolutely zero excuse for doing stupid things, except for "nobody has ever cared, because case folding is so stupid to begin with that people just expect it to perform horribly badly". End result: - generic_ci_d_compare() should *not* consider the memcmp() to be a "fall back to this for non-casefolded". You should start with that, and if the bytes are equal then the strings are equal. End of story. - if the bytes are not equal, then the strings *might* still compare equal if it's a casefolded directory. - but EVEN THEN you shouldn't fall back to actually doing UTF-8 decoding unless you saw the high bit being set at some point. - and if they different in anything but bit #5 and you didn't see the high bit, you know they are different. It's a bit complicated, yes. But no, doing things one unicode character at a time is just bad bad bad. Linus
On Mon, 20 Nov 2023 at 18:29, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It's a bit complicated, yes. But no, doing things one unicode > character at a time is just bad bad bad. Put another way: the _point_ of UTF-8 is that ASCII is still ASCII. It's literally why UTF-8 doesn't suck. So you can still compare ASCII strings as-is. No, that doesn't help people who are really using other locales, and are actively using complicated characters. But it very much does mean that you can compare "Bad" and "bad" and never ever look at any unicode translation ever. In a perfect world, you'd use all the complicated DCACHE_WORD_ACCESS stuff that can do all of this one word at a time. But even if you end up doing the rules just one byte at a time, it means that you can deal with the common cases without "unicode cursors" or function calls to extract unicode characters, or anything like that. You can still treat things as bytes. So the top of generic_ci_d_compare() should probably be something trivial like this: const char *ct = name.name; unsigned int tcount = name.len; /* Handle the exact equality quickly */ if (len == tcount && !dentry_string_cmp(str, ct, tcount)) return 0; because byte-wise equality is equality even if high bits are set. After that, it should probably do something like /* Not byte-identical, but maybe igncase identical in ASCII */ do { unsigned char a, b; /* Dentry name byte */ a = *str; /* If that's NUL, the qstr needs to be done too! */ if (!a) return !!tcount; /* Alternatively, if the qstr is done, it needed to be NUL */ if (!tcount) return 1; b = *ct; if ((a | b) & 0x80) break; if (a != b) { /* Quick "not same" igncase ASCII */ if ((a ^ b) & ~32) return 1; a &= ~32; if (a < 'A' || a > 'Z') return 1; } /* Ok, same ASCII, bytefolded, go to next */ str++; ct++; tcount--; len--; } and only after THAT should it do the utf name comparison (and only on the remaining parts, since the above will have checked for common ASCII beginnings). And the above was obviously never tested, and written in the MUA, and may be completely wrong in all the details, but you get the idea. Deal with the usual cases first. Do the full unicode only when you absolutely have to. Linus
On Mon, Nov 20, 2023 at 07:03:13PM -0800, Linus Torvalds wrote: > On Mon, 20 Nov 2023 at 18:29, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It's a bit complicated, yes. But no, doing things one unicode > > character at a time is just bad bad bad. > > Put another way: the _point_ of UTF-8 is that ASCII is still ASCII. > It's literally why UTF-8 doesn't suck. > > So you can still compare ASCII strings as-is. > > No, that doesn't help people who are really using other locales, and > are actively using complicated characters. > > But it very much does mean that you can compare "Bad" and "bad" and > never ever look at any unicode translation ever. Yeah, agreed, that would be a nice optimization. However, in the unfortunate case where (a) it's non-ASCII, and (b) the input string is non-normalized and/or differs in case, we end up scanning some portion of the two strings twice; once doing the strcmp, and once doing the Unicode slow path. That being said, given that even in the case where we're dealing with non-ASCII strings, in the fairly common case where the program is doing a readdir() followed by a open() or stat(), the filename will be byte-identical and so a strcmp() will suffice. So I agree that it's a nice optimization. It'd be interesting how much such an optimization would actually show up in various benchmarks. It'd have to be something that was really metadata-heavy, or else the filenamea lookups would get drowned out. - Ted
Linus Torvalds <torvalds@linux-foundation.org> writes: > I dislike case folding with a passion - it's about the worst design > decision a filesystem can ever do - but the other side of that is that > if you have to have case folding, the last thing you want to do is to > have each filesystem deal with that sh*t-for-brains decision itself. Thanks for pitching in. We all agree it is a horrible feature that we support for very specific use cases. I'm the one who added the code, yes, but I was never under the illusion it's a good feature. It solely exists to let Linux handle the bad decisions made elsewhere. > So moving more support for case folding into the VFS so that the > horrid thing at least gets better support is something I'm perfectly > fine with despite my dislike of it. Yes. The entire implementation was meant to stay as far away as possible from the fast lookup path (didn't want to displease Viro). The negative dentry is the only exception that required more changes to vfs to address the issue he found of dangling negative dentries when turning a directory case-insensitive. But, fyi, there is work in progress to add support to more filesystems. This is why I really want to get all of this done first. There is a use case to enable it in shmem because of containerized environments running wine; I was recently cc'ed on a bcachefs implementation; and there are people working on adding it to btrfs (to support wine in specific products). > Of course, "do it in shared generic code" doesn't tend to really fix > the braindamage, but at least it's now shared braindamage and not > spread out all over. I'm looking at things like > generic_ci_d_compare(), and it hurts to see the mindless "let's do > lookups and compares one utf8 character at a time". What a disgrace. > Somebody either *really* didn't care, or was a Unicode person who > didn't understand the point of UTF-8. Yes. I saw the rest of the thread and you are obviously correct here. It needs to be fixed. I will follow up with patches. > The patches look fine to me. Al - do you even care about them? I saw that Al Viro answered. Thank you, Al. So I'll wait for either his review or the merge window. Thanks,
On Tue, Nov 21, 2023 at 12:12:15AM -0500, Theodore Ts'o wrote: > Yeah, agreed, that would be a nice optimization. However, in the > unfortunate case where (a) it's non-ASCII, and (b) the input string is > non-normalized and/or differs in case, we end up scanning some portion > of the two strings twice; once doing the strcmp, and once doing the > Unicode slow path. The first pass is cheap and the second one will be running entirely from the cache, so I doubt that scanning them twice will be a loss...
On Tue, Nov 21, 2023 at 02:27:34AM +0000, Al Viro wrote: > I will review that series; my impression from the previous iterations > had been fairly unpleasant, TBH, but I hadn't rechecked since April > or so. The serious gap, AFAICS, is the interplay with open-by-fhandle. It's not unfixable, but we need to figure out what to do when lookup runs into a disconnected directory alias. d_splice_alias() will move it in place, all right, but any state ->lookup() has hung off the dentry that had been passed to it will be lost. And I seriously suspect that we want to combine that state propagation with d_splice_alias() (or its variant to be used in such cases), rather than fixing the things up afterwards. In particular, propagating ->d_op is really not trivial at that point; it is safe to do to ->lookup() argument prior to d_splice_alias() (even though that's too subtle and brittle, IMO), but after d_splice_alias() has succeeded, the damn thing is live and can be hit by hash lookups, revalidate, etc. The only things that can't happen to it are ->d_delete(), ->d_prune(), ->d_iput() and ->d_init(). Everything else is fair game. And then there's an interesting question about the interplay with reparenting. It's OK to return an error rather than reparent, but we need some way to tell if we need to do so.
On Wed, 22 Nov 2023 at 13:19, Al Viro <viro@zeniv.linux.org.uk> wrote: > > The serious gap, AFAICS, is the interplay with open-by-fhandle. So I'm obviously not a fan of igncase filesystems, but I don't think this series actually changes any of that. > It's not unfixable, but we need to figure out what to do when > lookup runs into a disconnected directory alias. d_splice_alias() > will move it in place, all right, but any state ->lookup() has > hung off the dentry that had been passed to it will be lost. I guess this migth be about the new DCACHE_CASEFOLDED_NAME bit. At least for now, that is only used by generic_ci_d_revalidate() for negative dentries, so it shouldn't matter for that d_splice_alias() that only does positive dentries. No? Or is there something else you worry about? Side note: Gabriel, as things are now, instead of that if (!d_is_casefolded_name(dentry)) return 0; in generic_ci_d_revalidate(), I would suggest that any time a directory is turned into a case-folded one, you'd just walk all the dentries for that directory and invalidate negative ones at that point. Or was there some reason I missed that made it a good idea to do it at run-time after-the-fact? Linus
On Wed, Nov 22, 2023 at 09:19:01PM +0000, Al Viro wrote: > On Tue, Nov 21, 2023 at 02:27:34AM +0000, Al Viro wrote: > > > I will review that series; my impression from the previous iterations > > had been fairly unpleasant, TBH, but I hadn't rechecked since April > > or so. > > The serious gap, AFAICS, is the interplay with open-by-fhandle. > It's not unfixable, but we need to figure out what to do when > lookup runs into a disconnected directory alias. d_splice_alias() > will move it in place, all right, but any state ->lookup() has > hung off the dentry that had been passed to it will be lost. > > And I seriously suspect that we want to combine that state > propagation with d_splice_alias() (or its variant to be used in > such cases), rather than fixing the things up afterwards. > > In particular, propagating ->d_op is really not trivial at that > point; it is safe to do to ->lookup() argument prior to d_splice_alias() > (even though that's too subtle and brittle, IMO), but after > d_splice_alias() has succeeded, the damn thing is live and can > be hit by hash lookups, revalidate, etc. > > The only things that can't happen to it are ->d_delete(), ->d_prune(), > ->d_iput() and ->d_init(). Everything else is fair game. > > And then there's an interesting question about the interplay with > reparenting. It's OK to return an error rather than reparent, > but we need some way to tell if we need to do so. Hmm... int (*d_transfer)(struct dentry *alias, struct dentry *new)? Called if d_splice_alias() picks that sucker, under rename_lock, before the call of __d_move(). Can check IS_ROOT(alias) (due to rename_lock), so can tell attaching from reparenting, returning an error - failed d_splice_alias(). Perhaps would be even better inside __d_move(), once all ->d_lock are taken... Turn the current bool exchange in there into honest enum (exchange/move/splice) and call ->d_transfer() on splice. In case of failure it's still not too late to back out - __d_move() would return an int, ignored in d_move() and d_exchange() and treated as "fail in unlikely case it's non-zero" in d_splice_alias() and __d_unalias()... Comments? Note that e.g. res = d_splice_alias(inode, dentry); if (!IS_ERR(fid)) { if (!res) v9fs_fid_add(dentry, &fid); else if (!IS_ERR(res)) v9fs_fid_add(res, &fid); else p9_fid_put(fid); } in 9p ->lookup() would turn into v9fs_fid_add(dentry, &fid); return d_splice_alias(inode, dentry); with ->d_transfer(alias, new) being simply struct hlist_node *p = new->d_fsdata; hlist_del_init(p); __add_fid(alias, hlist_entry(p, struct p9_fid, dlist)); return 0; assuming the call from __d_move()...
On Thu, Nov 23, 2023 at 01:12:08AM +0000, Al Viro wrote: > On Wed, Nov 22, 2023 at 09:19:01PM +0000, Al Viro wrote: > > On Tue, Nov 21, 2023 at 02:27:34AM +0000, Al Viro wrote: > > > > > I will review that series; my impression from the previous iterations > > > had been fairly unpleasant, TBH, but I hadn't rechecked since April > > > or so. > > > > The serious gap, AFAICS, is the interplay with open-by-fhandle. > > It's not unfixable, but we need to figure out what to do when > > lookup runs into a disconnected directory alias. d_splice_alias() > > will move it in place, all right, but any state ->lookup() has > > hung off the dentry that had been passed to it will be lost. > > > > And I seriously suspect that we want to combine that state > > propagation with d_splice_alias() (or its variant to be used in > > such cases), rather than fixing the things up afterwards. > > > > In particular, propagating ->d_op is really not trivial at that > > point; it is safe to do to ->lookup() argument prior to d_splice_alias() > > (even though that's too subtle and brittle, IMO), but after > > d_splice_alias() has succeeded, the damn thing is live and can > > be hit by hash lookups, revalidate, etc. > > > > The only things that can't happen to it are ->d_delete(), ->d_prune(), > > ->d_iput() and ->d_init(). Everything else is fair game. > > > > And then there's an interesting question about the interplay with > > reparenting. It's OK to return an error rather than reparent, > > but we need some way to tell if we need to do so. > > Hmm... int (*d_transfer)(struct dentry *alias, struct dentry *new)? > Called if d_splice_alias() picks that sucker, under rename_lock, > before the call of __d_move(). Can check IS_ROOT(alias) (due to > rename_lock), so can tell attaching from reparenting, returning > an error - failed d_splice_alias(). > > Perhaps would be even better inside __d_move(), once all ->d_lock > are taken... Turn the current bool exchange in there into honest > enum (exchange/move/splice) and call ->d_transfer() on splice. > In case of failure it's still not too late to back out - __d_move() > would return an int, ignored in d_move() and d_exchange() and > treated as "fail in unlikely case it's non-zero" in d_splice_alias() > and __d_unalias()... > > Comments? Note that e.g. > res = d_splice_alias(inode, dentry); > if (!IS_ERR(fid)) { > if (!res) > v9fs_fid_add(dentry, &fid); > else if (!IS_ERR(res)) > v9fs_fid_add(res, &fid); > else > p9_fid_put(fid); > } > > in 9p ->lookup() would turn into > > v9fs_fid_add(dentry, &fid); > return d_splice_alias(inode, dentry); > > with ->d_transfer(alias, new) being simply > > struct hlist_node *p = new->d_fsdata; > hlist_del_init(p); > __add_fid(alias, hlist_entry(p, struct p9_fid, dlist)); > return 0; > > assuming the call from __d_move()... Incidentally, 9p and this one would not be the only places that could use it - affs - alias->d_fsdata = new->d_fsdata afs - ditto ocfs2 - smells like another possible benefitiary (attaching locks, etc. would be saner if done before d_splice_alias(), with ->d_transfer() moving the lock to the alias)...
On Wed, Nov 22, 2023 at 04:18:56PM -0800, Linus Torvalds wrote: > On Wed, 22 Nov 2023 at 13:19, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > The serious gap, AFAICS, is the interplay with open-by-fhandle. > > So I'm obviously not a fan of igncase filesystems, but I don't think > this series actually changes any of that. > > > It's not unfixable, but we need to figure out what to do when > > lookup runs into a disconnected directory alias. d_splice_alias() > > will move it in place, all right, but any state ->lookup() has > > hung off the dentry that had been passed to it will be lost. > > I guess this migth be about the new DCACHE_CASEFOLDED_NAME bit. > > At least for now, that is only used by generic_ci_d_revalidate() for > negative dentries, so it shouldn't matter for that d_splice_alias() > that only does positive dentries. No? > > Or is there something else you worry about? Dentries created by d_obtain_alias() will never go anywhere near generic_set_encrypted_ci_d_ops(). They do *not* get ->d_op set that way. When ext4_lookup() does a lookup in c-i directory it does have ->d_op set on dentry it got from the caller. Which is promptly discarded when d_splice_alias() finds a preexisting alias for it. Positive dentries eventually become negative; not invalidating them when that happens is a large part of the point of this series. ->d_revalidate() is taught to check if they are marked with that bit, but... they won't have that ->d_revalidate() associated with them, will they? ->d_hash() and ->d_compare() come from the parent, but ->d_revalidate() comes from dentry itself. In other words, generic_ci_d_revalidate() won't see the lack of that bit on dentry, etc. - it won't be called for that dentry in the first place.
Linus Torvalds <torvalds@linux-foundation.org> writes: > Side note: Gabriel, as things are now, instead of that > > if (!d_is_casefolded_name(dentry)) > return 0; > > in generic_ci_d_revalidate(), I would suggest that any time a > directory is turned into a case-folded one, you'd just walk all the > dentries for that directory and invalidate negative ones at that > point. Or was there some reason I missed that made it a good idea to > do it at run-time after-the-fact? > The problem I found with that approach, which I originally tried, was preventing concurrent lookups from racing with the invalidation and creating more 'case-sensitive' negative dentries. Did I miss a way to synchronize with concurrent lookups of the children of the dentry? We can trivially ensure the dentry doesn't have positive children by holding the parent lock, but that doesn't protect from concurrent lookups creating negative dentries, as far as I understand.
On Thu, 23 Nov 2023 at 07:57, Gabriel Krisman Bertazi <gabriel@krisman.be> wrote: > > The problem I found with that approach, which I originally tried, was > preventing concurrent lookups from racing with the invalidation and > creating more 'case-sensitive' negative dentries. Did I miss a way to > synchronize with concurrent lookups of the children of the dentry? We > can trivially ensure the dentry doesn't have positive children by > holding the parent lock, but that doesn't protect from concurrent > lookups creating negative dentries, as far as I understand. I'd just set the "casefolded" bit, then do a RCU grace period wait, and then invalidate all old negative dentries. Sure, there's technically a window there where somebody could hit an existing negative dentry that matches a casefolded name after casefolded has been set (but before the invalidation) and the lookup would result in a "does not exist" lookup that way. But that seems no different from the lookup having been done before the casefolded bit got set, so I don't think that's an _actual_ difference. If you do a lookup concurrently with the directory being set casefolded, you get one or the other. And no, I haven't thought about this a ton, but it seems the obvious thing to do. Temporary stale negative dentries while the casefolded bit is in the process of being set seems a harmless thing, exactly because they would seem to be the same thing as if the lookup was done before... And yes, that "wait for RCU grace period" is a somewhat slow operation, but how often do the casefolded bits get changed? This is not a huge deal. I don't hate your approach, I just found it surprising. Linus
On Thu, Nov 23, 2023 at 10:57:22AM -0500, Gabriel Krisman Bertazi wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > > Side note: Gabriel, as things are now, instead of that > > > > if (!d_is_casefolded_name(dentry)) > > return 0; > > > > in generic_ci_d_revalidate(), I would suggest that any time a > > directory is turned into a case-folded one, you'd just walk all the > > dentries for that directory and invalidate negative ones at that > > point. Or was there some reason I missed that made it a good idea to > > do it at run-time after-the-fact? > > > > The problem I found with that approach, which I originally tried, was > preventing concurrent lookups from racing with the invalidation and > creating more 'case-sensitive' negative dentries. Did I miss a way to > synchronize with concurrent lookups of the children of the dentry? We > can trivially ensure the dentry doesn't have positive children by > holding the parent lock, but that doesn't protect from concurrent > lookups creating negative dentries, as far as I understand. AFAICS, there is a problem with dentries that never came through ->lookup(). Unless I'm completely misreading your code, your generic_ci_d_revalidate() is not called for them. Ever. Hash lookups are controlled by ->d_op of parent; that's where ->d_hash() and ->d_compare() come from. Revalidate comes from *child*. You need ->d_op->d_revalidate of child dentry to be set to your generic_ci_d_revalidate(). The place where it gets set is generic_set_encrypted_ci_d_ops(). Look at its callchain; in case of ext4 it gets called from ext4_lookup_dentry(), which is called from ext4_lookup(). And dentry passed to it is the argument of ->lookup(). Now take a look at open-by-fhandle stuff; all methods in there (->fh_to_dentry(), ->fh_to_parent(), ->get_parent()) end up returning d_obtain_alias(some inode). We *do* call ->lookup(), all right - in reconnect_one(), while trying to connect those suckers with the main tree. But the way it works is that d_splice_alias() in ext4_lookup() moves the existing alias for subdirectory, connecting it to the parent. That's not the dentry ext4_lookup() had set ->d_op on - that's the dentry that came from d_obtain_alias(). And those do not have ->d_op set by anything in your tree. That's the problem I'd been talking about - there is a class of situations where the work done by ext4_lookup() to set the state of dentry gets completely lost. After lookup you do have a dentry in the right place, with the right name and inode, etc., but with NULL ->d_op->d_revalidate.
Al Viro <viro@zeniv.linux.org.uk> writes: > On Thu, Nov 23, 2023 at 10:57:22AM -0500, Gabriel Krisman Bertazi wrote: >> Linus Torvalds <torvalds@linux-foundation.org> writes: >> >> > Side note: Gabriel, as things are now, instead of that >> > >> > if (!d_is_casefolded_name(dentry)) >> > return 0; >> > >> > in generic_ci_d_revalidate(), I would suggest that any time a >> > directory is turned into a case-folded one, you'd just walk all the >> > dentries for that directory and invalidate negative ones at that >> > point. Or was there some reason I missed that made it a good idea to >> > do it at run-time after-the-fact? >> > >> >> The problem I found with that approach, which I originally tried, was >> preventing concurrent lookups from racing with the invalidation and >> creating more 'case-sensitive' negative dentries. Did I miss a way to >> synchronize with concurrent lookups of the children of the dentry? We >> can trivially ensure the dentry doesn't have positive children by >> holding the parent lock, but that doesn't protect from concurrent >> lookups creating negative dentries, as far as I understand. > > AFAICS, there is a problem with dentries that never came through > ->lookup(). Unless I'm completely misreading your code, your > generic_ci_d_revalidate() is not called for them. Ever. > > Hash lookups are controlled by ->d_op of parent; that's where ->d_hash() > and ->d_compare() come from. Revalidate comes from *child*. You need > ->d_op->d_revalidate of child dentry to be set to your generic_ci_d_revalidate(). > > The place where it gets set is generic_set_encrypted_ci_d_ops(). Look > at its callchain; in case of ext4 it gets called from ext4_lookup_dentry(), > which is called from ext4_lookup(). And dentry passed to it is the > argument of ->lookup(). > > Now take a look at open-by-fhandle stuff; all methods in there > (->fh_to_dentry(), ->fh_to_parent(), ->get_parent()) end up > returning d_obtain_alias(some inode). > > We *do* call ->lookup(), all right - in reconnect_one(), while > trying to connect those suckers with the main tree. But the way > it works is that d_splice_alias() in ext4_lookup() moves the > existing alias for subdirectory, connecting it to the parent. > That's not the dentry ext4_lookup() had set ->d_op on - that's > the dentry that came from d_obtain_alias(). And those do not > have ->d_op set by anything in your tree. > > That's the problem I'd been talking about - there is a class of situations > where the work done by ext4_lookup() to set the state of dentry gets > completely lost. After lookup you do have a dentry in the right place, > with the right name and inode, etc., but with NULL > ->d_op->d_revalidate. I get the problem now. I admit to not understanding all the details yet, which is why I haven't answered directly, but I understand already how it can get borked. I'm studying your explanation. Originally, ->d_op could be propagated trivially since we had sb->s_d_op set, which would be set by __d_alloc, but that is no longer the case since we combined fscrypt and CI support. What I still don't understand is why we shouldn't fixup ->d_op when calling d_obtain_alias (before __d_instantiate_anon) and you say we better do it in d_splice_alias. The ->d_op is going to be the same across the filesystem when the casefold feature is enabled, regardless if the directory is casefolded. If we set it there, the alias already has the right d_op from the start.
On Thu, Nov 23, 2023 at 12:37:43PM -0500, Gabriel Krisman Bertazi wrote: > > That's the problem I'd been talking about - there is a class of situations > > where the work done by ext4_lookup() to set the state of dentry gets > > completely lost. After lookup you do have a dentry in the right place, > > with the right name and inode, etc., but with NULL > > ->d_op->d_revalidate. > > I get the problem now. I admit to not understanding all the details yet, > which is why I haven't answered directly, but I understand already how > it can get borked. I'm studying your explanation. > > Originally, ->d_op could be propagated trivially since we had sb->s_d_op > set, which would be set by __d_alloc, but that is no longer the case > since we combined fscrypt and CI support. > > What I still don't understand is why we shouldn't fixup ->d_op when > calling d_obtain_alias (before __d_instantiate_anon) and you say we > better do it in d_splice_alias. The ->d_op is going to be the same > across the filesystem when the casefold feature is enabled, regardless > if the directory is casefolded. If we set it there, the alias already > has the right d_op from the start. *blink* A paragraph above you've said that it's not constant over the entire filesystem. Look, it's really simple - any setup work of that sort done in ->lookup() is either misplaced, or should be somehow transferred over to the alias if one gets picked. As for d_obtain_alias()... AFAICS, it's far more limited in what information it could access. It knows the inode, but it has no idea about the parent to be. The more I look at that, the more it feels like we need a method that would tell the filesystem that this dentry is about to be spliced here. 9p is another place where it would obviously simplify the things; ocfs2 'attach lock' stuff is another case where the things get much more complicated by having to do that stuff after splicing, etc. It's not even hard to do: 1. turn bool exchange in __d_move() arguments into 3-value thing - move, exchange or splice. Have the callers in d_splice_alias() and __d_unalias() pass "splice" instead of false (aka normal move). 2. make __d_move() return an int (normally 0) 3. if asked to splice and if there's target->d_op->d_transfer(), let __d_move() call it right after spin_lock_nested(&dentry->d_lock, 2); spin_lock_nested(&target->d_lock, 3); in there. Passing it target and dentry, obviously. In unlikely case of getting a non-zero returned by the method, undo locks and return that value to __d_move() caller. 4. d_move() and d_exchange() would ignore the value returned by __d_move(); __d_unalias() turn __d_move(alias, dentry, false); ret = 0; into ret = __d_move(alias, dentry, Splice); d_splice_alias() turn __d_move(new, dentry, false); write_sequnlock(&rename_lock); into err = __d_move(new, dentry, Splice); write_sequnlock(&rename_lock); if (unlikely(err)) { dput(new); new = ERR_PTR(err); } (actually, dput()-on-error part would be common to all 3 branches in there, so it would probably get pulled out of that if-else if-else). I can cook a patch doing that (and convert the obvious beneficiaries already in the tree to it) and throw it into dcache branch - just need to massage the series in there for repost... PS: note, BTW, that fscrypt folks have already placed a hook into __d_move(), exactly for the case of splice; I wonder if that would be foldable into the same mechanism - hadn't looked in details yet.
Al Viro <viro@zeniv.linux.org.uk> writes: > On Thu, Nov 23, 2023 at 12:37:43PM -0500, Gabriel Krisman Bertazi wrote: >> > That's the problem I'd been talking about - there is a class of situations >> > where the work done by ext4_lookup() to set the state of dentry gets >> > completely lost. After lookup you do have a dentry in the right place, >> > with the right name and inode, etc., but with NULL >> > ->d_op->d_revalidate. >> >> I get the problem now. I admit to not understanding all the details yet, >> which is why I haven't answered directly, but I understand already how >> it can get borked. I'm studying your explanation. >> >> Originally, ->d_op could be propagated trivially since we had sb->s_d_op >> set, which would be set by __d_alloc, but that is no longer the case >> since we combined fscrypt and CI support. >> >> What I still don't understand is why we shouldn't fixup ->d_op when >> calling d_obtain_alias (before __d_instantiate_anon) and you say we >> better do it in d_splice_alias. The ->d_op is going to be the same >> across the filesystem when the casefold feature is enabled, regardless >> if the directory is casefolded. If we set it there, the alias already >> has the right d_op from the start. > > *blink* > > A paragraph above you've said that it's not constant over the entire > filesystem. The same ->d_op is used by every dentry in the filesystem if the superblock has the casefold bit enabled, regardless of whether a specific inode is casefolded or not. See generic_set_encrypted_ci_d_ops in my tree. It is called unconditionally by ext4_lookup and only checks the superblock: void generic_set_encrypted_ci_d_ops(struct dentry *dentry) { if (dentry->d_sb->s_encoding) { d_set_d_op(dentry, &generic_encrypted_ci_dentry_ops); return; } ... What I meant was that this used to be set once at sb->s_d_op, and propagated during dentry allocation. Therefore, the propagation to the alias would happen inside __d_alloc. Once we enabled fscrypt and casefold to work together, sb->s_d_op is NULL and we always set the same handler for every dentry during lookup. > Look, it's really simple - any setup work of that sort done in ->lookup() > is either misplaced, or should be somehow transferred over to the alias > if one gets picked. > > As for d_obtain_alias()... AFAICS, it's far more limited in what information > it could access. It knows the inode, but it has no idea about the parent > to be. Since it has the inode, d_obtain_alias has the superblock. I think that's all we need for generic_set_encrypted_ci_d_ops. > The more I look at that, the more it feels like we need a method that would > tell the filesystem that this dentry is about to be spliced here. 9p is > another place where it would obviously simplify the things; ocfs2 'attach > lock' stuff is another case where the things get much more complicated > by having to do that stuff after splicing, etc. > > It's not even hard to do: > > 1. turn bool exchange in __d_move() arguments into 3-value thing - move, > exchange or splice. Have the callers in d_splice_alias() and __d_unalias() > pass "splice" instead of false (aka normal move). > > 2. make __d_move() return an int (normally 0) > > 3. if asked to splice and if there's target->d_op->d_transfer(), let > __d_move() call it right after > spin_lock_nested(&dentry->d_lock, 2); > spin_lock_nested(&target->d_lock, 3); > in there. Passing it target and dentry, obviously. In unlikely case > of getting a non-zero returned by the method, undo locks and return > that value to __d_move() caller. > > 4. d_move() and d_exchange() would ignore the value returned by __d_move(); > __d_unalias() turn > __d_move(alias, dentry, false); > ret = 0; > into > ret = __d_move(alias, dentry, Splice); > d_splice_alias() turn > __d_move(new, dentry, false); > write_sequnlock(&rename_lock); > into > err = __d_move(new, dentry, Splice); > write_sequnlock(&rename_lock); > if (unlikely(err)) { > dput(new); > new = ERR_PTR(err); > } > (actually, dput()-on-error part would be common to all 3 branches > in there, so it would probably get pulled out of that if-else if-else). > > I can cook a patch doing that (and convert the obvious beneficiaries already > in the tree to it) and throw it into dcache branch - just need to massage > the series in there for repost... if you can write that, I'll definitely appreciate it. It will surely take me much longer to figure it out myself.
On Thu, Nov 23, 2023 at 02:06:39PM -0500, Gabriel Krisman Bertazi wrote: > > A paragraph above you've said that it's not constant over the entire > > filesystem. > > The same ->d_op is used by every dentry in the filesystem if the superblock > has the casefold bit enabled, regardless of whether a specific inode is > casefolded or not. See generic_set_encrypted_ci_d_ops in my tree. It is > called unconditionally by ext4_lookup and only checks the superblock: > > void generic_set_encrypted_ci_d_ops(struct dentry *dentry) > { > if (dentry->d_sb->s_encoding) { > d_set_d_op(dentry, &generic_encrypted_ci_dentry_ops); > return; > } > ... > > What I meant was that this used to be set once at sb->s_d_op, and > propagated during dentry allocation. Therefore, the propagation to the > alias would happen inside __d_alloc. Once we enabled fscrypt and > casefold to work together, sb->s_d_op is NULL Why? That's what I don't understand - if you really want it for all dentries on that filesystem, that's what ->s_d_op is for. If it is not, you have that problem, no matter which way you flip ->d_op value. > and we always set the same > handler for every dentry during lookup. Not every dentry goes through lookup - see upthread for details. > > Look, it's really simple - any setup work of that sort done in ->lookup() > > is either misplaced, or should be somehow transferred over to the alias > > if one gets picked. > > > > As for d_obtain_alias()... AFAICS, it's far more limited in what information > > it could access. It knows the inode, but it has no idea about the parent > > to be. > > Since it has the inode, d_obtain_alias has the superblock. I think that's all > we need for generic_set_encrypted_ci_d_ops. Huh? If it really depends only upon the superblock, just set it in ->s_d_op when you set the superblock up. Again, whatever setup you do for dentry in ->lookup(), you either * have a filesystem that never picks an existing directory alias (e.g. doesn't allow open-by-fhandle or has a very unusual implementation of related methods, like e.g. shmem), or * have that setup misplaced, in part that applies to all dentries out there (->s_d_op for universal ->d_op value, ->d_init() for uniform allocation of objects hanging from ->d_fsdata and other things like that), or * need to figure out how to transfer the result to alias (manually after d_splice_alias(), if races do not matter or using a new method explicitly for that), or * lose that state for aliases.
On Thu, Nov 23, 2023 at 07:53:27PM +0000, Al Viro wrote: > Huh? If it really depends only upon the superblock, just set it in ->s_d_op > when you set the superblock up. > > Again, whatever setup you do for dentry in ->lookup(), you either > * have a filesystem that never picks an existing directory alias > (e.g. doesn't allow open-by-fhandle or has a very unusual implementation > of related methods, like e.g. shmem), or > * have that setup misplaced, in part that applies to all dentries out > there (->s_d_op for universal ->d_op value, ->d_init() for uniform allocation > of objects hanging from ->d_fsdata and other things like that), or > * need to figure out how to transfer the result to alias (manually > after d_splice_alias(), if races do not matter or using a new method explicitly > for that), or > * lose that state for aliases. Note, BTW, that fscrypt tries to be very special in its handling of that stuff - see fscrypt_handle_d_move() thing and comments in front of its definition. Then take a look at the place where it's called. BTW, it looks like it's broken, since it discounts the possibility of splice caused by lookup on no-key name. You get DCACHE_NOKEY_NAME removed unconditionally there, no-key or not. It's not impossible that the boilerplate around the fscrypt_has_permitted_context() calls in fscrypt-enabled ->lookup() instances somehow prevents those, but if so, it's not obvious from the causual look.
On Thu, Nov 23, 2023 at 02:06:39PM -0500, Gabriel Krisman Bertazi wrote: > > > > 4. d_move() and d_exchange() would ignore the value returned by __d_move(); > > __d_unalias() turn > > __d_move(alias, dentry, false); > > ret = 0; > > into > > ret = __d_move(alias, dentry, Splice); > > d_splice_alias() turn > > __d_move(new, dentry, false); > > write_sequnlock(&rename_lock); > > into > > err = __d_move(new, dentry, Splice); > > write_sequnlock(&rename_lock); > > if (unlikely(err)) { > > dput(new); > > new = ERR_PTR(err); > > } > > (actually, dput()-on-error part would be common to all 3 branches > > in there, so it would probably get pulled out of that if-else if-else). > > > > I can cook a patch doing that (and convert the obvious beneficiaries already > > in the tree to it) and throw it into dcache branch - just need to massage > > the series in there for repost... > > if you can write that, I'll definitely appreciate it. It will surely > take me much longer to figure it out myself. Speaking of other stuff in the series - passing the expected name to ->d_revalidate() is definitely the right thing to do, for a lot of other reasons. We do have ->d_name UAF issues in ->d_revalidate() instances, and that allows to solve them nicely. It's self-contained (your 2/9 and 3/9), so I'm going to grab that into a never-rebased branch, just to be able to base the followups propagating the use of stable name into instances. Anyway, need to finish writing up the description of existing dcache series first...
Al Viro <viro@zeniv.linux.org.uk> writes: > On Thu, Nov 23, 2023 at 02:06:39PM -0500, Gabriel Krisman Bertazi wrote: > >> > A paragraph above you've said that it's not constant over the entire >> > filesystem. >> >> The same ->d_op is used by every dentry in the filesystem if the superblock >> has the casefold bit enabled, regardless of whether a specific inode is >> casefolded or not. See generic_set_encrypted_ci_d_ops in my tree. It is >> called unconditionally by ext4_lookup and only checks the superblock: >> >> void generic_set_encrypted_ci_d_ops(struct dentry *dentry) >> { >> if (dentry->d_sb->s_encoding) { >> d_set_d_op(dentry, &generic_encrypted_ci_dentry_ops); >> return; >> } >> ... >> >> What I meant was that this used to be set once at sb->s_d_op, and >> propagated during dentry allocation. Therefore, the propagation to the >> alias would happen inside __d_alloc. Once we enabled fscrypt and >> casefold to work together, sb->s_d_op is NULL > > Why? That's what I don't understand - if you really want it for > all dentries on that filesystem, that's what ->s_d_op is for. > If it is not, you have that problem, no matter which way you flip ->d_op > value. I'm not sure why it changed. I'm guessing that, since it doesn't make sense to set fscrypt_d_revalidate for every dentry in the !case-insensitive case, they just kept the same behavior for case-insensitive+fscrypt. This is what I get from looking at the git history. I will get a new series reverting to use ->s_d_op, folding the dentry_cmp behavior you mentioned, and based on what you merge in your branch. >> and we always set the same >> handler for every dentry during lookup. > > Not every dentry goes through lookup - see upthread for details. Yes, I got that already. This should be "we always set the same handler for every dentry that goes through lookup and bork whatever doesn't come through lookup."
Al Viro <viro@zeniv.linux.org.uk> writes: > On Thu, Nov 23, 2023 at 02:06:39PM -0500, Gabriel Krisman Bertazi wrote: > >> > >> > 4. d_move() and d_exchange() would ignore the value returned by __d_move(); >> > __d_unalias() turn >> > __d_move(alias, dentry, false); >> > ret = 0; >> > into >> > ret = __d_move(alias, dentry, Splice); >> > d_splice_alias() turn >> > __d_move(new, dentry, false); >> > write_sequnlock(&rename_lock); >> > into >> > err = __d_move(new, dentry, Splice); >> > write_sequnlock(&rename_lock); >> > if (unlikely(err)) { >> > dput(new); >> > new = ERR_PTR(err); >> > } >> > (actually, dput()-on-error part would be common to all 3 branches >> > in there, so it would probably get pulled out of that if-else if-else). >> > >> > I can cook a patch doing that (and convert the obvious beneficiaries already >> > in the tree to it) and throw it into dcache branch - just need to massage >> > the series in there for repost... >> >> if you can write that, I'll definitely appreciate it. It will surely >> take me much longer to figure it out myself. > > Speaking of other stuff in the series - passing the expected name to > ->d_revalidate() is definitely the right thing to do, for a lot of > other reasons. We do have ->d_name UAF issues in ->d_revalidate() > instances, and that allows to solve them nicely. > > It's self-contained (your 2/9 and 3/9), so I'm going to grab that > into a never-rebased branch, just to be able to base the followups > propagating the use of stable name into instances. ack. I'll base the other changes we discussed on top of your branch. thanks,
On Fri, Nov 24, 2023 at 10:22:49AM -0500, Gabriel Krisman Bertazi wrote:
> ack. I'll base the other changes we discussed on top of your branch.
Rebased to v6.7-rc1, fixed up (ceph calls fscrypt_d_revalidate() directly,
and D/f/porting entry had been missing), pushed out as #no-rebase-d_revalidate
On Sat, Nov 25, 2023 at 10:01:36PM +0000, Al Viro wrote: > On Fri, Nov 24, 2023 at 10:22:49AM -0500, Gabriel Krisman Bertazi wrote: > > > ack. I'll base the other changes we discussed on top of your branch. > > Rebased to v6.7-rc1, fixed up (ceph calls fscrypt_d_revalidate() directly, > and D/f/porting entry had been missing), pushed out as #no-rebase-d_revalidate FWIW, ->d_revalidate() has an old unpleasant problem we might try to solve now. In non-RCU mode We treat 0 as "invalidate that sucker and do a fresh lookup". Fine, except that there's a possibility of race here - we'd hit ->d_revalidate() while another thread was renaming object in question. Or has just found it by doing lookup in a place where it had been moved on server. ->d_revalidate() decides that it needs to be looked up on server and forms a request before rename succeeds. So NFS (e.g.) request goes out with the old parent and name. By the time server sees it, RENAME has been processed and succeeded. There's no such file in the old place anymore. So ->d_revalidate() returns 0... and we proceed to invalidate the dentry. Which had been moved to *new* place by now. In that place it's perfectly valid and does not deserve invalidation. Scenario when rename had been done not from this client is even worse: server:/srv/nfs/foo is mounted on /mnt/foo we state /mnt/foo/bar /mnt/foo/bar is in dcache somebody on server renames /srv/nfs/foo/bar to /srv/nfs/foo/barf process A: stat /mnt/foo/bar/baz. process B: mount something on /mnt/foo/barf/ process B: no /mnt/foo/barf in dcache, let's look it up found fhandle of /mnt/foo sent LOOKUP "barf" in it got an fhandle and found it matching the inode of /mnt/foo/bar process A: has reached /mnt/foo/bar and decided to revalidate it. found fhandle of /mnt/foo sent a LOOKUP "bar" in that got "nothing with that name there" ->d_revalidate() returns 0 loses CPU process B: splice the dentry of /mnt/foo/bar to /mnt/foo/barf proceed to mount on top of it process A: gets CPU back calls d_invalidate() on the dentry that now is /mnt/foo/barf dissolves the mount created by process B Note that server:/srv/nfs/foo/barf has been there and perfectly valid since before B has started doing anything. It has no idea that the damn thing used to be in a different place and something on the same client had seen it at the old place once upon a time. As far as it is concerned, mount has succeeded and then quietly disappeared. The mountpoint is still there - with freshly looked up dentry, since the old one had been invalidated, but userland doesn't see that, so... WTF? It's not easy to hit, but I'd expect it to be feasible on SMP KVM, where instead of A losing CPU we might've had the virtual CPU losing the timeslice on host. IMO we should only do d_invalidate() if * ->d_revalidate() has returned 0 * dentry is still hashed, still has the same parent and still matches the name from ->d_compare() POV. If it doesn't, we should just leave it whereever it has been moved to and act as if we hadn't seen it in the first place. In other words, have d_revalidate(dentry, parent, name, flags) doing the following: if no ->d_revalidate return 1 ret = ->d_revalidate(...) if (unlikely(ret == 0) && !(flags & LOOKUP_RCU)) { spin_lock(&dentry->d_lock); if (!d_same_name(dentry, parent, name)) spin_lock(&dentry->d_lock); else d_invalidate_locked(dentry); } return ret where d_invalidate_locked() would be d_invalidate() sans the initial spin_lock(&dentry->d_lock); That would solve that problem, AFAICS. Objections, anyone? I'm too sleepy to put together a patch at the moment, will post after I get some sleep... PS: as the matter of fact, it might be a good idea to pass the parent as explicit argument to ->d_revalidate(), now that we are passing the name as well. Look at the boilerplate in the instances; all that parent = READ_ONCE(dentry->d_parent); dir = d_inode_rcu(parent); if (!dir) return -ECHILD; ... on the RCU side combined with parent = dget_parent(dentry); dir = d_inode(parent); ... dput(dir); stuff. It's needed only because the caller had not told us which directory is that thing supposed to be in; in non-RCU mode the parent is explicitly pinned down, no need to play those games. All we need is dir = d_inode_rcu(parent); if (!dir) // could happen only in RCU mode return -ECHILD; assuming we need the parent inode, that is. So... how about int (*d_revalidate)(struct dentry *dentry, struct dentry *parent, const struct qstr *name, unsigned int flags); since we are touching all instances anyway?
[folks involved into d_invalidate()/submount eviction stuff Cc'd] On Sun, Nov 26, 2023 at 04:52:19AM +0000, Al Viro wrote: > PS: as the matter of fact, it might be a good idea to pass the parent > as explicit argument to ->d_revalidate(), now that we are passing the > name as well. Look at the boilerplate in the instances; all that > parent = READ_ONCE(dentry->d_parent); > dir = d_inode_rcu(parent); > if (!dir) > return -ECHILD; > ... > on the RCU side combined with > parent = dget_parent(dentry); > dir = d_inode(parent); > ... > dput(dir); > stuff. > > It's needed only because the caller had not told us which directory > is that thing supposed to be in; in non-RCU mode the parent is > explicitly pinned down, no need to play those games. All we need > is > dir = d_inode_rcu(parent); > if (!dir) // could happen only in RCU mode > return -ECHILD; > assuming we need the parent inode, that is. > > So... how about > int (*d_revalidate)(struct dentry *dentry, struct dentry *parent, > const struct qstr *name, unsigned int flags); > since we are touching all instances anyway? OK, it's definitely a good idea for simplifying ->d_revalidate() instances and I think we should go for it on thes grounds alone. I'll do that. d_invalidate() situation is more subtle - we need to sort out its interplay with d_splice_alias(). More concise variant of the scenario in question: * we have /mnt/foo/bar and a lot of its descendents in dcache on client * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz * somebody on the client does a lookup of /mnt/foo/bar and gets told by the server that there's no directory with that name anymore. * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts evicting its descendents * We try to mount something on /mnt/foo/baz/blah. We look up baz, get an fhandle and notice that there's a directory inode for it (/mnt/foo/bar). d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing it in process, as it ought to. Then we find /mnt/foo/baz/blah in dcache and mount on top of it. * d_invalidate() finishes shrink_dcache_parent() and starts hunting for submounts to dissolve. And finds the mount we'd done. Which mount quietly disappears. Note that from the server POV the thing had been moved quite a while ago. No server-side races involved - all it seeem is a couple of LOOKUP in the same directory, one for the old name, one for the new. On the client on the mounter side we have an uneventful mount on /mnt/foo/baz, which had been there on server at the time we started and which remains in place after mount we'd created suddenly disappears. For the thread that ended up calling d_invalidate(), they'd been doing e.g. stat on a pathname that used to be there a while ago, but currently isn't. They get -ENOENT and no indication that something odd might have happened. From ->d_revalidate() point of view there's also nothing odd happening - dentry is not a mountpoint, it stays in place until we return and there's no directory entry with that name on in its parent. It's as clear-cut as it gets - dentry is stale. The only overlap happening there is d_splice_alias() hitting in the middle of already started d_invalidate(). For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately" and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something to do with it, but the same problem existed prior to that. FWIW, I suspect that the right answer would be along the lines of * if d_splice_alias() does move an exsiting (attached) alias in place, it ought to dissolve all mountpoints in subtree being moved. There might be subtleties, but in case when that __d_unalias() happens due to rename on server this is definitely the right thing to do. * d_invalidate() should *NOT* do anything with dentry that got moved (including moved by d_splice_alias()) from the place we'd found it in dcache. At least d_invalidate() done due to having ->d_revalidate() return 0. * d_invalidate() should dissolve all mountpoints in the subtree that existed when it got started (and found the victim still unmoved, that is). It should (as it does) prevent any new mountpoints added in that subtree, unless the mountpoint to be had been moved (spliced) out. What it really shouldn't do is touch the mountpoints that are currently outside of it due to moves. I'm going to look around and see if we have any weird cases where d_splice_alias() is used for things like "correct the case of dentry name on a case-mangled filesystem" - that would presumably not want to dissolve any submounts. I seem to recall seeing some shite of that sort, but that was a long time ago. Eric, Miklos - it might be a good idea if you at least took a look at whatever comes out of that (sub)thread; I'm trying to reconstruct the picture, but the last round of serious reworking of that area had been almost 10 years ago and your recollections of the considerations back then might help. I realize that they are probably rather fragmentary (mine definitely are) and any analysis will need to be redone on the current tree, but...
On Sun, Nov 26, 2023 at 06:41:41PM +0000, Al Viro wrote: > d_invalidate() situation is more subtle - we need to sort out its interplay > with d_splice_alias(). > > More concise variant of the scenario in question: > * we have /mnt/foo/bar and a lot of its descendents in dcache on client > * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz > * somebody on the client does a lookup of /mnt/foo/bar and gets told by > the server that there's no directory with that name anymore. > * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts > evicting its descendents > * We try to mount something on /mnt/foo/baz/blah. We look up baz, get > an fhandle and notice that there's a directory inode for it (/mnt/foo/bar). > d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing > it in process, as it ought to. Then we find /mnt/foo/baz/blah in dcache and > mount on top of it. > * d_invalidate() finishes shrink_dcache_parent() and starts hunting for > submounts to dissolve. And finds the mount we'd done. Which mount quietly > disappears. > > Note that from the server POV the thing had been moved quite a while ago. > No server-side races involved - all it seeem is a couple of LOOKUP in the > same directory, one for the old name, one for the new. > > On the client on the mounter side we have an uneventful mount on /mnt/foo/baz, > which had been there on server at the time we started and which remains in > place after mount we'd created suddenly disappears. > > For the thread that ended up calling d_invalidate(), they'd been doing e.g. > stat on a pathname that used to be there a while ago, but currently isn't. > They get -ENOENT and no indication that something odd might have happened. > > >From ->d_revalidate() point of view there's also nothing odd happening - > dentry is not a mountpoint, it stays in place until we return and there's > no directory entry with that name on in its parent. It's as clear-cut > as it gets - dentry is stale. > > The only overlap happening there is d_splice_alias() hitting in the middle > of already started d_invalidate(). > > For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately" > and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something > to do with it, but the same problem existed prior to that. > > FWIW, I suspect that the right answer would be along the lines of > * if d_splice_alias() does move an exsiting (attached) alias in > place, it ought to dissolve all mountpoints in subtree being moved. > There might be subtleties, but in case when that __d_unalias() happens > due to rename on server this is definitely the right thing to do. > * d_invalidate() should *NOT* do anything with dentry that > got moved (including moved by d_splice_alias()) from the place we'd > found it in dcache. At least d_invalidate() done due to having > ->d_revalidate() return 0. > * d_invalidate() should dissolve all mountpoints in the > subtree that existed when it got started (and found the victim > still unmoved, that is). It should (as it does) prevent any > new mountpoints added in that subtree, unless the mountpoint > to be had been moved (spliced) out. What it really shouldn't > do is touch the mountpoints that are currently outside of it > due to moves. > > I'm going to look around and see if we have any weird cases where > d_splice_alias() is used for things like "correct the case of > dentry name on a case-mangled filesystem" - that would presumably > not want to dissolve any submounts. I seem to recall seeing > some shite of that sort, but that was a long time ago. > > Eric, Miklos - it might be a good idea if you at least took a > look at whatever comes out of that (sub)thread; I'm trying to > reconstruct the picture, but the last round of serious reworking > of that area had been almost 10 years ago and your recollections > of the considerations back then might help. I realize that they > are probably rather fragmentary (mine definitely are) and any > analysis will need to be redone on the current tree, but... TBH, I wonder if we ought to have d_invalidate() variant that would unhash the dentry in question, do a variant of shrink_dcache_parent() that would report if there had been any mountpoints and if there had been any, do namespace_lock() and go hunting for mounts in that subtree, moving corresponding struct mountpoint to a private list as we go (removing them from mountpoint hash chains, that it). Then have them all evicted after we'd finished walking the subtree... The tricky part will be lock ordering - right now we have the mountpoint hash protected by mount_lock (same as mount hash, probably worth splitting anyway) and that nests outside of ->d_lock. Note that we don't do mountpoint hash lookups on mountpoint crossing - it's nowhere near the really hot paths. What we have is lookup_mountpoint() - plain hash lookup. Always under namespace_lock() and mount_lock. get_mountpoint() - there's an insertion into hash chain, with dentry passed through the d_set_mounted(), which would fail if we have d_invalidate() on the subtree. Also always under namespace_lock() and mount_lock. __put_mountpoint() - removal from the hash chain. We remove from hash chain after having cleared DCACHE_MOUNTED. _That_ can happen under mount_lock alone (taking out the stuck submounts on final mntput()). So convert the mountpoint hash chains to hlist_bl, bitlocks nesting under ->d_lock. Introduce a new dentry flag (DCHACE_MOUNT_INVALIDATION?) In d_walk() callback we would * do nothing if DCACHE_MOUNT is not set or DCACHE_MOUNT_INVALIDATION is. * otherwise set DCACHE_MOUNT_INVALIDATION, grab the bitlock on the mountpoint hash chain matching that dentry, find struct mountpoint in it, remove it from the chain and insert into a separate "collector" chain, all without messing with refcount. In lookup_mountpoint() and get_mountpoint() take the bitlock on chain. In __put_mountpoint(), once it has grabbed ->d_lock * check if it has DCACHE_MOUNT_INVALIDATION, use that to decide which chain we are locking - the normal one or the collector * clear both DCACHE_MOUNT and DCACHE_MOUNT_INVALIDATION * remove from chain * unlock the chain * drop ->d_lock. Once we are finished walking the tree, go over the collector list and do what __detach_mount() guts do. We are no longer under any ->d_lock, so locking is not a problem. namespace_unlock() will flush them all, same as it does for __detach_mount(). In __d_unalias() case do that d_invalidate() analogues of the alias. Yes, it might do final mntput() of other filesystems, while under ->i_rwsem on our parent. Not a problem, fs shutdown will go either through task_work or schedule_delayed_work(); in any case, it won't happen under ->i_rwsem. We obviously can't do that under rename_lock, though, so we'll need to massage that path in d_splice_alias() a bit. So, something like d_invalidate_locked(victim) called with victim->d_lock held. d_splice_alias() would use that (see above) and places where we do d_invalidate() after ->d_revalidate() having returned 0 would do this: lock dentry if it still has the same parent and name d_invalidate_locked() else unlock dentry probably folded into fs/namei.c:d_revalidate()... Not tonight, though - I'd rather do that while properly awake ;-/
Al Viro <viro@zeniv.linux.org.uk> writes: > On Sun, Nov 26, 2023 at 06:41:41PM +0000, Al Viro wrote: > >> d_invalidate() situation is more subtle - we need to sort out its interplay >> with d_splice_alias(). >> >> More concise variant of the scenario in question: >> * we have /mnt/foo/bar and a lot of its descendents in dcache on client >> * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz >> * somebody on the client does a lookup of /mnt/foo/bar and gets told by >> the server that there's no directory with that name anymore. >> * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts >> evicting its descendents >> * We try to mount something on /mnt/foo/baz/blah. We look up baz, get >> an fhandle and notice that there's a directory inode for it (/mnt/foo/bar). >> d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing >> it in process, as it ought to. Then we find /mnt/foo/baz/blah in dcache and >> mount on top of it. >> * d_invalidate() finishes shrink_dcache_parent() and starts hunting for >> submounts to dissolve. And finds the mount we'd done. Which mount quietly >> disappears. >> >> Note that from the server POV the thing had been moved quite a while ago. >> No server-side races involved - all it seeem is a couple of LOOKUP in the >> same directory, one for the old name, one for the new. >> >> On the client on the mounter side we have an uneventful mount on /mnt/foo/baz, >> which had been there on server at the time we started and which remains in >> place after mount we'd created suddenly disappears. >> >> For the thread that ended up calling d_invalidate(), they'd been doing e.g. >> stat on a pathname that used to be there a while ago, but currently isn't. >> They get -ENOENT and no indication that something odd might have happened. >> >> >From ->d_revalidate() point of view there's also nothing odd happening - >> dentry is not a mountpoint, it stays in place until we return and there's >> no directory entry with that name on in its parent. It's as clear-cut >> as it gets - dentry is stale. >> >> The only overlap happening there is d_splice_alias() hitting in the middle >> of already started d_invalidate(). >> >> For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately" >> and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something >> to do with it, but the same problem existed prior to that. >> >> FWIW, I suspect that the right answer would be along the lines of >> * if d_splice_alias() does move an exsiting (attached) alias in >> place, it ought to dissolve all mountpoints in subtree being moved. >> There might be subtleties, but in case when that __d_unalias() happens >> due to rename on server this is definitely the right thing to do. >> * d_invalidate() should *NOT* do anything with dentry that >> got moved (including moved by d_splice_alias()) from the place we'd >> found it in dcache. At least d_invalidate() done due to having >> ->d_revalidate() return 0. >> * d_invalidate() should dissolve all mountpoints in the >> subtree that existed when it got started (and found the victim >> still unmoved, that is). It should (as it does) prevent any >> new mountpoints added in that subtree, unless the mountpoint >> to be had been moved (spliced) out. What it really shouldn't >> do is touch the mountpoints that are currently outside of it >> due to moves. >> >> I'm going to look around and see if we have any weird cases where >> d_splice_alias() is used for things like "correct the case of >> dentry name on a case-mangled filesystem" - that would presumably >> not want to dissolve any submounts. I seem to recall seeing >> some shite of that sort, but that was a long time ago. >> >> Eric, Miklos - it might be a good idea if you at least took a >> look at whatever comes out of that (sub)thread; I'm trying to >> reconstruct the picture, but the last round of serious reworking >> of that area had been almost 10 years ago and your recollections >> of the considerations back then might help. I realize that they >> are probably rather fragmentary (mine definitely are) and any >> analysis will need to be redone on the current tree, but... By subthread I assume you are referring to the work to that generalized check_submounts_and_drop into the current d_invalidate. My memory is that there were deliberate restrictions on where d_revalidate could be called so as not to mess with mounts. I believe those restrictions either prevented or convinced us it prevented nasty interactions between d_invalidate and d_splice_alias. There is a lot going on there. I remember one of the relevant restrictions was marking dentries dont_mount, and inodes S_DEAD in unlink and rmdir. But even without out that marking if d_invalidate is called from d_revalidate the inode and all of it's dentries must be dead because the inode is stale and most go. There should be no resurrecting it at that point. I suspect the most fruitful way to think of the d_invalidate vs d_splice_alias races is an unlink vs rename race. I don't think the mechanism matters, but deeply and fundamentally if we detect a directory inode is dead we need to stick with that decision and not attempt to resurrect it with d_splice_alias. Looking at ext4 and f2fs it appears when case folding they are calling d_invalidate before the generic code can, and before marking like dont_mount happen. Is that the tie in with where the current conversation comes in? > TBH, I wonder if we ought to have d_invalidate() variant that would > unhash the dentry in question, You mean like the current d_invalidate does? It calls __d_drop which unhashes the thing and prevent lookups. You even pointed to the change that added that in the previous email. The only thing that does not happen currently is marking the dentry as unhashed. Looking the rmdir code uses not only dont_mount but marks the inode S_DEAD as well. Right now we can't even get to d_splice_alias unless the original dentry is unhashed. So I suspect it isn't d_invalidate you are fighting. > do a variant of shrink_dcache_parent() > that would report if there had been any mountpoints and if there > had been any, do namespace_lock() and go hunting for mounts in that > subtree, moving corresponding struct mountpoint to a private list > as we go (removing them from mountpoint hash chains, that it). Then > have them all evicted after we'd finished walking the subtree... > > The tricky part will be lock ordering - right now we have the > mountpoint hash protected by mount_lock (same as mount hash, probably > worth splitting anyway) and that nests outside of ->d_lock. I don't get get it. All we have to do is to prevent the inode lookup from succeeding if we have decided the inode has been deleted. It may be a little more subtle the path of the inode we are connecting goes through a dentry that is being invalidated. But either need to prevent it in the lookup that leads to d_alloc, or prevent the new dentry from being attached. I know d_splice_alias takes the rename_lock to prevent some of those races. I hope that helps on the recollection front. I am confused what is going on with ext4 and f2fs. I think they are calling d_invalidate when all they need to call is d_drop. Eric
"Eric W. Biederman" <ebiederm@xmission.com> writes: > I am confused what is going on with ext4 and f2fs. I think they > are calling d_invalidate when all they need to call is d_drop. ext4 and f2f2 are buggy in how they call d_invalidate, if I am reading the code correctly. d_invalidate calls detach_mounts. detach_mounts relies on setting D_CANT_MOUNT on the top level dentry to prevent races with new mounts. ext4 and f2fs (in their case insensitive code) are calling d_invalidate before dont_mount has been called to set D_CANT_MOUNT. Eric
On Mon, Nov 27, 2023 at 09:47:47AM -0600, Eric W. Biederman wrote: > There is a lot going on there. I remember one of the relevant > restrictions was marking dentries dont_mount, and inodes S_DEAD > in unlink and rmdir. > > But even without out that marking if d_invalidate is called > from d_revalidate the inode and all of it's dentries must be > dead because the inode is stale and most go. There should > be no resurrecting it at that point. > > I suspect the most fruitful way to think of the d_invalidate vs > d_splice_alias races is an unlink vs rename race. > > I don't think the mechanism matters, but deeply and fundamentally > if we detect a directory inode is dead we need to stick with > that decision and not attempt to resurrect it with d_splice_alias. Wrong. Deeply and fundamentally we detect a dentry that does not match the directory contents according to the server. For example, due to rename done on server. With object in question perfectly alive there - fhandle still works, etc. However, it's no longer where it used to be. And we would bloody better not have lookups for the old name result in access to that object. We also should never allow the access to *new* name lead to two live dentries for the same directory inode. Again, this is not about rmdir() or unlink() - invalidation can happen for object that is still open, still accessed and still very much alive. Does that all the time for any filesystem with ->d_revalidate().
On Mon, Nov 27, 2023 at 04:03:18PM +0000, Al Viro wrote: > On Mon, Nov 27, 2023 at 09:47:47AM -0600, Eric W. Biederman wrote: > > > There is a lot going on there. I remember one of the relevant > > restrictions was marking dentries dont_mount, and inodes S_DEAD > > in unlink and rmdir. > > > > But even without out that marking if d_invalidate is called > > from d_revalidate the inode and all of it's dentries must be > > dead because the inode is stale and most go. There should > > be no resurrecting it at that point. > > > > I suspect the most fruitful way to think of the d_invalidate vs > > d_splice_alias races is an unlink vs rename race. > > > > I don't think the mechanism matters, but deeply and fundamentally > > if we detect a directory inode is dead we need to stick with > > that decision and not attempt to resurrect it with d_splice_alias. > > Wrong. Deeply and fundamentally we detect a dentry that does not > match the directory contents according to the server. > > For example, due to rename done on server. With object in question > perfectly alive there - fhandle still works, etc. > > However, it's no longer where it used to be. And we would bloody better > not have lookups for the old name result in access to that object. > We also should never allow the access to *new* name lead to two live > dentries for the same directory inode. > > Again, this is not about rmdir() or unlink() - invalidation can happen > for object that is still open, still accessed and still very much alive. > Does that all the time for any filesystem with ->d_revalidate(). Put another way, there used to be very odd song and dance in ->d_revalidate() instances along the lines of "we can't possibly tell the caller to invalidate a mountpoint"; it was racy in the best case and during the rewrite of d_invalidate() to teach it how to evict submounts those attempts had been dropped - ->d_revalidate() returning 0 does end up with mounts dissolved by d_invalidate() from caller. It always had been racy, starting with the checks that used to be in ->d_revalidate() instances way before all those changes. So the switch of d_invalidate() to dissolving submounts had been a step in the right direction, but it's not being careful enough. Again, it's about d_invalidate() caused by pathwalk running into a dentry that doesn't match the reality vs. d_splice_alias() finding that it matches the inode we had looked up elsewhere.
On Mon, Nov 27, 2023 at 06:38:42AM +0000, Al Viro wrote: > On Sun, Nov 26, 2023 at 06:41:41PM +0000, Al Viro wrote: > > > d_invalidate() situation is more subtle - we need to sort out its interplay > > with d_splice_alias(). > > > > More concise variant of the scenario in question: > > * we have /mnt/foo/bar and a lot of its descendents in dcache on client > > * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz > > * somebody on the client does a lookup of /mnt/foo/bar and gets told by > > the server that there's no directory with that name anymore. > > * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts > > evicting its descendents > > * We try to mount something on /mnt/foo/baz/blah. We look up baz, get > > an fhandle and notice that there's a directory inode for it (/mnt/foo/bar). > > d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing > > it in process, as it ought to. Then we find /mnt/foo/baz/blah in dcache and > > mount on top of it. > > * d_invalidate() finishes shrink_dcache_parent() and starts hunting for > > submounts to dissolve. And finds the mount we'd done. Which mount quietly > > disappears. > > > > Note that from the server POV the thing had been moved quite a while ago. > > No server-side races involved - all it seeem is a couple of LOOKUP in the > > same directory, one for the old name, one for the new. > > > > On the client on the mounter side we have an uneventful mount on /mnt/foo/baz, > > which had been there on server at the time we started and which remains in > > place after mount we'd created suddenly disappears. > > > > For the thread that ended up calling d_invalidate(), they'd been doing e.g. > > stat on a pathname that used to be there a while ago, but currently isn't. > > They get -ENOENT and no indication that something odd might have happened. > > > > >From ->d_revalidate() point of view there's also nothing odd happening - > > dentry is not a mountpoint, it stays in place until we return and there's > > no directory entry with that name on in its parent. It's as clear-cut > > as it gets - dentry is stale. > > > > The only overlap happening there is d_splice_alias() hitting in the middle > > of already started d_invalidate(). > > > > For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately" > > and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something > > to do with it, but the same problem existed prior to that. > > > > FWIW, I suspect that the right answer would be along the lines of > > * if d_splice_alias() does move an exsiting (attached) alias in > > place, it ought to dissolve all mountpoints in subtree being moved. > > There might be subtleties, but in case when that __d_unalias() happens > > due to rename on server this is definitely the right thing to do. > > * d_invalidate() should *NOT* do anything with dentry that > > got moved (including moved by d_splice_alias()) from the place we'd > > found it in dcache. At least d_invalidate() done due to having > > ->d_revalidate() return 0. > > * d_invalidate() should dissolve all mountpoints in the > > subtree that existed when it got started (and found the victim > > still unmoved, that is). It should (as it does) prevent any > > new mountpoints added in that subtree, unless the mountpoint > > to be had been moved (spliced) out. What it really shouldn't > > do is touch the mountpoints that are currently outside of it > > due to moves. > > > > I'm going to look around and see if we have any weird cases where > > d_splice_alias() is used for things like "correct the case of > > dentry name on a case-mangled filesystem" - that would presumably > > not want to dissolve any submounts. I seem to recall seeing > > some shite of that sort, but that was a long time ago. > > > > Eric, Miklos - it might be a good idea if you at least took a > > look at whatever comes out of that (sub)thread; I'm trying to > > reconstruct the picture, but the last round of serious reworking > > of that area had been almost 10 years ago and your recollections > > of the considerations back then might help. I realize that they > > are probably rather fragmentary (mine definitely are) and any > > analysis will need to be redone on the current tree, but... > > TBH, I wonder if we ought to have d_invalidate() variant that would > unhash the dentry in question, do a variant of shrink_dcache_parent() > that would report if there had been any mountpoints and if there > had been any, do namespace_lock() and go hunting for mounts in that > subtree, moving corresponding struct mountpoint to a private list > as we go (removing them from mountpoint hash chains, that it). Then > have them all evicted after we'd finished walking the subtree... That sounds reasonable. > > The tricky part will be lock ordering - right now we have the > mountpoint hash protected by mount_lock (same as mount hash, probably > worth splitting anyway) and that nests outside of ->d_lock. > > Note that we don't do mountpoint hash lookups on mountpoint crossing > - it's nowhere near the really hot paths. What we have is > lookup_mountpoint() - plain hash lookup. Always > under namespace_lock() and mount_lock. > get_mountpoint() - there's an insertion into hash chain, > with dentry passed through the d_set_mounted(), which would > fail if we have d_invalidate() on the subtree. > Also always under namespace_lock() and mount_lock. > __put_mountpoint() - removal from the hash chain. > We remove from hash chain after having cleared DCACHE_MOUNTED. > _That_ can happen under mount_lock alone (taking out the stuck > submounts on final mntput()). > > So convert the mountpoint hash chains to hlist_bl, bitlocks nesting under > ->d_lock. Introduce a new dentry flag (DCHACE_MOUNT_INVALIDATION?) > In d_walk() callback we would > * do nothing if DCACHE_MOUNT is not set or DCACHE_MOUNT_INVALIDATION > is. > * otherwise set DCACHE_MOUNT_INVALIDATION, grab the bitlock on the > mountpoint hash chain matching that dentry, find struct mountpoint in it, > remove it from the chain and insert into a separate "collector" chain, all > without messing with refcount. Ok. > In lookup_mountpoint() and get_mountpoint() take the bitlock on chain. > In __put_mountpoint(), once it has grabbed ->d_lock > * check if it has DCACHE_MOUNT_INVALIDATION, use that to > decide which chain we are locking - the normal one or the collector > * clear both DCACHE_MOUNT and DCACHE_MOUNT_INVALIDATION > * remove from chain > * unlock the chain > * drop ->d_lock. > > Once we are finished walking the tree, go over the collector list > and do what __detach_mount() guts do. We are no longer under > any ->d_lock, so locking is not a problem. namespace_unlock() will > flush them all, same as it does for __detach_mount(). Ok.
On Mon, Nov 27, 2023 at 10:01:34AM -0600, Eric W. Biederman wrote: > "Eric W. Biederman" <ebiederm@xmission.com> writes: > > > I am confused what is going on with ext4 and f2fs. I think they > > are calling d_invalidate when all they need to call is d_drop. > > ext4 and f2f2 are buggy in how they call d_invalidate, if I am reading > the code correctly. > > d_invalidate calls detach_mounts. > > detach_mounts relies on setting D_CANT_MOUNT on the top level dentry to > prevent races with new mounts. > > ext4 and f2fs (in their case insensitive code) are calling d_invalidate > before dont_mount has been called to set D_CANT_MOUNT. Not really - note that the place where we check cant_mount() is under the lock on the mountpoint's inode, so anything inside ->unlink() or ->rmdir() is indistinguishable from the places where we do dont_mount() in vfs_{unlink,rmdir}.
Al Viro <viro@zeniv.linux.org.uk> writes: > On Mon, Nov 27, 2023 at 04:03:18PM +0000, Al Viro wrote: >> On Mon, Nov 27, 2023 at 09:47:47AM -0600, Eric W. Biederman wrote: >> >> > There is a lot going on there. I remember one of the relevant >> > restrictions was marking dentries dont_mount, and inodes S_DEAD >> > in unlink and rmdir. >> > >> > But even without out that marking if d_invalidate is called >> > from d_revalidate the inode and all of it's dentries must be >> > dead because the inode is stale and most go. There should >> > be no resurrecting it at that point. >> > >> > I suspect the most fruitful way to think of the d_invalidate vs >> > d_splice_alias races is an unlink vs rename race. >> > >> > I don't think the mechanism matters, but deeply and fundamentally >> > if we detect a directory inode is dead we need to stick with >> > that decision and not attempt to resurrect it with d_splice_alias. >> >> Wrong. Deeply and fundamentally we detect a dentry that does not >> match the directory contents according to the server. >> >> For example, due to rename done on server. With object in question >> perfectly alive there - fhandle still works, etc. >> >> However, it's no longer where it used to be. And we would bloody better >> not have lookups for the old name result in access to that object. >> We also should never allow the access to *new* name lead to two live >> dentries for the same directory inode. >> >> Again, this is not about rmdir() or unlink() - invalidation can happen >> for object that is still open, still accessed and still very much alive. >> Does that all the time for any filesystem with ->d_revalidate(). > > Put another way, there used to be very odd song and dance in ->d_revalidate() > instances along the lines of "we can't possibly tell the caller to invalidate > a mountpoint"; it was racy in the best case and during the rewrite of > d_invalidate() to teach it how to evict submounts those attempts had been > dropped - ->d_revalidate() returning 0 does end up with mounts dissolved > by d_invalidate() from caller. > > It always had been racy, starting with the checks that used to be in > ->d_revalidate() instances way before all those changes. So the switch > of d_invalidate() to dissolving submounts had been a step in the right > direction, but it's not being careful enough. > > Again, it's about d_invalidate() caused by pathwalk running into a dentry that > doesn't match the reality vs. d_splice_alias() finding that it matches the > inode we had looked up elsewhere. My point is we should have a atomic way to decide the disposition of such a dentry, and it's children. Either we should decide it is useless and remove it and all of it's children. Or we should decide it was renamed and just handle it that way. If we can record such a decision on the dentry or possibly on the inode then we can resolve the race by having it be a proper race of which comes first. It isn't a proper delete of the inode so anything messing with the inode and marking it S_DEAD is probably wrong. The code could do something like mark the dentry dont_mount which should be enough to for d_splice_alias to say oops, something is not proper here. Let the d_invalidate do it's thing. Or the code could remove the dentry from inode->i_dentry and keep d_splice alias from finding it, and it's children completely. That is different from unhashing it. Anyway that is my memory and my general sense of what is going on. I help it helps. Eric
On Mon, Nov 27, 2023 at 05:25:44PM +0000, Al Viro wrote: > On Mon, Nov 27, 2023 at 10:01:34AM -0600, Eric W. Biederman wrote: > > "Eric W. Biederman" <ebiederm@xmission.com> writes: > > > > > I am confused what is going on with ext4 and f2fs. I think they > > > are calling d_invalidate when all they need to call is d_drop. > > > > ext4 and f2f2 are buggy in how they call d_invalidate, if I am reading > > the code correctly. > > > > d_invalidate calls detach_mounts. > > > > detach_mounts relies on setting D_CANT_MOUNT on the top level dentry to > > prevent races with new mounts. > > > > ext4 and f2fs (in their case insensitive code) are calling d_invalidate > > before dont_mount has been called to set D_CANT_MOUNT. > > Not really - note that the place where we check cant_mount() is under > the lock on the mountpoint's inode, so anything inside ->unlink() or > ->rmdir() is indistinguishable from the places where we do dont_mount() > in vfs_{unlink,rmdir}. Said that, we could simply use d_drop() in those, since the caller will take care of mount eviction - we have ->unlink() or ->rmdir() returning success, after all. The same goes for xfs caller and for cifs_prime_dcache() (in the latter case we have just checked that they sucker is negative, so d_invalidate() and d_drop() are doing the same thing).
On Mon, Nov 27, 2023 at 12:19:09PM -0600, Eric W. Biederman wrote: > Either we should decide it is useless and remove it and all of it's > children. > > Or we should decide it was renamed and just handle it that way. How? An extra roundtrip to server trying to do getattr on the fhandle we've got? Cost of that aside, we *still* need to dissolve submounts in such case; there is no warranty that we'll ever guess the new name and no way to ask the server for one, so we can't let them sit around. Not that having mounts (local by definition) suddenly show up in the unexpected place because of rename on server looks like a good thing, especially since had that rename on server been done as cp -rl + rm -rf the same mounts would be gone... > If we can record such a decision on the dentry or possibly on the inode > then we can resolve the race by having it be a proper race of which > comes first. > > It isn't a proper delete of the inode so anything messing with the inode > and marking it S_DEAD is probably wrong. s/probably/certainly/, but where would d_invalidate() do such a thing? It's none of its business... > The code could do something like mark the dentry dont_mount which should > be enough to for d_splice_alias to say oops, something is not proper > here. Let the d_invalidate do it's thing. > > Or the code could remove the dentry from inode->i_dentry and keep > d_splice alias from finding it, and it's children completely. > That is different from unhashing it. We might be just in the middle of getdents(2) on the directory in question. It can be opened; we can't do anything that destructive there. Again, it's about the d_invalidate() on ->d_revalidate() reporting 0; uses like proc_invalidate_siblings_dcache() are separate story, simply because there d_splice_alias() is not going to move anything anywhere.
On Fri, Nov 24, 2023 at 10:20:39AM -0500, Gabriel Krisman Bertazi wrote: > I'm not sure why it changed. I'm guessing that, since it doesn't make > sense to set fscrypt_d_revalidate for every dentry in the > !case-insensitive case, they just kept the same behavior for > case-insensitive+fscrypt. This is what I get from looking at the git > history. > > I will get a new series reverting to use ->s_d_op, folding the > dentry_cmp behavior you mentioned, and based on what you merge in your > branch. FWIW, I suspect that we might want something along the lines of d_set_always_valid(dentry) { grab ->d_lock clear DCACHE_OP_REVALIDATE release ->d_lock } to make it possible to suppress ->d_revalidate() for this particular dentry...
On Mon, Nov 27, 2023 at 06:38:43AM +0000, Al Viro wrote: > > FWIW, I suspect that the right answer would be along the lines of > > * if d_splice_alias() does move an exsiting (attached) alias in > > place, it ought to dissolve all mountpoints in subtree being moved. > > There might be subtleties, Are there ever... Starting with the "our test for loop creation (alias is a direct ancestor, need to fail with -ELOOP) is dependent upon rename_lock being held all along". Folks, what semantics do we want for dissolving mounts on splice? The situation when it happens is when we have a subtree on e.g. NFS and have some mounts (on client) inside that. Then somebody on server moves the root of that subtree somewhere else and we try to do a lookup in new place. Options: 1) our dentry for directory that got moved on server is moved into new place, along with the entire subtree *and* everything mounted on it. Very dubious semantics, especially since if we look the old location up before looking for new one, the mounts will be dissolved; no way around that. 2) lookup fails. It's already possible; e.g. if server has /srv/nfs/1/2/3 moved to /srv/nfs/x, then /srv/nfs/1/2 moved to /srv/nfs/x/y and client has a process with cwd in /mnt/nfs/1/2/3 doing a lookup for "y", there's no way in hell to handle that - the lookup will return the fhandle of /srv/nfs/x, which is the same thing the client has for /mnt/nfs/1/2; we *can't* move that dentry to /mnt/nfs/1/2/3/y - not without creating a detached loop. We can also run into -ESTALE if one of the trylocks in __d_unalias() fails. Having the same happen if there are mounts in the subtree we are trying to splice would be unpleasant, but not fatal. The trouble is, that won't be a transient failure - not until somebody tries to look the old location up. 3) dissolve the mounts. Doable, but it's not easy; especially since we end up having to redo the loop-prevention check after the mounts had been dissolved. And that check may be failing by that time, with no way to undo that dissolving...
> 2) lookup fails. It's already possible; e.g. if server has I think that's the sanest option. The other options seem even less intuitive. > not fatal. The trouble is, that won't be a transient failure - > not until somebody tries to look the old location up. Eh, nfs semantics are quite special anyway already. I'd rather have that in lookup than more magic involving moving mounts around or having them disappear (Yes, we have the detach semantics on removal but that's different.).
Al Viro <viro@zeniv.linux.org.uk> writes: > On Mon, Nov 27, 2023 at 06:38:43AM +0000, Al Viro wrote: > >> > FWIW, I suspect that the right answer would be along the lines of >> > * if d_splice_alias() does move an exsiting (attached) alias in >> > place, it ought to dissolve all mountpoints in subtree being moved. >> > There might be subtleties, > > Are there ever... Starting with the "our test for loop creation > (alias is a direct ancestor, need to fail with -ELOOP) is dependent > upon rename_lock being held all along". > > Folks, what semantics do we want for dissolving mounts on splice? > The situation when it happens is when we have a subtree on e.g. NFS > and have some mounts (on client) inside that. Then somebody on > server moves the root of that subtree somewhere else and we try > to do a lookup in new place. Options: > > 1) our dentry for directory that got moved on server is moved into > new place, along with the entire subtree *and* everything mounted > on it. Very dubious semantics, especially since if we look the > old location up before looking for new one, the mounts will be > dissolved; no way around that. > > 2) lookup fails. It's already possible; e.g. if server has > /srv/nfs/1/2/3 moved to /srv/nfs/x, then /srv/nfs/1/2 moved > to /srv/nfs/x/y and client has a process with cwd in /mnt/nfs/1/2/3 > doing a lookup for "y", there's no way in hell to handle that - > the lookup will return the fhandle of /srv/nfs/x, which is the > same thing the client has for /mnt/nfs/1/2; we *can't* move that > dentry to /mnt/nfs/1/2/3/y - not without creating a detached loop. > We can also run into -ESTALE if one of the trylocks in > __d_unalias() fails. Having the same happen if there are mounts > in the subtree we are trying to splice would be unpleasant, but > not fatal. The trouble is, that won't be a transient failure - > not until somebody tries to look the old location up. > > 3) dissolve the mounts. Doable, but it's not easy; especially > since we end up having to redo the loop-prevention check after > the mounts had been dissolved. And that check may be failing > by that time, with no way to undo that dissolving... To be clear this is a change in current semantics and has a minuscule change of resulting in a regression. That should be called out in the change log. If we choose to change the semantics I would suggest that the new semantics be: If a different name for a directory already exists: * Detach the mounts unconditionally (leaving dentry descendants alone). * Attempt the current splice. - If the splice succeeds ( return the new dentry ) - If the splice fails ( fail the lookup, and d_invalidate the existing name ) Unconditionally dissolving the mounts before attempting the rename should simplify everything. In the worst case a race between d_invalidate and d_splice_alias will now become a race to see who can detach the mounts first. Eric
Hello: This series was applied to jaegeuk/f2fs.git (dev) by Gabriel Krisman Bertazi <krisman@suse.de>: On Wed, 16 Aug 2023 01:07:54 -0400 you wrote: > Hi, > > This is v6 of the negative dentry on case-insensitive directories. > Thanks Eric for the review of the last iteration. This version > drops the patch to expose the helper to check casefolding directories, > since it is not necessary in ecryptfs and it might be going away. It > also addresses some documentation details, fix a build bot error and > simplifies the commit messages. See the changelog in each patch for > more details. > > [...] Here is the summary with links: - [f2fs-dev,v6,1/9] ecryptfs: Reject casefold directory inodes https://git.kernel.org/jaegeuk/f2fs/c/cd72c7ef5fed - [f2fs-dev,v6,2/9] 9p: Split ->weak_revalidate from ->revalidate (no matching commit) - [f2fs-dev,v6,3/9] fs: Expose name under lookup to d_revalidate hooks (no matching commit) - [f2fs-dev,v6,4/9] fs: Add DCACHE_CASEFOLDED_NAME flag (no matching commit) - [f2fs-dev,v6,5/9] libfs: Validate negative dentries in case-insensitive directories (no matching commit) - [f2fs-dev,v6,6/9] libfs: Chain encryption checks after case-insensitive revalidation (no matching commit) - [f2fs-dev,v6,7/9] libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops (no matching commit) - [f2fs-dev,v6,8/9] ext4: Enable negative dentries on case-insensitive lookup (no matching commit) - [f2fs-dev,v6,9/9] f2fs: Enable negative dentries on case-insensitive lookup (no matching commit) You are awesome, thank you!