Message ID | 20081205144810.GA25585@dastardly.home.dghda.com |
---|---|
State | Superseded, archived |
Headers | show |
On Fri, 5 Dec 2008 14:48:10 +0000 "Duane Griffin" <duaneg@dghda.com> wrote: > Hi folks, > > I am looking at a report of an intermittent BUG caused by an > intentionally corrupted ext2 filesystem: > http://bugzilla.kernel.org/show_bug.cgi?id=11412 > > What I think is happening is generic_readlink gets the name via > i_ops->follow_link and passes it into vfs_readlink, without it > necessarily being validating anywhere. If the name is not > NULL-terminated the strlen call in vfs_readlink may run off past the end > of the page. I think this is potentially happening in > page_follow_link_light, as well as ext2_follow_link, so it isn't just > ext* that is affected. > > Does this sound correct, or have I missed something? > > Assuming this is a real problem, does anyone have a better solution than > scanning the name for a \0 (in ext2_follow_link and > page_follow_link_light) and returning -ENAMETOOLONG if we can't find > one? I.e. something like this: It would be nice to fix this in a single place, for all filesystems, for all time. But how to do that? > diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c > index 4e2426e..9b01af2 100644 > --- a/fs/ext2/symlink.c > +++ b/fs/ext2/symlink.c > @@ -24,8 +24,14 @@ > static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd) > { > struct ext2_inode_info *ei = EXT2_I(dentry->d_inode); > - nd_set_link(nd, (char *)ei->i_data); > - return NULL; > + void *err = NULL; > + > + if (memchr(ei->i_data, 0, sizeof(ei->i_data)) == NULL) > + err = ERR_PTR(-ENAMETOOLONG); > + else > + nd_set_link(nd, (char *)ei->i_data); > + > + return err; > } Perhaps nd_set_link() is a suitable place? Change that function so that it is passed a third argument (max_len) and then check that within nd_set_link(). Change nd_set_link() to return a __must_check-marked errno, change callers to handle errors appropriately. Or something totally different ;) But along those lines? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duane Griffin wrote: > Hi folks, > > I am looking at a report of an intermittent BUG caused by an > intentionally corrupted ext2 filesystem: > http://bugzilla.kernel.org/show_bug.cgi?id=11412 > > What I think is happening is generic_readlink gets the name via > i_ops->follow_link and passes it into vfs_readlink, without it > necessarily being validating anywhere. If the name is not > NULL-terminated the strlen call in vfs_readlink may run off past the end > of the page. I think this is potentially happening in > page_follow_link_light, as well as ext2_follow_link, so it isn't just > ext* that is affected. > > Does this sound correct, or have I missed something? > > Assuming this is a real problem, does anyone have a better solution than > scanning the name for a \0 (in ext2_follow_link and > page_follow_link_light) and returning -ENAMETOOLONG if we can't find > one? I.e. something like this: > > diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c > index 4e2426e..9b01af2 100644 > --- a/fs/ext2/symlink.c > +++ b/fs/ext2/symlink.c > @@ -24,8 +24,14 @@ > static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd) > { > struct ext2_inode_info *ei = EXT2_I(dentry->d_inode); > - nd_set_link(nd, (char *)ei->i_data); > - return NULL; > + void *err = NULL; > + > + if (memchr(ei->i_data, 0, sizeof(ei->i_data)) == NULL) > + err = ERR_PTR(-ENAMETOOLONG); > + else > + nd_set_link(nd, (char *)ei->i_data); > + > + return err; > } Here (Like below) Just zero the very last byte in the buffer. The first time this buffer was strcpy to, it was including the null terminated string. then written to inode on disk. When read, at most it could be, is as space allocated at inode (including null). If intentionally damaged, the symlink will be corrupted but Kernel is safe. > > const struct inode_operations ext2_symlink_inode_operations = { > diff --git a/fs/namei.c b/fs/namei.c > index d34e0f9..f20e94b 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -2750,29 +2750,49 @@ static char *page_getlink(struct dentry * dentry, struct page **ppage) > { > struct page * page; > struct address_space *mapping = dentry->d_inode->i_mapping; > + char *kaddr; > + > page = read_mapping_page(mapping, 0, NULL); > if (IS_ERR(page)) > return (char*)page; > + > + kaddr = kmap(page); > + if (memchr(kaddr, 0, PAGE_SIZE) == NULL) { > + kunmap(kaddr); > + page_cache_release(page); > + return ERR_PTR(-ENAMETOOLONG); > + } > + You don't need to search and fail here. All you need is to NULL terminate on read_i_size() + 1 of inode. The length of the written string was set on write-time the buffer is big since it is a full page and I think symlinks are limited to less then that. If a person damaged the symlink inode on disk (set different size) then the data is corrupted but Kernel is still safe. > *ppage = page; > - return kmap(page); > + return kaddr; > } > > int page_readlink(struct dentry *dentry, char __user *buffer, int buflen) > { > + int res; > struct page *page = NULL; > char *s = page_getlink(dentry, &page); > - int res = vfs_readlink(dentry,buffer,buflen,s); > + > + if (IS_ERR(s)) > + return PTR_ERR(s); > + Above will not fail this change is not needed > + res = vfs_readlink(dentry, buffer, buflen, s); > if (page) { > kunmap(page); > page_cache_release(page); > } > + > return res; > } > > void *page_follow_link_light(struct dentry *dentry, struct nameidata *nd) > { > struct page *page = NULL; > - nd_set_link(nd, page_getlink(dentry, &page)); > + char *name = page_getlink(dentry, &page); > + if (IS_ERR(name)) > + return name; > + Same here > + nd_set_link(nd, name); > return page; > } > > Cheers, > Duane. > I hit this problem too, while developing a filesystem that was based on ext2. The reason that it works is because the remainder of a page is always Zero'ed out on writes. Then when read, you receive back your zero terminated link. (Which means that if you have a symlink exactly 4k it will BUG but I guess that is not possible). The solution is to use the i_size information for the string length, and zero terminate at i_size + 1. The way I fixed it is that I Zero out the last page's remainder on read and not on write like ext2 and other do it. (A symlink is less then 4k, right?) Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Boaz, thanks for your review and comments... 2008/12/9 Boaz Harrosh <bharrosh@panasas.com>: >> diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c >> index 4e2426e..9b01af2 100644 >> --- a/fs/ext2/symlink.c >> +++ b/fs/ext2/symlink.c >> @@ -24,8 +24,14 @@ >> static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd) >> { >> struct ext2_inode_info *ei = EXT2_I(dentry->d_inode); >> - nd_set_link(nd, (char *)ei->i_data); >> - return NULL; >> + void *err = NULL; >> + >> + if (memchr(ei->i_data, 0, sizeof(ei->i_data)) == NULL) >> + err = ERR_PTR(-ENAMETOOLONG); >> + else >> + nd_set_link(nd, (char *)ei->i_data); >> + >> + return err; >> } > > Here (Like below) Just zero the very last byte in the buffer. > The first time this buffer was strcpy to, it was including the null terminated > string. then written to inode on disk. When read, at most it could be, > is as space allocated at inode (including null). If intentionally damaged, the symlink > will be corrupted but Kernel is safe. I considered this approach. Filesystems that allocate buffers for the name (e.g. XFS) already tend to unconditionally NULL-terminate it, so this is a non-issue for them. However others (including ext2) do not allocate a buffer, instead pointing to the in-memory data representing the on-disk data. If we NULL-terminate in those cases the in-memory and on-disk data would differ. If the kernel writes out the data for some other reason (say after updating atime) then we may unintentionally modify the link target. That may not be a serious problem in practice, but it doesn't feel right. However, if the FS maintainers don't have a problem with it, it will certainly be cleaner and easier to implement than scanning. Opinions? [snip] > I hit this problem too, while developing a filesystem that was based > on ext2. The reason that it works is because the remainder of a page is always > Zero'ed out on writes. Then when read, you receive back your zero terminated link. > (Which means that if you have a symlink exactly 4k it will BUG but I guess > that is not possible). It is not possible for an uncorrupted symlink :) > The solution is to use the i_size information for the string length, and zero > terminate at i_size + 1. > > The way I fixed it is that I Zero out the last page's remainder on read and not > on write like ext2 and other do it. (A symlink is less then 4k, right?) Right. If PATH_MAX is larger than PAGE_SIZE no doubt all sorts of things would start going horribly wrong. Cheers, Duane.
Duane Griffin wrote: > Hi Boaz, thanks for your review and comments... > > 2008/12/9 Boaz Harrosh <bharrosh@panasas.com>: >>> diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c >>> index 4e2426e..9b01af2 100644 >>> --- a/fs/ext2/symlink.c >>> +++ b/fs/ext2/symlink.c >>> @@ -24,8 +24,14 @@ >>> static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd) >>> { >>> struct ext2_inode_info *ei = EXT2_I(dentry->d_inode); >>> - nd_set_link(nd, (char *)ei->i_data); >>> - return NULL; >>> + void *err = NULL; >>> + >>> + if (memchr(ei->i_data, 0, sizeof(ei->i_data)) == NULL) >>> + err = ERR_PTR(-ENAMETOOLONG); >>> + else >>> + nd_set_link(nd, (char *)ei->i_data); >>> + >>> + return err; >>> } >> Here (Like below) Just zero the very last byte in the buffer. >> The first time this buffer was strcpy to, it was including the null terminated >> string. then written to inode on disk. When read, at most it could be, >> is as space allocated at inode (including null). If intentionally damaged, the symlink >> will be corrupted but Kernel is safe. > > I considered this approach. Filesystems that allocate buffers for the > name (e.g. XFS) already tend to unconditionally NULL-terminate it, so > this is a non-issue for them. However others (including ext2) do not > allocate a buffer, instead pointing to the in-memory data representing > the on-disk data. If we NULL-terminate in those cases the in-memory > and on-disk data would differ. If the kernel writes out the data for > some other reason (say after updating atime) then we may > unintentionally modify the link target. That may not be a serious > problem in practice, but it doesn't feel right. > I just want to make sure that you understand the code above and convince you that this can/should be done and will damage nothing. The code you see above is only for links that are shorter then some constant. The ext2 (and other fs's) will cache this case and write the symlink directly into the inode that will then have 0 number of data blocks. The space allocated at inode is constant and is chosen for good inode packing on disk. The inode starts empty then if a symlink is short the string is strcpy to above buffer. So even if intentional damage was done to on-disk data, putting another null at the end will never hurt. At most it is redundant since there is another one preceding. But in the case of damage the damage is fixed. There can never be an information lost. For symlinks that are longer then above constant 1 data block is allocated and the symlink is written, padded by zeros. This is taken care of by the generic layer in the code you patched at fs/namei.c. Terminating at i_size + 1 will never reach the disk since only i_size bytes are ever written. > However, if the FS maintainers don't have a problem with it, it will > certainly be cleaner and easier to implement than scanning. Opinions? > > [snip] > >> I hit this problem too, while developing a filesystem that was based >> on ext2. The reason that it works is because the remainder of a page is always >> Zero'ed out on writes. Then when read, you receive back your zero terminated link. >> (Which means that if you have a symlink exactly 4k it will BUG but I guess >> that is not possible). > > It is not possible for an uncorrupted symlink :) > >> The solution is to use the i_size information for the string length, and zero >> terminate at i_size + 1. >> >> The way I fixed it is that I Zero out the last page's remainder on read and not >> on write like ext2 and other do it. (A symlink is less then 4k, right?) > > Right. If PATH_MAX is larger than PAGE_SIZE no doubt all sorts of > things would start going horribly wrong. Right that's what I thought. So my approach should be safe. Zero out at i_size + 1 > > Cheers, > Duane. > Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 08, 2008 at 02:30:03PM -0800, Andrew Morton wrote: > Perhaps nd_set_link() is a suitable place? Change that function so > that it is passed a third argument (max_len) and then check that within > nd_set_link(). Change nd_set_link() to return a __must_check-marked > errno, change callers to handle errors appropriately. > > Or something totally different ;) But along those lines? Note that XFS and possibly other filesystem don't store the NULL termination on disk. So having a follow_link interface that uses a counted string would be a nice little optimization for the XFS follow_link / readlink implementation. But I'm not really sure it's worth complicating the VFS for that little gem. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig wrote: > On Mon, Dec 08, 2008 at 02:30:03PM -0800, Andrew Morton wrote: >> Perhaps nd_set_link() is a suitable place? Change that function so >> that it is passed a third argument (max_len) and then check that within >> nd_set_link(). Change nd_set_link() to return a __must_check-marked >> errno, change callers to handle errors appropriately. >> >> Or something totally different ;) But along those lines? > > Note that XFS and possibly other filesystem don't store the NULL > termination on disk. Note that ext2, for example, also only writes the string bytes without any NULLs. It only happen to be zero padded because any last-page is zero-padded from i_size to end of page. > So having a follow_link interface that uses a > counted string would be a nice little optimization for the XFS > follow_link / readlink implementation. But I'm not really sure it's > worth complicating the VFS for that little gem. > The inode's i_size already holds the string count so at the higher level we have that information. But I'm convinced, nd_set_link() should receive a new max_len, all users should be changed as a matter of code audit. nd_set_link() should then proceed to truncate the string at that length unconditionally no need for error returns. My $0.017 Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2008/12/9 Boaz Harrosh <bharrosh@panasas.com>: > I just want to make sure that you understand the code above and convince > you that this can/should be done and will damage nothing. No problem, it never hurts to spell things out in detail :) > The code you see above is only for links that are shorter then some constant. > The ext2 (and other fs's) will cache this case and write the symlink directly > into the inode that will then have 0 number of data blocks. The space allocated > at inode is constant and is chosen for good inode packing on disk. The inode > starts empty then if a symlink is short the string is strcpy to above buffer. Sure, I understand this. > So even if intentional damage was done to on-disk data, putting another null > at the end will never hurt. At most it is redundant since there is another > one preceding. But in the case of damage the damage is fixed. There can never > be an information lost. If the link name on disk is corrupted and the last character is not a zero then it may be changed. That is information lost, albeit probably not *useful* information. I wouldn't like to make such a change without an OK from the FS maintainers, but I'll happily do it with one. So far we seem to have one vote in favour and none against. :) Note that the corruption doesn't have to be intentional, of course, although it was in the particular bug report I was looking at originally. > For symlinks that are longer then above constant 1 data block is allocated > and the symlink is written, padded by zeros. This is taken care of by the > generic layer in the code you patched at fs/namei.c. Terminating at > i_size + 1 will never reach the disk since only i_size bytes are ever written. If the data on disk is corrupted such that i_size == PAGE_SIZE then again we would have different data in-memory and on-disk. However in this case the page would only contain the link name and so wouldn't be dirtied and written out unless the name was changed anyway. So I agree, it should be safe in this case, regardless of my concerns above. Thanks, Duane.
2008/12/9 Boaz Harrosh <bharrosh@panasas.com>: > Christoph Hellwig wrote: >> On Mon, Dec 08, 2008 at 02:30:03PM -0800, Andrew Morton wrote: >>> Perhaps nd_set_link() is a suitable place? Change that function so >>> that it is passed a third argument (max_len) and then check that within >>> nd_set_link(). Change nd_set_link() to return a __must_check-marked >>> errno, change callers to handle errors appropriately. >>> >>> Or something totally different ;) But along those lines? >> >> Note that XFS and possibly other filesystem don't store the NULL >> termination on disk. > > Note that ext2, for example, also only writes the string bytes without > any NULLs. It only happen to be zero padded because any last-page is zero-padded > from i_size to end of page. > >> So having a follow_link interface that uses a >> counted string would be a nice little optimization for the XFS >> follow_link / readlink implementation. But I'm not really sure it's >> worth complicating the VFS for that little gem. > > The inode's i_size already holds the string count so at the higher level > we have that information. But I'm convinced, nd_set_link() should receive > a new max_len, all users should be changed as a matter of code audit. > nd_set_link() should then proceed to truncate the string at that length > unconditionally no need for error returns. I've looked at a few alternative options: scanning for NULLs, NULL-terminating in nd_set_link, NULL-terminating in the FS code (where it is necessary and not already being done), and passing the length around explicitly. NULL-terminating is definitely cleaner and easier than scanning. Unfortunately, as Christoph indicated, passing the length around explicitly does rather complicate the code. So the question is whether to NULL-terminate in nd_set_link or earlier in the FS code. Having tried both options, I'm inclined to do it in the FS code and leave nd_set_link as it is. Many of the filesystems already take pains to ensure the links are NULL-terminated and the minimal change of fixing the others seems the safest option. However, this way we won't solve things for all filesystems for all time, as Andrew wanted. I'll post my preferred patches shortly, but if anyone would like to see what the full nd_set_link change would look like let me know and I'll post them for comparison. FYI, here are the diffstats for the two options: Terminating in FS code: fs/9p/vfs_inode.c | 5 +++-- fs/befs/linuxvfs.c | 5 ++++- fs/ecryptfs/inode.c | 3 ++- fs/ext2/symlink.c | 4 +++- fs/ext3/symlink.c | 4 +++- fs/ext4/symlink.c | 4 +++- fs/freevxfs/vxfs_immed.c | 1 + fs/jfs/symlink.c | 2 ++ fs/namei.c | 8 ++++++-- fs/sysv/symlink.c | 4 +++- fs/ufs/symlink.c | 4 +++- 11 files changed, 33 insertions(+), 11 deletions(-) Adding length param and terminating in nd_set_link (but not removing all the existing FS termination code): fs/9p/vfs_inode.c | 10 +++++----- fs/autofs/symlink.c | 2 +- fs/autofs4/symlink.c | 2 +- fs/befs/linuxvfs.c | 14 ++++++++++++-- fs/cifs/link.c | 8 ++------ fs/configfs/symlink.c | 4 ++-- fs/debugfs/file.c | 2 +- fs/ecryptfs/inode.c | 20 ++++++++++---------- fs/ext2/symlink.c | 2 +- fs/ext3/symlink.c | 2 +- fs/ext4/symlink.c | 2 +- fs/freevxfs/vxfs_immed.c | 2 +- fs/fuse/dir.c | 2 +- fs/jffs2/symlink.c | 2 +- fs/jfs/symlink.c | 3 ++- fs/namei.c | 11 +++++++++-- fs/nfs/symlink.c | 4 ++-- fs/proc/generic.c | 2 +- fs/smbfs/symlink.c | 8 ++++---- fs/sysfs/symlink.c | 2 +- fs/sysv/symlink.c | 3 ++- fs/ubifs/file.c | 2 +- fs/ufs/symlink.c | 2 +- fs/xfs/linux-2.6/xfs_iops.c | 4 ++-- include/linux/namei.h | 4 +++- mm/shmem.c | 4 ++-- 26 files changed, 70 insertions(+), 53 deletions(-) Cheers, Duane.
diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c index 4e2426e..9b01af2 100644 --- a/fs/ext2/symlink.c +++ b/fs/ext2/symlink.c @@ -24,8 +24,14 @@ static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd) { struct ext2_inode_info *ei = EXT2_I(dentry->d_inode); - nd_set_link(nd, (char *)ei->i_data); - return NULL; + void *err = NULL; + + if (memchr(ei->i_data, 0, sizeof(ei->i_data)) == NULL) + err = ERR_PTR(-ENAMETOOLONG); + else + nd_set_link(nd, (char *)ei->i_data); + + return err; } const struct inode_operations ext2_symlink_inode_operations = { diff --git a/fs/namei.c b/fs/namei.c index d34e0f9..f20e94b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2750,29 +2750,49 @@ static char *page_getlink(struct dentry * dentry, struct page **ppage) { struct page * page; struct address_space *mapping = dentry->d_inode->i_mapping; + char *kaddr; + page = read_mapping_page(mapping, 0, NULL); if (IS_ERR(page)) return (char*)page; + + kaddr = kmap(page); + if (memchr(kaddr, 0, PAGE_SIZE) == NULL) { + kunmap(kaddr); + page_cache_release(page); + return ERR_PTR(-ENAMETOOLONG); + } + *ppage = page; - return kmap(page); + return kaddr; } int page_readlink(struct dentry *dentry, char __user *buffer, int buflen) { + int res; struct page *page = NULL; char *s = page_getlink(dentry, &page); - int res = vfs_readlink(dentry,buffer,buflen,s); + + if (IS_ERR(s)) + return PTR_ERR(s); + + res = vfs_readlink(dentry, buffer, buflen, s); if (page) { kunmap(page); page_cache_release(page); } + return res; } void *page_follow_link_light(struct dentry *dentry, struct nameidata *nd) { struct page *page = NULL; - nd_set_link(nd, page_getlink(dentry, &page)); + char *name = page_getlink(dentry, &page); + if (IS_ERR(name)) + return name; + + nd_set_link(nd, name); return page; }