Message ID | 1495776116-27890-1-git-send-email-gwshan@linux.vnet.ibm.com |
---|---|
State | Rejected |
Headers | show |
On Fri, May 26, 2017 at 3:21 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote: > When adding memory regions to global list, the overlapped one > (including it's descriptor) is purged. However, the descriptor might > be in BSS (not HEAP) section. It will trigger the exception caused > by assert() in mem_free() as below backtrace indicates. The memory > regions causing this are "ibm,os-reserve" and "hostboot-reserve-0" > separately. The former descriptor is resident in BSS section while > the later one is in HEAP: > > CPU 0050 Backtrace: > S: 0000000031d43a00 R: 00000000300136d0 .backtrace+0x2c > S: 0000000031d43a90 R: 00000000300191ec ._abort+0x4c > S: 0000000031d43b10 R: 0000000030019268 .assert_fail+0x34 > S: 0000000031d43b90 R: 0000000030015d08 .mem_free+0x64 > S: 0000000031d43c20 R: 0000000030017080 .__free+0x38 > S: 0000000031d43cb0 R: 0000000030015518 .add_region+0x230 > S: 0000000031d43d60 R: 0000000030016854 .mem_region_init+0x2c8 > S: 0000000031d43e30 R: 0000000030014994 .main_cpu_entry+0x3ec > S: 0000000031d43f00 R: 0000000030002648 boot_entry+0x198 > > This fixes the issue by doing nothing when trying to freeing memory > block which is resident in unmanaged (not HEAP) region. > > Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Not a fan. The specific bug you're seeing here is because of hostboot being broken on FSP machines and reserving memory that it shouldn't. This patch papers over the problem by making any invalid free() silently fail which is something we should avoid. > --- > core/mem_region.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/core/mem_region.c b/core/mem_region.c > index a314558..bfb5ac9 100644 > --- a/core/mem_region.c > +++ b/core/mem_region.c > @@ -469,9 +469,10 @@ void mem_free(struct mem_region *region, void *mem, const char *location) > if (!mem) > return; > > - /* Your memory is in the region, right? */ > - assert(mem >= region_start(region) + sizeof(*hdr)); > - assert(mem < region_start(region) + region->len); > + /* Give up if it is in unmanaged and invalid regions */ > + if (mem < (region_start(region) + sizeof(*hdr)) || > + mem >= (region_start(region) + region->len)) > + return; > > /* Grab header. */ > hdr = mem - sizeof(*hdr); > -- > 2.7.4 > > _______________________________________________ > Skiboot mailing list > Skiboot@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot
On Fri, May 26, 2017 at 04:12:24PM +1000, Oliver O'Halloran wrote: >On Fri, May 26, 2017 at 3:21 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote: >> When adding memory regions to global list, the overlapped one >> (including it's descriptor) is purged. However, the descriptor might >> be in BSS (not HEAP) section. It will trigger the exception caused >> by assert() in mem_free() as below backtrace indicates. The memory >> regions causing this are "ibm,os-reserve" and "hostboot-reserve-0" >> separately. The former descriptor is resident in BSS section while >> the later one is in HEAP: >> >> CPU 0050 Backtrace: >> S: 0000000031d43a00 R: 00000000300136d0 .backtrace+0x2c >> S: 0000000031d43a90 R: 00000000300191ec ._abort+0x4c >> S: 0000000031d43b10 R: 0000000030019268 .assert_fail+0x34 >> S: 0000000031d43b90 R: 0000000030015d08 .mem_free+0x64 >> S: 0000000031d43c20 R: 0000000030017080 .__free+0x38 >> S: 0000000031d43cb0 R: 0000000030015518 .add_region+0x230 >> S: 0000000031d43d60 R: 0000000030016854 .mem_region_init+0x2c8 >> S: 0000000031d43e30 R: 0000000030014994 .main_cpu_entry+0x3ec >> S: 0000000031d43f00 R: 0000000030002648 boot_entry+0x198 >> >> This fixes the issue by doing nothing when trying to freeing memory >> block which is resident in unmanaged (not HEAP) region. >> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> > >Not a fan. The specific bug you're seeing here is because of hostboot >being broken on FSP machines and reserving memory that it shouldn't. >This patch papers over the problem by making any invalid free() >silently fail which is something we should avoid. > Well, It's not to fail sliently. The memory block resident in bss/data can't simply free'd. I agree it's reasonable to avoid freeing the memory chunks, which in unmanaged from source. If you're going to do that, this patch can be dropped sliently. On the other hand, I also found the last skiboot can't boot and it's related to get_hb_reserved_me(). Everything becomes fine when it's simply commented out (or skipped). It seems the memory blocks (in device-tree) that should have reserved by kernel, aren't reserved properly. It leads to data/memory corruption eventually. I'm thinking the correct thing to do is mering the memory region with the overlapped one in mem_region.c. Cheers, Gavin >> --- >> core/mem_region.c | 7 ++++--- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/core/mem_region.c b/core/mem_region.c >> index a314558..bfb5ac9 100644 >> --- a/core/mem_region.c >> +++ b/core/mem_region.c >> @@ -469,9 +469,10 @@ void mem_free(struct mem_region *region, void *mem, const char *location) >> if (!mem) >> return; >> >> - /* Your memory is in the region, right? */ >> - assert(mem >= region_start(region) + sizeof(*hdr)); >> - assert(mem < region_start(region) + region->len); >> + /* Give up if it is in unmanaged and invalid regions */ >> + if (mem < (region_start(region) + sizeof(*hdr)) || >> + mem >= (region_start(region) + region->len)) >> + return; >> >> /* Grab header. */ >> hdr = mem - sizeof(*hdr); >> -- >> 2.7.4 >> >> _______________________________________________ >> Skiboot mailing list >> Skiboot@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/skiboot >
On Fri, May 26, 2017 at 4:30 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote: > On Fri, May 26, 2017 at 04:12:24PM +1000, Oliver O'Halloran wrote: >>On Fri, May 26, 2017 at 3:21 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote: >>> When adding memory regions to global list, the overlapped one >>> (including it's descriptor) is purged. However, the descriptor might >>> be in BSS (not HEAP) section. It will trigger the exception caused >>> by assert() in mem_free() as below backtrace indicates. The memory >>> regions causing this are "ibm,os-reserve" and "hostboot-reserve-0" >>> separately. The former descriptor is resident in BSS section while >>> the later one is in HEAP: >>> >>> CPU 0050 Backtrace: >>> S: 0000000031d43a00 R: 00000000300136d0 .backtrace+0x2c >>> S: 0000000031d43a90 R: 00000000300191ec ._abort+0x4c >>> S: 0000000031d43b10 R: 0000000030019268 .assert_fail+0x34 >>> S: 0000000031d43b90 R: 0000000030015d08 .mem_free+0x64 >>> S: 0000000031d43c20 R: 0000000030017080 .__free+0x38 >>> S: 0000000031d43cb0 R: 0000000030015518 .add_region+0x230 >>> S: 0000000031d43d60 R: 0000000030016854 .mem_region_init+0x2c8 >>> S: 0000000031d43e30 R: 0000000030014994 .main_cpu_entry+0x3ec >>> S: 0000000031d43f00 R: 0000000030002648 boot_entry+0x198 >>> >>> This fixes the issue by doing nothing when trying to freeing memory >>> block which is resident in unmanaged (not HEAP) region. >>> >>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> >> >>Not a fan. The specific bug you're seeing here is because of hostboot >>being broken on FSP machines and reserving memory that it shouldn't. >>This patch papers over the problem by making any invalid free() >>silently fail which is something we should avoid. >> > > Well, It's not to fail sliently. The memory block resident in > bss/data can't simply free'd. I agree it's reasonable to avoid freeing > the memory chunks, which in unmanaged from source. If you're going > to do that, this patch can be dropped sliently. On the other hand, > I also found the last skiboot can't boot and it's related to get_hb_reserved_me(). > Everything becomes fine when it's simply commented out (or skipped). It seems the > memory blocks (in device-tree) that should have reserved by kernel, aren't > reserved properly. It leads to data/memory corruption eventually. I'm > thinking the correct thing to do is mering the memory region with the > overlapped one in mem_region.c. In this situation merging the overlapping reservations is fine, but we can't rely on that being true in general. We've had problems in the past with reservations for the OCC being overlapped with those for the hostboot runtime services. In those situations I would prefer we just crashed early rather than silently corrupting state. > > Cheers, > Gavin > >>> --- >>> core/mem_region.c | 7 ++++--- >>> 1 file changed, 4 insertions(+), 3 deletions(-) >>> >>> diff --git a/core/mem_region.c b/core/mem_region.c >>> index a314558..bfb5ac9 100644 >>> --- a/core/mem_region.c >>> +++ b/core/mem_region.c >>> @@ -469,9 +469,10 @@ void mem_free(struct mem_region *region, void *mem, const char *location) >>> if (!mem) >>> return; >>> >>> - /* Your memory is in the region, right? */ >>> - assert(mem >= region_start(region) + sizeof(*hdr)); >>> - assert(mem < region_start(region) + region->len); >>> + /* Give up if it is in unmanaged and invalid regions */ >>> + if (mem < (region_start(region) + sizeof(*hdr)) || >>> + mem >= (region_start(region) + region->len)) >>> + return; >>> >>> /* Grab header. */ >>> hdr = mem - sizeof(*hdr); >>> -- >>> 2.7.4 >>> >>> _______________________________________________ >>> Skiboot mailing list >>> Skiboot@lists.ozlabs.org >>> https://lists.ozlabs.org/listinfo/skiboot >> >
On Fri, May 26, 2017 at 04:57:19PM +1000, Oliver O'Halloran wrote: >On Fri, May 26, 2017 at 4:30 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote: >> On Fri, May 26, 2017 at 04:12:24PM +1000, Oliver O'Halloran wrote: >>>On Fri, May 26, 2017 at 3:21 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote: >>>> When adding memory regions to global list, the overlapped one >>>> (including it's descriptor) is purged. However, the descriptor might >>>> be in BSS (not HEAP) section. It will trigger the exception caused >>>> by assert() in mem_free() as below backtrace indicates. The memory >>>> regions causing this are "ibm,os-reserve" and "hostboot-reserve-0" >>>> separately. The former descriptor is resident in BSS section while >>>> the later one is in HEAP: >>>> >>>> CPU 0050 Backtrace: >>>> S: 0000000031d43a00 R: 00000000300136d0 .backtrace+0x2c >>>> S: 0000000031d43a90 R: 00000000300191ec ._abort+0x4c >>>> S: 0000000031d43b10 R: 0000000030019268 .assert_fail+0x34 >>>> S: 0000000031d43b90 R: 0000000030015d08 .mem_free+0x64 >>>> S: 0000000031d43c20 R: 0000000030017080 .__free+0x38 >>>> S: 0000000031d43cb0 R: 0000000030015518 .add_region+0x230 >>>> S: 0000000031d43d60 R: 0000000030016854 .mem_region_init+0x2c8 >>>> S: 0000000031d43e30 R: 0000000030014994 .main_cpu_entry+0x3ec >>>> S: 0000000031d43f00 R: 0000000030002648 boot_entry+0x198 >>>> >>>> This fixes the issue by doing nothing when trying to freeing memory >>>> block which is resident in unmanaged (not HEAP) region. >>>> >>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> >>> >>>Not a fan. The specific bug you're seeing here is because of hostboot >>>being broken on FSP machines and reserving memory that it shouldn't. >>>This patch papers over the problem by making any invalid free() >>>silently fail which is something we should avoid. >>> >> >> Well, It's not to fail sliently. The memory block resident in >> bss/data can't simply free'd. I agree it's reasonable to avoid freeing >> the memory chunks, which in unmanaged from source. If you're going >> to do that, this patch can be dropped sliently. On the other hand, >> I also found the last skiboot can't boot and it's related to get_hb_reserved_me(). >> Everything becomes fine when it's simply commented out (or skipped). It seems the >> memory blocks (in device-tree) that should have reserved by kernel, aren't >> reserved properly. It leads to data/memory corruption eventually. I'm >> thinking the correct thing to do is mering the memory region with the >> overlapped one in mem_region.c. > >In this situation merging the overlapping reservations is fine, but we >can't rely on that being true in general. We've had problems in the >past with reservations for the OCC being overlapped with those for the >hostboot runtime services. In those situations I would prefer we just >crashed early rather than silently corrupting state. > Yes, It makes sense. This patch can be dropped. Cheers, Gavin
diff --git a/core/mem_region.c b/core/mem_region.c index a314558..bfb5ac9 100644 --- a/core/mem_region.c +++ b/core/mem_region.c @@ -469,9 +469,10 @@ void mem_free(struct mem_region *region, void *mem, const char *location) if (!mem) return; - /* Your memory is in the region, right? */ - assert(mem >= region_start(region) + sizeof(*hdr)); - assert(mem < region_start(region) + region->len); + /* Give up if it is in unmanaged and invalid regions */ + if (mem < (region_start(region) + sizeof(*hdr)) || + mem >= (region_start(region) + region->len)) + return; /* Grab header. */ hdr = mem - sizeof(*hdr);
When adding memory regions to global list, the overlapped one (including it's descriptor) is purged. However, the descriptor might be in BSS (not HEAP) section. It will trigger the exception caused by assert() in mem_free() as below backtrace indicates. The memory regions causing this are "ibm,os-reserve" and "hostboot-reserve-0" separately. The former descriptor is resident in BSS section while the later one is in HEAP: CPU 0050 Backtrace: S: 0000000031d43a00 R: 00000000300136d0 .backtrace+0x2c S: 0000000031d43a90 R: 00000000300191ec ._abort+0x4c S: 0000000031d43b10 R: 0000000030019268 .assert_fail+0x34 S: 0000000031d43b90 R: 0000000030015d08 .mem_free+0x64 S: 0000000031d43c20 R: 0000000030017080 .__free+0x38 S: 0000000031d43cb0 R: 0000000030015518 .add_region+0x230 S: 0000000031d43d60 R: 0000000030016854 .mem_region_init+0x2c8 S: 0000000031d43e30 R: 0000000030014994 .main_cpu_entry+0x3ec S: 0000000031d43f00 R: 0000000030002648 boot_entry+0x198 This fixes the issue by doing nothing when trying to freeing memory block which is resident in unmanaged (not HEAP) region. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> --- core/mem_region.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)