Message ID | 1342786906-12634-1-git-send-email-Shaohui.Xie@freescale.com (mailing list archive) |
---|---|
State | Rejected, archived |
Headers | show |
On Fri, 2012-07-20 at 20:21 +0800, Shaohui Xie wrote: > PowerPC platform only supports ZONE_DMA zone for 64bit kernel, so all the > memory will be put into this zone. If the memory size is greater than > the device's DMA capability and device uses dma_alloc_coherent to allocate > memory, it will get an address which is over the device's DMA addressing, > the device will fail. > > So we split the memory to two zones by adding a zone ZONE_NORMAL, since > we already allocate PCICSRBAR/PEXCSRBAR right below the 4G boundary (if the > lowest PCI address is above 4G), so we constrain the DMA zone ZONE_DMA > to 2GB, also, we clear the flag __GFP_DMA and set it only if the device's > dma_mask < total memory size. By doing this, devices which cannot DMA all > the memory will be limited to ZONE_DMA, but devices which can DMA all the > memory will not be affected by this limitation. This is wrong. Don't you have an iommu do deal with those devices anyway ? What about swiotlb ? If you *really* need to honor 32 (or 31 even) bit DMAs, what you -may- want to do is create a ZONE_DMA32 like other architectures, do not hijack the historical ZONE_DMA. But even then, I'm dubious this is really needed. Cheers, Ben.
On 07/23/2012 01:06 AM, Benjamin Herrenschmidt wrote: > On Fri, 2012-07-20 at 20:21 +0800, Shaohui Xie wrote: >> PowerPC platform only supports ZONE_DMA zone for 64bit kernel, so all the >> memory will be put into this zone. If the memory size is greater than >> the device's DMA capability and device uses dma_alloc_coherent to allocate >> memory, it will get an address which is over the device's DMA addressing, >> the device will fail. >> >> So we split the memory to two zones by adding a zone ZONE_NORMAL, since >> we already allocate PCICSRBAR/PEXCSRBAR right below the 4G boundary (if the >> lowest PCI address is above 4G), so we constrain the DMA zone ZONE_DMA >> to 2GB, also, we clear the flag __GFP_DMA and set it only if the device's >> dma_mask < total memory size. By doing this, devices which cannot DMA all >> the memory will be limited to ZONE_DMA, but devices which can DMA all the >> memory will not be affected by this limitation. > > This is wrong. How so? > Don't you have an iommu do deal with those devices anyway ? Yes, but we don't yet have DMA API support for it, it would lower performance because we'd have to use a lot of subwindows which are poorly cached (and even then we wouldn't be able to map more than 256 pages at once on a given device), and the IOMMU may not be available at all if we're being virtualized. > What about swiotlb ? That doesn't help with alloc_coherent(). > If you *really* need to honor 32 (or 31 even) bit DMAs, 31-bit is to accommodate PCI, which has PEXCSRBAR that must live under 4 GiB and can't be disabled. > what you -may- want to do is create a ZONE_DMA32 like other architectures, do not > hijack the historical ZONE_DMA. Could you point me to somewhere that clearly defines what ZONE_DMA is to be used for, such that this counts as hijacking (but using ZONE_DMA32 to mean 31-bit wouldn't)? The only arches I see using ZONE_DMA32 (x86 and mips) also have a separate, more restrictive ZONE_DMA. PowerPC doesn't. It uses ZONE_DMA to point to all of memory (except highmem on 32-bit) -- how is that not hijacking, if this is? We can't have ZONE_DMA be less restrictive than ZONE_DMA32, because the fallback rules are hardcoded the other way around in generic code. The exact threshold for ZONE_DMA could be made platform-configurable. > But even then, I'm dubious this is really needed. We'd like our drivers to stop crashing with more than 4GiB of RAM on 64-bit. -Scott
On Mon, 2012-07-23 at 11:17 -0500, Scott Wood wrote: > > This is wrong. > > How so? > > > Don't you have an iommu do deal with those devices anyway ? > > Yes, but we don't yet have DMA API support for it, it would lower > performance because we'd have to use a lot of subwindows which are > poorly cached (and even then we wouldn't be able to map more than 256 > pages at once on a given device), and the IOMMU may not be available at > all if we're being virtualized. Ugh ? You mean some designers need to be fired urgently and wasted everybody's time implementing an unusable iommu ? Nice one ... > > What about swiotlb ? > > That doesn't help with alloc_coherent(). Somewhat... you can use the pool, but it sucks. > > If you *really* need to honor 32 (or 31 even) bit DMAs, > > 31-bit is to accommodate PCI, which has PEXCSRBAR that must live under 4 > GiB and can't be disabled. > > > what you -may- want to do is create a ZONE_DMA32 like other architectures, do not > > hijack the historical ZONE_DMA. > > Could you point me to somewhere that clearly defines what ZONE_DMA is to > be used for, such that this counts as hijacking (but using ZONE_DMA32 to > mean 31-bit wouldn't)? Habit and history. ZONE_DMA used to be about ISA DMA, doesn't apply to us, and in general ZONE_NORMAL alias to it iirc. It's old stuff I haven't looked for a long time. However, my understanding is that what you are trying to solve is exactly what ZONE_DMA32 was created for. > The only arches I see using ZONE_DMA32 (x86 and > mips) also have a separate, more restrictive ZONE_DMA. And ? Who cares ? Drivers who know about a 32-bit limitations use GFP_DMA32, that's what is expected, don't mess around with ZONE_DMA. > PowerPC doesn't. > It uses ZONE_DMA to point to all of memory (except highmem on 32-bit) > -- how is that not hijacking, if this is? We can't have ZONE_DMA be > less restrictive than ZONE_DMA32, because the fallback rules are > hardcoded the other way around in generic code. > > The exact threshold for ZONE_DMA could be made platform-configurable. > > > But even then, I'm dubious this is really needed. > > We'd like our drivers to stop crashing with more than 4GiB of RAM on 64-bit. Fix your HW :-) Cheers, Ben.
On Mon, Jul 23, 2012 at 5:20 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > And ? Who cares ? Drivers who know about a 32-bit limitations use > GFP_DMA32, that's what is expected, don't mess around with ZONE_DMA. I thought drivers are supposed to set a dma_mask, and dma_alloc_coherent() is supposed to use that to figure out how to honor that mask.
On Mon, 2012-07-23 at 23:08 +0000, Tabi Timur-B04825 wrote: > > > And ? Who cares ? Drivers who know about a 32-bit limitations use > > GFP_DMA32, that's what is expected, don't mess around with ZONE_DMA. > > I thought drivers are supposed to set a dma_mask, and > dma_alloc_coherent() is supposed to use that to figure out how to > honor that mask. Sure, that's the right way to go, I meant bits of pieces of the infrastructure in between. Why diverge from other archs gratuituously here ? Cheers, Ben.
Benjamin Herrenschmidt wrote: > Sure, that's the right way to go, I meant bits of pieces of the > infrastructure in between. Why diverge from other archs gratuituously > here ? Ok, I'm confused. Are you suggesting that drivers do this: u64 fsl_dma_dmamask = DMA_BIT_MASK(36); dev->dma_mask = &fsl_dma_dmamask; v = dma_alloc_coherent(dev, ..., GFP_DMA32); That is, set the DMA mask *and* set GFP_DMA32? That seems redundant. I don't understand why a driver would set GFP_DMA32 if it has already set the mask.
On Mon, 2012-07-23 at 18:15 -0500, Timur Tabi wrote: > Benjamin Herrenschmidt wrote: > > Sure, that's the right way to go, I meant bits of pieces of the > > infrastructure in between. Why diverge from other archs gratuituously > > here ? > > Ok, I'm confused. Are you suggesting that drivers do this: > > u64 fsl_dma_dmamask = DMA_BIT_MASK(36); > dev->dma_mask = &fsl_dma_dmamask; > v = dma_alloc_coherent(dev, ..., GFP_DMA32); > > That is, set the DMA mask *and* set GFP_DMA32? That seems redundant. No, but dma_alloc_coherent would under the hood. > I don't understand why a driver would set GFP_DMA32 if it has already set > the mask. The layers in between, not the well behaved drivers. Again, we have ZONE_DMA32 specifically for the purpose, why use something else ? In any case, make the whole thing at the very least a config option, I don't want sane HW to have to deal with split zones. Cheers, Ben.
On Tue, 2012-07-24 at 09:29 +1000, Benjamin Herrenschmidt wrote: > The layers in between, not the well behaved drivers. Again, we have > ZONE_DMA32 specifically for the purpose, why use something else ? > > In any case, make the whole thing at the very least a config option, I > don't want sane HW to have to deal with split zones. Or if possible a flag set by machine probe() Cheers, Ben.
Benjamin Herrenschmidt wrote: > No, but dma_alloc_coherent would under the hood. Which is what Shaohui's patch does. Well, it does it for GFP_DMA instead of GFP_DMA32, but still. When you said, "Drivers who know about a 32-bit limitations use GFP_DMA32", I thought you meant that drivers should *set* GFP_DMA32. >> I don't understand why a driver would set GFP_DMA32 if it has already set >> the mask. > > The layers in between, not the well behaved drivers. Again, we have > ZONE_DMA32 specifically for the purpose, why use something else ? > > In any case, make the whole thing at the very least a config option, I > don't want sane HW to have to deal with split zones. The DMA zone only kicks in if the DMA mask is set to a size smaller that available physical memory. Sane HW should set the DMA mask to DMA_BIT_MASK(36). And we have plenty of sane HW on our SOCs, but not every device is like that. The whole point behind this patch is that some drivers are setting a DMA mask of 32, but still getting memory above 4GB.
On 07/23/2012 06:30 PM, Benjamin Herrenschmidt wrote: > On Tue, 2012-07-24 at 09:29 +1000, Benjamin Herrenschmidt wrote: > >> The layers in between, not the well behaved drivers. Again, we have >> ZONE_DMA32 specifically for the purpose, why use something else ? >> >> In any case, make the whole thing at the very least a config option, I >> don't want sane HW to have to deal with split zones. > > Or if possible a flag set by machine probe() I suggested making the threshold configurable by platform code. Sane hardware would leave it at its default of infinity (~0ULL), and you wouldn't have a split zone. Our hardware would set it at 31-bit. It looks like this is already sort-of done for ISA_DMA_THRESHOLD for 32-bit non-coherent-DMA platforms (amigaone sets a 24-bit threshold), though I don't see where ZONE_DMA is limited accordingly. We'd add an additional threshold for ZONE_DMA32, make it actually affect the zone definition, and make the standard alloc_coherent() honor it. -Scott
On 07/23/2012 05:20 PM, Benjamin Herrenschmidt wrote: > On Mon, 2012-07-23 at 11:17 -0500, Scott Wood wrote: >>> This is wrong. >> >> How so? >> >>> Don't you have an iommu do deal with those devices anyway ? >> >> Yes, but we don't yet have DMA API support for it, it would lower >> performance because we'd have to use a lot of subwindows which are >> poorly cached (and even then we wouldn't be able to map more than 256 >> pages at once on a given device), and the IOMMU may not be available at >> all if we're being virtualized. > > Ugh ? You mean some designers need to be fired urgently and wasted > everybody's time implementing an unusable iommu ? Nice one ... Yeah, that's old news. It's somewhat usable for specific purposes, using very large pages, but not for arbitrary use. >>> But even then, I'm dubious this is really needed. >> >> We'd like our drivers to stop crashing with more than 4GiB of RAM on 64-bit. > > Fix your HW :-) I wish... Some hardware people may solicit input from us every now and again, but ultimately they do what they want. Returning addresses in excess of a device's declared DMA mask is something that needs fixing too, though. -Scott
On Mon, 2012-07-23 at 18:36 -0500, Timur Tabi wrote: > The DMA zone only kicks in if the DMA mask is set to a size smaller > that > available physical memory. Sane HW should set the DMA mask to > DMA_BIT_MASK(36). And we have plenty of sane HW on our SOCs, but not > every device is like that. > > The whole point behind this patch is that some drivers are setting a > DMA > mask of 32, but still getting memory above 4GB. Sure but I don't want to create the zones in the first place (and thus introduce the added pressure on the memory management) on machines that don't need it. Cheers, Ben.
Benjamin Herrenschmidt wrote: > Sure but I don't want to create the zones in the first place (and thus > introduce the added pressure on the memory management) on machines that > don't need it. Ah yes, I forgot about memory pressure.
> -----Original Message----- > From: Linuxppc-dev [mailto:linuxppc-dev-bounces+tie- > fei.zang=freescale.com@lists.ozlabs.org] On Behalf Of Benjamin > Herrenschmidt > Sent: Tuesday, July 24, 2012 11:09 AM > To: Tabi Timur-B04825 > Cc: Wood Scott-B07421; Hu Mingkai-B21284; linuxppc-dev@lists.ozlabs.org; > Xie Shaohui-B21989; Chen Yuanquan-B41889 > Subject: Re: [PATCH] powerpc/mm: add ZONE_NORMAL zone for 64 bit kernel > > On Mon, 2012-07-23 at 18:36 -0500, Timur Tabi wrote: > > The DMA zone only kicks in if the DMA mask is set to a size smaller > > that > > available physical memory. Sane HW should set the DMA mask to > > DMA_BIT_MASK(36). And we have plenty of sane HW on our SOCs, but not > > every device is like that. > > > > The whole point behind this patch is that some drivers are setting a > > DMA > > mask of 32, but still getting memory above 4GB. > > Sure but I don't want to create the zones in the first place (and thus > introduce the added pressure on the memory management) on machines that > don't need it. OK, then how do you think Scott's suggestion to add a configurable threshold? Roy
Benjamin Herrenschmidt wrote: > Sure but I don't want to create the zones in the first place (and thus > introduce the added pressure on the memory management) on machines that > don't need it. One thing that does confuse me -- by default, we don't create a ZONE_NORMAL. We only create a ZONE_DMA. Why is that? Shouldn't it be the other way around?
On Tue, 2012-07-24 at 04:04 +0000, Tabi Timur-B04825 wrote: > Benjamin Herrenschmidt wrote: > > Sure but I don't want to create the zones in the first place (and thus > > introduce the added pressure on the memory management) on machines that > > don't need it. > > One thing that does confuse me -- by default, we don't create a > ZONE_NORMAL. We only create a ZONE_DMA. Why is that? Shouldn't it be > the other way around? Because ZONE_NORMAL allocations can be serviced from the ZONE_DMA while the other way isn't possible. Especially in the old days, there were quite a few cases of drivers and/or subsystems who were a bit heavy handed at using ZONE_DMA, so not having one would essentially make them not work at all. Cheers, Ben.
> -----Original Message----- > From: Linuxppc-dev [mailto:linuxppc-dev- > bounces+bharat.bhushan=freescale.com@lists.ozlabs.org] On Behalf Of Benjamin > Herrenschmidt > Sent: Tuesday, July 24, 2012 10:16 AM > To: Tabi Timur-B04825 > Cc: Wood Scott-B07421; Hu Mingkai-B21284; linuxppc-dev@lists.ozlabs.org; Xie > Shaohui-B21989; Chen Yuanquan-B41889 > Subject: Re: [PATCH] powerpc/mm: add ZONE_NORMAL zone for 64 bit kernel > > On Tue, 2012-07-24 at 04:04 +0000, Tabi Timur-B04825 wrote: > > Benjamin Herrenschmidt wrote: > > > Sure but I don't want to create the zones in the first place (and > > > thus introduce the added pressure on the memory management) on > > > machines that don't need it. > > > > One thing that does confuse me -- by default, we don't create a > > ZONE_NORMAL. We only create a ZONE_DMA. Why is that? Shouldn't it > > be the other way around? > > Because ZONE_NORMAL allocations can be serviced from the ZONE_DMA while the > other way isn't possible. Say, if we have defined only one zone (ZONE_DMA) to which we give all memory ( > 4G). Device set the DMA_MASK to 4G or less. dma_alloc_coherent() will set GFP_DMA flag, But that is of no use, because the memory allocator have only one zone which have all memory (which assumes all dma-able). And can return memory at address at > 4G. which will crash !! I think we have to have at least one zone which gives memory to be dma-able for all devices (memory limit should be set by platform, because different platform have different devices with different limits.). And another ( 1 or more) will cover rest of memory. Thanks -Bharat > > Especially in the old days, there were quite a few cases of drivers and/or > subsystems who were a bit heavy handed at using ZONE_DMA, so not having one > would essentially make them not work at all. > > Cheers, > Ben. > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index b1ec983..8029295 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -30,6 +30,7 @@ void *dma_direct_alloc_coherent(struct device *dev, size_t size, struct dma_attrs *attrs) { void *ret; + phys_addr_t top_ram_pfn = memblock_end_of_DRAM(); #ifdef CONFIG_NOT_COHERENT_CACHE ret = __dma_alloc_coherent(dev, size, dma_handle, flag); if (ret == NULL) @@ -40,8 +41,18 @@ void *dma_direct_alloc_coherent(struct device *dev, size_t size, struct page *page; int node = dev_to_node(dev); + /* + * check for crappy device which has dma_mask < ZONE_DMA, and + * we are not going to support it, just warn and fail. + */ + if (*dev->dma_mask < DMA_BIT_MASK(31)) { + dev_err(dev, "Unsupported dma_mask 0x%llx\n", *dev->dma_mask); + return NULL; + } /* ignore region specifiers */ - flag &= ~(__GFP_HIGHMEM); + flag &= ~(__GFP_HIGHMEM | __GFP_DMA); + if (*dev->dma_mask < top_ram_pfn - 1) + flag |= GFP_DMA; page = alloc_pages_node(node, flag, get_order(size)); if (page == NULL) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index baaafde..a494555 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -281,7 +281,9 @@ void __init paging_init(void) max_zone_pfns[ZONE_DMA] = lowmem_end_addr >> PAGE_SHIFT; max_zone_pfns[ZONE_HIGHMEM] = top_of_ram >> PAGE_SHIFT; #else - max_zone_pfns[ZONE_DMA] = top_of_ram >> PAGE_SHIFT; + max_zone_pfns[ZONE_DMA] = min_t(phys_addr_t, top_of_ram, + 1ull << 31) >> PAGE_SHIFT; + max_zone_pfns[ZONE_NORMAL] = top_of_ram >> PAGE_SHIFT; #endif free_area_init_nodes(max_zone_pfns);