Message ID | 20100317145950.GA5752@random.random |
---|---|
State | New |
Headers | show |
> + if (size >= PREFERRED_RAM_ALIGN) > + new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, size); > Is this deliberately bigger-than rather than multiple-of? Having the size not be a multiple of alignment seems somewhat strange, it's always going to be wrong at one end... Paul
On Wed, Mar 17, 2010 at 03:05:57PM +0000, Paul Brook wrote: > > + if (size >= PREFERRED_RAM_ALIGN) > > + new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, size); > > > > Is this deliberately bigger-than rather than multiple-of? > Having the size not be a multiple of alignment seems somewhat strange, it's > always going to be wrong at one end... Size not multiple I think is legitimate, the below-4G chunk isn't required to end 2M aligned, all it matters is that the above-4G then starts aligned. In short one thing to add in the future as parameter to qemu_ram_alloc is the physical address that the host virtual address corresponds to. The guest physical address that the host retval corresponds to, has to be aligned with PREFERRED_RAM_ALIGN for NPT/EPT to work. I don't think it's a big concern right now.
> On Wed, Mar 17, 2010 at 03:05:57PM +0000, Paul Brook wrote: > > > + if (size >= PREFERRED_RAM_ALIGN) > > > + new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, > > > size); > > > > Is this deliberately bigger-than rather than multiple-of? > > Having the size not be a multiple of alignment seems somewhat strange, > > it's always going to be wrong at one end... > > Size not multiple I think is legitimate, the below-4G chunk isn't > required to end 2M aligned, all it matters is that the above-4G then > starts aligned. In short one thing to add in the future as parameter > to qemu_ram_alloc is the physical address that the host virtual > address corresponds to. In general you don't know this at allocation time. > The guest physical address that the host > retval corresponds to, has to be aligned with PREFERRED_RAM_ALIGN for > NPT/EPT to work. I don't think it's a big concern right now. If you allocating chinks that are multiples of the relevant page size, then I don't think you can expect anything particularly sensible to happen. Paul
On Wed, Mar 17, 2010 at 03:21:26PM +0000, Paul Brook wrote: > > On Wed, Mar 17, 2010 at 03:05:57PM +0000, Paul Brook wrote: > > > > + if (size >= PREFERRED_RAM_ALIGN) > > > > + new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, > > > > size); > > > > > > Is this deliberately bigger-than rather than multiple-of? > > > Having the size not be a multiple of alignment seems somewhat strange, > > > it's always going to be wrong at one end... > > > > Size not multiple I think is legitimate, the below-4G chunk isn't > > required to end 2M aligned, all it matters is that the above-4G then > > starts aligned. In short one thing to add in the future as parameter > > to qemu_ram_alloc is the physical address that the host virtual > > address corresponds to. > > In general you don't know this at allocation time. Caller knows it, it's not like the caller is outside of qemu, it's not some library. We know this is enough with the caller that there is now. Again there is absolutely no relation between the "size" and this. Size can be anything and it's absolutely irrelevant. All it matter is the _start_. Both the guest physical address _start_ and the host virtual address _start_. And they don't have to be aligned to 2M, simply their alignment or misalignment have to match and this is the simplest way to have them match. > > The guest physical address that the host > > retval corresponds to, has to be aligned with PREFERRED_RAM_ALIGN for > > NPT/EPT to work. I don't think it's a big concern right now. > > If you allocating chinks that are multiples of the relevant page size, then I > don't think you can expect anything particularly sensible to happen. If you want me to do a bigger more complex patch that passes down to qemu_ram_alloc the actual guest physical address that the virtual address returned by qram_mem_alloc will correspond to, I will do it. That likely would be something like qemu_ram_alloc_align. And if somebody volunteers to avoid me to do it, you're welcome. I don't care how this happens but it must happen.
> > > Size not multiple I think is legitimate, the below-4G chunk isn't > > > required to end 2M aligned, all it matters is that the above-4G then > > > starts aligned. In short one thing to add in the future as parameter > > > to qemu_ram_alloc is the physical address that the host virtual > > > address corresponds to. > > > > In general you don't know this at allocation time. > > Caller knows it, it's not like the caller is outside of qemu, it's not > some library. We know this is enough with the caller that there is now. No we don't. As discussed previously, there are machines where the physical location of RAM is configurable at runtime. In fact it's common for the ram to be completely absent at reset. Paul
On Wed, Mar 17, 2010 at 03:52:15PM +0000, Paul Brook wrote: > > > > Size not multiple I think is legitimate, the below-4G chunk isn't > > > > required to end 2M aligned, all it matters is that the above-4G then > > > > starts aligned. In short one thing to add in the future as parameter > > > > to qemu_ram_alloc is the physical address that the host virtual > > > > address corresponds to. > > > > > > In general you don't know this at allocation time. > > > > Caller knows it, it's not like the caller is outside of qemu, it's not > > some library. We know this is enough with the caller that there is now. > > No we don't. As discussed previously, there are machines where the physical > location of RAM is configurable at runtime. In fact it's common for the ram > to be completely absent at reset. This is why PREFERRED_RAM_ALIGN is only defined for __x86_64__. I'm not talking about other archs that may never support transparent hugepages in the kernel because of other architectural constrains that may prevent to map hugepages mixed with regular pages in the same vma.
> On Wed, Mar 17, 2010 at 03:52:15PM +0000, Paul Brook wrote: > > > > > Size not multiple I think is legitimate, the below-4G chunk isn't > > > > > required to end 2M aligned, all it matters is that the above-4G > > > > > then starts aligned. In short one thing to add in the future as > > > > > parameter to qemu_ram_alloc is the physical address that the host > > > > > virtual address corresponds to. > > > > > > > > In general you don't know this at allocation time. > > > > > > Caller knows it, it's not like the caller is outside of qemu, it's not > > > some library. We know this is enough with the caller that there is now. > > > > No we don't. As discussed previously, there are machines where the > > physical location of RAM is configurable at runtime. In fact it's common > > for the ram to be completely absent at reset. > > This is why PREFERRED_RAM_ALIGN is only defined for __x86_64__. I'm > not talking about other archs that may never support transparent > hugepages in the kernel because of other architectural constrains that > may prevent to map hugepages mixed with regular pages in the same vma. __x86__64 only tells you about the host. I'm talking about the guest machine. Paul
On Wed, Mar 17, 2010 at 04:07:09PM +0000, Paul Brook wrote: > > On Wed, Mar 17, 2010 at 03:52:15PM +0000, Paul Brook wrote: > > > > > > Size not multiple I think is legitimate, the below-4G chunk isn't > > > > > > required to end 2M aligned, all it matters is that the above-4G > > > > > > then starts aligned. In short one thing to add in the future as > > > > > > parameter to qemu_ram_alloc is the physical address that the host > > > > > > virtual address corresponds to. > > > > > > > > > > In general you don't know this at allocation time. > > > > > > > > Caller knows it, it's not like the caller is outside of qemu, it's not > > > > some library. We know this is enough with the caller that there is now. > > > > > > No we don't. As discussed previously, there are machines where the > > > physical location of RAM is configurable at runtime. In fact it's common > > > for the ram to be completely absent at reset. > > > > This is why PREFERRED_RAM_ALIGN is only defined for __x86_64__. I'm > > not talking about other archs that may never support transparent > > hugepages in the kernel because of other architectural constrains that > > may prevent to map hugepages mixed with regular pages in the same vma. > > __x86__64 only tells you about the host. I'm talking about the guest machine. When it's qemu and not kvm (so when the guest might not be x86 arch) the guest physical address becomes as irrelevant as the size and only the host virtual address has to start 2M aligned on x86_64 host. I think this already takes care of all practical issues, and there's no need of further work until pc.c will start allocating chunks of ram starting at guest physical addresses not 2M aligned. Maybe if we add memory hotplug or something.
diff --git a/exec.c b/exec.c index 14767b7..ab33f6b 100644 --- a/exec.c +++ b/exec.c @@ -2745,6 +2745,18 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path) } #endif +#if defined(__linux__) && defined(__x86_64__) +/* + * Align on the max transparent hugepage size so that + * "(gfn ^ pfn) & (HPAGE_SIZE-1) == 0" to allow KVM to + * take advantage of hugepages with NPT/EPT or to + * ensure the first 2M of the guest physical ram will + * be mapped by the same hugetlb for QEMU (it is worth + * it even without NPT/EPT). + */ +#define PREFERRED_RAM_ALIGN (2*1024*1024) +#endif + ram_addr_t qemu_ram_alloc(ram_addr_t size) { RAMBlock *new_block; @@ -2768,11 +2780,19 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size) PROT_EXEC|PROT_READ|PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); #else - new_block->host = qemu_vmalloc(size); +#ifdef PREFERRED_RAM_ALIGN + if (size >= PREFERRED_RAM_ALIGN) + new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, size); + else +#endif + new_block->host = qemu_vmalloc(size); #endif #ifdef MADV_MERGEABLE madvise(new_block->host, size, MADV_MERGEABLE); #endif +#ifdef MADV_HUGEPAGE + madvise(new_block->host, size, MADV_HUGEPAGE); +#endif } new_block->offset = last_ram_offset; new_block->length = size;