Message ID | 509627A2.3040509@web.de |
---|---|
State | New |
Headers | show |
On 11/04/2012 10:30 AM, Jan Kiszka wrote: > From: Jan Kiszka <jan.kiszka@siemens.com> > > Cirrus is triggering this, e.g. during Win2k boot: Changes only on > disabled regions require no topology update when transaction depth drops > to 0 again. 817dcc5368988b0 (pci: give each device its own address space) mad this much worse by multiplying the number of address spaces. Each change is now evaluated N+2 times, where N is the number of PCI devices. It also causes a corresponding expansion in memory usage. I want to address this by caching AddressSpaceDispatch trees with the key being the contents of the FlatView for that address space. This will drop the number of distinct trees to 2-4 (3 if some devices have PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different from the cpu memory address space) but will fail if we make each address space different (for example filtering out the device's own BARs). If this change also improves cpu usage sufficiently, then it will be better than your patch, which doesn't recognize changes in an enabled region inside a disabled or hidden region. In other words, your patch fits the problem at hand but isn't general. On the other hand my approach doesn't eliminate render_memory_region(), just the exec.c stuff and listener updates. So we need to understand where the slowness comes from. > > @@ -1543,6 +1554,7 @@ void address_space_init(AddressSpace *as, MemoryRegion *root) > flatview_init(as->current_map); > QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link); > as->name = NULL; > + memory_region_update_pending = true; > memory_region_transaction_commit(); > address_space_init_dispatch(as); > } > @@ -1552,6 +1564,7 @@ void address_space_destroy(AddressSpace *as) > /* Flush out anything from MemoryListeners listening in on this */ > memory_region_transaction_begin(); > as->root = NULL; > + memory_region_update_pending = true; > memory_region_transaction_commit(); > QTAILQ_REMOVE(&address_spaces, as, address_spaces_link); > address_space_destroy_dispatch(as); init and destroy cannot happen to a region that is mapped (and cannot happen within a transaction), so these two are redundant.
On 2012-11-04 20:21, Avi Kivity wrote: > On 11/04/2012 10:30 AM, Jan Kiszka wrote: >> From: Jan Kiszka <jan.kiszka@siemens.com> >> >> Cirrus is triggering this, e.g. during Win2k boot: Changes only on >> disabled regions require no topology update when transaction depth drops >> to 0 again. > > 817dcc5368988b0 (pci: give each device its own address space) mad this > much worse by multiplying the number of address spaces. Each change is > now evaluated N+2 times, where N is the number of PCI devices. It also > causes a corresponding expansion in memory usage. I know... But this regression predates your changes, is already visible right after 02e2b95fb4. > > I want to address this by caching AddressSpaceDispatch trees with the > key being the contents of the FlatView for that address space. This > will drop the number of distinct trees to 2-4 (3 if some devices have > PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different > from the cpu memory address space) but will fail if we make each address > space different (for example filtering out the device's own BARs). > > If this change also improves cpu usage sufficiently, then it will be > better than your patch, which doesn't recognize changes in an enabled > region inside a disabled or hidden region. True, though the question is how common such scenarios are. This one (cirrus with win2k) is already special. > In other words, your patch > fits the problem at hand but isn't general. On the other hand my > approach doesn't eliminate render_memory_region(), just the exec.c stuff > and listener updates. So we need to understand where the slowness comes > from. I would just like to have some even intermediate solution for 1.3. We can still make it more perfect later on if required. Jan
On 11/05/2012 08:26 AM, Jan Kiszka wrote: > On 2012-11-04 20:21, Avi Kivity wrote: > > On 11/04/2012 10:30 AM, Jan Kiszka wrote: > >> From: Jan Kiszka <jan.kiszka@siemens.com> > >> > >> Cirrus is triggering this, e.g. during Win2k boot: Changes only on > >> disabled regions require no topology update when transaction depth drops > >> to 0 again. > > > > 817dcc5368988b0 (pci: give each device its own address space) mad this > > much worse by multiplying the number of address spaces. Each change is > > now evaluated N+2 times, where N is the number of PCI devices. It also > > causes a corresponding expansion in memory usage. > > I know... But this regression predates your changes, is already visible > right after 02e2b95fb4. > > > > > I want to address this by caching AddressSpaceDispatch trees with the > > key being the contents of the FlatView for that address space. This > > will drop the number of distinct trees to 2-4 (3 if some devices have > > PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different > > from the cpu memory address space) but will fail if we make each address > > space different (for example filtering out the device's own BARs). > > > > If this change also improves cpu usage sufficiently, then it will be > > better than your patch, which doesn't recognize changes in an enabled > > region inside a disabled or hidden region. > > True, though the question is how common such scenarios are. This one > (cirrus with win2k) is already special. > > > In other words, your patch > > fits the problem at hand but isn't general. On the other hand my > > approach doesn't eliminate render_memory_region(), just the exec.c stuff > > and listener updates. So we need to understand where the slowness comes > > from. > > I would just like to have some even intermediate solution for 1.3. We > can still make it more perfect later on if required. > I think we should apply a v2 then, the more general optimizations will take some time.
On 2012-11-05 09:12, Avi Kivity wrote: > On 11/05/2012 08:26 AM, Jan Kiszka wrote: >> On 2012-11-04 20:21, Avi Kivity wrote: >>> On 11/04/2012 10:30 AM, Jan Kiszka wrote: >>>> From: Jan Kiszka <jan.kiszka@siemens.com> >>>> >>>> Cirrus is triggering this, e.g. during Win2k boot: Changes only on >>>> disabled regions require no topology update when transaction depth drops >>>> to 0 again. >>> >>> 817dcc5368988b0 (pci: give each device its own address space) mad this >>> much worse by multiplying the number of address spaces. Each change is >>> now evaluated N+2 times, where N is the number of PCI devices. It also >>> causes a corresponding expansion in memory usage. >> >> I know... But this regression predates your changes, is already visible >> right after 02e2b95fb4. >> >>> >>> I want to address this by caching AddressSpaceDispatch trees with the >>> key being the contents of the FlatView for that address space. This >>> will drop the number of distinct trees to 2-4 (3 if some devices have >>> PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different >>> from the cpu memory address space) but will fail if we make each address >>> space different (for example filtering out the device's own BARs). >>> >>> If this change also improves cpu usage sufficiently, then it will be >>> better than your patch, which doesn't recognize changes in an enabled >>> region inside a disabled or hidden region. >> >> True, though the question is how common such scenarios are. This one >> (cirrus with win2k) is already special. >> >>> In other words, your patch >>> fits the problem at hand but isn't general. On the other hand my >>> approach doesn't eliminate render_memory_region(), just the exec.c stuff >>> and listener updates. So we need to understand where the slowness comes >>> from. >> >> I would just like to have some even intermediate solution for 1.3. We >> can still make it more perfect later on if required. >> > > I think we should apply a v2 then, the more general optimizations will > take some time. OK - what should v2 do differently? Jan
On 11/05/2012 10:51 AM, Jan Kiszka wrote: > On 2012-11-05 09:12, Avi Kivity wrote: > > On 11/05/2012 08:26 AM, Jan Kiszka wrote: > >> On 2012-11-04 20:21, Avi Kivity wrote: > >>> On 11/04/2012 10:30 AM, Jan Kiszka wrote: > >>>> From: Jan Kiszka <jan.kiszka@siemens.com> > >>>> > >>>> Cirrus is triggering this, e.g. during Win2k boot: Changes only on > >>>> disabled regions require no topology update when transaction depth drops > >>>> to 0 again. > >>> > >>> 817dcc5368988b0 (pci: give each device its own address space) mad this > >>> much worse by multiplying the number of address spaces. Each change is > >>> now evaluated N+2 times, where N is the number of PCI devices. It also > >>> causes a corresponding expansion in memory usage. > >> > >> I know... But this regression predates your changes, is already visible > >> right after 02e2b95fb4. > >> > >>> > >>> I want to address this by caching AddressSpaceDispatch trees with the > >>> key being the contents of the FlatView for that address space. This > >>> will drop the number of distinct trees to 2-4 (3 if some devices have > >>> PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different > >>> from the cpu memory address space) but will fail if we make each address > >>> space different (for example filtering out the device's own BARs). > >>> > >>> If this change also improves cpu usage sufficiently, then it will be > >>> better than your patch, which doesn't recognize changes in an enabled > >>> region inside a disabled or hidden region. > >> > >> True, though the question is how common such scenarios are. This one > >> (cirrus with win2k) is already special. > >> > >>> In other words, your patch > >>> fits the problem at hand but isn't general. On the other hand my > >>> approach doesn't eliminate render_memory_region(), just the exec.c stuff > >>> and listener updates. So we need to understand where the slowness comes > >>> from. > >> > >> I would just like to have some even intermediate solution for 1.3. We > >> can still make it more perfect later on if required. > >> > > > > I think we should apply a v2 then, the more general optimizations will > > take some time. > > OK - what should v2 do differently? > As I noted, init and destroy cannot cause a topology update.
On 2012-11-05 13:33, Avi Kivity wrote: > On 11/05/2012 10:51 AM, Jan Kiszka wrote: >> On 2012-11-05 09:12, Avi Kivity wrote: >>> On 11/05/2012 08:26 AM, Jan Kiszka wrote: >>>> On 2012-11-04 20:21, Avi Kivity wrote: >>>>> On 11/04/2012 10:30 AM, Jan Kiszka wrote: >>>>>> From: Jan Kiszka <jan.kiszka@siemens.com> >>>>>> >>>>>> Cirrus is triggering this, e.g. during Win2k boot: Changes only on >>>>>> disabled regions require no topology update when transaction depth drops >>>>>> to 0 again. >>>>> >>>>> 817dcc5368988b0 (pci: give each device its own address space) mad this >>>>> much worse by multiplying the number of address spaces. Each change is >>>>> now evaluated N+2 times, where N is the number of PCI devices. It also >>>>> causes a corresponding expansion in memory usage. >>>> >>>> I know... But this regression predates your changes, is already visible >>>> right after 02e2b95fb4. >>>> >>>>> >>>>> I want to address this by caching AddressSpaceDispatch trees with the >>>>> key being the contents of the FlatView for that address space. This >>>>> will drop the number of distinct trees to 2-4 (3 if some devices have >>>>> PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different >>>>> from the cpu memory address space) but will fail if we make each address >>>>> space different (for example filtering out the device's own BARs). >>>>> >>>>> If this change also improves cpu usage sufficiently, then it will be >>>>> better than your patch, which doesn't recognize changes in an enabled >>>>> region inside a disabled or hidden region. >>>> >>>> True, though the question is how common such scenarios are. This one >>>> (cirrus with win2k) is already special. >>>> >>>>> In other words, your patch >>>>> fits the problem at hand but isn't general. On the other hand my >>>>> approach doesn't eliminate render_memory_region(), just the exec.c stuff >>>>> and listener updates. So we need to understand where the slowness comes >>>>> from. >>>> >>>> I would just like to have some even intermediate solution for 1.3. We >>>> can still make it more perfect later on if required. >>>> >>> >>> I think we should apply a v2 then, the more general optimizations will >>> take some time. >> >> OK - what should v2 do differently? >> > > As I noted, init and destroy cannot cause a topology update. Ah, right. Why are we wrapping them in transaction_begin/commit at all then? Jan
On 11/05/2012 02:37 PM, Jan Kiszka wrote: >> >> As I noted, init and destroy cannot cause a topology update. > > Ah, right. Why are we wrapping them in transaction_begin/commit at all then? > We aren't. void memory_region_destroy(MemoryRegion *mr) { assert(QTAILQ_EMPTY(&mr->subregions)); assert(memory_region_transaction_depth == 0);
On 2012-11-25 11:18, Avi Kivity wrote: > On 11/05/2012 02:37 PM, Jan Kiszka wrote: >>> >>> As I noted, init and destroy cannot cause a topology update. >> >> Ah, right. Why are we wrapping them in transaction_begin/commit at all then? >> > > We aren't. > > > void memory_region_destroy(MemoryRegion *mr) > { > assert(QTAILQ_EMPTY(&mr->subregions)); > assert(memory_region_transaction_depth == 0); > We were talking about address_space_init/destroy. Jan
On 11/25/2012 12:45 PM, Jan Kiszka wrote: > On 2012-11-25 11:18, Avi Kivity wrote: >> On 11/05/2012 02:37 PM, Jan Kiszka wrote: >>>> >>>> As I noted, init and destroy cannot cause a topology update. >>> >>> Ah, right. Why are we wrapping them in transaction_begin/commit at all then? >>> >> >> We aren't. >> >> >> void memory_region_destroy(MemoryRegion *mr) >> { >> assert(QTAILQ_EMPTY(&mr->subregions)); >> assert(memory_region_transaction_depth == 0); >> > > We were talking about address_space_init/destroy. This is to force a re-rendering of the address space, so that listeners see the construction/destruction. Simply assigning as->root wouldn't do that. This kind of reliance on side effects should be documented with a comment (and forbidden to anything that is outside the implementation). My bad.
diff --git a/memory.c b/memory.c index d5150f8..122a58e 100644 --- a/memory.c +++ b/memory.c @@ -22,7 +22,8 @@ #include "memory-internal.h" -unsigned memory_region_transaction_depth = 0; +static unsigned memory_region_transaction_depth; +static bool memory_region_update_pending; static bool global_dirty_log = false; static QTAILQ_HEAD(memory_listeners, MemoryListener) memory_listeners @@ -741,7 +742,8 @@ void memory_region_transaction_commit(void) assert(memory_region_transaction_depth); --memory_region_transaction_depth; - if (!memory_region_transaction_depth) { + if (!memory_region_transaction_depth && memory_region_update_pending) { + memory_region_update_pending = false; MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); QTAILQ_FOREACH(as, &address_spaces, address_spaces_link) { @@ -1060,6 +1062,7 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client) memory_region_transaction_begin(); mr->dirty_log_mask = (mr->dirty_log_mask & ~mask) | (log * mask); + memory_region_update_pending |= mr->enabled; memory_region_transaction_commit(); } @@ -1097,6 +1100,7 @@ void memory_region_set_readonly(MemoryRegion *mr, bool readonly) if (mr->readonly != readonly) { memory_region_transaction_begin(); mr->readonly = readonly; + memory_region_update_pending |= mr->enabled; memory_region_transaction_commit(); } } @@ -1106,6 +1110,7 @@ void memory_region_rom_device_set_readable(MemoryRegion *mr, bool readable) if (mr->readable != readable) { memory_region_transaction_begin(); mr->readable = readable; + memory_region_update_pending |= mr->enabled; memory_region_transaction_commit(); } } @@ -1248,6 +1253,7 @@ void memory_region_add_eventfd(MemoryRegion *mr, memmove(&mr->ioeventfds[i+1], &mr->ioeventfds[i], sizeof(*mr->ioeventfds) * (mr->ioeventfd_nb-1 - i)); mr->ioeventfds[i] = mrfd; + memory_region_update_pending |= mr->enabled; memory_region_transaction_commit(); } @@ -1280,6 +1286,7 @@ void memory_region_del_eventfd(MemoryRegion *mr, --mr->ioeventfd_nb; mr->ioeventfds = g_realloc(mr->ioeventfds, sizeof(*mr->ioeventfds)*mr->ioeventfd_nb + 1); + memory_region_update_pending |= mr->enabled; memory_region_transaction_commit(); } @@ -1323,6 +1330,7 @@ static void memory_region_add_subregion_common(MemoryRegion *mr, } QTAILQ_INSERT_TAIL(&mr->subregions, subregion, subregions_link); done: + memory_region_update_pending |= mr->enabled && subregion->enabled; memory_region_transaction_commit(); } @@ -1353,6 +1361,7 @@ void memory_region_del_subregion(MemoryRegion *mr, assert(subregion->parent == mr); subregion->parent = NULL; QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link); + memory_region_update_pending |= mr->enabled && subregion->enabled; memory_region_transaction_commit(); } @@ -1363,6 +1372,7 @@ void memory_region_set_enabled(MemoryRegion *mr, bool enabled) } memory_region_transaction_begin(); mr->enabled = enabled; + memory_region_update_pending = true; memory_region_transaction_commit(); } @@ -1397,6 +1407,7 @@ void memory_region_set_alias_offset(MemoryRegion *mr, hwaddr offset) memory_region_transaction_begin(); mr->alias_offset = offset; + memory_region_update_pending |= mr->enabled; memory_region_transaction_commit(); } @@ -1543,6 +1554,7 @@ void address_space_init(AddressSpace *as, MemoryRegion *root) flatview_init(as->current_map); QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link); as->name = NULL; + memory_region_update_pending = true; memory_region_transaction_commit(); address_space_init_dispatch(as); } @@ -1552,6 +1564,7 @@ void address_space_destroy(AddressSpace *as) /* Flush out anything from MemoryListeners listening in on this */ memory_region_transaction_begin(); as->root = NULL; + memory_region_update_pending = true; memory_region_transaction_commit(); QTAILQ_REMOVE(&address_spaces, as, address_spaces_link); address_space_destroy_dispatch(as);