Message ID | 20100111134248.GA25622@lst.de |
---|---|
State | New |
Headers | show |
On 01/11/2010 07:42 AM, Christoph Hellwig wrote: > On Mon, Jan 11, 2010 at 10:30:53AM +0200, Avi Kivity wrote: > >> The patch has potential to reduce performance on volumes with multiple >> spindles. Consider two processes issuing sequential reads into a RAID >> array. With this patch, the reads will be executed sequentially rather >> than in parallel, so I think a follow-on patch to make the minimum depth >> a parameter (set by the guest? the host?) would be helpful. >> > Let's think about the life cycle of I/O requests a bit. > > We have an idle virtqueue (aka one virtio-blk device). The first (read) > request comes in, we get the virtio notify from the guest, which calls > into virtio_blk_handle_output. With the new code we now disable the > notify once we start processing the first request. If the second > request hits the queue before we call into virtio_blk_get_request > the second time we're fine even with the new code as we keep picking it > up. If however it hits after we leave virtio_blk_handle_output, but > before we complete the first request we do indeed introduce additional > latency. > > So instead of disabling notify while requests are active we might want > to only disable it while we are inside virtio_blk_handle_output. > Something like the following minimally tested patch: > I'd suggest that we get even more aggressive and install an idle bottom half that checks the queue for newly submitted requests. If we keep getting requests submitted before a new one completes, we'll never take an I/O exit. The same approach is probably a good idea for virtio-net. Regards, Anthony Liguori
On 01/11/2010 03:42 PM, Christoph Hellwig wrote: > On Mon, Jan 11, 2010 at 10:30:53AM +0200, Avi Kivity wrote: > >> The patch has potential to reduce performance on volumes with multiple >> spindles. Consider two processes issuing sequential reads into a RAID >> array. With this patch, the reads will be executed sequentially rather >> than in parallel, so I think a follow-on patch to make the minimum depth >> a parameter (set by the guest? the host?) would be helpful. >> > Let's think about the life cycle of I/O requests a bit. > > We have an idle virtqueue (aka one virtio-blk device). The first (read) > request comes in, we get the virtio notify from the guest, which calls > into virtio_blk_handle_output. With the new code we now disable the > notify once we start processing the first request. If the second > request hits the queue before we call into virtio_blk_get_request > the second time we're fine even with the new code as we keep picking it > up. If however it hits after we leave virtio_blk_handle_output, but > before we complete the first request we do indeed introduce additional > latency. > > So instead of disabling notify while requests are active we might want > to only disable it while we are inside virtio_blk_handle_output. > Something like the following minimally tested patch: > > > Index: qemu/hw/virtio-blk.c > =================================================================== > --- qemu.orig/hw/virtio-blk.c 2010-01-11 14:28:42.896010503 +0100 > +++ qemu/hw/virtio-blk.c 2010-01-11 14:40:13.535256353 +0100 > @@ -328,7 +328,15 @@ static void virtio_blk_handle_output(Vir > int num_writes = 0; > BlockDriverState *old_bs = NULL; > > + /* > + * While we are processing requests there is no need to get further > + * notifications from the guest - it'll just burn cpu cycles doing > + * useless context switches into the host. > + */ > + virtio_queue_set_notification(s->vq, 0); > + > while ((req = virtio_blk_get_request(s))) { > +handle_request: > if (req->elem.out_num< 1 || req->elem.in_num< 1) { > fprintf(stderr, "virtio-blk missing headers\n"); > exit(1); > @@ -358,6 +366,18 @@ static void virtio_blk_handle_output(Vir > } > } > > + /* > + * Once we're done processing all pending requests re-enable the queue > + * notification. If there's an entry pending after we enabled > + * notification again we hit a race and need to process it before > + * returning. > + */ > + virtio_queue_set_notification(s->vq, 1); > + req = virtio_blk_get_request(s); > + if (req) { > + goto handle_request; > + } > + > I don't think this will have much effect. First, the time spent in virtio_blk_handle_output() is a small fraction of total guest time, so the probability of the guest hitting the notifications closed window is low. Second, while we're in that function, the vcpu that kicked us is stalled, and other vcpus are likely to be locked out of the queue by the guest.
On 01/11/2010 03:49 PM, Anthony Liguori wrote: >> So instead of disabling notify while requests are active we might want >> to only disable it while we are inside virtio_blk_handle_output. >> Something like the following minimally tested patch: > > > I'd suggest that we get even more aggressive and install an idle > bottom half that checks the queue for newly submitted requests. If we > keep getting requests submitted before a new one completes, we'll > never take an I/O exit. > That has the downside of bouncing a cache line on unrelated exits. It probably doesn't matter with qemu as it is now, since it will bounce qemu_mutex, but it will hurt with large guests (especially if they have many rings). IMO we should get things to work well without riding on unrelated exits, especially as we're trying to reduce those exits. > The same approach is probably a good idea for virtio-net. With vhost-net you don't see exits.
On 01/11/2010 08:29 AM, Avi Kivity wrote: > On 01/11/2010 03:49 PM, Anthony Liguori wrote: >>> So instead of disabling notify while requests are active we might want >>> to only disable it while we are inside virtio_blk_handle_output. >>> Something like the following minimally tested patch: >> >> >> I'd suggest that we get even more aggressive and install an idle >> bottom half that checks the queue for newly submitted requests. If >> we keep getting requests submitted before a new one completes, we'll >> never take an I/O exit. >> > > That has the downside of bouncing a cache line on unrelated exits. The read and write sides of the ring are widely separated in physical memory specifically to avoid cache line bouncing. > It probably doesn't matter with qemu as it is now, since it will > bounce qemu_mutex, but it will hurt with large guests (especially if > they have many rings). > > IMO we should get things to work well without riding on unrelated > exits, especially as we're trying to reduce those exits. A block I/O request can potentially be very, very long lived. By serializing requests like this, there's a high likelihood that it's going to kill performance with anything capable of processing multiple requests. OTOH, if we aggressively poll the ring when we have an opportunity to, there's very little down side to that and it addresses the serialization problem. >> The same approach is probably a good idea for virtio-net. > > With vhost-net you don't see exits. The point is, when we've disabled notification, we should poll on the ring for additional requests instead of waiting for one to complete before looking at another one. Even with vhost-net, this logic is still applicable although potentially harder to achieve. Regards, Anthony Liguori
On 01/11/2010 04:37 PM, Anthony Liguori wrote: >> That has the downside of bouncing a cache line on unrelated exits. > > > The read and write sides of the ring are widely separated in physical > memory specifically to avoid cache line bouncing. I meant, exits on random vcpus will cause the cacheline containing the notification disable flag to bounce around. As it is, we read it on the vcpu that owns the queue and write it on that vcpu or the I/O thread. >> It probably doesn't matter with qemu as it is now, since it will >> bounce qemu_mutex, but it will hurt with large guests (especially if >> they have many rings). >> >> IMO we should get things to work well without riding on unrelated >> exits, especially as we're trying to reduce those exits. > > A block I/O request can potentially be very, very long lived. By > serializing requests like this, there's a high likelihood that it's > going to kill performance with anything capable of processing multiple > requests. Right, that's why I suggested having a queue depth at which disabling notification kicks in. The patch hardcodes this depth to 1, unpatched qemu is infinite, a good value is probably spindle count + VAT. > OTOH, if we aggressively poll the ring when we have an opportunity to, > there's very little down side to that and it addresses the > serialization problem. But we can't guarantee that we'll get those opportunities, so it doesn't address the problem in a general way. A guest that doesn't use hpet and only has a single virtio-blk device will not have any reason to exit to qemu.
On 01/11/2010 08:46 AM, Avi Kivity wrote: > On 01/11/2010 04:37 PM, Anthony Liguori wrote: >>> That has the downside of bouncing a cache line on unrelated exits. >> >> >> The read and write sides of the ring are widely separated in physical >> memory specifically to avoid cache line bouncing. > > I meant, exits on random vcpus will cause the cacheline containing the > notification disable flag to bounce around. As it is, we read it on > the vcpu that owns the queue and write it on that vcpu or the I/O thread. Bottom halves are always run from the IO thread. >>> It probably doesn't matter with qemu as it is now, since it will >>> bounce qemu_mutex, but it will hurt with large guests (especially if >>> they have many rings). >>> >>> IMO we should get things to work well without riding on unrelated >>> exits, especially as we're trying to reduce those exits. >> >> A block I/O request can potentially be very, very long lived. By >> serializing requests like this, there's a high likelihood that it's >> going to kill performance with anything capable of processing >> multiple requests. > > Right, that's why I suggested having a queue depth at which disabling > notification kicks in. The patch hardcodes this depth to 1, unpatched > qemu is infinite, a good value is probably spindle count + VAT. That means we would need a user visible option which is quite unfortunate. Also, that logic only really makes sense with cache=off. With cache=writethrough, you can get pathological cases whereas you have an uncached access followed by cached accesses. In fact, with read-ahead, this is probably not an uncommon scenario. >> OTOH, if we aggressively poll the ring when we have an opportunity >> to, there's very little down side to that and it addresses the >> serialization problem. > > But we can't guarantee that we'll get those opportunities, so it > doesn't address the problem in a general way. A guest that doesn't > use hpet and only has a single virtio-blk device will not have any > reason to exit to qemu. We can mitigate this with a timer but honestly, we need to do perf measurements to see. My feeling is that we will need some more aggressive form of polling than just waiting for IO completion. I don't think queue depth is enough because it assumes that all requests are equal. When dealing with cache=off or even just storage with it's own cache, that's simply not the case. Regards, Anthony Liguori
On 01/11/2010 05:13 PM, Anthony Liguori wrote: > On 01/11/2010 08:46 AM, Avi Kivity wrote: >> On 01/11/2010 04:37 PM, Anthony Liguori wrote: >>>> That has the downside of bouncing a cache line on unrelated exits. >>> >>> >>> The read and write sides of the ring are widely separated in >>> physical memory specifically to avoid cache line bouncing. >> >> I meant, exits on random vcpus will cause the cacheline containing >> the notification disable flag to bounce around. As it is, we read it >> on the vcpu that owns the queue and write it on that vcpu or the I/O >> thread. > > Bottom halves are always run from the IO thread. Okay, so that won't be an issue. >>>> It probably doesn't matter with qemu as it is now, since it will >>>> bounce qemu_mutex, but it will hurt with large guests (especially >>>> if they have many rings). >>>> >>>> IMO we should get things to work well without riding on unrelated >>>> exits, especially as we're trying to reduce those exits. >>> >>> A block I/O request can potentially be very, very long lived. By >>> serializing requests like this, there's a high likelihood that it's >>> going to kill performance with anything capable of processing >>> multiple requests. >> >> Right, that's why I suggested having a queue depth at which disabling >> notification kicks in. The patch hardcodes this depth to 1, >> unpatched qemu is infinite, a good value is probably spindle count + >> VAT. > > That means we would need a user visible option which is quite > unfortunate. We could guess it, perhaps. > Also, that logic only really makes sense with cache=off. With > cache=writethrough, you can get pathological cases whereas you have an > uncached access followed by cached accesses. In fact, with > read-ahead, this is probably not an uncommon scenario. So you'd increase the disable depths when cache!=none. >>> OTOH, if we aggressively poll the ring when we have an opportunity >>> to, there's very little down side to that and it addresses the >>> serialization problem. >> >> But we can't guarantee that we'll get those opportunities, so it >> doesn't address the problem in a general way. A guest that doesn't >> use hpet and only has a single virtio-blk device will not have any >> reason to exit to qemu. > > We can mitigate this with a timer but honestly, we need to do perf > measurements to see. My feeling is that we will need some more > aggressive form of polling than just waiting for IO completion. I > don't think queue depth is enough because it assumes that all requests > are equal. When dealing with cache=off or even just storage with it's > own cache, that's simply not the case. Maybe we can adapt behaviour dynamically based on how fast results come in.
On 01/11/2010 09:19 AM, Avi Kivity wrote: >>>> OTOH, if we aggressively poll the ring when we have an opportunity >>>> to, there's very little down side to that and it addresses the >>>> serialization problem. >>> >>> But we can't guarantee that we'll get those opportunities, so it >>> doesn't address the problem in a general way. A guest that doesn't >>> use hpet and only has a single virtio-blk device will not have any >>> reason to exit to qemu. >> >> We can mitigate this with a timer but honestly, we need to do perf >> measurements to see. My feeling is that we will need some more >> aggressive form of polling than just waiting for IO completion. I >> don't think queue depth is enough because it assumes that all >> requests are equal. When dealing with cache=off or even just storage >> with it's own cache, that's simply not the case. > > Maybe we can adapt behaviour dynamically based on how fast results > come in. Based on our experiences with virtio-net, what I'd suggest is to make a lot of tunable options (ring size, various tx mitigation schemes, timeout durations, etc) and then we can do some deep performance studies to see how things interact with each other. I think we should do that before making any changes because I'm deeply concerned that we'll introduce significant performance regressions. Regards, Anthony Liguori
On 01/11/2010 05:22 PM, Anthony Liguori wrote: > > Based on our experiences with virtio-net, what I'd suggest is to make > a lot of tunable options (ring size, various tx mitigation schemes, > timeout durations, etc) and then we can do some deep performance > studies to see how things interact with each other. > > I think we should do that before making any changes because I'm deeply > concerned that we'll introduce significant performance regressions. > I agree. We can start with this patch, with a tunable depth, defaulting to current behaviour.
On 01/11/2010 09:31 AM, Avi Kivity wrote: > On 01/11/2010 05:22 PM, Anthony Liguori wrote: >> >> Based on our experiences with virtio-net, what I'd suggest is to make >> a lot of tunable options (ring size, various tx mitigation schemes, >> timeout durations, etc) and then we can do some deep performance >> studies to see how things interact with each other. >> >> I think we should do that before making any changes because I'm >> deeply concerned that we'll introduce significant performance >> regressions. >> > > I agree. We can start with this patch, with a tunable depth, > defaulting to current behaviour. I wouldn't be opposed to that provided we made it clear that these options were not supported long term. I don't want management tools (like libvirt) to start relying on them. Regards, Anthony Liguori
On 01/11/2010 05:32 PM, Anthony Liguori wrote: > On 01/11/2010 09:31 AM, Avi Kivity wrote: >> On 01/11/2010 05:22 PM, Anthony Liguori wrote: >>> >>> Based on our experiences with virtio-net, what I'd suggest is to >>> make a lot of tunable options (ring size, various tx mitigation >>> schemes, timeout durations, etc) and then we can do some deep >>> performance studies to see how things interact with each other. >>> >>> I think we should do that before making any changes because I'm >>> deeply concerned that we'll introduce significant performance >>> regressions. >>> >> >> I agree. We can start with this patch, with a tunable depth, >> defaulting to current behaviour. > > I wouldn't be opposed to that provided we made it clear that these > options were not supported long term. I don't want management tools > (like libvirt) to start relying on them. > x-option-name for experimental options? -device disk,if=virtio,x-queue-depth-suppress-notify=4
On 01/11/2010 09:35 AM, Avi Kivity wrote: > On 01/11/2010 05:32 PM, Anthony Liguori wrote: >> On 01/11/2010 09:31 AM, Avi Kivity wrote: >>> On 01/11/2010 05:22 PM, Anthony Liguori wrote: >>>> >>>> Based on our experiences with virtio-net, what I'd suggest is to >>>> make a lot of tunable options (ring size, various tx mitigation >>>> schemes, timeout durations, etc) and then we can do some deep >>>> performance studies to see how things interact with each other. >>>> >>>> I think we should do that before making any changes because I'm >>>> deeply concerned that we'll introduce significant performance >>>> regressions. >>>> >>> >>> I agree. We can start with this patch, with a tunable depth, >>> defaulting to current behaviour. >> >> I wouldn't be opposed to that provided we made it clear that these >> options were not supported long term. I don't want management tools >> (like libvirt) to start relying on them. >> > > x-option-name for experimental options? > > -device disk,if=virtio,x-queue-depth-suppress-notify=4 Sounds reasonable to me. Regards, Anthony Liguori
On Mon, Jan 11, 2010 at 08:37:10AM -0600, Anthony Liguori wrote: > On 01/11/2010 08:29 AM, Avi Kivity wrote: >> On 01/11/2010 03:49 PM, Anthony Liguori wrote: >>>> So instead of disabling notify while requests are active we might want >>>> to only disable it while we are inside virtio_blk_handle_output. >>>> Something like the following minimally tested patch: >>> >>> >>> I'd suggest that we get even more aggressive and install an idle >>> bottom half that checks the queue for newly submitted requests. If >>> we keep getting requests submitted before a new one completes, we'll >>> never take an I/O exit. >>> >> >> That has the downside of bouncing a cache line on unrelated exits. > > The read and write sides of the ring are widely separated in physical > memory specifically to avoid cache line bouncing. > >> It probably doesn't matter with qemu as it is now, since it will >> bounce qemu_mutex, but it will hurt with large guests (especially if >> they have many rings). >> >> IMO we should get things to work well without riding on unrelated >> exits, especially as we're trying to reduce those exits. > > A block I/O request can potentially be very, very long lived. By > serializing requests like this, there's a high likelihood that it's > going to kill performance with anything capable of processing multiple > requests. > > OTOH, if we aggressively poll the ring when we have an opportunity to, > there's very little down side to that and it addresses the serialization > problem. > >>> The same approach is probably a good idea for virtio-net. >> >> With vhost-net you don't see exits. > > The point is, when we've disabled notification, we should poll on the > ring for additional requests instead of waiting for one to complete > before looking at another one. > > Even with vhost-net, this logic is still applicable although potentially > harder to achieve. > > Regards, > > Anthony Liguori > vhost net does this already: it has a mode where it poll when skbs have left send queue: for tap this is when they have crossed the bridge, for packet socket this is when they have been transmitted.
On Mon, Jan 11, 2010 at 09:13:23AM -0600, Anthony Liguori wrote: > On 01/11/2010 08:46 AM, Avi Kivity wrote: >> On 01/11/2010 04:37 PM, Anthony Liguori wrote: >>>> That has the downside of bouncing a cache line on unrelated exits. >>> >>> >>> The read and write sides of the ring are widely separated in physical >>> memory specifically to avoid cache line bouncing. >> >> I meant, exits on random vcpus will cause the cacheline containing the >> notification disable flag to bounce around. As it is, we read it on >> the vcpu that owns the queue and write it on that vcpu or the I/O >> thread. > > Bottom halves are always run from the IO thread. >>>> It probably doesn't matter with qemu as it is now, since it will >>>> bounce qemu_mutex, but it will hurt with large guests (especially >>>> if they have many rings). >>>> >>>> IMO we should get things to work well without riding on unrelated >>>> exits, especially as we're trying to reduce those exits. >>> >>> A block I/O request can potentially be very, very long lived. By >>> serializing requests like this, there's a high likelihood that it's >>> going to kill performance with anything capable of processing >>> multiple requests. >> >> Right, that's why I suggested having a queue depth at which disabling >> notification kicks in. The patch hardcodes this depth to 1, unpatched >> qemu is infinite, a good value is probably spindle count + VAT. > > That means we would need a user visible option which is quite unfortunate. > > Also, that logic only really makes sense with cache=off. With > cache=writethrough, you can get pathological cases whereas you have an > uncached access followed by cached accesses. In fact, with read-ahead, > this is probably not an uncommon scenario. > >>> OTOH, if we aggressively poll the ring when we have an opportunity >>> to, there's very little down side to that and it addresses the >>> serialization problem. >> >> But we can't guarantee that we'll get those opportunities, so it >> doesn't address the problem in a general way. A guest that doesn't >> use hpet and only has a single virtio-blk device will not have any >> reason to exit to qemu. > > We can mitigate this with a timer but honestly, we need to do perf > measurements to see. My feeling is that we will need some more > aggressive form of polling than just waiting for IO completion. I don't > think queue depth is enough because it assumes that all requests are > equal. When dealing with cache=off or even just storage with it's own > cache, that's simply not the case. > > Regards, > > Anthony Liguori > BTW this is why vhost net uses queue depth in bytes.
Index: qemu/hw/virtio-blk.c =================================================================== --- qemu.orig/hw/virtio-blk.c 2010-01-11 14:28:42.896010503 +0100 +++ qemu/hw/virtio-blk.c 2010-01-11 14:40:13.535256353 +0100 @@ -328,7 +328,15 @@ static void virtio_blk_handle_output(Vir int num_writes = 0; BlockDriverState *old_bs = NULL; + /* + * While we are processing requests there is no need to get further + * notifications from the guest - it'll just burn cpu cycles doing + * useless context switches into the host. + */ + virtio_queue_set_notification(s->vq, 0); + while ((req = virtio_blk_get_request(s))) { +handle_request: if (req->elem.out_num < 1 || req->elem.in_num < 1) { fprintf(stderr, "virtio-blk missing headers\n"); exit(1); @@ -358,6 +366,18 @@ static void virtio_blk_handle_output(Vir } } + /* + * Once we're done processing all pending requests re-enable the queue + * notification. If there's an entry pending after we enabled + * notification again we hit a race and need to process it before + * returning. + */ + virtio_queue_set_notification(s->vq, 1); + req = virtio_blk_get_request(s); + if (req) { + goto handle_request; + } + if (num_writes > 0) { do_multiwrite(old_bs, blkreq, num_writes); }