Message ID | 20121212105101.GA6461@redhat.com |
---|---|
State | New |
Headers | show |
On Wed, Dec 12, 2012 at 12:51:01PM +0200, Michael S. Tsirkin wrote: > Add sanity check to address the following concern: > > During migration, all we pass the index of the request; > the rest can be re-read from the ring. > > This is not generally enough if any available requests are outstanding. > Imagine a ring of size 4. Below A means available U means used. > > A 1 > A 2 > U 2 > A 2 > U 2 > A 2 > U 2 > A 2 > U 2 > > At this point available ring has wrapped around, the only > way to know head 1 is outstanding is because backend > has stored this info somewhere. > > The reason we manage to migrate without tracking this in migration > state is because we flush outstanding requests before > migration. > This flush is device-specific though, let's add > a safeguard in virtio core to ensure it's done properly. > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > --- > > Changes from v1: > v1 was against the wrong tree, it didn't build against qemu.git > > hw/virtio.c | 2 ++ > 1 file changed, 2 insertions(+) VirtIOBlock->rq can trigger the assertion. IIUC hw/virtio-blk.c may handle I/O errors by keeping the request pending and on a list (->rq). This allows the user to restart them after, for example, adding more space to the host file system containing the disk image file. We keep a list of failed requests and we migrate this list. So I think inuse != 0 when migrating with pending failed I/O requests. Stefan
Il 12/12/2012 14:50, Stefan Hajnoczi ha scritto: > VirtIOBlock->rq can trigger the assertion. > > IIUC hw/virtio-blk.c may handle I/O errors by keeping the request > pending and on a list (->rq). This allows the user to restart them > after, for example, adding more space to the host file system containing > the disk image file. > > We keep a list of failed requests and we migrate this list. So I think > inuse != 0 when migrating with pending failed I/O requests. Same for virtio-scsi. Each request in that case is sent as part of the SCSIDevice that it refers to, via callbacks in SCSIBusInfo. Paolo
On Wed, Dec 12, 2012 at 02:50:50PM +0100, Stefan Hajnoczi wrote: > On Wed, Dec 12, 2012 at 12:51:01PM +0200, Michael S. Tsirkin wrote: > > Add sanity check to address the following concern: > > > > During migration, all we pass the index of the request; > > the rest can be re-read from the ring. > > > > This is not generally enough if any available requests are outstanding. > > Imagine a ring of size 4. Below A means available U means used. > > > > A 1 > > A 2 > > U 2 > > A 2 > > U 2 > > A 2 > > U 2 > > A 2 > > U 2 > > > > At this point available ring has wrapped around, the only > > way to know head 1 is outstanding is because backend > > has stored this info somewhere. > > > > The reason we manage to migrate without tracking this in migration > > state is because we flush outstanding requests before > > migration. > > This flush is device-specific though, let's add > > a safeguard in virtio core to ensure it's done properly. > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > > --- > > > > Changes from v1: > > v1 was against the wrong tree, it didn't build against qemu.git > > > > hw/virtio.c | 2 ++ > > 1 file changed, 2 insertions(+) > > VirtIOBlock->rq can trigger the assertion. > > IIUC hw/virtio-blk.c may handle I/O errors by keeping the request > pending and on a list (->rq). This allows the user to restart them > after, for example, adding more space to the host file system containing > the disk image file. > > We keep a list of failed requests and we migrate this list. Could not find it. It needs to be in virtio-blk in order to know to update the used ring once it completes, right? > So I think > inuse != 0 when migrating with pending failed I/O requests. > > Stefan Okay but let's make sure this is not a bug.
On Wed, Dec 12, 2012 at 03:01:55PM +0100, Paolo Bonzini wrote: > Il 12/12/2012 14:50, Stefan Hajnoczi ha scritto: > > VirtIOBlock->rq can trigger the assertion. > > > > IIUC hw/virtio-blk.c may handle I/O errors by keeping the request > > pending and on a list (->rq). This allows the user to restart them > > after, for example, adding more space to the host file system containing > > the disk image file. > > > > We keep a list of failed requests and we migrate this list. So I think > > inuse != 0 when migrating with pending failed I/O requests. > > Same for virtio-scsi. Each request in that case is sent as part of the > SCSIDevice that it refers to, via callbacks in SCSIBusInfo. > > Paolo Looks like this will leak ring entries. All I see is: virtio_scsi_load calling virtio_load. When the loading side will get last avail index it will assume all requests up to that value have completed, so it will never put the missing heads in the used ring. And it is at this point too late for the source to change the used ring as guest memory has migrated.
Il 12/12/2012 15:30, Michael S. Tsirkin ha scritto: > > Same for virtio-scsi. Each request in that case is sent as part of the > > SCSIDevice that it refers to, via callbacks in SCSIBusInfo. It is in virtio_scsi_load_request. > Looks like this will leak ring entries. > > All I see is: virtio_scsi_load calling virtio_load. > When the loading side will get last avail index it > will assume all requests up to that value have > completed, so it will never put the missing heads > in the used ring. Ok, so we need some API for virtio-{blk,scsi} to communicate back the indexes of in-flight requests to virtio. The indexes are known from the VirtQueueElement, so that's fine. Even better would be a virtio_save_request/virtio_load_request API... Paolo
On Wed, Dec 12, 2012 at 03:36:10PM +0100, Paolo Bonzini wrote: > Il 12/12/2012 15:30, Michael S. Tsirkin ha scritto: > > > Same for virtio-scsi. Each request in that case is sent as part of the > > > SCSIDevice that it refers to, via callbacks in SCSIBusInfo. > > It is in virtio_scsi_load_request. > > > Looks like this will leak ring entries. > > > > All I see is: virtio_scsi_load calling virtio_load. > > When the loading side will get last avail index it > > will assume all requests up to that value have > > completed, so it will never put the missing heads > > in the used ring. > > Ok, so we need some API for virtio-{blk,scsi} to communicate back the > indexes of in-flight requests to virtio. The indexes are known from the > VirtQueueElement, so that's fine. > > Even better would be a virtio_save_request/virtio_load_request API... > > Paolo So you are saying this is a bug then? Great. This is exactly what the assert above is out there to catch. And you really can't fix it without breaking migration compatibility. As step 1, I think we should just complete all outstanding requests when VM stops. Yes it means you can't do the retry hack after migration but this is hardly common scenario.
Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto: > > Ok, so we need some API for virtio-{blk,scsi} to communicate back the > > indexes of in-flight requests to virtio. The indexes are known from the > > VirtQueueElement, so that's fine. > > > > Even better would be a virtio_save_request/virtio_load_request API... > > So you are saying this is a bug then? Great. I'm not sure what you mean by "it will never put the missing heads in the used ring". The serialized requests are put in the used rings when they are completed and virtio-{blk,scsi} calls virtqueue_push. Is the problem if you have a vring that looks like this: A A U A U U A A ? Which heads are leaked? {0,1}, {2} or {6,7}? Or a combination thereof? Also, I'm not sure your "fix" (crash QEMU) is correct. I hope we can make it work. > This is exactly what the assert above is out there to catch. > And you really can't fix it without breaking migration compatibility. Why not? The index in the vring is in the migration data. > As step 1, I think we should just complete all outstanding > requests when VM stops. > > Yes it means you can't do the retry hack after migration > but this is hardly common scenario. I disagree... Paolo
On Wed, Dec 12, 2012 at 04:01:27PM +0100, Paolo Bonzini wrote: > Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto: > > > Ok, so we need some API for virtio-{blk,scsi} to communicate back the > > > indexes of in-flight requests to virtio. The indexes are known from the > > > VirtQueueElement, so that's fine. > > > > > > Even better would be a virtio_save_request/virtio_load_request API... > > > > So you are saying this is a bug then? Great. > > I'm not sure what you mean by "it will never put the missing heads > in the used ring". The serialized requests are put in the used rings > when they are completed and virtio-{blk,scsi} calls virtqueue_push. Is > the problem if you have a vring that looks like this: > > A A U A U U A A > > ? Which heads are leaked? {0,1}, {2} or {6,7}? Or a combination thereof? I don't know what A A U A U U A A means. Pls see the example of the problem in the original commit log. > Also, I'm not sure your "fix" (crash QEMU) is correct. I hope we can > make it work. Of course assert is not a fix. The patch is just to catch bugs like this one. To make it work, complete all requests when vm is stopped. This needs to be done in the specific device, the assert will catch buggy devices. > > This is exactly what the assert above is out there to catch. > > And you really can't fix it without breaking migration compatibility. > > Why not? The index in the vring is in the migration data. index is not enough if requests are outstanding. Pls check the example in the log of the patch. > > As step 1, I think we should just complete all outstanding > > requests when VM stops. > > > > Yes it means you can't do the retry hack after migration > > but this is hardly common scenario. > > I disagree... > > Paolo Disagree with what? You are saying it's common?
Il 12/12/2012 16:25, Michael S. Tsirkin ha scritto: > On Wed, Dec 12, 2012 at 04:01:27PM +0100, Paolo Bonzini wrote: >> Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto: >>>> Ok, so we need some API for virtio-{blk,scsi} to communicate back the >>>> indexes of in-flight requests to virtio. The indexes are known from the >>>> VirtQueueElement, so that's fine. >>>> >>>> Even better would be a virtio_save_request/virtio_load_request API... >>> >>> So you are saying this is a bug then? Great. >> >> I'm not sure what you mean by "it will never put the missing heads >> in the used ring". The serialized requests are put in the used rings >> when they are completed and virtio-{blk,scsi} calls virtqueue_push. Is >> the problem if you have a vring that looks like this: >> >> A A U A U U A A >> >> ? Which heads are leaked? {0,1}, {2} or {6,7}? Or a combination thereof? > > I don't know what A A U A U U A A means. Right, not very clear.... It means that descriptor 2 and 4 and 5 are free, while the others are in-flight. > To make it work, complete all requests when vm is stopped. That's not a choice, sorry. >>> This is exactly what the assert above is out there to catch. >>> And you really can't fix it without breaking migration compatibility. >> >> Why not? The index in the vring is in the migration data. > > index is not enough if requests are outstanding. Sorry, I meant the descriptor index. > Pls check the example in the log of the patch. I'm likewise not sure what you meant by A 1 A 2 U 2 A 2 U 2 A 2 U 2 A 2 <--- U 2 If I understand it, before the point marked with the arrow, the avail ring is 1 2 2 2 vring_avail_idx(vq) == 3 last_avail_idx == 3 After, it is 2 2 2 2 vring_avail_idx(vq) == 4 last_avail_idx == 3 What's wrong with that? You wrote "the only way to know head 1 is outstanding is because backend has stored this info somewhere". But the backend _is_ tracking it (by serializing and then restoring the VirtQueueElement) and no leak happens because virtqueue_fill/flush will put the head on the used ring sooner or later. >>> As step 1, I think we should just complete all outstanding >>> requests when VM stops. >>> >>> Yes it means you can't do the retry hack after migration >>> but this is hardly common scenario. >> >> I disagree... > > Disagree with what? You are saying it's common? It's not common, but you cannot block migration because you have an I/O error. Solving the error may involve migrating the guests away from that host. Paolo
On Wed, Dec 12, 2012 at 04:52:53PM +0100, Paolo Bonzini wrote: > Il 12/12/2012 16:25, Michael S. Tsirkin ha scritto: > > On Wed, Dec 12, 2012 at 04:01:27PM +0100, Paolo Bonzini wrote: > >> Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto: > >>>> Ok, so we need some API for virtio-{blk,scsi} to communicate back the > >>>> indexes of in-flight requests to virtio. The indexes are known from the > >>>> VirtQueueElement, so that's fine. > >>>> > >>>> Even better would be a virtio_save_request/virtio_load_request API... > >>> > >>> So you are saying this is a bug then? Great. > >> > >> I'm not sure what you mean by "it will never put the missing heads > >> in the used ring". The serialized requests are put in the used rings > >> when they are completed and virtio-{blk,scsi} calls virtqueue_push. Is > >> the problem if you have a vring that looks like this: > >> > >> A A U A U U A A > >> > >> ? Which heads are leaked? {0,1}, {2} or {6,7}? Or a combination thereof? > > > > I don't know what A A U A U U A A means. > > Right, not very clear.... It means that descriptor 2 and 4 and 5 are > free, while the others are in-flight. > > > To make it work, complete all requests when vm is stopped. > > That's not a choice, sorry. > > >>> This is exactly what the assert above is out there to catch. > >>> And you really can't fix it without breaking migration compatibility. > >> > >> Why not? The index in the vring is in the migration data. > > > > index is not enough if requests are outstanding. > > Sorry, I meant the descriptor index. > > > Pls check the example in the log of the patch. > > I'm likewise not sure what you meant by > > A 1 > A 2 > U 2 > A 2 > U 2 > A 2 > U 2 > A 2 <--- > U 2 > > If I understand it, before the point marked with the arrow, the avail > ring is > > 1 2 2 2 > > vring_avail_idx(vq) == 3 > last_avail_idx == 3 > > After, it is > > 2 2 2 2 > vring_avail_idx(vq) == 4 > last_avail_idx == 3 > > What's wrong with that? > > You wrote "the only way to know head 1 is outstanding is because backend > has stored this info somewhere". But the backend _is_ tracking it (by > serializing and then restoring the VirtQueueElement) and no leak happens > because virtqueue_fill/flush will put the head on the used ring sooner > or later. If you did this before save vm inuse would be 0. You said that at the point where we save state, some entries are outstanding. It is too late to put head at that point. > >>> As step 1, I think we should just complete all outstanding > >>> requests when VM stops. > >>> > >>> Yes it means you can't do the retry hack after migration > >>> but this is hardly common scenario. > >> > >> I disagree... > > > > Disagree with what? You are saying it's common? > > It's not common, but you cannot block migration because you have an I/O > error. Solving the error may involve migrating the guests away from > that host. > > Paolo No, you should complete with error.
Il 12/12/2012 17:37, Michael S. Tsirkin ha scritto: > > You wrote "the only way to know head 1 is outstanding is because backend > > has stored this info somewhere". But the backend _is_ tracking it (by > > serializing and then restoring the VirtQueueElement) and no leak happens > > because virtqueue_fill/flush will put the head on the used ring sooner > > or later. > > If you did this before save vm inuse would be 0. No, I won't. I want a simple API that the device can call to keep inuse up-to-date. Perhaps a bit ugly compared to just saving inuse, but it works. Or are there other bits that need resyncing besides inuse? Bits that cannot be recovered from the existing migration data? > You said that at the point where we save state, > some entries are outstanding. It is too late to > put head at that point. I don't want to put head on the source. I want to put it on the destination, when the request is completed. Same as it is done now, with bugfixes of course. Are there any problems doing so, except that inuse will not be up-to-date (easily fixed)? > > It's not common, but you cannot block migration because you have an I/O > > error. Solving the error may involve migrating the guests away from > > that host. > > No, you should complete with error. Knowing that the request will fail, the admin will not be able to do migration, even if that will solve the error transparently. Paolo
On Wed, Dec 12, 2012 at 05:51:51PM +0100, Paolo Bonzini wrote: > Il 12/12/2012 17:37, Michael S. Tsirkin ha scritto: > > > You wrote "the only way to know head 1 is outstanding is because backend > > > has stored this info somewhere". But the backend _is_ tracking it (by > > > serializing and then restoring the VirtQueueElement) and no leak happens > > > because virtqueue_fill/flush will put the head on the used ring sooner > > > or later. > > > > If you did this before save vm inuse would be 0. > > No, I won't. I want a simple API that the device can call to keep inuse > up-to-date. Perhaps a bit ugly compared to just saving inuse, but it > works. Or are there other bits that need resyncing besides inuse? Bits > that cannot be recovered from the existing migration data? Saving inuse counter is useless. We need to know which requests are outstanding if we want to retry them on remote. > > You said that at the point where we save state, > > some entries are outstanding. It is too late to > > put head at that point. > > I don't want to put head on the source. I want to put it on the > destination, when the request is completed. Same as it is done now, > with bugfixes of course. Are there any problems doing so, except that > inuse will not be up-to-date (easily fixed)? You have an outstanding request that is behind last avail index. You do not want to complete it. You migrate. There is no way for remote to understand that the request is outstanding. > > > It's not common, but you cannot block migration because you have an I/O > > > error. Solving the error may involve migrating the guests away from > > > that host. > > > > No, you should complete with error. > > Knowing that the request will fail, the admin will not be able to do > migration, even if that will solve the error transparently. > > Paolo You are saying there's no way to complete all requests?
Il 12/12/2012 18:14, Michael S. Tsirkin ha scritto: > On Wed, Dec 12, 2012 at 05:51:51PM +0100, Paolo Bonzini wrote: >> Il 12/12/2012 17:37, Michael S. Tsirkin ha scritto: >>>> You wrote "the only way to know head 1 is outstanding is because backend >>>> has stored this info somewhere". But the backend _is_ tracking it (by >>>> serializing and then restoring the VirtQueueElement) and no leak happens >>>> because virtqueue_fill/flush will put the head on the used ring sooner >>>> or later. >>> >>> If you did this before save vm inuse would be 0. >> >> No, I won't. I want a simple API that the device can call to keep inuse >> up-to-date. Perhaps a bit ugly compared to just saving inuse, but it >> works. Or are there other bits that need resyncing besides inuse? Bits >> that cannot be recovered from the existing migration data? > > Saving inuse counter is useless. We need to know which requests > are outstanding if we want to retry them on remote. And that's what virtio-blk and virtio-scsi have been doing for years. They store the VirtQueueElement including the index and the sglists. Can you explain *why* the index is not enough to reconstruct the state on the destination? There may be bugs and you may need help from virtio_blk_load, but that's okay. >>> You said that at the point where we save state, >>> some entries are outstanding. It is too late to >>> put head at that point. >> >> I don't want to put head on the source. I want to put it on the >> destination, when the request is completed. Same as it is done now, >> with bugfixes of course. Are there any problems doing so, except that >> inuse will not be up-to-date (easily fixed)? > > You have an outstanding request that is behind last avail index. > You do not want to complete it. You migrate. There is no > way for remote to understand that the request is outstanding. The savevm callbacks know which request is outstanding and pass the information to the destination. See virtio_blk_save and virtio_blk_load. What is not clear, and you haven't explained, is how you get to a bug in the handling of the avail ring. What's wrong with this explanation: A 1 A 2 U 2 A 2 U 2 A 2 U 2 A 2 <--- U 2 where before the point marked with the arrow, the avail ring is 1 2 2 2 vring_avail_idx(vq) == 3 last_avail_idx == 3 and after the point marked with the arrow, the avail ring is 2 2 2 2 vring_avail_idx(vq) == 4 last_avail_idx == 3 ?!? >>>> It's not common, but you cannot block migration because you have an I/O >>>> error. Solving the error may involve migrating the guests away from >>>> that host. >>> >>> No, you should complete with error. >> >> Knowing that the request will fail, the admin will not be able to do >> migration, even if that will solve the error transparently. > > You are saying there's no way to complete all requests? With an error, yes. Transparently after fixing the error (which may involve migration), no. Paolo
On Wed, Dec 12, 2012 at 06:39:15PM +0100, Paolo Bonzini wrote: > Il 12/12/2012 18:14, Michael S. Tsirkin ha scritto: > > On Wed, Dec 12, 2012 at 05:51:51PM +0100, Paolo Bonzini wrote: > >> Il 12/12/2012 17:37, Michael S. Tsirkin ha scritto: > >>>> You wrote "the only way to know head 1 is outstanding is because backend > >>>> has stored this info somewhere". But the backend _is_ tracking it (by > >>>> serializing and then restoring the VirtQueueElement) and no leak happens > >>>> because virtqueue_fill/flush will put the head on the used ring sooner > >>>> or later. > >>> > >>> If you did this before save vm inuse would be 0. > >> > >> No, I won't. I want a simple API that the device can call to keep inuse > >> up-to-date. Perhaps a bit ugly compared to just saving inuse, but it > >> works. Or are there other bits that need resyncing besides inuse? Bits > >> that cannot be recovered from the existing migration data? > > > > Saving inuse counter is useless. We need to know which requests > > are outstanding if we want to retry them on remote. > > And that's what virtio-blk and virtio-scsi have been doing for years. I don't see it - all I see in save is virtio_save. there's the extra code to save the elements in flight and send them to remote? > They store the VirtQueueElement including the index and the sglists. > Can you explain *why* the index is not enough to reconstruct the state > on the destination? There may be bugs and you may need help from > virtio_blk_load, but that's okay. > > >>> You said that at the point where we save state, > >>> some entries are outstanding. It is too late to > >>> put head at that point. > >> > >> I don't want to put head on the source. I want to put it on the > >> destination, when the request is completed. Same as it is done now, > >> with bugfixes of course. Are there any problems doing so, except that > >> inuse will not be up-to-date (easily fixed)? > > > > You have an outstanding request that is behind last avail index. > > You do not want to complete it. You migrate. There is no > > way for remote to understand that the request is outstanding. > > The savevm callbacks know which request is outstanding and pass the > information to the destination. See virtio_blk_save and virtio_blk_load. > > What is not clear, and you haven't explained, is how you get to a bug in > the handling of the avail ring. What's wrong with this explanation: > > A 1 > A 2 > U 2 > A 2 > U 2 > A 2 > U 2 > A 2 <--- > U 2 > > where before the point marked with the arrow, the avail ring is > > 1 2 2 2 > > vring_avail_idx(vq) == 3 > last_avail_idx == 3 > > and after the point marked with the arrow, the avail ring is > > 2 2 2 2 > vring_avail_idx(vq) == 4 > last_avail_idx == 3 > > ?!? You need to retry A1 on remote. How do you do that? There's no way to find out it has not been completed from the ring itself. > >>>> It's not common, but you cannot block migration because you have an I/O > >>>> error. Solving the error may involve migrating the guests away from > >>>> that host. > >>> > >>> No, you should complete with error. > >> > >> Knowing that the request will fail, the admin will not be able to do > >> migration, even if that will solve the error transparently. > > > > You are saying there's no way to complete all requests? > > With an error, yes. Transparently after fixing the error (which may > involve migration), no. > > Paolo
Il 12/12/2012 20:23, Michael S. Tsirkin ha scritto: >>> Saving inuse counter is useless. We need to know which requests >>> are outstanding if we want to retry them on remote. >> >> And that's what virtio-blk and virtio-scsi have been doing for years. > > I don't see it - all I see in save is virtio_save. static void virtio_blk_save(QEMUFile *f, void *opaque) { VirtIOBlock *s = opaque; VirtIOBlockReq *req = s->rq; virtio_save(&s->vdev, f); while (req) { qemu_put_sbyte(f, 1); qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem)); req = req->next; } qemu_put_sbyte(f, 0); } virtio-scsi does it in virtio_scsi_save_request. > You need to retry A1 on remote. How do you do that? There's > no way to find out it has not been completed > from the ring itself. virtio_blk_dma_restart_bh and scsi_dma_restart_bh do it. Paolo
On Wed, Dec 12, 2012 at 10:00:37PM +0100, Paolo Bonzini wrote: > Il 12/12/2012 20:23, Michael S. Tsirkin ha scritto: > >>> Saving inuse counter is useless. We need to know which requests > >>> are outstanding if we want to retry them on remote. > >> > >> And that's what virtio-blk and virtio-scsi have been doing for years. > > > > I don't see it - all I see in save is virtio_save. > > static void virtio_blk_save(QEMUFile *f, void *opaque) > { > VirtIOBlock *s = opaque; > VirtIOBlockReq *req = s->rq; > > virtio_save(&s->vdev, f); > > while (req) { > qemu_put_sbyte(f, 1); > qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem)); Ow. Does it really save VirtQueueElement? typedef struct VirtQueueElement { unsigned int index; unsigned int out_num; unsigned int in_num; hwaddr in_addr[VIRTQUEUE_MAX_SIZE]; hwaddr out_addr[VIRTQUEUE_MAX_SIZE]; struct iovec in_sg[VIRTQUEUE_MAX_SIZE]; struct iovec out_sg[VIRTQUEUE_MAX_SIZE]; } VirtQueueElement; Complete with pointers into qemu memory and all? That's got to hurt. All we really need is the index. > req = req->next; > } > qemu_put_sbyte(f, 0); > } > > > virtio-scsi does it in virtio_scsi_save_request. > > > You need to retry A1 on remote. How do you do that? There's > > no way to find out it has not been completed > > from the ring itself. > > virtio_blk_dma_restart_bh and scsi_dma_restart_bh do it. > > Paolo Okay, so the only bug is inuse getting negative right? So all we need to do is fix up the inuse value after restoring the outstanding requests - basically count the restored buffers and set inuse accordingly.
"Michael S. Tsirkin" <mst@redhat.com> writes: > On Wed, Dec 12, 2012 at 10:00:37PM +0100, Paolo Bonzini wrote: >> Il 12/12/2012 20:23, Michael S. Tsirkin ha scritto: >> >>> Saving inuse counter is useless. We need to know which requests >> >>> are outstanding if we want to retry them on remote. >> >> >> >> And that's what virtio-blk and virtio-scsi have been doing for years. >> > >> > I don't see it - all I see in save is virtio_save. >> >> static void virtio_blk_save(QEMUFile *f, void *opaque) >> { >> VirtIOBlock *s = opaque; >> VirtIOBlockReq *req = s->rq; >> >> virtio_save(&s->vdev, f); >> >> while (req) { >> qemu_put_sbyte(f, 1); >> qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem)); > > Ow. Does it really save VirtQueueElement? > > typedef struct VirtQueueElement > { > unsigned int index; > unsigned int out_num; > unsigned int in_num; > hwaddr in_addr[VIRTQUEUE_MAX_SIZE]; > hwaddr out_addr[VIRTQUEUE_MAX_SIZE]; > struct iovec in_sg[VIRTQUEUE_MAX_SIZE]; > struct iovec out_sg[VIRTQUEUE_MAX_SIZE]; > } VirtQueueElement; > > Complete with pointers into qemu memory and all? > That's got to hurt. > > All we really need is the index. > Yes, take a look at the series I sent out that scrubs all of this to just send the index and the addresses of the element. We technically should save the addresses and sizes too. It makes it a heck of a lot safer then re-reading guest memory since we do some validation on the size of the sg elements. But we could get away with only saving the index if we really wanted to. Regards, Anthony Liguori >> req = req->next; >> } >> qemu_put_sbyte(f, 0); >> } >> >> >> virtio-scsi does it in virtio_scsi_save_request. >> >> > You need to retry A1 on remote. How do you do that? There's >> > no way to find out it has not been completed >> > from the ring itself. >> >> virtio_blk_dma_restart_bh and scsi_dma_restart_bh do it. >> >> Paolo > > Okay, so the only bug is inuse getting negative right? > So all we need to do is fix up the inuse value > after restoring the outstanding requests - basically > count the restored buffers and set inuse accordingly. > > -- > MST
On Wed, Dec 12, 2012 at 03:33:32PM -0600, Anthony Liguori wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > On Wed, Dec 12, 2012 at 10:00:37PM +0100, Paolo Bonzini wrote: > >> Il 12/12/2012 20:23, Michael S. Tsirkin ha scritto: > >> >>> Saving inuse counter is useless. We need to know which requests > >> >>> are outstanding if we want to retry them on remote. > >> >> > >> >> And that's what virtio-blk and virtio-scsi have been doing for years. > >> > > >> > I don't see it - all I see in save is virtio_save. > >> > >> static void virtio_blk_save(QEMUFile *f, void *opaque) > >> { > >> VirtIOBlock *s = opaque; > >> VirtIOBlockReq *req = s->rq; > >> > >> virtio_save(&s->vdev, f); > >> > >> while (req) { > >> qemu_put_sbyte(f, 1); > >> qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem)); > > > > Ow. Does it really save VirtQueueElement? > > > > typedef struct VirtQueueElement > > { > > unsigned int index; > > unsigned int out_num; > > unsigned int in_num; BTW there's a hole after in_num which is uninitialized, that's also a nasty thing to send on the wire. > > hwaddr in_addr[VIRTQUEUE_MAX_SIZE]; > > hwaddr out_addr[VIRTQUEUE_MAX_SIZE]; > > struct iovec in_sg[VIRTQUEUE_MAX_SIZE]; > > struct iovec out_sg[VIRTQUEUE_MAX_SIZE]; > > } VirtQueueElement; > > > > Complete with pointers into qemu memory and all? > > That's got to hurt. > > > > All we really need is the index. > > > > Yes, take a look at the series I sent out that scrubs all of this to > just send the index and the addresses of the element. Will do. > We technically should save the addresses and sizes too. I guess as long as these are guest addresses, not ther qemu ones. > It makes it a > heck of a lot safer then re-reading guest memory since we do some > validation on the size of the sg elements. > But we could get away with only saving the index if we really wanted to. I guess re-validating it is needed anyway: we should not trust remote more than we trust the guest. > Regards, > > Anthony Liguori > > >> req = req->next; > >> } > >> qemu_put_sbyte(f, 0); > >> } > >> > >> > >> virtio-scsi does it in virtio_scsi_save_request. > >> > >> > You need to retry A1 on remote. How do you do that? There's > >> > no way to find out it has not been completed > >> > from the ring itself. > >> > >> virtio_blk_dma_restart_bh and scsi_dma_restart_bh do it. > >> > >> Paolo > > > > Okay, so the only bug is inuse getting negative right? > > So all we need to do is fix up the inuse value > > after restoring the outstanding requests - basically > > count the restored buffers and set inuse accordingly. > > > > -- > > MST
Il 12/12/2012 22:19, Michael S. Tsirkin ha scritto: > On Wed, Dec 12, 2012 at 10:00:37PM +0100, Paolo Bonzini wrote: >> Il 12/12/2012 20:23, Michael S. Tsirkin ha scritto: >>>>> Saving inuse counter is useless. We need to know which requests >>>>> are outstanding if we want to retry them on remote. >>>> >>>> And that's what virtio-blk and virtio-scsi have been doing for years. >>> >>> I don't see it - all I see in save is virtio_save. >> >> static void virtio_blk_save(QEMUFile *f, void *opaque) >> { >> VirtIOBlock *s = opaque; >> VirtIOBlockReq *req = s->rq; >> >> virtio_save(&s->vdev, f); >> >> while (req) { >> qemu_put_sbyte(f, 1); >> qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem)); > > Ow. Does it really save VirtQueueElement? > > typedef struct VirtQueueElement > { > unsigned int index; > unsigned int out_num; > unsigned int in_num; > hwaddr in_addr[VIRTQUEUE_MAX_SIZE]; > hwaddr out_addr[VIRTQUEUE_MAX_SIZE]; > struct iovec in_sg[VIRTQUEUE_MAX_SIZE]; > struct iovec out_sg[VIRTQUEUE_MAX_SIZE]; > } VirtQueueElement; > > Complete with pointers into qemu memory and all? Yes, that sucks (also the endianness is totally broken), but the pointers into QEMU memory are restored afterwards by remapping the address. Like Anthony, I'm not sure whether reloading the sglist from guest memory is safe. It would require re-validation of everything. We _can_ trust remote. Things like races on connecting to the incoming-migration socket are best handled outside QEMU with a firewall or a separate network that is not guest-accessible. I'm pretty sure that virtio-blk is the latest of our worries here. > Okay, so the only bug is inuse getting negative right? > So all we need to do is fix up the inuse value > after restoring the outstanding requests - basically > count the restored buffers and set inuse accordingly. Yes, I agree. Paolo
Am 12.12.2012 17:37, schrieb Michael S. Tsirkin: > On Wed, Dec 12, 2012 at 04:52:53PM +0100, Paolo Bonzini wrote: >> Il 12/12/2012 16:25, Michael S. Tsirkin ha scritto: >>> On Wed, Dec 12, 2012 at 04:01:27PM +0100, Paolo Bonzini wrote: >>>> Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto: >>>>> As step 1, I think we should just complete all outstanding >>>>> requests when VM stops. >>>>> >>>>> Yes it means you can't do the retry hack after migration >>>>> but this is hardly common scenario. >>>> >>>> I disagree... >>> >>> Disagree with what? You are saying it's common? >> >> It's not common, but you cannot block migration because you have an I/O >> error. Solving the error may involve migrating the guests away from >> that host. >> >> Paolo > > No, you should complete with error. Just to confirm that this isn't possible: rerror/werror=stop is a supported feature, and I do get bug reports when it breaks (including migration while the VM is stopped for an I/O error). Makes it common enough that breaking it is definitely not an option, right? Kevin
Anthony Liguori <aliguori@us.ibm.com> writes: > Yes, take a look at the series I sent out that scrubs all of this to > just send the index and the addresses of the element. > > We technically should save the addresses and sizes too. It makes it a > heck of a lot safer then re-reading guest memory since we do some > validation on the size of the sg elements. Not really. The guest puts the descriptors in the ring and leaves them there until the device acks. If it changes them once they're exposed but before they're acked, it can get either before or after version, and always could. Cheers, Rusty.
> > We technically should save the addresses and sizes too. It makes > > it a heck of a lot safer then re-reading guest memory since we do some > > validation on the size of the sg elements. > > Not really. > > The guest puts the descriptors in the ring and leaves them there until > the device acks. If it changes them once they're exposed but before > they're acked, it can get either before or after version, and always > could. The problems start when the guest tries to race against QEMU and defy the validation. Always using the validated version is a bit easier than redoing the validation after migration. Paolo
On Thu, Dec 13, 2012 at 11:48:17AM +0100, Kevin Wolf wrote: > Am 12.12.2012 17:37, schrieb Michael S. Tsirkin: > > On Wed, Dec 12, 2012 at 04:52:53PM +0100, Paolo Bonzini wrote: > >> Il 12/12/2012 16:25, Michael S. Tsirkin ha scritto: > >>> On Wed, Dec 12, 2012 at 04:01:27PM +0100, Paolo Bonzini wrote: > >>>> Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto: > >>>>> As step 1, I think we should just complete all outstanding > >>>>> requests when VM stops. > >>>>> > >>>>> Yes it means you can't do the retry hack after migration > >>>>> but this is hardly common scenario. > >>>> > >>>> I disagree... > >>> > >>> Disagree with what? You are saying it's common? > >> > >> It's not common, but you cannot block migration because you have an I/O > >> error. Solving the error may involve migrating the guests away from > >> that host. > >> > >> Paolo > > > > No, you should complete with error. > > Just to confirm that this isn't possible: rerror/werror=stop is a > supported feature, and I do get bug reports when it breaks (including > migration while the VM is stopped for an I/O error). > > Makes it common enough that breaking it is definitely not an option, right? > > Kevin Nod. So we need to fix inuse which is currently broken. We still can have an assert but it needs to be done more carefully, counting outstanding requests and checking it matches inuse.
Paolo Bonzini <pbonzini@redhat.com> writes: >> > We technically should save the addresses and sizes too. It makes >> > it a heck of a lot safer then re-reading guest memory since we do some >> > validation on the size of the sg elements. >> >> Not really. >> >> The guest puts the descriptors in the ring and leaves them there until >> the device acks. If it changes them once they're exposed but before >> they're acked, it can get either before or after version, and always >> could. > > The problems start when the guest tries to race against QEMU and defy > the validation. Always using the validated version is a bit easier > than redoing the validation after migration. Exactly. Regards, Anthony Liguori > > Paolo
diff --git a/hw/virtio.c b/hw/virtio.c index f40a8c5..6227642 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -788,6 +788,8 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) if (vdev->vq[i].vring.num == 0) break; + assert(!vdev->vq[i].inuse); + qemu_put_be32(f, vdev->vq[i].vring.num); qemu_put_be64(f, vdev->vq[i].pa); qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
Add sanity check to address the following concern: During migration, all we pass the index of the request; the rest can be re-read from the ring. This is not generally enough if any available requests are outstanding. Imagine a ring of size 4. Below A means available U means used. A 1 A 2 U 2 A 2 U 2 A 2 U 2 A 2 U 2 At this point available ring has wrapped around, the only way to know head 1 is outstanding is because backend has stored this info somewhere. The reason we manage to migrate without tracking this in migration state is because we flush outstanding requests before migration. This flush is device-specific though, let's add a safeguard in virtio core to ensure it's done properly. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> --- Changes from v1: v1 was against the wrong tree, it didn't build against qemu.git hw/virtio.c | 2 ++ 1 file changed, 2 insertions(+)