Message ID | 20230918144206.560120-1-armbru@redhat.com |
---|---|
Headers | show |
Series | migration/rdma: Error handling fixes | expand |
On Mon, Sep 18, 2023 at 04:41:14PM +0200, Markus Armbruster wrote: > Oh dear, where to start. There's so much wrong, and in pretty obvious > ways. This code should never have passed review. I'm refraining from > saying more; see the commit messages instead. > > Issues remaining after this series include: > > * Terrible error messages > > * Some error message cascades remain > > * There is no written contract for QEMUFileHooks, and the > responsibility for reporting errors is unclear Even being removed.. because no one is really extending that.. https://lore.kernel.org/all/20230509120700.78359-1-quintela@redhat.com/#t > > * There seem to be no tests whatsoever I always see rdma as "odd fixes" stage.. for a long time. But maybe I was wrong. Copying Zhijian for status of rdma; Zhijian, I saw that you just replied to the hwpoison issue. Maybe we should have one entry for rdma too, just like colo? Thanks,
On Tue, Sep 19, 2023 at 12:49:46PM -0400, Peter Xu wrote: > On Mon, Sep 18, 2023 at 04:41:14PM +0200, Markus Armbruster wrote: > > Oh dear, where to start. There's so much wrong, and in pretty obvious > > ways. This code should never have passed review. I'm refraining from > > saying more; see the commit messages instead. > > > > Issues remaining after this series include: > > > > * Terrible error messages > > > > * Some error message cascades remain > > > > * There is no written contract for QEMUFileHooks, and the > > responsibility for reporting errors is unclear > > Even being removed.. because no one is really extending that.. > > https://lore.kernel.org/all/20230509120700.78359-1-quintela@redhat.com/#t One day (in another 5-10 years) I still hope we'll get to the point where QEMUFile itself is obsolete :-) Getting rid of QEMUFileHooks is a great step in that direction. Me finishing a old PoC to bring buffering to QIOChannel would be another big step. The data rate limiting would be the biggest missing piece to enable migration/vmstate logic to directly consume a QIOChannel. Eliminating QEMUFile would help to bring Error **errp to all the vmstate codepaths. > > * There seem to be no tests whatsoever > > I always see rdma as "odd fixes" stage.. for a long time. But maybe I was > wrong. In the MAINTAINERS file RDMA still get classified as formally supported under the migration maintainers. I'm not convinced that is an accurate description of its status. I tend to agree with you that it is 'odd fixes' at the very best. Dave Gilbert had previously speculated about whether we should even consider deprecating it on the basis that latest non-RDMA migration is too much better than in the past, with multifd and zerocopy, that RDMA might not even offer a significant enough peformance win to justify. > Copying Zhijian for status of rdma; Zhijian, I saw that you just replied to > the hwpoison issue. Maybe we should have one entry for rdma too, just like > colo? With regards, Daniel
Daniel P. Berrangé <berrange@redhat.com> writes: > On Tue, Sep 19, 2023 at 12:49:46PM -0400, Peter Xu wrote: >> On Mon, Sep 18, 2023 at 04:41:14PM +0200, Markus Armbruster wrote: >> > Oh dear, where to start. There's so much wrong, and in pretty obvious >> > ways. This code should never have passed review. I'm refraining from >> > saying more; see the commit messages instead. >> > >> > Issues remaining after this series include: >> > >> > * Terrible error messages >> > >> > * Some error message cascades remain >> > >> > * There is no written contract for QEMUFileHooks, and the >> > responsibility for reporting errors is unclear >> >> Even being removed.. because no one is really extending that.. >> >> https://lore.kernel.org/all/20230509120700.78359-1-quintela@redhat.com/#t > > One day (in another 5-10 years) I still hope we'll get to > the point where QEMUFile itself is obsolete :-) Getting > rid of QEMUFileHooks is a great step in that direction. > Me finishing a old PoC to bring buffering to QIOChannel > would be another big step. > If you need any help with that let me know. I've been tripping over QEMUFile weirdness on a daily basis. Just last week I was looking into restricting the usage of qemu_file_set_error() to qemu-file.c so we can get rid of this situation where any piece of code that has a pointer to the QEMUFile can put whatever it wants in f->last_error* and the rest of the code has to guess when to call qemu_file_get_error(). *last_error actually stores the first error Moving all the interesting parts into the channel and removing QEMUFile would of course be the better solution. Multifd already ignores it completly, so there's probably more code that could be made generic after that change. Also, looking at what people do with iovs in the block layer, it seems the migration code is a little behind. > The data rate limiting would be the biggest missing piece > to enable migration/vmstate logic to directly consume > a QIOChannel. > > Eliminating QEMUFile would help to bring Error **errp > to all the vmstate codepaths. > >> > * There seem to be no tests whatsoever >> >> I always see rdma as "odd fixes" stage.. for a long time. But maybe I was >> wrong. > > In the MAINTAINERS file RDMA still get classified as formally > supported under the migration maintainers. I'm not convinced > that is an accurate description of its status. I tend to agree > with you that it is 'odd fixes' at the very best. > > Dave Gilbert had previously speculated about whether we should > even consider deprecating it on the basis that latest non-RDMA > migration is too much better than in the past, with multifd > and zerocopy, that RDMA might not even offer a significant > enough peformance win to justify. > >> Copying Zhijian for status of rdma; Zhijian, I saw that you just replied to >> the hwpoison issue. Maybe we should have one entry for rdma too, just like >> colo? > > With regards, > Daniel
Daniel P. Berrangé <berrange@redhat.com> writes: > On Tue, Sep 19, 2023 at 12:49:46PM -0400, Peter Xu wrote: >> On Mon, Sep 18, 2023 at 04:41:14PM +0200, Markus Armbruster wrote: >> > Oh dear, where to start. There's so much wrong, and in pretty obvious >> > ways. This code should never have passed review. I'm refraining from >> > saying more; see the commit messages instead. >> > >> > Issues remaining after this series include: >> > >> > * Terrible error messages >> > >> > * Some error message cascades remain >> > >> > * There is no written contract for QEMUFileHooks, and the >> > responsibility for reporting errors is unclear >> >> Even being removed.. because no one is really extending that.. >> >> https://lore.kernel.org/all/20230509120700.78359-1-quintela@redhat.com/#t > > One day (in another 5-10 years) I still hope we'll get to > the point where QEMUFile itself is obsolete :-) Getting > rid of QEMUFileHooks is a great step in that direction. > Me finishing a old PoC to bring buffering to QIOChannel > would be another big step. > > The data rate limiting would be the biggest missing piece > to enable migration/vmstate logic to directly consume > a QIOChannel. > > Eliminating QEMUFile would help to bring Error **errp > to all the vmstate codepaths. Sounds like improvement to me. >> > * There seem to be no tests whatsoever >> >> I always see rdma as "odd fixes" stage.. for a long time. But maybe I was >> wrong. To be honest, it doesn't look or smell maintained to me. More like thrown over the fence and left to rot. Given the shape it is in, I wouldn't let friends use it in production. > In the MAINTAINERS file RDMA still get classified as formally > supported under the migration maintainers. I'm not convinced > that is an accurate description of its status. I tend to agree > with you that it is 'odd fixes' at the very best. Let's fix MAINTAINERS not to raise unrealistic expectations. > Dave Gilbert had previously speculated about whether we should > even consider deprecating it on the basis that latest non-RDMA > migration is too much better than in the past, with multifd > and zerocopy, that RDMA might not even offer a significant > enough peformance win to justify. I provided approximately 52 additional arguments for deprecating it :) >> Copying Zhijian for status of rdma; Zhijian, I saw that you just replied to >> the hwpoison issue. Maybe we should have one entry for rdma too, just like >> colo? > > With regards, > Daniel
Perter, On 20/09/2023 00:49, Peter Xu wrote: > On Mon, Sep 18, 2023 at 04:41:14PM +0200, Markus Armbruster wrote: >> Oh dear, where to start. There's so much wrong, and in pretty obvious >> ways. This code should never have passed review. I'm refraining from >> saying more; see the commit messages instead. >> >> Issues remaining after this series include: >> >> * Terrible error messages >> >> * Some error message cascades remain >> >> * There is no written contract for QEMUFileHooks, and the >> responsibility for reporting errors is unclear > > Even being removed.. because no one is really extending that.. > > https://lore.kernel.org/all/20230509120700.78359-1-quintela@redhat.com/#t > >> >> * There seem to be no tests whatsoever > > I always see rdma as "odd fixes" stage.. for a long time. But maybe I was > wrong. > > Copying Zhijian for status of rdma; Thanks, Yeah, sometimes I will pay attention to migration, especially patches related to RDMA and COLO. I just knew i have missed so much patches to RDMA, most of them had got RVB, but dropped at PULL phase at last. What a pity. Zhijian, I saw that you just replied to > the hwpoison issue. Maybe we should have one entry for rdma too, just like > colo? I'm worried that I may not have enough time, ability, or environment to review/test the RDMA patches. but for this patch set, i will take a look later. Thanks Zhijian > > Thanks, >
On Thu, Sep 21, 2023 at 08:27:24AM +0000, Zhijian Li (Fujitsu) wrote: > I'm worried that I may not have enough time, ability, or environment to review/test > the RDMA patches. but for this patch set, i will take a look later. That'll be helpful, thanks! So it seems maybe at least we should have an entry for rdma migration to reflect the state of the code there. AFAIU we don't strictly need a maintainer for the entries; an empty entry should suffice, which I can prepare a patch for it: RDMA Migration S: Odd Fixes F: migration/rdma* Zhijian, if you still want to get emails when someone changes the code, maybe you still wanted be listed as a reviewer (even if not a maintainer)? So that you don't necessarily need to review every time, but just in case you still want to get notified whenever someone changes it, that means one line added onto above: R: Li Zhijian <lizhijian@fujitsu.com> Do you prefer me to add that R: for you when I send the maintainer file update? I'm curious whether Fujitsu is using this code in production, if so it'll be great if you can be supported as, perhaps part of, your job to maintain the rdma code. But maybe that's not the case. In all cases, thanks a lot for the helps already.
On 22/09/2023 23:21, Peter Xu wrote: > On Thu, Sep 21, 2023 at 08:27:24AM +0000, Zhijian Li (Fujitsu) wrote: >> I'm worried that I may not have enough time, ability, or environment to review/test >> the RDMA patches. but for this patch set, i will take a look later. > > That'll be helpful, thanks! > > So it seems maybe at least we should have an entry for rdma migration to > reflect the state of the code there. AFAIU we don't strictly need a > maintainer for the entries; an empty entry should suffice, which I can > prepare a patch for it: > > RDMA Migration > S: Odd Fixes > F: migration/rdma* > > Zhijian, if you still want to get emails when someone changes the code, > maybe you still wanted be listed as a reviewer (even if not a maintainer)? > So that you don't necessarily need to review every time, but just in case > you still want to get notified whenever someone changes it, that means one > line added onto above: > > R: Li Zhijian <lizhijian@fujitsu.com> > > Do you prefer me to add that R: for you when I send the maintainer file > update? Yes, thanks. my pleasure :) > > I'm curious whether Fujitsu is using this code in production, if so it'll > be great if you can be supported as, This depends on Fujitsu's customer requirements, I don't really know either :) Thanks Zhijian perhaps part of, your job to maintain > the rdma code. But maybe that's not the case.> > In all cases, thanks a lot for the helps already. >
Thank you for reviewing! Especially Li Zhijian, who went through pretty much all the patches. Big chunk of work, much appreciated. Will send v2 shortly.
Daniel P. Berrangé <berrange@redhat.com> wrote: > On Tue, Sep 19, 2023 at 12:49:46PM -0400, Peter Xu wrote: >> On Mon, Sep 18, 2023 at 04:41:14PM +0200, Markus Armbruster wrote: >> > Oh dear, where to start. There's so much wrong, and in pretty obvious >> > ways. This code should never have passed review. I'm refraining from >> > saying more; see the commit messages instead. >> > >> > Issues remaining after this series include: >> > >> > * Terrible error messages >> > >> > * Some error message cascades remain >> > >> > * There is no written contract for QEMUFileHooks, and the >> > responsibility for reporting errors is unclear >> >> Even being removed.. because no one is really extending that.. >> >> https://lore.kernel.org/all/20230509120700.78359-1-quintela@redhat.com/#t > > One day (in another 5-10 years) I still hope we'll get to > the point where QEMUFile itself is obsolete :-) If you see the patches on list, I have move rate limit check outside of QEMUFile, so right now it is only a buffer to write in the main migration thread. > Getting > rid of QEMUFileHooks is a great step in that direction. Thanks. > Me finishing a old PoC to bring buffering to QIOChannel > would be another big step. I want to get rid of qemu_file_set_error() and friends first. We should handle errors correctly when they happen, and go from there. > The data rate limiting would be the biggest missing piece > to enable migration/vmstate logic to directly consume > a QIOChannel. As said, that is done. I have three atomic counters: - qemu_file_bytes - rdma_bytes - multifd_bytes We do the rate limiting adding that 3 counters. The only thing that QEMUFile does know is increase qemu_file_bytes when it needs to do it. > Eliminating QEMUFile would help to bring Error **errp > to all the vmstate codepaths. Yes! See the last problem on the list where they couldn't use migrate_set_error() in vmstate.c because that breaks test-vmstate.c. >> I always see rdma as "odd fixes" stage.. for a long time. But maybe I was >> wrong. > > In the MAINTAINERS file RDMA still get classified as formally > supported under the migration maintainers. I'm not convinced > that is an accurate description of its status. I tend to agree > with you that it is 'odd fixes' at the very best. It is not exact, we wanted to get rid of it. > Dave Gilbert had previously speculated about whether we should > even consider deprecating it on the basis that latest non-RDMA > migration is too much better than in the past, with multifd > and zerocopy, that RDMA might not even offer a significant > enough peformance win to justify. The main problem with RDMA is that it uses a really weird model for migration point of view: - everything there is asynchronous (nothing else is like that) - it uses its own everything, i.e. abuses QEMUFile and QIOChannel to try to look like one, but it fails. - Its error handling is "peculiar", to be friendly. See this series from Markus to make it look normal. Thanks, Juan.
On Wed, Oct 04, 2023 at 08:00:34PM +0200, Juan Quintela wrote: > Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Tue, Sep 19, 2023 at 12:49:46PM -0400, Peter Xu wrote: > >> On Mon, Sep 18, 2023 at 04:41:14PM +0200, Markus Armbruster wrote: > >> > Oh dear, where to start. There's so much wrong, and in pretty obvious > >> > ways. This code should never have passed review. I'm refraining from > >> > saying more; see the commit messages instead. > >> > > >> > Issues remaining after this series include: > >> > > >> > * Terrible error messages > >> > > >> > * Some error message cascades remain > >> > > >> > * There is no written contract for QEMUFileHooks, and the > >> > responsibility for reporting errors is unclear > >> > >> Even being removed.. because no one is really extending that.. > >> > >> https://lore.kernel.org/all/20230509120700.78359-1-quintela@redhat.com/#t > > > > One day (in another 5-10 years) I still hope we'll get to > > the point where QEMUFile itself is obsolete :-) > > If you see the patches on list, I have move rate limit check outside of > QEMUFile, so right now it is only a buffer to write in the main > migration thread. Can you point me to that patch(es) as I'm not identifying them yet. With regards, Daniel
Daniel P. Berrangé <berrange@redhat.com> wrote: > On Wed, Oct 04, 2023 at 08:00:34PM +0200, Juan Quintela wrote: >> Daniel P. Berrangé <berrange@redhat.com> wrote: >> > On Tue, Sep 19, 2023 at 12:49:46PM -0400, Peter Xu wrote: >> >> On Mon, Sep 18, 2023 at 04:41:14PM +0200, Markus Armbruster wrote: >> >> > Oh dear, where to start. There's so much wrong, and in pretty obvious >> >> > ways. This code should never have passed review. I'm refraining from >> >> > saying more; see the commit messages instead. >> >> > >> >> > Issues remaining after this series include: >> >> > >> >> > * Terrible error messages >> >> > >> >> > * Some error message cascades remain >> >> > >> >> > * There is no written contract for QEMUFileHooks, and the >> >> > responsibility for reporting errors is unclear >> >> >> >> Even being removed.. because no one is really extending that.. >> >> >> >> https://lore.kernel.org/all/20230509120700.78359-1-quintela@redhat.com/#t >> > >> > One day (in another 5-10 years) I still hope we'll get to >> > the point where QEMUFile itself is obsolete :-) >> >> If you see the patches on list, I have move rate limit check outside of >> QEMUFile, so right now it is only a buffer to write in the main >> migration thread. > > Can you point me to that patch(es) as I'm not identifying > them yet. Yet another set of counters. They are on today PULL request. Later, Juan.