Message ID | 20090814154308.26116.46980.stgit@dev.haskins.net |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
* Gregory Haskins <ghaskins@novell.com> wrote: > This will generally be used for hypervisors to publish any host-side > virtual devices up to a guest. The guest will have the opportunity > to consume any devices present on the vbus-proxy as if they were > platform devices, similar to existing buses like PCI. > > Signed-off-by: Gregory Haskins <ghaskins@novell.com> > --- > > MAINTAINERS | 6 ++ > arch/x86/Kconfig | 2 + > drivers/Makefile | 1 > drivers/vbus/Kconfig | 14 ++++ > drivers/vbus/Makefile | 3 + > drivers/vbus/bus-proxy.c | 152 +++++++++++++++++++++++++++++++++++++++++++ > include/linux/vbus_driver.h | 73 +++++++++++++++++++++ > 7 files changed, 251 insertions(+), 0 deletions(-) > create mode 100644 drivers/vbus/Kconfig > create mode 100644 drivers/vbus/Makefile > create mode 100644 drivers/vbus/bus-proxy.c > create mode 100644 include/linux/vbus_driver.h Is there a consensus on this with the KVM folks? (i've added the KVM list to the Cc:) Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ingo Molnar wrote: > * Gregory Haskins <ghaskins@novell.com> wrote: > > >> This will generally be used for hypervisors to publish any host-side >> virtual devices up to a guest. The guest will have the opportunity >> to consume any devices present on the vbus-proxy as if they were >> platform devices, similar to existing buses like PCI. >> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com> >> --- >> >> MAINTAINERS | 6 ++ >> arch/x86/Kconfig | 2 + >> drivers/Makefile | 1 >> drivers/vbus/Kconfig | 14 ++++ >> drivers/vbus/Makefile | 3 + >> drivers/vbus/bus-proxy.c | 152 +++++++++++++++++++++++++++++++++++++++++++ >> include/linux/vbus_driver.h | 73 +++++++++++++++++++++ >> 7 files changed, 251 insertions(+), 0 deletions(-) >> create mode 100644 drivers/vbus/Kconfig >> create mode 100644 drivers/vbus/Makefile >> create mode 100644 drivers/vbus/bus-proxy.c >> create mode 100644 include/linux/vbus_driver.h >> > > Is there a consensus on this with the KVM folks? (i've added the KVM > list to the Cc:) > I'll let Avi comment about it from a KVM perspective but from a QEMU perspective, I don't think we want to support two paravirtual IO frameworks. I'd like to see them converge. Since there's an install base of guests today with virtio drivers, there really ought to be a compelling reason to change the virtio ABI in a non-backwards compatible way. This means convergence really ought to be adding features to virtio. On paper, I don't think vbus really has any features over virtio. vbus does things in different ways (paravirtual bus vs. pci for discovery) but I think we're happy with how virtio does things today. I think the reason vbus gets better performance for networking today is that vbus' backends are in the kernel while virtio's backends are currently in userspace. Since Michael has a functioning in-kernel backend for virtio-net now, I suspect we're weeks (maybe days) away from performance results. My expectation is that vhost + virtio-net will be as good as venet + vbus. If that's the case, then I don't see any reason to adopt vbus unless Greg things there are other compelling features over virtio. Regards, Anthony Liguori > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Anthony Liguori <anthony@codemonkey.ws> wrote: > Ingo Molnar wrote: >> * Gregory Haskins <ghaskins@novell.com> wrote: >> >> >>> This will generally be used for hypervisors to publish any host-side >>> virtual devices up to a guest. The guest will have the opportunity >>> to consume any devices present on the vbus-proxy as if they were >>> platform devices, similar to existing buses like PCI. >>> >>> Signed-off-by: Gregory Haskins <ghaskins@novell.com> >>> --- >>> >>> MAINTAINERS | 6 ++ >>> arch/x86/Kconfig | 2 + >>> drivers/Makefile | 1 drivers/vbus/Kconfig | >>> 14 ++++ >>> drivers/vbus/Makefile | 3 + >>> drivers/vbus/bus-proxy.c | 152 +++++++++++++++++++++++++++++++++++++++++++ >>> include/linux/vbus_driver.h | 73 +++++++++++++++++++++ >>> 7 files changed, 251 insertions(+), 0 deletions(-) >>> create mode 100644 drivers/vbus/Kconfig >>> create mode 100644 drivers/vbus/Makefile >>> create mode 100644 drivers/vbus/bus-proxy.c >>> create mode 100644 include/linux/vbus_driver.h >>> >> >> Is there a consensus on this with the KVM folks? (i've added the KVM >> list to the Cc:) > > I'll let Avi comment about it from a KVM perspective but from a > QEMU perspective, I don't think we want to support two paravirtual > IO frameworks. I'd like to see them converge. Since there's an > install base of guests today with virtio drivers, there really > ought to be a compelling reason to change the virtio ABI in a > non-backwards compatible way. This means convergence really ought > to be adding features to virtio. I agree. While different paravirt drivers are inevitable for things that are externally constrained (say support different hypervisors), doing different _Linux internal_ paravirt drivers looks plain stupid and counter-productive. It splits testing and development. So either the vbus code replaces virtio (for technical merits such as performance and other details), or virtio is enhanced with the vbus performance enhancements. > On paper, I don't think vbus really has any features over virtio. > vbus does things in different ways (paravirtual bus vs. pci for > discovery) but I think we're happy with how virtio does things > today. > > I think the reason vbus gets better performance for networking > today is that vbus' backends are in the kernel while virtio's > backends are currently in userspace. Since Michael has a > functioning in-kernel backend for virtio-net now, I suspect we're > weeks (maybe days) away from performance results. My expectation > is that vhost + virtio-net will be as good as venet + vbus. If > that's the case, then I don't see any reason to adopt vbus unless > Greg things there are other compelling features over virtio. Keeping virtio's backend in user-space was rather stupid IMHO. Having the _option_ to piggyback to user-space (for flexibility, extensibility, etc.) is OK, but not having kernel acceleration is bad. Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/15/2009 01:32 PM, Ingo Molnar wrote: >> This will generally be used for hypervisors to publish any host-side >> virtual devices up to a guest. The guest will have the opportunity >> to consume any devices present on the vbus-proxy as if they were >> platform devices, similar to existing buses like PCI. >> >> > Is there a consensus on this with the KVM folks? (i've added the KVM > list to the Cc:) > My opinion is that this is a duplication of effort and we'd be better off if everyone contributed to enhancing virtio, which already has widely deployed guest drivers and non-Linux guest support. It may have merit if it is proven that it is technically superior to virtio (and I don't mean some benchmark in some point in time; I mean design wise). So far I haven't seen any indications that it is.
Ingo Molnar wrote: > * Gregory Haskins <ghaskins@novell.com> wrote: > >> This will generally be used for hypervisors to publish any host-side >> virtual devices up to a guest. The guest will have the opportunity >> to consume any devices present on the vbus-proxy as if they were >> platform devices, similar to existing buses like PCI. >> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com> >> --- >> >> MAINTAINERS | 6 ++ >> arch/x86/Kconfig | 2 + >> drivers/Makefile | 1 >> drivers/vbus/Kconfig | 14 ++++ >> drivers/vbus/Makefile | 3 + >> drivers/vbus/bus-proxy.c | 152 +++++++++++++++++++++++++++++++++++++++++++ >> include/linux/vbus_driver.h | 73 +++++++++++++++++++++ >> 7 files changed, 251 insertions(+), 0 deletions(-) >> create mode 100644 drivers/vbus/Kconfig >> create mode 100644 drivers/vbus/Makefile >> create mode 100644 drivers/vbus/bus-proxy.c >> create mode 100644 include/linux/vbus_driver.h > > Is there a consensus on this with the KVM folks? (i've added the KVM > list to the Cc:) > > Hi Ingo, Avi can correct me if I am wrong, but the agreement that he and I came to a few months ago was something to the effect of: kvm will be neutral towards various external IO subsystems, and instead provide various hooks (see irqfd, ioeventfd) to permit these IO subsystems to interface with kvm. AlacrityVM is one of the first projects to take advantage of that interface. AlacrityVM is kvm-core + vbus-core + vbus-kvm-connector + vbus-enhanced qemu + guest drivers. This thread is part of the guest-drivers portion. Note that it is specific to alacrityvm, not kvm, which is why the kvm list was not included in the conversation (also an agreement with Avi: http://lkml.org/lkml/2009/8/6/231). Kind Regards, -Greg
Ingo Molnar wrote: >> I think the reason vbus gets better performance for networking >> today is that vbus' backends are in the kernel while virtio's >> backends are currently in userspace. Since Michael has a >> functioning in-kernel backend for virtio-net now, I suspect we're >> weeks (maybe days) away from performance results. My expectation >> is that vhost + virtio-net will be as good as venet + vbus. If >> that's the case, then I don't see any reason to adopt vbus unless >> Greg things there are other compelling features over virtio. >> > > Keeping virtio's backend in user-space was rather stupid IMHO. > I don't think it's quite so clear. There's nothing about vhost_net that would prevent a userspace application from using it as a higher performance replacement for tun/tap. The fact that we can avoid userspace for most of the fast paths is nice but that's really an issue of vhost_net vs. tun/tap. From the kernel's perspective, a KVM guest is just a userspace process. Having new userspace interfaces that are only useful to KVM guests would be a bad thing. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Anthony Liguori wrote: > Ingo Molnar wrote: >> * Gregory Haskins <ghaskins@novell.com> wrote: >> >> >>> This will generally be used for hypervisors to publish any host-side >>> virtual devices up to a guest. The guest will have the opportunity >>> to consume any devices present on the vbus-proxy as if they were >>> platform devices, similar to existing buses like PCI. >>> >>> Signed-off-by: Gregory Haskins <ghaskins@novell.com> >>> --- >>> >>> MAINTAINERS | 6 ++ >>> arch/x86/Kconfig | 2 + >>> drivers/Makefile | 1 drivers/vbus/Kconfig | >>> 14 ++++ >>> drivers/vbus/Makefile | 3 + >>> drivers/vbus/bus-proxy.c | 152 >>> +++++++++++++++++++++++++++++++++++++++++++ >>> include/linux/vbus_driver.h | 73 +++++++++++++++++++++ >>> 7 files changed, 251 insertions(+), 0 deletions(-) >>> create mode 100644 drivers/vbus/Kconfig >>> create mode 100644 drivers/vbus/Makefile >>> create mode 100644 drivers/vbus/bus-proxy.c >>> create mode 100644 include/linux/vbus_driver.h >>> >> >> Is there a consensus on this with the KVM folks? (i've added the KVM >> list to the Cc:) >> > > I'll let Avi comment about it from a KVM perspective but from a QEMU > perspective, I don't think we want to support two paravirtual IO > frameworks. I'd like to see them converge. Since there's an install > base of guests today with virtio drivers, there really ought to be a > compelling reason to change the virtio ABI in a non-backwards compatible > way. Note: No one has ever proposed to change the virtio-ABI. In fact, this thread in question doesn't even touch virtio, and even the patches that I have previously posted to add virtio-capability do it in a backwards compatible way Case in point: Take an upstream kernel and you can modprobe the vbus-pcibridge in and virtio devices will work over that transport unmodified. See http://lkml.org/lkml/2009/8/6/244 for details. Note that I have tentatively dropped the virtio-vbus patch from the queue due to lack of interest, but I can resurrect it if need be. > This means convergence really ought to be adding features to virtio. virtio is a device model. vbus is a bus model and a host backend facility. Adding features to virtio would be orthogonal to some kind of convergence goal. virtio can run unmodified or add new features within its own namespace independent of vbus, as it pleases. vbus will simply transport those changes. > > On paper, I don't think vbus really has any features over virtio. Again, do not confuse vbus with virtio. They are different layers of the stack. > vbus > does things in different ways (paravirtual bus vs. pci for discovery) > but I think we're happy with how virtio does things today. > Thats fine. KVM can stick with virtio-pci if it wants. AlacrityVM will support virtio-pci and vbus (with possible convergence with virtio-vbus). If at some point KVM thinks vbus is interesting, I will gladly work with getting it integrated into upstream KVM as well. Until then, they can happily coexist without issue between the two projects. > I think the reason vbus gets better performance for networking today is > that vbus' backends are in the kernel while virtio's backends are > currently in userspace. Well, with all due respect, you also said initially when I announced vbus that in-kernel doesn't matter, and tried to make virtio-net run as fast as venet from userspace ;) Given that we never saw those userspace patches from you that in fact equaled my performance, I assume you were wrong about that statement. Perhaps you were wrong about other things too? > Since Michael has a functioning in-kernel > backend for virtio-net now, I suspect we're weeks (maybe days) away from > performance results. My expectation is that vhost + virtio-net will be > as good as venet + vbus. This is not entirely impossible, at least for certain simple benchmarks like singleton throughput and latency. But if you think that this somehow invalidates vbus as a concept, you have missed the point entirely. vbus is about creating a flexible (e.g. cross hypervisor, and even physical system or userspace application) in-kernel IO containers with linux. The "guest" interface represents what I believe to be the ideal interface for ease of use, yet maximum performance for software-to-software interaction. This means very low latency and high-throughput for both synchronous and asynchronous IO, minimizing enters/exits, reducing enter/exit cost, prioritization, parallel computation, etc. The things that we (the alacrityvm community) have coming down the pipeline for high-performance virtualization require that these issues be addressed. venet was originally crafted just to validate the approach and test the vbus interface. It ended up being so much faster that virtio-net, that people in the vbus community started coding against its ABI. Therefore, I decided to support it formally and indefinately. If I can get consensus on virtio-vbus going forward, it will probably be the last vbus-specific driver for which there is overlap with virtio (e.g. virtio-block, virtio-console, etc). Instead, you will only see native vbus devices for non-native virtio type things, like real-time and advanced fabric support. OTOH, Michael's patch is purely targeted at improving virtio-net on kvm, and its likewise constrained by various limitations of that decision (such as its reliance of the PCI model, and the kvm memory scheme). The tradeoff is that his approach will work in all existing virtio-net kvm guests, and is probably significantly less code since he can re-use the qemu PCI bus model. Conversely, I am not afraid of requiring a new driver to optimize the general PV interface. In the long term, this will reduce the amount of reimplementing the same code over and over, reduce system overhead, and it adds new features not previously available (for instance, coalescing and prioritizing interrupts). > If that's the case, then I don't see any > reason to adopt vbus unless Greg things there are other compelling > features over virtio. Aside from the fact that this is another confusion of the vbus/virtio relationship...yes, of course there are compelling features (IMHO) or I wouldn't be expending effort ;) They are at least compelling enough to put in AlacrityVM. If upstream KVM doesn't want them, that's KVMs decision and I am fine with that. Simply never apply my qemu patches to qemu-kvm.git, and KVM will be blissfully unaware if vbus is present. I do hope that I can convince the KVM community otherwise, however. :) Kind Regards, -Greg
Avi Kivity wrote: > On 08/15/2009 01:32 PM, Ingo Molnar wrote: >>> This will generally be used for hypervisors to publish any host-side >>> virtual devices up to a guest. The guest will have the opportunity >>> to consume any devices present on the vbus-proxy as if they were >>> platform devices, similar to existing buses like PCI. >>> >>> >> Is there a consensus on this with the KVM folks? (i've added the KVM >> list to the Cc:) >> > > My opinion is that this is a duplication of effort and we'd be better > off if everyone contributed to enhancing virtio, which already has > widely deployed guest drivers and non-Linux guest support. > > It may have merit if it is proven that it is technically superior to > virtio (and I don't mean some benchmark in some point in time; I mean > design wise). So far I haven't seen any indications that it is. > The design is very different, so hopefully I can start to convince you why it might be interesting. Kind Regards, -Greg
* Anthony Liguori <anthony@codemonkey.ws> wrote: > Ingo Molnar wrote: >>> I think the reason vbus gets better performance for networking today >>> is that vbus' backends are in the kernel while virtio's backends are >>> currently in userspace. Since Michael has a functioning in-kernel >>> backend for virtio-net now, I suspect we're weeks (maybe days) away >>> from performance results. My expectation is that vhost + virtio-net >>> will be as good as venet + vbus. If that's the case, then I don't >>> see any reason to adopt vbus unless Greg things there are other >>> compelling features over virtio. >>> >> >> Keeping virtio's backend in user-space was rather stupid IMHO. > > I don't think it's quite so clear. in such a narrow quote it's not so clear indeed - that's why i qualified it with: >> Having the _option_ to piggyback to user-space (for flexibility, >> extensibility, etc.) is OK, but not having kernel acceleration is >> bad. Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Gregory Haskins <gregory.haskins@gmail.com> wrote: > Ingo Molnar wrote: > > * Gregory Haskins <ghaskins@novell.com> wrote: > > > >> This will generally be used for hypervisors to publish any host-side > >> virtual devices up to a guest. The guest will have the opportunity > >> to consume any devices present on the vbus-proxy as if they were > >> platform devices, similar to existing buses like PCI. > >> > >> Signed-off-by: Gregory Haskins <ghaskins@novell.com> > >> --- > >> > >> MAINTAINERS | 6 ++ > >> arch/x86/Kconfig | 2 + > >> drivers/Makefile | 1 > >> drivers/vbus/Kconfig | 14 ++++ > >> drivers/vbus/Makefile | 3 + > >> drivers/vbus/bus-proxy.c | 152 +++++++++++++++++++++++++++++++++++++++++++ > >> include/linux/vbus_driver.h | 73 +++++++++++++++++++++ > >> 7 files changed, 251 insertions(+), 0 deletions(-) > >> create mode 100644 drivers/vbus/Kconfig > >> create mode 100644 drivers/vbus/Makefile > >> create mode 100644 drivers/vbus/bus-proxy.c > >> create mode 100644 include/linux/vbus_driver.h > > > > Is there a consensus on this with the KVM folks? (i've added the KVM > > list to the Cc:) > > > > > > Hi Ingo, > > Avi can correct me if I am wrong, but the agreement that he and I > came to a few months ago was something to the effect of: > > kvm will be neutral towards various external IO subsystems, and > instead provide various hooks (see irqfd, ioeventfd) to permit > these IO subsystems to interface with kvm. > > AlacrityVM is one of the first projects to take advantage of that > interface. AlacrityVM is kvm-core + vbus-core + > vbus-kvm-connector + vbus-enhanced qemu + guest drivers. This > thread is part of the guest-drivers portion. Note that it is > specific to alacrityvm, not kvm, which is why the kvm list was not > included in the conversation (also an agreement with Avi: > http://lkml.org/lkml/2009/8/6/231). Well my own opinion is that the fracturing of the Linux internal driver space into diverging pieces of duplicate functionality (absent compelling technical reasons) is harmful. Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/17/2009 05:14 PM, Gregory Haskins wrote: > Note: No one has ever proposed to change the virtio-ABI. In fact, this > thread in question doesn't even touch virtio, and even the patches that > I have previously posted to add virtio-capability do it in a backwards > compatible way > Your patches include venet, which is a direct competitor to virtio-net, so it splits the development effort. > Case in point: Take an upstream kernel and you can modprobe the > vbus-pcibridge in and virtio devices will work over that transport > unmodified. > Older kernels don't support it, and Windows doesn't support it. >> vbus >> does things in different ways (paravirtual bus vs. pci for discovery) >> but I think we're happy with how virtio does things today. >> >> > Thats fine. KVM can stick with virtio-pci if it wants. AlacrityVM will > support virtio-pci and vbus (with possible convergence with > virtio-vbus). If at some point KVM thinks vbus is interesting, I will > gladly work with getting it integrated into upstream KVM as well. Until > then, they can happily coexist without issue between the two projects. > If vbus is to go upstream, it must go via the same path other drivers go. Virtio wasn't merged via the kvm tree and virtio-host won't be either. I don't have any technical objections to vbus/venet (I had in the past re interrupts but I believe you've addressed them), and it appears to perform very well. However I still think we should address virtio's shortcomings (as Michael is doing) rather than create a competitor. We have enough external competition, we don't need in-tree competitors. >> I think the reason vbus gets better performance for networking today is >> that vbus' backends are in the kernel while virtio's backends are >> currently in userspace. >> > Well, with all due respect, you also said initially when I announced > vbus that in-kernel doesn't matter, and tried to make virtio-net run as > fast as venet from userspace ;) Given that we never saw those userspace > patches from you that in fact equaled my performance, I assume you were > wrong about that statement. I too thought that if we'd improved the userspace interfaces we'd get fast networking without pushing virtio details into the kernels, benefiting not just kvm but the Linux community at large. This might still be correct but in fact no one turned up with the patches. Maybe they're impossible to write, hard to write, or uninteresting to write for those who are capable of writing them. As it is, we've given up and Michael wrote vhost. > Perhaps you were wrong about other things too? > I'm pretty sure Anthony doesn't posses a Diploma of Perpetual Omniscience. >> Since Michael has a functioning in-kernel >> backend for virtio-net now, I suspect we're weeks (maybe days) away from >> performance results. My expectation is that vhost + virtio-net will be >> as good as venet + vbus. >> > This is not entirely impossible, at least for certain simple benchmarks > like singleton throughput and latency. What about more complex benchmarks? Do you thing vbus+venet has an advantage there? > But if you think that this > somehow invalidates vbus as a concept, you have missed the point entirely. > > vbus is about creating a flexible (e.g. cross hypervisor, and even > physical system or userspace application) in-kernel IO containers with > linux. The "guest" interface represents what I believe to be the ideal > interface for ease of use, yet maximum performance for > software-to-software interaction. Maybe. But layering venet or vblock on top of it makes it specific to hypervisors. The venet/vblock ABIs are not very interesting for user-to-user (and anyway, they could use virtio just as well). > venet was originally crafted just to validate the approach and test the > vbus interface. It ended up being so much faster that virtio-net, that > people in the vbus community started coding against its ABI. It ended up being much faster than qemu's host implementation, not the virtio ABI. When asked you've indicated that you don't see any deficiencies in the virtio protocol. > OTOH, Michael's patch is purely targeted at improving virtio-net on kvm, > and its likewise constrained by various limitations of that decision > (such as its reliance of the PCI model, and the kvm memory scheme). The > tradeoff is that his approach will work in all existing virtio-net kvm > guests, and is probably significantly less code since he can re-use the > qemu PCI bus model. > virtio does not depend on PCI and virtio-host does not either. > Conversely, I am not afraid of requiring a new driver to optimize the > general PV interface. In the long term, this will reduce the amount of > reimplementing the same code over and over, reduce system overhead, and > it adds new features not previously available (for instance, coalescing > and prioritizing interrupts). > If it were proven to me a new driver is needed I'd switch too. So far no proof has materialized. >> If that's the case, then I don't see any >> reason to adopt vbus unless Greg things there are other compelling >> features over virtio. >> > Aside from the fact that this is another confusion of the vbus/virtio > relationship...yes, of course there are compelling features (IMHO) or I > wouldn't be expending effort ;) They are at least compelling enough to > put in AlacrityVM. If upstream KVM doesn't want them, that's KVMs > decision and I am fine with that. Simply never apply my qemu patches to > qemu-kvm.git, and KVM will be blissfully unaware if vbus is present. I > do hope that I can convince the KVM community otherwise, however. :) > If the vbus patches make it into the kernel I see no reason not to support them in qemu. qemu supports dozens if not hundreds of devices, one more wouldn't matter. But there's a lot of work before that can happen; for example you must support save/restore/migrate for vbus to be mergable.
On 08/17/2009 05:16 PM, Gregory Haskins wrote: >> My opinion is that this is a duplication of effort and we'd be better >> off if everyone contributed to enhancing virtio, which already has >> widely deployed guest drivers and non-Linux guest support. >> >> It may have merit if it is proven that it is technically superior to >> virtio (and I don't mean some benchmark in some point in time; I mean >> design wise). So far I haven't seen any indications that it is. >> >> > > The design is very different, so hopefully I can start to convince you > why it might be interesting. > We've been through this before I believe. If you can point out specific differences that make venet outperform virtio-net I'll be glad to hear (and steal) them though.
Ingo Molnar wrote: > * Gregory Haskins <gregory.haskins@gmail.com> wrote: > >> Ingo Molnar wrote: >>> * Gregory Haskins <ghaskins@novell.com> wrote: >>> >>>> This will generally be used for hypervisors to publish any host-side >>>> virtual devices up to a guest. The guest will have the opportunity >>>> to consume any devices present on the vbus-proxy as if they were >>>> platform devices, similar to existing buses like PCI. >>>> >>>> Signed-off-by: Gregory Haskins <ghaskins@novell.com> >>>> --- >>>> >>>> MAINTAINERS | 6 ++ >>>> arch/x86/Kconfig | 2 + >>>> drivers/Makefile | 1 >>>> drivers/vbus/Kconfig | 14 ++++ >>>> drivers/vbus/Makefile | 3 + >>>> drivers/vbus/bus-proxy.c | 152 +++++++++++++++++++++++++++++++++++++++++++ >>>> include/linux/vbus_driver.h | 73 +++++++++++++++++++++ >>>> 7 files changed, 251 insertions(+), 0 deletions(-) >>>> create mode 100644 drivers/vbus/Kconfig >>>> create mode 100644 drivers/vbus/Makefile >>>> create mode 100644 drivers/vbus/bus-proxy.c >>>> create mode 100644 include/linux/vbus_driver.h >>> Is there a consensus on this with the KVM folks? (i've added the KVM >>> list to the Cc:) >>> >>> >> Hi Ingo, >> >> Avi can correct me if I am wrong, but the agreement that he and I >> came to a few months ago was something to the effect of: >> >> kvm will be neutral towards various external IO subsystems, and >> instead provide various hooks (see irqfd, ioeventfd) to permit >> these IO subsystems to interface with kvm. >> >> AlacrityVM is one of the first projects to take advantage of that >> interface. AlacrityVM is kvm-core + vbus-core + >> vbus-kvm-connector + vbus-enhanced qemu + guest drivers. This >> thread is part of the guest-drivers portion. Note that it is >> specific to alacrityvm, not kvm, which is why the kvm list was not >> included in the conversation (also an agreement with Avi: >> http://lkml.org/lkml/2009/8/6/231). > > Well my own opinion is that the fracturing of the Linux internal > driver space into diverging pieces of duplicate functionality > (absent compelling technical reasons) is harmful. [Adding Michael Tsirkin] Hi Ingo, 1) First off, let me state that I have made every effort to propose this as a solution to integrate with KVM, the most recent of which is April: http://lkml.org/lkml/2009/4/21/408 If you read through the various vbus related threads on LKML/KVM posted this year, I think you will see that I made numerous polite offerings to work with people on finding a common solution here, including Michael. In the end, Michael decided that go a different route using some of the ideas proposed in vbus + venet-tap to create vhost-net. This is fine, and I respect his decision. But do not try to pin "fracturing" on me, because I tried everything to avoid it. :) Since I still disagree with the fundamental approach of how KVM IO works, I am continuing my effort in the downstream project "AlacrityVM" which will hopefully serve to build a better understanding of what it is I am doing with the vbus technology, and a point to maintain the subsystem. 2) There *are* technical reasons for this change (and IMHO, they are compelling), many of which have already been previously discussed (including my last reply to Anthony) so I wont rehash them here. 3) Even if there really is some duplication here, I disagree with you that it is somehow harmful to the Linux community per se. Case in point, look at the graphs posted on the AlacrityVM wiki: http://developer.novell.com/wiki/index.php/AlacrityVM Prior to my effort, KVM was humming along at the status quo and I came along with a closer eye and almost doubled the throughput and cut latency by 78%. Given an apparent disagreement with aspects of my approach, Michael went off and created a counter example that was motivated by my performance findings. Therefore, even if Avi ultimately accepts Michaels vhost approach instead of mine, Linux as a hypervisor platform has been significantly _improved_ by a little friendly competition, not somehow damaged by it. 4) Lastly, these patches are almost entirely just stand alone Linux drivers that do not affect KVM if KVM doesn't wish to acknowledge them. Its just like any of the other numerous drivers that are accepted upstream into Linux every day. The only maintained subsystem that is technically touched by this series is netdev, and David Miller already approved of the relevant patch's inclusion: http://lkml.org/lkml/2009/8/3/505 So with all due respect, where is the problem? The patches are all professionally developed according to the Linux coding standards, pass checkpatch, are GPL'ed, and work with a freely available platform which you can download today (http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=summary) Kind Regards, -Greg
* Avi Kivity <avi@redhat.com> wrote: > I don't have any technical objections to vbus/venet (I had in the > past re interrupts but I believe you've addressed them), and it > appears to perform very well. However I still think we should > address virtio's shortcomings (as Michael is doing) rather than > create a competitor. We have enough external competition, we > don't need in-tree competitors. I do have strong technical objections: distributions really want to standardize on as few Linux internal virtualization APIs as possible, so splintering it just because /bin/cp is easy to do is bad. If virtio pulls even with vbus's performance and vbus has no advantages over virtio i do NAK vbus on that basis. Lets stop the sillyness before it starts hurting users. Coming up with something better is good, but doing an incompatible, duplicative framework just for NIH reasons is stupid and should be resisted. People dont get to add a new sys_read_v2() without strong technical arguments either - the same holds for our Linux internal driver abstractions, APIs and ABIs. ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Gregory Haskins <gregory.haskins@gmail.com> wrote: > Hi Ingo, > > 1) First off, let me state that I have made every effort to > propose this as a solution to integrate with KVM, the most recent > of which is April: > > http://lkml.org/lkml/2009/4/21/408 > > If you read through the various vbus related threads on LKML/KVM > posted this year, I think you will see that I made numerous polite > offerings to work with people on finding a common solution here, > including Michael. > > In the end, Michael decided that go a different route using some > of the ideas proposed in vbus + venet-tap to create vhost-net. > This is fine, and I respect his decision. But do not try to pin > "fracturing" on me, because I tried everything to avoid it. :) That's good. So if virtio is fixed to be as fast as vbus, and if there's no other techical advantages of vbus over virtio you'll be glad to drop vbus and stand behind virtio? Also, are you willing to help virtio to become faster? Or do you have arguments why that is impossible to do so and why the only possible solution is vbus? Avi says no such arguments were offered so far. Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote: > On 08/17/2009 05:16 PM, Gregory Haskins wrote: >>> My opinion is that this is a duplication of effort and we'd be better >>> off if everyone contributed to enhancing virtio, which already has >>> widely deployed guest drivers and non-Linux guest support. >>> >>> It may have merit if it is proven that it is technically superior to >>> virtio (and I don't mean some benchmark in some point in time; I mean >>> design wise). So far I haven't seen any indications that it is. >>> >>> >> >> The design is very different, so hopefully I can start to convince you >> why it might be interesting. >> > > We've been through this before I believe. If you can point out specific > differences that make venet outperform virtio-net I'll be glad to hear > (and steal) them though. > You sure know how to convince someone to collaborate with you, eh? Unforunately, i've answered that question numerous times, but it apparently falls on deaf ears. -Greg
On 08/17/2009 06:05 PM, Gregory Haskins wrote: > Hi Ingo, > > 1) First off, let me state that I have made every effort to propose this > as a solution to integrate with KVM, the most recent of which is April: > > http://lkml.org/lkml/2009/4/21/408 > > If you read through the various vbus related threads on LKML/KVM posted > this year, I think you will see that I made numerous polite offerings to > work with people on finding a common solution here, including Michael. > > In the end, Michael decided that go a different route using some of the > ideas proposed in vbus + venet-tap to create vhost-net. This is fine, > and I respect his decision. But do not try to pin "fracturing" on me, > because I tried everything to avoid it. :) > Given your post, there are only three possible ways to continue kvm guest driver development: - develop virtio/vhost, drop vbus/venet - develop vbus/venet, drop virtio - develop both Developing both fractures the community. Dropping virtio invalidates the installed base and Windows effort. There were no strong technical reasons shown in favor of the remaining option. > Since I still disagree with the fundamental approach of how KVM IO > works, What's that? > Prior to my effort, KVM was humming along at the status quo and I came > along with a closer eye and almost doubled the throughput and cut > latency by 78%. Given an apparent disagreement with aspects of my > approach, Michael went off and created a counter example that was > motivated by my performance findings. > Oh, virtio-net performance was a thorn in our side for a long time. I agree that venet was an additional spur. > Therefore, even if Avi ultimately accepts Michaels vhost approach > instead of mine, Linux as a hypervisor platform has been significantly > _improved_ by a little friendly competition, not somehow damaged by it. > Certainly, and irqfd/ioeventfd are a net win in any case. > 4) Lastly, these patches are almost entirely just stand alone Linux > drivers that do not affect KVM if KVM doesn't wish to acknowledge them. > Its just like any of the other numerous drivers that are accepted > upstream into Linux every day. The only maintained subsystem that is > technically touched by this series is netdev, and David Miller already > approved of the relevant patch's inclusion: > > http://lkml.org/lkml/2009/8/3/505 > > So with all due respect, where is the problem? The patches are all > professionally developed according to the Linux coding standards, pass > checkpatch, are GPL'ed, and work with a freely available platform which > you can download today > (http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=summary) > As I mentioned before, I have no technical objections to the patches, I just wish the effort could be concentrated in one direction.
* Gregory Haskins <gregory.haskins@gmail.com> wrote: > Avi Kivity wrote: > > On 08/17/2009 05:16 PM, Gregory Haskins wrote: > >>> My opinion is that this is a duplication of effort and we'd be better > >>> off if everyone contributed to enhancing virtio, which already has > >>> widely deployed guest drivers and non-Linux guest support. > >>> > >>> It may have merit if it is proven that it is technically superior to > >>> virtio (and I don't mean some benchmark in some point in time; I mean > >>> design wise). So far I haven't seen any indications that it is. > >>> > >>> > >> > >> The design is very different, so hopefully I can start to convince you > >> why it might be interesting. > >> > > > > We've been through this before I believe. If you can point out > > specific differences that make venet outperform virtio-net I'll > > be glad to hear (and steal) them though. > > You sure know how to convince someone to collaborate with you, eh? > > Unforunately, i've answered that question numerous times, but it > apparently falls on deaf ears. I'm trying to find the relevant discussion. The link you gave in the previous mail: http://lkml.org/lkml/2009/4/21/408 does not offer any design analysis of vbus versus virtio, and why the only fix to virtio is vbus. It offers a comparison and a blanket statement that vbus is superior but no arguments. (If you've already explained in a past thread then please give me an URL to that reply if possible, or forward me that prior reply. Thanks!) Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/17/2009 06:09 PM, Gregory Haskins wrote: > >> We've been through this before I believe. If you can point out specific >> differences that make venet outperform virtio-net I'll be glad to hear >> (and steal) them though. >> >> > You sure know how to convince someone to collaborate with you, eh? > > If I've offended you, I apologize. > Unforunately, i've answered that question numerous times, but it > apparently falls on deaf ears. > Well, I'm sorry, I truly don't think I've had that question answered with specificity. I'm really interested in it (out of a selfish desire to improve virtio), but the only comment I recall from you was to the effect that the virtio rings were better than ioq in terms of cache placement.
On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote: > Case in point: Take an upstream kernel and you can modprobe the > vbus-pcibridge in and virtio devices will work over that transport > unmodified. > > See http://lkml.org/lkml/2009/8/6/244 for details. The modprobe you are talking about would need to be done in guest kernel, correct? > OTOH, Michael's patch is purely targeted at improving virtio-net on kvm, > and its likewise constrained by various limitations of that decision > (such as its reliance of the PCI model, and the kvm memory scheme). vhost is actually not related to PCI in any way. It simply leaves all setup for userspace to do. And the memory scheme was intentionally separated from kvm so that it can easily support e.g. lguest.
Ingo Molnar wrote: > * Gregory Haskins <gregory.haskins@gmail.com> wrote: > >> Hi Ingo, >> >> 1) First off, let me state that I have made every effort to >> propose this as a solution to integrate with KVM, the most recent >> of which is April: >> >> http://lkml.org/lkml/2009/4/21/408 >> >> If you read through the various vbus related threads on LKML/KVM >> posted this year, I think you will see that I made numerous polite >> offerings to work with people on finding a common solution here, >> including Michael. >> >> In the end, Michael decided that go a different route using some >> of the ideas proposed in vbus + venet-tap to create vhost-net. >> This is fine, and I respect his decision. But do not try to pin >> "fracturing" on me, because I tried everything to avoid it. :) > > That's good. > > So if virtio is fixed to be as fast as vbus, and if there's no other > techical advantages of vbus over virtio you'll be glad to drop vbus > and stand behind virtio? To reiterate: vbus and virtio are not mutually exclusive. The virtio device model rides happily on top of the vbus bus model. This is primarily a question of the virtio-pci adapter, vs virtio-vbus. For more details, see this post: http://lkml.org/lkml/2009/8/6/244 There is a secondary question of venet (a vbus native device) verses virtio-net (a virtio native device that works with PCI or VBUS). If this contention is really around venet vs virtio-net, I may possibly conceed and retract its submission to mainline. I've been pushing it to date because people are using it and I don't see any reason that the driver couldn't be upstream. > > Also, are you willing to help virtio to become faster? Yes, that is not a problem. Note that virtio in general, and virtio-net/venet in particular are not the primary goal here, however. Improved 802.x and block IO are just positive side-effects of the effort. I started with 802.x networking just to demonstrate the IO layer capabilities, and to test it. It ended up being so good on contrast to existing facilities, that developers in the vbus community started using it for production development. Ultimately, I created vbus to address areas of performance that have not yet been addressed in things like KVM. Areas such as real-time guests, or RDMA (host bypass) interfaces. I also designed it in such a way that we could, in theory, write one set of (linux-based) backends, and have them work across a variety of environments (such as containers/VMs like KVM, lguest, openvz, but also physical systems like blade enclosures and clusters, or even applications running on the host). > Or do you > have arguments why that is impossible to do so and why the only > possible solution is vbus? Avi says no such arguments were offered > so far. Not for lack of trying. I think my points have just been missed everytime I try to describe them. ;) Basically I write a message very similar to this one, and the next conversation starts back from square one. But I digress, let me try again.. Noting that this discussion is really about the layer *below* virtio, not virtio itself (e.g. PCI vs vbus). Lets start with a little background: -- Background -- So on one level, we have the resource-container technology called "vbus". It lets you create a container on the host, fill it with virtual devices, and assign that container to some context (such as a KVM guest). These "devices" are LKMs, and each device has a very simple verb namespace consisting of a synchronous "call()" method, and a "shm()" method for establishing async channels. The async channels are just shared-memory with a signal path (e.g. interrupts and hypercalls), which the device+driver can use to overlay things like rings (virtqueues, IOQs), or other shared-memory based constructs of their choosing (such as a shared table). The signal path is designed to minimize enter/exits and reduce spurious signals in a unified way (see shm-signal patch). call() can be used both for config-space like details, as well as fast-path messaging that require synchronous behavior (such as guest scheduler updates). All of this is managed via sysfs/configfs. On the guest, we have a "vbus-proxy" which is how the guest gets access to devices assigned to its container. (as an aside, "virtio" devices can be populated in the container, and then surfaced up to the virtio-bus via that virtio-vbus patch I mentioned). There is a thing called a "vbus-connector" which is the guest specific part. Its job is to connect the vbus-proxy in the guest, to the vbus container on the host. How it does its job is specific to the connector implementation, but its role is to transport messages between the guest and the host (such as for call() and shm() invocations) and to handle things like discovery and hotswap. -- Issues -- Out of all this, I think the biggest contention point is the design of the vbus-connector that I use in AlacrityVM (Avi, correct me if I am wrong and you object to other aspects as well). I suspect that if I had designed the vbus-connector to surface vbus devices as PCI devices via QEMU, the patches would potentially have been pulled in a while ago. There are, of course, reasons why vbus does *not* render as PCI, so this is the meat of of your question, I believe. At a high level, PCI was designed for software-to-hardware interaction, so it makes assumptions about that relationship that do not necessarily apply to virtualization. For instance: A) hardware can only generate byte/word sized requests at a time because that is all the pcb-etch and silicon support. So hardware is usually expressed in terms of some number of "registers". B) each access to one of these registers is relatively cheap C) the target end-point has no visibility into the CPU machine state other than the parameters passed in the bus-cycle (usually an address and data tuple). D) device-ids are in a fixed width register and centrally assigned from an authority (e.g. PCI-SIG). E) Interrupt/MSI routing is per-device oriented F) Interrupts/MSI are assumed cheap to inject G) Interrupts/MSI are non-priortizable. H) Interrupts/MSI are statically established These assumptions and constraints may be completely different or simply invalid in a virtualized guest. For instance, the hypervisor is just software, and therefore it's not restricted to "etch" constraints. IO requests can be arbitrarily large, just as if you are invoking a library function-call or OS system-call. Likewise, each one of those requests is a branch and a context switch, so it has often has greater performance implications than a simple register bus-cycle in hardware. If you use an MMIO variant, it has to run through the page-fault code to be decoded. The result is typically decreased performance if you try to do the same thing real hardware does. This is why you usually see hypervisor specific drivers (e.g. virtio-net, vmnet, etc) a common feature. _Some_ performance oriented items can technically be accomplished in PCI, albeit in a much more awkward way. For instance, you can set up a really fast, low-latency "call()" mechanism using a PIO port on a PCI-model and ioeventfd. As a matter of fact, this is exactly what the vbus pci-bridge does: http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=drivers/vbus/pci-bridge.c;h=f0ed51af55b5737b3ae4239ed2adfe12c7859941;hb=ee557a5976921650b792b19e6a93cd03fcad304a#l102 (Also note that the enabling technology, ioeventfd, is something that came out of my efforts on vbus). The problem here is that this is incredibly awkward to setup. You have all that per-cpu goo and the registration of the memory on the guest. And on the host side, you have all the vmapping of the registered memory, and the file-descriptor to manage. In short, its really painful. I would much prefer to do this *once*, and then let all my devices simple re-use that infrastructure. This is, in fact, what I do. Here is the device model that a guest sees: struct vbus_device_proxy_ops { int (*open)(struct vbus_device_proxy *dev, int version, int flags); int (*close)(struct vbus_device_proxy *dev, int flags); int (*shm)(struct vbus_device_proxy *dev, int id, int prio, void *ptr, size_t len, struct shm_signal_desc *sigdesc, struct shm_signal **signal, int flags); int (*call)(struct vbus_device_proxy *dev, u32 func, void *data, size_t len, int flags); void (*release)(struct vbus_device_proxy *dev); }; Now the client just calls dev->call() and its lighting quick, and they don't have to worry about all the details of making it quick, nor expend addition per-cpu heap and address space to get it. Moving on: _Other_ items cannot be replicated (at least, not without hacking it into something that is no longer PCI. Things like the pci-id namespace are just silly for software. I would rather have a namespace that does not require central management so people are free to create vbus-backends at will. This is akin to registering a device MAJOR/MINOR, verses using the various dynamic assignment mechanisms. vbus uses a string identifier in place of a pci-id. This is superior IMHO, and not compatible with PCI. As another example, the connector design coalesces *all* shm-signals into a single interrupt (by prio) that uses the same context-switch mitigation techniques that help boost things like networking. This effectively means we can detect and optimize out ack/eoi cycles from the APIC as the IO load increases (which is when you need it most). PCI has no such concept. In addition, the signals and interrupts are priority aware, which is useful for things like 802.1p networking where you may establish 8-tx and 8-rx queues for your virtio-net device. x86 APIC really has no usable equivalent, so PCI is stuck here. Also, the signals can be allocated on-demand for implementing things like IPC channels in response to guest requests since there is no assumption about device-to-interrupt mappings. This is more flexible. And through all of this, this design would work in any guest even if it doesn't have PCI (e.g. lguest, UML, physical systems, etc). -- Bottom Line -- The idea here is to generalize all the interesting parts that are common (fast sync+async io, context-switch mitigation, back-end models, memory abstractions, signal-path routing, etc) that a variety of linux based technologies can use (kvm, lguest, openvz, uml, physical systems) and only require the thin "connector" code to port the system around. The idea is to try to get this aspect of PV right once, and at some point in the future, perhaps vbus will be as ubiquitous as PCI. Well, perhaps not *that* ubiquitous, but you get the idea ;) Then device models like virtio can ride happily on top and we end up with a really robust and high-performance Linux-based stack. I don't buy the argument that we already have PCI so lets use it. I don't think its the best design and I am not afraid to make an investment in a change here because I think it will pay off in the long run. I hope this helps to clarify my motivation. Kind Regards, -Greg
Ingo Molnar wrote: > * Gregory Haskins <gregory.haskins@gmail.com> wrote: > >> Avi Kivity wrote: >>> On 08/17/2009 05:16 PM, Gregory Haskins wrote: >>>>> My opinion is that this is a duplication of effort and we'd be better >>>>> off if everyone contributed to enhancing virtio, which already has >>>>> widely deployed guest drivers and non-Linux guest support. >>>>> >>>>> It may have merit if it is proven that it is technically superior to >>>>> virtio (and I don't mean some benchmark in some point in time; I mean >>>>> design wise). So far I haven't seen any indications that it is. >>>>> >>>>> >>>> The design is very different, so hopefully I can start to convince you >>>> why it might be interesting. >>>> >>> We've been through this before I believe. If you can point out >>> specific differences that make venet outperform virtio-net I'll >>> be glad to hear (and steal) them though. >> You sure know how to convince someone to collaborate with you, eh? >> >> Unforunately, i've answered that question numerous times, but it >> apparently falls on deaf ears. > > I'm trying to find the relevant discussion. The link you gave in the > previous mail: > > http://lkml.org/lkml/2009/4/21/408 > > does not offer any design analysis of vbus versus virtio, and why > the only fix to virtio is vbus. It offers a comparison and a blanket > statement that vbus is superior but no arguments. > > (If you've already explained in a past thread then please give me an > URL to that reply if possible, or forward me that prior reply. > Thanks!) Sorry, it was a series of long threads from quite a while back. I will see if I can find some references, but it might be easier to just start fresh (see the last reply I sent). Kind Regards, -Greg
Michael S. Tsirkin wrote: > On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote: >> Case in point: Take an upstream kernel and you can modprobe the >> vbus-pcibridge in and virtio devices will work over that transport >> unmodified. >> >> See http://lkml.org/lkml/2009/8/6/244 for details. > > The modprobe you are talking about would need > to be done in guest kernel, correct? Yes, and your point is? "unmodified" (pardon the psuedo pun) modifies "virtio", not "guest". It means you can take an off-the-shelf kernel with off-the-shelf virtio (ala distro-kernel) and modprobe vbus-pcibridge and get alacrityvm acceleration. It is not a design goal of mine to forbid the loading of a new driver, so I am ok with that requirement. > >> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm, >> and its likewise constrained by various limitations of that decision >> (such as its reliance of the PCI model, and the kvm memory scheme). > > vhost is actually not related to PCI in any way. It simply leaves all > setup for userspace to do. And the memory scheme was intentionally > separated from kvm so that it can easily support e.g. lguest. > I think you have missed my point. I mean that vhost requires a separate bus-model (ala qemu-pci). And no, your memory scheme is not separated, at least, not very well. It still assumes memory-regions and copy_to_user(), which is very kvm-esque. Vbus has people using things like userspace containers (no regions), and physical hardware (dma controllers, so no regions or copy_to_user) so your scheme quickly falls apart once you get away from KVM. Don't get me wrong: That design may have its place. Perhaps you only care about fixing KVM, which is a perfectly acceptable strategy. Its just not a strategy that I think is the best approach. Essentially you are promoting the proliferation of competing backends, and I am trying to unify them (which is ironic that this thread started with concerns I was fragmenting things ;). The bottom line is, you have a simpler solution that is more finely targeted at KVM and virtio-networking. It fixes probably a lot of problems with the existing implementation, but it still has limitations. OTOH, what I am promoting is more complex, but more flexible. That is the tradeoff. You can't have both ;) So do not for one second think that what you implemented is equivalent, because they are not. In fact, I believe I warned you about this potential problem when you decided to implement your own version. I think I said something to the effect of "you will either have a subset of functionality, or you will ultimately reinvent what I did". Right now you are in the subset phase. Perhaps someday you will be in the complete-reinvent phase. Why you wanted to go that route when I had already worked though the issues is something perhaps only you will ever know, but I'm sure you had your reasons. But do note you could have saved yourself grief by reusing my already implemented and tested variant, as I politely offered to work with you on making it meet your needs. Kind Regards -Greg
Gregory Haskins wrote: > Note: No one has ever proposed to change the virtio-ABI. virtio-pci is part of the virtio ABI. You are proposing changing that. You cannot add new kernel modules to guests and expect them to remain supported. So there is value in reusing existing ABIs >> I think the reason vbus gets better performance for networking today is >> that vbus' backends are in the kernel while virtio's backends are >> currently in userspace. >> > > Well, with all due respect, you also said initially when I announced > vbus that in-kernel doesn't matter, and tried to make virtio-net run as > fast as venet from userspace ;) Given that we never saw those userspace > patches from you that in fact equaled my performance, I assume you were > wrong about that statement. Perhaps you were wrong about other things too? > I'm wrong about a lot of things :-) I haven't yet been convinced that I'm wrong here though. One of the gray areas here is what constitutes an in-kernel backend. tun/tap is a sort of an in-kernel backend. Userspace is still involved in all of the paths. vhost seems to be an intermediate step between tun/tap and vbus. The fast paths avoid userspace completely. Many of the slow paths involve userspace still (like migration apparently). With vbus, userspace is avoided entirely. In some ways, you could argue that slirp and vbus are opposite ends of the virtual I/O spectrum. I believe strongly that we should avoid putting things in the kernel unless they absolutely have to be. I'm definitely interested in playing with vhost to see if there are ways to put even less in the kernel. In particular, I think it would be a big win to avoid knowledge of slots in the kernel by doing ring translation in userspace. This implies a userspace transition in the fast path. This may or may not be acceptable. I think this is going to be a very interesting experiment and will ultimately determine whether my intuition about the cost of dropping to userspace is right or wrong. > Conversely, I am not afraid of requiring a new driver to optimize the > general PV interface. In the long term, this will reduce the amount of > reimplementing the same code over and over, reduce system overhead, and > it adds new features not previously available (for instance, coalescing > and prioritizing interrupts). > I think you have a lot of ideas and I don't know that we've been able to really understand your vision. Do you have any plans on writing a paper about vbus that goes into some of your thoughts in detail? >> If that's the case, then I don't see any >> reason to adopt vbus unless Greg things there are other compelling >> features over virtio. >> > > Aside from the fact that this is another confusion of the vbus/virtio > relationship...yes, of course there are compelling features (IMHO) or I > wouldn't be expending effort ;) They are at least compelling enough to > put in AlacrityVM. This whole AlactricyVM thing is really hitting this nail with a sledgehammer. While the kernel needs to be very careful about what it pulls in, as long as you're willing to commit to ABI compatibility, we can pull code into QEMU to support vbus. Then you can just offer vbus host and guest drivers instead of forking the kernel. > If upstream KVM doesn't want them, that's KVMs > decision and I am fine with that. Simply never apply my qemu patches to > qemu-kvm.git, and KVM will be blissfully unaware if vbus is present. As I mentioned before, if you submit patches to upstream QEMU, we'll apply them (after appropriate review). As I said previously, we want to avoid user confusion as much as possible. Maybe this means limiting it to -device or a separate machine type. I'm not sure, but that's something we can discussion on qemu-devel. > I > do hope that I can convince the KVM community otherwise, however. :) > Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/18/2009 04:08 AM, Anthony Liguori wrote: > I believe strongly that we should avoid putting things in the kernel > unless they absolutely have to be. I'm definitely interested in > playing with vhost to see if there are ways to put even less in the > kernel. In particular, I think it would be a big win to avoid > knowledge of slots in the kernel by doing ring translation in > userspace. This implies a userspace transition in the fast path. > This may or may not be acceptable. I think this is going to be a very > interesting experiment and will ultimately determine whether my > intuition about the cost of dropping to userspace is right or wrong. I believe with a perfectly scaling qemu this should be feasible. Currently qemu is far from scaling perfectly, but inefficient userspace is not a reason to put things into the kernel. Having a translated ring is also a nice solution for migration - userspace can mark the pages dirty while translating the receive ring. Still, in-kernel translation is simple enough that I think we should keep it.
On 08/17/2009 10:33 PM, Gregory Haskins wrote: > > There is a secondary question of venet (a vbus native device) verses > virtio-net (a virtio native device that works with PCI or VBUS). If > this contention is really around venet vs virtio-net, I may possibly > conceed and retract its submission to mainline. I've been pushing it to > date because people are using it and I don't see any reason that the > driver couldn't be upstream. > That's probably the cause of much confusion. The primary kvm pain point is now networking, so in any vbus discussion we're concentrating on that aspect. >> Also, are you willing to help virtio to become faster? >> > Yes, that is not a problem. Note that virtio in general, and > virtio-net/venet in particular are not the primary goal here, however. > Improved 802.x and block IO are just positive side-effects of the > effort. I started with 802.x networking just to demonstrate the IO > layer capabilities, and to test it. It ended up being so good on > contrast to existing facilities, that developers in the vbus community > started using it for production development. > > Ultimately, I created vbus to address areas of performance that have not > yet been addressed in things like KVM. Areas such as real-time guests, > or RDMA (host bypass) interfaces. Can you explain how vbus achieves RDMA? I also don't see the connection to real time guests. > I also designed it in such a way that > we could, in theory, write one set of (linux-based) backends, and have > them work across a variety of environments (such as containers/VMs like > KVM, lguest, openvz, but also physical systems like blade enclosures and > clusters, or even applications running on the host). > Sorry, I'm still confused. Why would openvz need vbus? It already has zero-copy networking since it's a shared kernel. Shared memory should also work seamlessly, you just need to expose the shared memory object on a shared part of the namespace. And of course, anything in the kernel is already shared. >> Or do you >> have arguments why that is impossible to do so and why the only >> possible solution is vbus? Avi says no such arguments were offered >> so far. >> > Not for lack of trying. I think my points have just been missed > everytime I try to describe them. ;) Basically I write a message very > similar to this one, and the next conversation starts back from square > one. But I digress, let me try again.. > > Noting that this discussion is really about the layer *below* virtio, > not virtio itself (e.g. PCI vs vbus). Lets start with a little background: > > -- Background -- > > So on one level, we have the resource-container technology called > "vbus". It lets you create a container on the host, fill it with > virtual devices, and assign that container to some context (such as a > KVM guest). These "devices" are LKMs, and each device has a very simple > verb namespace consisting of a synchronous "call()" method, and a > "shm()" method for establishing async channels. > > The async channels are just shared-memory with a signal path (e.g. > interrupts and hypercalls), which the device+driver can use to overlay > things like rings (virtqueues, IOQs), or other shared-memory based > constructs of their choosing (such as a shared table). The signal path > is designed to minimize enter/exits and reduce spurious signals in a > unified way (see shm-signal patch). > > call() can be used both for config-space like details, as well as > fast-path messaging that require synchronous behavior (such as guest > scheduler updates). > > All of this is managed via sysfs/configfs. > One point of contention is that this is all managementy stuff and should be kept out of the host kernel. Exposing shared memory, interrupts, and guest hypercalls can all be easily done from userspace (as virtio demonstrates). True, some devices need kernel acceleration, but that's no reason to put everything into the host kernel. > On the guest, we have a "vbus-proxy" which is how the guest gets access > to devices assigned to its container. (as an aside, "virtio" devices > can be populated in the container, and then surfaced up to the > virtio-bus via that virtio-vbus patch I mentioned). > > There is a thing called a "vbus-connector" which is the guest specific > part. Its job is to connect the vbus-proxy in the guest, to the vbus > container on the host. How it does its job is specific to the connector > implementation, but its role is to transport messages between the guest > and the host (such as for call() and shm() invocations) and to handle > things like discovery and hotswap. > virtio has an exact parallel here (virtio-pci and friends). > Out of all this, I think the biggest contention point is the design of > the vbus-connector that I use in AlacrityVM (Avi, correct me if I am > wrong and you object to other aspects as well). I suspect that if I had > designed the vbus-connector to surface vbus devices as PCI devices via > QEMU, the patches would potentially have been pulled in a while ago. > Exposing devices as PCI is an important issue for me, as I have to consider non-Linux guests. Another issue is the host kernel management code which I believe is superfluous. But the biggest issue is compatibility. virtio exists and has Windows and Linux drivers. Without a fatal flaw in virtio we'll continue to support it. Given that, why spread to a new model? Of course, I understand you're interested in non-ethernet, non-block devices. I can't comment on these until I see them. Maybe they can fit the virtio model, and maybe they can't. > There are, of course, reasons why vbus does *not* render as PCI, so this > is the meat of of your question, I believe. > > At a high level, PCI was designed for software-to-hardware interaction, > so it makes assumptions about that relationship that do not necessarily > apply to virtualization. > > For instance: > > A) hardware can only generate byte/word sized requests at a time because > that is all the pcb-etch and silicon support. So hardware is usually > expressed in terms of some number of "registers". > No, hardware happily DMAs to and fro main memory. Some hardware of course uses mmio registers extensively, but not virtio hardware. With the recent MSI support no registers are touched in the fast path. > C) the target end-point has no visibility into the CPU machine state > other than the parameters passed in the bus-cycle (usually an address > and data tuple). > That's not an issue. Accessing memory is cheap. > D) device-ids are in a fixed width register and centrally assigned from > an authority (e.g. PCI-SIG). > That's not an issue either. Qumranet/Red Hat has donated a range of device IDs for use in virtio. Device IDs are how devices are associated with drivers, so you'll need something similar for vbus. > E) Interrupt/MSI routing is per-device oriented > Please elaborate. What is the issue? How does vbus solve it? > F) Interrupts/MSI are assumed cheap to inject > Interrupts are not assumed cheap; that's why interrupt mitigation is used (on real and virtual hardware). > G) Interrupts/MSI are non-priortizable. > They are prioritizable; Linux ignores this though (Windows doesn't). Please elaborate on what the problem is and how vbus solves it. > H) Interrupts/MSI are statically established > Can you give an example of why this is a problem? > These assumptions and constraints may be completely different or simply > invalid in a virtualized guest. For instance, the hypervisor is just > software, and therefore it's not restricted to "etch" constraints. IO > requests can be arbitrarily large, just as if you are invoking a library > function-call or OS system-call. Likewise, each one of those requests is > a branch and a context switch, so it has often has greater performance > implications than a simple register bus-cycle in hardware. If you use > an MMIO variant, it has to run through the page-fault code to be decoded. > > The result is typically decreased performance if you try to do the same > thing real hardware does. This is why you usually see hypervisor > specific drivers (e.g. virtio-net, vmnet, etc) a common feature. > > _Some_ performance oriented items can technically be accomplished in > PCI, albeit in a much more awkward way. For instance, you can set up a > really fast, low-latency "call()" mechanism using a PIO port on a > PCI-model and ioeventfd. As a matter of fact, this is exactly what the > vbus pci-bridge does: > What performance oriented items have been left unaddressed? virtio and vbus use three communications channels: call from guest to host (implemented as pio and reasonably fast), call from host to guest (implemented as msi and reasonably fast) and shared memory (as fast as it can be). Where does PCI limit you in any way? > The problem here is that this is incredibly awkward to setup. You have > all that per-cpu goo and the registration of the memory on the guest. > And on the host side, you have all the vmapping of the registered > memory, and the file-descriptor to manage. In short, its really painful. > > I would much prefer to do this *once*, and then let all my devices > simple re-use that infrastructure. This is, in fact, what I do. Here > is the device model that a guest sees: > virtio also reuses the pci code, on both guest and host. > Moving on: _Other_ items cannot be replicated (at least, not without > hacking it into something that is no longer PCI. > > Things like the pci-id namespace are just silly for software. I would > rather have a namespace that does not require central management so > people are free to create vbus-backends at will. This is akin to > registering a device MAJOR/MINOR, verses using the various dynamic > assignment mechanisms. vbus uses a string identifier in place of a > pci-id. This is superior IMHO, and not compatible with PCI. > How do you handle conflicts? Again you need a central authority to hand out names or prefixes. > As another example, the connector design coalesces *all* shm-signals > into a single interrupt (by prio) that uses the same context-switch > mitigation techniques that help boost things like networking. This > effectively means we can detect and optimize out ack/eoi cycles from the > APIC as the IO load increases (which is when you need it most). PCI has > no such concept. > That's a bug, not a feature. It means poor scaling as the number of vcpus increases and as the number of devices increases. Note nothing prevents steering multiple MSIs into a single vector. It's a bad idea though. > In addition, the signals and interrupts are priority aware, which is > useful for things like 802.1p networking where you may establish 8-tx > and 8-rx queues for your virtio-net device. x86 APIC really has no > usable equivalent, so PCI is stuck here. > x86 APIC is priority aware. > Also, the signals can be allocated on-demand for implementing things > like IPC channels in response to guest requests since there is no > assumption about device-to-interrupt mappings. This is more flexible. > Yes. However given that vectors are a scarce resource you're severely limited in that. And if you're multiplexing everything on one vector, then you can just as well demultiplex your channels in the virtio driver code. > And through all of this, this design would work in any guest even if it > doesn't have PCI (e.g. lguest, UML, physical systems, etc). > That is true for virtio which works on pci-less lguest and s390. > -- Bottom Line -- > > The idea here is to generalize all the interesting parts that are common > (fast sync+async io, context-switch mitigation, back-end models, memory > abstractions, signal-path routing, etc) that a variety of linux based > technologies can use (kvm, lguest, openvz, uml, physical systems) and > only require the thin "connector" code to port the system around. The > idea is to try to get this aspect of PV right once, and at some point in > the future, perhaps vbus will be as ubiquitous as PCI. Well, perhaps > not *that* ubiquitous, but you get the idea ;) > That is exactly the design goal of virtio (except it limits itself to virtualization). > Then device models like virtio can ride happily on top and we end up > with a really robust and high-performance Linux-based stack. I don't > buy the argument that we already have PCI so lets use it. I don't think > its the best design and I am not afraid to make an investment in a > change here because I think it will pay off in the long run. > Sorry, I don't think you've shown any quantifiable advantages.
On Mon, Aug 17, 2009 at 04:17:09PM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote: > >> Case in point: Take an upstream kernel and you can modprobe the > >> vbus-pcibridge in and virtio devices will work over that transport > >> unmodified. > >> > >> See http://lkml.org/lkml/2009/8/6/244 for details. > > > > The modprobe you are talking about would need > > to be done in guest kernel, correct? > > Yes, and your point is? "unmodified" (pardon the psuedo pun) modifies > "virtio", not "guest". > It means you can take an off-the-shelf kernel > with off-the-shelf virtio (ala distro-kernel) and modprobe > vbus-pcibridge and get alacrityvm acceleration. Heh, by that logic ksplice does not modify running kernel either :) > It is not a design goal of mine to forbid the loading of a new driver, > so I am ok with that requirement. > > >> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm, > >> and its likewise constrained by various limitations of that decision > >> (such as its reliance of the PCI model, and the kvm memory scheme). > > > > vhost is actually not related to PCI in any way. It simply leaves all > > setup for userspace to do. And the memory scheme was intentionally > > separated from kvm so that it can easily support e.g. lguest. > > > > I think you have missed my point. I mean that vhost requires a separate > bus-model (ala qemu-pci). So? That can be in userspace, and can be anything including vbus. > And no, your memory scheme is not separated, > at least, not very well. It still assumes memory-regions and > copy_to_user(), which is very kvm-esque. I don't think so: works for lguest, kvm, UML and containers > Vbus has people using things > like userspace containers (no regions), vhost by default works without regions > and physical hardware (dma > controllers, so no regions or copy_to_user) so your scheme quickly falls > apart once you get away from KVM. Someone took a driver and is building hardware for it ... so what? > Don't get me wrong: That design may have its place. Perhaps you only > care about fixing KVM, which is a perfectly acceptable strategy. > Its just not a strategy that I think is the best approach. Essentially you > are promoting the proliferation of competing backends, and I am trying > to unify them (which is ironic that this thread started with concerns I > was fragmenting things ;). So, you don't see how venet fragments things? It's pretty obvious ... > The bottom line is, you have a simpler solution that is more finely > targeted at KVM and virtio-networking. It fixes probably a lot of > problems with the existing implementation, but it still has limitations. > > OTOH, what I am promoting is more complex, but more flexible. That is > the tradeoff. You can't have both ;) We can. connect eventfds to hypercalls, and vhost will work with vbus. > So do not for one second think > that what you implemented is equivalent, because they are not. > > In fact, I believe I warned you about this potential problem when you > decided to implement your own version. I think I said something to the > effect of "you will either have a subset of functionality, or you will > ultimately reinvent what I did". Right now you are in the subset phase. No. Unlike vbus, vhost supports unmodified guests and live migration. > Perhaps someday you will be in the complete-reinvent phase. Why you > wanted to go that route when I had already worked though the issues is > something perhaps only you will ever know, but I'm sure you had your > reasons. But do note you could have saved yourself grief by reusing my > already implemented and tested variant, as I politely offered to work > with you on making it meet your needs. > Kind Regards > -Greg > you have a midlayer. I could not use it without pulling in all of it.
On Mon, Aug 17, 2009 at 08:08:24PM -0500, Anthony Liguori wrote: > In particular, I think it would be a big win to avoid knowledge of slots in > the kernel by doing ring translation in userspace. vhost supports this BTW: just don't call the memory table ioctl. In-kernel translation is simple, as well, though.
On Mon, Aug 17, 2009 at 03:33:30PM -0400, Gregory Haskins wrote: > There is a secondary question of venet (a vbus native device) verses > virtio-net (a virtio native device that works with PCI or VBUS). If > this contention is really around venet vs virtio-net, I may possibly > conceed and retract its submission to mainline. For me yes, venet+ioq competing with virtio+virtqueue. > I've been pushing it to date because people are using it and I don't > see any reason that the driver couldn't be upstream. If virtio is just as fast, they can just use it without knowing it. Clearly, that's better since we support virtio anyway ... > -- Issues -- > > Out of all this, I think the biggest contention point is the design of > the vbus-connector that I use in AlacrityVM (Avi, correct me if I am > wrong and you object to other aspects as well). I suspect that if I had > designed the vbus-connector to surface vbus devices as PCI devices via > QEMU, the patches would potentially have been pulled in a while ago. > > There are, of course, reasons why vbus does *not* render as PCI, so this > is the meat of of your question, I believe. > > At a high level, PCI was designed for software-to-hardware interaction, > so it makes assumptions about that relationship that do not necessarily > apply to virtualization. I'm not hung up on PCI, myself. An idea that might help you get Avi on-board: do setup in userspace, over PCI. Negotiate hypercall support (e.g. with a PCI capability) and then switch to that for fastpath. Hmm? > As another example, the connector design coalesces *all* shm-signals > into a single interrupt (by prio) that uses the same context-switch > mitigation techniques that help boost things like networking. This > effectively means we can detect and optimize out ack/eoi cycles from the > APIC as the IO load increases (which is when you need it most). PCI has > no such concept. Could you elaborate on this one for me? How does context-switch mitigation work? > In addition, the signals and interrupts are priority aware, which is > useful for things like 802.1p networking where you may establish 8-tx > and 8-rx queues for your virtio-net device. x86 APIC really has no > usable equivalent, so PCI is stuck here. By the way, multiqueue support in virtio would be very nice to have, and seems mostly unrelated to vbus.
On 08/18/2009 12:53 PM, Michael S. Tsirkin wrote: > I'm not hung up on PCI, myself. An idea that might help you get Avi > on-board: do setup in userspace, over PCI. Negotiate hypercall support > (e.g. with a PCI capability) and then switch to that for fastpath. Hmm? > Hypercalls don't nest well. When a nested guest issues a hypercall, you have to assume it is destined to the enclosing guest, so you can't assign a hypercall-capable device to a nested guest. mmio and pio don't have this problem since the host can use the address to locate the destination.
On Tue, Aug 18, 2009 at 01:00:25PM +0300, Avi Kivity wrote: > On 08/18/2009 12:53 PM, Michael S. Tsirkin wrote: >> I'm not hung up on PCI, myself. An idea that might help you get Avi >> on-board: do setup in userspace, over PCI. Negotiate hypercall support >> (e.g. with a PCI capability) and then switch to that for fastpath. Hmm? >> > > Hypercalls don't nest well. When a nested guest issues a hypercall, you > have to assume it is destined to the enclosing guest, so you can't > assign a hypercall-capable device to a nested guest. > > mmio and pio don't have this problem since the host can use the address > to locate the destination. So userspace could map hypercall to address during setup and tell the host kernel? > -- > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/18/2009 01:09 PM, Michael S. Tsirkin wrote: > >> mmio and pio don't have this problem since the host can use the address >> to locate the destination. >> > So userspace could map hypercall to address during setup and tell the > host kernel? > Suppose a nested guest has two devices. One a virtual device backed by its host (our guest), and one a virtual device backed by us (the real host), and assigned by the guest to the nested guest. If both devices use hypercalls, there is no way to distinguish between them.
On Tue, Aug 18, 2009 at 01:13:57PM +0300, Avi Kivity wrote: > On 08/18/2009 01:09 PM, Michael S. Tsirkin wrote: >> >>> mmio and pio don't have this problem since the host can use the address >>> to locate the destination. >>> >> So userspace could map hypercall to address during setup and tell the >> host kernel? >> > > Suppose a nested guest has two devices. One a virtual device backed by > its host (our guest), and one a virtual device backed by us (the real > host), and assigned by the guest to the nested guest. If both devices > use hypercalls, there is no way to distinguish between them. Not sure I understand. What I had in mind is that devices would have to either use different hypercalls and map hypercall to address during setup, or pass address with each hypercall. We get the hypercall, translate the address as if it was pio access, and know the destination? > -- > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/18/2009 01:28 PM, Michael S. Tsirkin wrote: > >> Suppose a nested guest has two devices. One a virtual device backed by >> its host (our guest), and one a virtual device backed by us (the real >> host), and assigned by the guest to the nested guest. If both devices >> use hypercalls, there is no way to distinguish between them. >> > Not sure I understand. What I had in mind is that devices would have to > either use different hypercalls and map hypercall to address during > setup, or pass address with each hypercall. We get the hypercall, > translate the address as if it was pio access, and know the destination? > There are no different hypercalls. There's just one hypercall instruction, and there's no standard on how it's used. If a nested call issues a hypercall instruction, you have no idea if it's calling a Hyper-V hypercall or a vbus/virtio kick. You could have a protocol where you register the hypercall instruction's address with its recipient, but it quickly becomes a tangled mess. And for what? pio and hypercalls have the same performance characteristics.
On Tue, Aug 18, 2009 at 01:45:05PM +0300, Avi Kivity wrote: > On 08/18/2009 01:28 PM, Michael S. Tsirkin wrote: >> >>> Suppose a nested guest has two devices. One a virtual device backed by >>> its host (our guest), and one a virtual device backed by us (the real >>> host), and assigned by the guest to the nested guest. If both devices >>> use hypercalls, there is no way to distinguish between them. >>> >> Not sure I understand. What I had in mind is that devices would have to >> either use different hypercalls and map hypercall to address during >> setup, or pass address with each hypercall. We get the hypercall, >> translate the address as if it was pio access, and know the destination? >> > > There are no different hypercalls. There's just one hypercall > instruction, and there's no standard on how it's used. If a nested call > issues a hypercall instruction, you have no idea if it's calling a > Hyper-V hypercall or a vbus/virtio kick. userspace will know which it is, because hypercall capability in the device has been activated, and can tell kernel, using something similar to iosignalfd. No? > You could have a protocol where you register the hypercall instruction's > address with its recipient, but it quickly becomes a tangled mess. I really thought we could pass the io address in register as an input parameter. Is there a way to do this in a secure manner? Hmm. Doesn't kvm use hypercalls now? How does this work with nesting? For example, in this code in arch/x86/kvm/x86.c: switch (nr) { case KVM_HC_VAPIC_POLL_IRQ: ret = 0; break; case KVM_HC_MMU_OP: r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret); break; default: ret = -KVM_ENOSYS; break; } how do we know that it's the guest and not the nested guest performing the hypercall? > And for what? pio and hypercalls have the same performance characteristics. No idea about that. I'm assuming Gregory knows why he wants to use hypercalls, I was just trying to help find a way that is also palatable, and flexible. > -- > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/18/2009 02:07 PM, Michael S. Tsirkin wrote: > On Tue, Aug 18, 2009 at 01:45:05PM +0300, Avi Kivity wrote: > >> On 08/18/2009 01:28 PM, Michael S. Tsirkin wrote: >> >>> >>>> Suppose a nested guest has two devices. One a virtual device backed by >>>> its host (our guest), and one a virtual device backed by us (the real >>>> host), and assigned by the guest to the nested guest. If both devices >>>> use hypercalls, there is no way to distinguish between them. >>>> >>>> >>> Not sure I understand. What I had in mind is that devices would have to >>> either use different hypercalls and map hypercall to address during >>> setup, or pass address with each hypercall. We get the hypercall, >>> translate the address as if it was pio access, and know the destination? >>> >>> >> There are no different hypercalls. There's just one hypercall >> instruction, and there's no standard on how it's used. If a nested call >> issues a hypercall instruction, you have no idea if it's calling a >> Hyper-V hypercall or a vbus/virtio kick. >> > userspace will know which it is, because hypercall capability > in the device has been activated, and can tell kernel, using > something similar to iosignalfd. No? > The host kernel sees a hypercall vmexit. How does it know if it's a nested-guest-to-guest hypercall or a nested-guest-to-host hypercall? The two are equally valid at the same time. >> You could have a protocol where you register the hypercall instruction's >> address with its recipient, but it quickly becomes a tangled mess. >> > I really thought we could pass the io address in register as an input > parameter. Is there a way to do this in a secure manner? > > Hmm. Doesn't kvm use hypercalls now? How does this work with nesting? > For example, in this code in arch/x86/kvm/x86.c: > > switch (nr) { > case KVM_HC_VAPIC_POLL_IRQ: > ret = 0; > break; > case KVM_HC_MMU_OP: > r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2),&ret); > break; > default: > ret = -KVM_ENOSYS; > break; > } > > how do we know that it's the guest and not the nested guest performing > the hypercall? > The host knows whether the guest or nested guest are running. If the guest is running, it's a guest-to-host hypercall. If the nested guest is running, it's a nested-guest-to-guest hypercall. We don't have nested-guest-to-host hypercalls (and couldn't unless we get agreement on a protocol from all hypervisor vendors).
On Tue, Aug 18, 2009 at 02:15:57PM +0300, Avi Kivity wrote: > On 08/18/2009 02:07 PM, Michael S. Tsirkin wrote: >> On Tue, Aug 18, 2009 at 01:45:05PM +0300, Avi Kivity wrote: >> >>> On 08/18/2009 01:28 PM, Michael S. Tsirkin wrote: >>> >>>> >>>>> Suppose a nested guest has two devices. One a virtual device backed by >>>>> its host (our guest), and one a virtual device backed by us (the real >>>>> host), and assigned by the guest to the nested guest. If both devices >>>>> use hypercalls, there is no way to distinguish between them. >>>>> >>>>> >>>> Not sure I understand. What I had in mind is that devices would have to >>>> either use different hypercalls and map hypercall to address during >>>> setup, or pass address with each hypercall. We get the hypercall, >>>> translate the address as if it was pio access, and know the destination? >>>> >>>> >>> There are no different hypercalls. There's just one hypercall >>> instruction, and there's no standard on how it's used. If a nested call >>> issues a hypercall instruction, you have no idea if it's calling a >>> Hyper-V hypercall or a vbus/virtio kick. >>> >> userspace will know which it is, because hypercall capability >> in the device has been activated, and can tell kernel, using >> something similar to iosignalfd. No? >> > > The host kernel sees a hypercall vmexit. How does it know if it's a > nested-guest-to-guest hypercall or a nested-guest-to-host hypercall? > The two are equally valid at the same time. Here is how this can work - it is similar to MSI if you like: - by default, the device uses pio kicks - nested guest driver can enable hypercall capability in the device, probably with pci config cycle - guest userspace (hypervisor running in guest) will see this request and perform pci config cycle on the "real" device, telling it to which nested guest this device is assigned - host userspace (hypervisor running in host) will see this. it now knows both which guest the hypercalls will be for, and that the device in question is an emulated one, and can set up kvm appropriately >>> You could have a protocol where you register the hypercall instruction's >>> address with its recipient, but it quickly becomes a tangled mess. >>> >> I really thought we could pass the io address in register as an input >> parameter. Is there a way to do this in a secure manner? >> >> Hmm. Doesn't kvm use hypercalls now? How does this work with nesting? >> For example, in this code in arch/x86/kvm/x86.c: >> >> switch (nr) { >> case KVM_HC_VAPIC_POLL_IRQ: >> ret = 0; >> break; >> case KVM_HC_MMU_OP: >> r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2),&ret); >> break; >> default: >> ret = -KVM_ENOSYS; >> break; >> } >> >> how do we know that it's the guest and not the nested guest performing >> the hypercall? >> > > The host knows whether the guest or nested guest are running. If the > guest is running, it's a guest-to-host hypercall. If the nested guest > is running, it's a nested-guest-to-guest hypercall. We don't have > nested-guest-to-host hypercalls (and couldn't unless we get agreement on > a protocol from all hypervisor vendors). Not necessarily. What I am saying is we could make this protocol part of guest paravirt driver. the guest that loads the driver and enables the capability, has to agree to the protocol. If it doesn't want to, it does not have to use that driver. > -- > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/18/2009 02:49 PM, Michael S. Tsirkin wrote: > >> The host kernel sees a hypercall vmexit. How does it know if it's a >> nested-guest-to-guest hypercall or a nested-guest-to-host hypercall? >> The two are equally valid at the same time. >> > Here is how this can work - it is similar to MSI if you like: > - by default, the device uses pio kicks > - nested guest driver can enable hypercall capability in the device, > probably with pci config cycle > - guest userspace (hypervisor running in guest) will see this request > and perform pci config cycle on the "real" device, telling it to which > nested guest this device is assigned > So far so good. > - host userspace (hypervisor running in host) will see this. > it now knows both which guest the hypercalls will be for, > and that the device in question is an emulated one, > and can set up kvm appropriately > No it doesn't. The fact that one device uses hypercalls doesn't mean all hypercalls are for that device. Hypercalls are a shared resource, and there's no way to tell for a given hypercall what device it is associated with (if any). >> The host knows whether the guest or nested guest are running. If the >> guest is running, it's a guest-to-host hypercall. If the nested guest >> is running, it's a nested-guest-to-guest hypercall. We don't have >> nested-guest-to-host hypercalls (and couldn't unless we get agreement on >> a protocol from all hypervisor vendors). >> > Not necessarily. What I am saying is we could make this protocol part of > guest paravirt driver. the guest that loads the driver and enables the > capability, has to agree to the protocol. If it doesn't want to, it does > not have to use that driver. > It would only work for kvm-on-kvm.
Anthony Liguori wrote: > Gregory Haskins wrote: >> Note: No one has ever proposed to change the virtio-ABI. > > virtio-pci is part of the virtio ABI. You are proposing changing that. I'm sorry, but I respectfully disagree with you here. virtio has an ABI...I am not modifying that. virtio-pci has an ABI...I am not modifying that either. The subsystem in question is virtio-vbus, and is a completely standalone addition to the virtio ecosystem. By your argument, virtio amd virtio-pci should fuse together, and virtio-lguest and virtio-s390 should go away because they diverge from the virtio-pci ABI, right? I seriously doubt you would agree with that statement. The fact is, the design of virtio not only permits modular replacement of its transport ABI, it encourages it. So how is virtio-vbus any different from the other three? I understand that it means you need to load a new driver in the guest, and I am ok with that. virtio-pci was once a non-upstream driver too and required someone to explicitly load it, wasn't it? You gotta crawl before you can walk... > > You cannot add new kernel modules to guests and expect them to remain > supported. ??? Of course you can. How is this different from any other driver? > So there is value in reusing existing ABIs Well, I wont argue with you on that one. There is certainly value there. My contention is that sometimes the liability of that ABI is greater than its value, and thats when its time to evaluate the design decisions that lead to re-use vs re-design. > >>> I think the reason vbus gets better performance for networking today is >>> that vbus' backends are in the kernel while virtio's backends are >>> currently in userspace. >>> >> >> Well, with all due respect, you also said initially when I announced >> vbus that in-kernel doesn't matter, and tried to make virtio-net run as >> fast as venet from userspace ;) Given that we never saw those userspace >> patches from you that in fact equaled my performance, I assume you were >> wrong about that statement. Perhaps you were wrong about other things >> too? >> > > I'm wrong about a lot of things :-) I haven't yet been convinced that > I'm wrong here though. > > One of the gray areas here is what constitutes an in-kernel backend. > tun/tap is a sort of an in-kernel backend. Userspace is still involved > in all of the paths. vhost seems to be an intermediate step between > tun/tap and vbus. The fast paths avoid userspace completely. Many of > the slow paths involve userspace still (like migration apparently). > With vbus, userspace is avoided entirely. In some ways, you could argue > that slirp and vbus are opposite ends of the virtual I/O spectrum. > > I believe strongly that we should avoid putting things in the kernel > unless they absolutely have to be. I would generally agree with you on that. Particularly in the case of kvm, having slow-path bus-management code in-kernel is not strictly necessary because KVM has qemu in userspace. The issue here is that vbus is designed to be a generic solution to in-kernel virtual-IO. It will support (via abstraction of key subsystems) a variety of environments that may or may not be similar in facilities to KVM, and therefore it represents the least-common-denominator as far as what external dependencies it requires. The bottom line is this: despite the tendency for people to jump at "don't put much in the kernel!", the fact is that a "bus" designed for software to software (such as vbus) is almost laughably trivial. Its essentially a list of objects that have an int (dev-id) and char* (dev-type) attribute. All the extra goo that you see me setting up in something like the kvm-connector needs to be done for fast-path _anyway_, so transporting the verbs to query this simple list is not really a big deal. If we were talking about full ICH emulation for a PCI bus, I would agree with you. In the case of vbus, I think its overstated. > I'm definitely interested in playing > with vhost to see if there are ways to put even less in the kernel. In > particular, I think it would be a big win to avoid knowledge of slots in > the kernel by doing ring translation in userspace. Ultimately I think that would not be a very good proposition. Ring translation is actually not that hard, and that would definitely be a measurable latency source to try and do as you propose. But, I will not discourage you from trying if that is what you want to do. > This implies a > userspace transition in the fast path. This may or may not be > acceptable. I think this is going to be a very interesting experiment > and will ultimately determine whether my intuition about the cost of > dropping to userspace is right or wrong. I can already tell you its wrong, just based on the fact that even extra kthread switches can hurt from my own experience playing in this area... > > >> Conversely, I am not afraid of requiring a new driver to optimize the >> general PV interface. In the long term, this will reduce the amount of >> reimplementing the same code over and over, reduce system overhead, and >> it adds new features not previously available (for instance, coalescing >> and prioritizing interrupts). >> > > I think you have a lot of ideas and I don't know that we've been able to > really understand your vision. Do you have any plans on writing a paper > about vbus that goes into some of your thoughts in detail? I really need to, I know... > >>> If that's the case, then I don't see any >>> reason to adopt vbus unless Greg things there are other compelling >>> features over virtio. >>> >> >> Aside from the fact that this is another confusion of the vbus/virtio >> relationship...yes, of course there are compelling features (IMHO) or I >> wouldn't be expending effort ;) They are at least compelling enough to >> put in AlacrityVM. > > This whole AlactricyVM thing is really hitting this nail with a > sledgehammer. Note that I didn't really want to go that route. As you know, I tried pushing this straight through kvm first since earlier this year, but I was met with reluctance to even bother truly understanding what I was proposing, comments like "tell me your ideas so I can steal them", and "sorry, we are going to reinvent our own instead". This isn't exactly going to motivate someone to continue pushing these ideas within that community. I was made to feel (purposely?) unwelcome at times. So I can either roll over and die, or start my own project. In addition, almost all of vbus is completely independent of kvm anyway (I think there are only 3 patches that actually touch KVM, and they are relatively minor). And vbus doesn't really fit into any other category of maintained subsystem either. So it really calls for a new branch of maintainership, of which I currently sit. AlacrityVM will serve as the collaboration point of that maintainership. The bottom line is, there are people out there who are interested in what we are doing (and that number grows everyday). Starting a new project wasn't what I wanted per se, but I don't think there was much choice. While the kernel needs to be very careful about what it > pulls in, as long as you're willing to commit to ABI compatibility, we > can pull code into QEMU to support vbus. Then you can just offer vbus > host and guest drivers instead of forking the kernel. Ok, I will work on pushing those patches next. > >> If upstream KVM doesn't want them, that's KVMs >> decision and I am fine with that. Simply never apply my qemu patches to >> qemu-kvm.git, and KVM will be blissfully unaware if vbus is present. > > As I mentioned before, if you submit patches to upstream QEMU, we'll > apply them (after appropriate review). As I said previously, we want to > avoid user confusion as much as possible. Maybe this means limiting it > to -device or a separate machine type. I'm not sure, but that's > something we can discussion on qemu-devel. Ok. Kind Regards, -Greg
On 08/18/2009 04:16 PM, Gregory Haskins wrote: > The issue here is that vbus is designed to be a generic solution to > in-kernel virtual-IO. It will support (via abstraction of key > subsystems) a variety of environments that may or may not be similar in > facilities to KVM, and therefore it represents the > least-common-denominator as far as what external dependencies it requires. > Maybe it will be easier to evaluate it in the context of these other environments. It's difficult to assess this without an example. > The bottom line is this: despite the tendency for people to jump at > "don't put much in the kernel!", the fact is that a "bus" designed for > software to software (such as vbus) is almost laughably trivial. Its > essentially a list of objects that have an int (dev-id) and char* > (dev-type) attribute. All the extra goo that you see me setting up in > something like the kvm-connector needs to be done for fast-path > _anyway_, so transporting the verbs to query this simple list is not > really a big deal. > It's not laughably trivial when you try to support the full feature set of kvm (for example, live migration will require dirty memory tracking, and exporting all state stored in the kernel to userspace). > Note that I didn't really want to go that route. As you know, I tried > pushing this straight through kvm first since earlier this year, but I > was met with reluctance to even bother truly understanding what I was > proposing, comments like "tell me your ideas so I can steal them", and > Oh come on, I wrote "steal" as a convenient shorthand for "cross-pollinate your ideas into our code according to the letter and spirit of the GNU General Public License". Since we're all trying to improve Linux we may as well cooperate. > "sorry, we are going to reinvent our own instead". No. Adopting venet/vbus would mean reinventing something that already existed. Continuing to support virtio/pci is not reinventing anything. > This isn't exactly > going to motivate someone to continue pushing these ideas within that > community. I was made to feel (purposely?) unwelcome at times. So I > can either roll over and die, or start my own project. > You haven't convinced me that your ideas are worth the effort of abandoning virtio/pci or maintaining both venet/vbus and virtio/pci. I'm sorry if that made you feel unwelcome. There's no reason to interpret disagreement as malice though.
Avi Kivity wrote: > On 08/17/2009 10:33 PM, Gregory Haskins wrote: >> >> There is a secondary question of venet (a vbus native device) verses >> virtio-net (a virtio native device that works with PCI or VBUS). If >> this contention is really around venet vs virtio-net, I may possibly >> conceed and retract its submission to mainline. I've been pushing it to >> date because people are using it and I don't see any reason that the >> driver couldn't be upstream. >> > > That's probably the cause of much confusion. The primary kvm pain point > is now networking, so in any vbus discussion we're concentrating on that > aspect. > >>> Also, are you willing to help virtio to become faster? >>> >> Yes, that is not a problem. Note that virtio in general, and >> virtio-net/venet in particular are not the primary goal here, however. >> Improved 802.x and block IO are just positive side-effects of the >> effort. I started with 802.x networking just to demonstrate the IO >> layer capabilities, and to test it. It ended up being so good on >> contrast to existing facilities, that developers in the vbus community >> started using it for production development. >> >> Ultimately, I created vbus to address areas of performance that have not >> yet been addressed in things like KVM. Areas such as real-time guests, >> or RDMA (host bypass) interfaces. > > Can you explain how vbus achieves RDMA? > > I also don't see the connection to real time guests. Both of these are still in development. Trying to stay true to the "release early and often" mantra, the core vbus technology is being pushed now so it can be reviewed. Stay tuned for these other developments. > >> I also designed it in such a way that >> we could, in theory, write one set of (linux-based) backends, and have >> them work across a variety of environments (such as containers/VMs like >> KVM, lguest, openvz, but also physical systems like blade enclosures and >> clusters, or even applications running on the host). >> > > Sorry, I'm still confused. Why would openvz need vbus? Its just an example. The point is that I abstracted what I think are the key points of fast-io, memory routing, signal routing, etc, so that it will work in a variety of (ideally, _any_) environments. There may not be _performance_ motivations for certain classes of VMs because they already have decent support, but they may want a connector anyway to gain some of the new features available in vbus. And looking forward, the idea is that we have commoditized the backend so we don't need to redo this each time a new container comes along. > It already has > zero-copy networking since it's a shared kernel. Shared memory should > also work seamlessly, you just need to expose the shared memory object > on a shared part of the namespace. And of course, anything in the > kernel is already shared. > >>> Or do you >>> have arguments why that is impossible to do so and why the only >>> possible solution is vbus? Avi says no such arguments were offered >>> so far. >>> >> Not for lack of trying. I think my points have just been missed >> everytime I try to describe them. ;) Basically I write a message very >> similar to this one, and the next conversation starts back from square >> one. But I digress, let me try again.. >> >> Noting that this discussion is really about the layer *below* virtio, >> not virtio itself (e.g. PCI vs vbus). Lets start with a little >> background: >> >> -- Background -- >> >> So on one level, we have the resource-container technology called >> "vbus". It lets you create a container on the host, fill it with >> virtual devices, and assign that container to some context (such as a >> KVM guest). These "devices" are LKMs, and each device has a very simple >> verb namespace consisting of a synchronous "call()" method, and a >> "shm()" method for establishing async channels. >> >> The async channels are just shared-memory with a signal path (e.g. >> interrupts and hypercalls), which the device+driver can use to overlay >> things like rings (virtqueues, IOQs), or other shared-memory based >> constructs of their choosing (such as a shared table). The signal path >> is designed to minimize enter/exits and reduce spurious signals in a >> unified way (see shm-signal patch). >> >> call() can be used both for config-space like details, as well as >> fast-path messaging that require synchronous behavior (such as guest >> scheduler updates). >> >> All of this is managed via sysfs/configfs. >> > > One point of contention is that this is all managementy stuff and should > be kept out of the host kernel. Exposing shared memory, interrupts, and > guest hypercalls can all be easily done from userspace (as virtio > demonstrates). True, some devices need kernel acceleration, but that's > no reason to put everything into the host kernel. See my last reply to Anthony. My two points here are that: a) having it in-kernel makes it a complete subsystem, which perhaps has diminished value in kvm, but adds value in most other places that we are looking to use vbus. b) the in-kernel code is being overstated as "complex". We are not talking about your typical virt thing, like an emulated ICH/PCI chipset. Its really a simple list of devices with a handful of attributes. They are managed using established linux interfaces, like sysfs/configfs. > >> On the guest, we have a "vbus-proxy" which is how the guest gets access >> to devices assigned to its container. (as an aside, "virtio" devices >> can be populated in the container, and then surfaced up to the >> virtio-bus via that virtio-vbus patch I mentioned). >> >> There is a thing called a "vbus-connector" which is the guest specific >> part. Its job is to connect the vbus-proxy in the guest, to the vbus >> container on the host. How it does its job is specific to the connector >> implementation, but its role is to transport messages between the guest >> and the host (such as for call() and shm() invocations) and to handle >> things like discovery and hotswap. >> > > virtio has an exact parallel here (virtio-pci and friends). > >> Out of all this, I think the biggest contention point is the design of >> the vbus-connector that I use in AlacrityVM (Avi, correct me if I am >> wrong and you object to other aspects as well). I suspect that if I had >> designed the vbus-connector to surface vbus devices as PCI devices via >> QEMU, the patches would potentially have been pulled in a while ago. >> > > Exposing devices as PCI is an important issue for me, as I have to > consider non-Linux guests. Thats your prerogative, but obviously not everyone agrees with you. Getting non-Linux guests to work is my problem if you chose to not be part of the vbus community. > Another issue is the host kernel management code which I believe is > superfluous. In your opinion, right? > > But the biggest issue is compatibility. virtio exists and has Windows > and Linux drivers. Without a fatal flaw in virtio we'll continue to > support it. So go ahead. > Given that, why spread to a new model? Note: I haven't asked you to (at least, not since April with the vbus-v3 release). Spreading to a new model is currently the role of the AlacrityVM project, since we disagree on the utility of a new model. > > Of course, I understand you're interested in non-ethernet, non-block > devices. I can't comment on these until I see them. Maybe they can fit > the virtio model, and maybe they can't. Yes, that I am not sure. They may. I will certainly explore that angle at some point. > >> There are, of course, reasons why vbus does *not* render as PCI, so this >> is the meat of of your question, I believe. >> >> At a high level, PCI was designed for software-to-hardware interaction, >> so it makes assumptions about that relationship that do not necessarily >> apply to virtualization. >> >> For instance: >> >> A) hardware can only generate byte/word sized requests at a time because >> that is all the pcb-etch and silicon support. So hardware is usually >> expressed in terms of some number of "registers". >> > > No, hardware happily DMAs to and fro main memory. Yes, now walk me through how you set up DMA to do something like a call when you do not know addresses apriori. Hint: count the number of MMIO/PIOs you need. If the number is > 1, you've lost. > Some hardware of > course uses mmio registers extensively, but not virtio hardware. With > the recent MSI support no registers are touched in the fast path. Note we are not talking about virtio here. Just raw PCI and why I advocate vbus over it. > >> C) the target end-point has no visibility into the CPU machine state >> other than the parameters passed in the bus-cycle (usually an address >> and data tuple). >> > > That's not an issue. Accessing memory is cheap. > >> D) device-ids are in a fixed width register and centrally assigned from >> an authority (e.g. PCI-SIG). >> > > That's not an issue either. Qumranet/Red Hat has donated a range of > device IDs for use in virtio. Yes, and to get one you have to do what? Register it with kvm.git, right? Kind of like registering a MAJOR/MINOR, would you agree? Maybe you do not mind (especially given your relationship to kvm.git), but there are disadvantages to that model for most of the rest of us. > Device IDs are how devices are associated > with drivers, so you'll need something similar for vbus. Nope, just like you don't need to do anything ahead of time for using a dynamic misc-device name. You just have both the driver and device know what they are looking for (its part of the ABI). > >> E) Interrupt/MSI routing is per-device oriented >> > > Please elaborate. What is the issue? How does vbus solve it? There are no "interrupts" in vbus..only shm-signals. You can establish an arbitrary amount of shm regions, each with an optional shm-signal associated with it. To do this, the driver calls dev->shm(), and you get back a shm_signal object. Underneath the hood, the vbus-connector (e.g. vbus-pcibridge) decides how it maps real interrupts to shm-signals (on a system level, not per device). This can be 1:1, or any other scheme. vbus-pcibridge uses one system-wide interrupt per priority level (today this is 8 levels), each with an IOQ based event channel. "signals" come as an event on that channel. So the "issue" is that you have no real choice with PCI. You just get device oriented interrupts. With vbus, its abstracted. So you can still get per-device standard MSI, or you can do fancier things like do coalescing and prioritization. > >> F) Interrupts/MSI are assumed cheap to inject >> > > Interrupts are not assumed cheap; that's why interrupt mitigation is > used (on real and virtual hardware). Its all relative. IDT dispatch and EOI overhead are "baseline" on real hardware, whereas they are significantly more expensive to do the vmenters and vmexits on virt (and you have new exit causes, like irq-windows, etc, that do not exist in real HW). > >> G) Interrupts/MSI are non-priortizable. >> > > They are prioritizable; Linux ignores this though (Windows doesn't). > Please elaborate on what the problem is and how vbus solves it. It doesn't work right. The x86 sense of interrupt priority is, sorry to say it, half-assed at best. I've worked with embedded systems that have real interrupt priority support in the hardware, end to end, including the PIC. The LAPIC on the other hand is really weak in this dept, and as you said, Linux doesn't even attempt to use whats there. > >> H) Interrupts/MSI are statically established >> > > Can you give an example of why this is a problem? Some of the things we are building use the model of having a device that hands out shm-signal in response to guest events (say, the creation of an IPC channel). This would generally be handled by a specific device model instance, and it would need to do this without pre-declaring the MSI vectors (to use PCI as an example). > >> These assumptions and constraints may be completely different or simply >> invalid in a virtualized guest. For instance, the hypervisor is just >> software, and therefore it's not restricted to "etch" constraints. IO >> requests can be arbitrarily large, just as if you are invoking a library >> function-call or OS system-call. Likewise, each one of those requests is >> a branch and a context switch, so it has often has greater performance >> implications than a simple register bus-cycle in hardware. If you use >> an MMIO variant, it has to run through the page-fault code to be decoded. >> >> The result is typically decreased performance if you try to do the same >> thing real hardware does. This is why you usually see hypervisor >> specific drivers (e.g. virtio-net, vmnet, etc) a common feature. >> >> _Some_ performance oriented items can technically be accomplished in >> PCI, albeit in a much more awkward way. For instance, you can set up a >> really fast, low-latency "call()" mechanism using a PIO port on a >> PCI-model and ioeventfd. As a matter of fact, this is exactly what the >> vbus pci-bridge does: >> > > What performance oriented items have been left unaddressed? Well, the interrupt model to name one. > > virtio and vbus use three communications channels: call from guest to > host (implemented as pio and reasonably fast), call from host to guest > (implemented as msi and reasonably fast) and shared memory (as fast as > it can be). Where does PCI limit you in any way? > >> The problem here is that this is incredibly awkward to setup. You have >> all that per-cpu goo and the registration of the memory on the guest. >> And on the host side, you have all the vmapping of the registered >> memory, and the file-descriptor to manage. In short, its really painful. >> >> I would much prefer to do this *once*, and then let all my devices >> simple re-use that infrastructure. This is, in fact, what I do. Here >> is the device model that a guest sees: >> > > virtio also reuses the pci code, on both guest and host. > >> Moving on: _Other_ items cannot be replicated (at least, not without >> hacking it into something that is no longer PCI. >> >> Things like the pci-id namespace are just silly for software. I would >> rather have a namespace that does not require central management so >> people are free to create vbus-backends at will. This is akin to >> registering a device MAJOR/MINOR, verses using the various dynamic >> assignment mechanisms. vbus uses a string identifier in place of a >> pci-id. This is superior IMHO, and not compatible with PCI. >> > > How do you handle conflicts? Again you need a central authority to hand > out names or prefixes. Not really, no. If you really wanted to be formal about it, you could adopt any series of UUID schemes. For instance, perhaps venet should be "com.novell::virtual-ethernet". Heck, I could use uuidgen. > >> As another example, the connector design coalesces *all* shm-signals >> into a single interrupt (by prio) that uses the same context-switch >> mitigation techniques that help boost things like networking. This >> effectively means we can detect and optimize out ack/eoi cycles from the >> APIC as the IO load increases (which is when you need it most). PCI has >> no such concept. >> > > That's a bug, not a feature. It means poor scaling as the number of > vcpus increases and as the number of devices increases. So the "avi-vbus-connector" can use 1:1, if you prefer. Large vcpu counts (which are not typical) and irq-affinity is not a target application for my design, so I prefer the coalescing model in the vbus-pcibridge included in this series. YMMV Note: If you really wanted to, you could have priority queues per-cpu, and get the best of both worlds (irq routing and coalescing/priority). > > Note nothing prevents steering multiple MSIs into a single vector. It's > a bad idea though. Yes, it is a bad idea...and not the same thing either. This would effectively create a shared-line scenario in the irq code, which is not what happens in vbus. > >> In addition, the signals and interrupts are priority aware, which is >> useful for things like 802.1p networking where you may establish 8-tx >> and 8-rx queues for your virtio-net device. x86 APIC really has no >> usable equivalent, so PCI is stuck here. >> > > x86 APIC is priority aware. Have you ever tried to use it? > >> Also, the signals can be allocated on-demand for implementing things >> like IPC channels in response to guest requests since there is no >> assumption about device-to-interrupt mappings. This is more flexible. >> > > Yes. However given that vectors are a scarce resource you're severely > limited in that. The connector I am pushing out does not have this limitation. > And if you're multiplexing everything on one vector, > then you can just as well demultiplex your channels in the virtio driver > code. Only per-device, not system wide. > >> And through all of this, this design would work in any guest even if it >> doesn't have PCI (e.g. lguest, UML, physical systems, etc). >> > > That is true for virtio which works on pci-less lguest and s390. Yes, and lguest and s390 had to build their own bus-model to do it, right? Thank you for bringing this up, because it is one of the main points here. What I am trying to do is generalize the bus to prevent the proliferation of more of these isolated models in the future. Build one, fast, in-kernel model so that we wouldn't need virtio-X, and virtio-Y in the future. They can just reuse the (performance optimized) bus and models, and only need to build the connector to bridge them. > >> -- Bottom Line -- >> >> The idea here is to generalize all the interesting parts that are common >> (fast sync+async io, context-switch mitigation, back-end models, memory >> abstractions, signal-path routing, etc) that a variety of linux based >> technologies can use (kvm, lguest, openvz, uml, physical systems) and >> only require the thin "connector" code to port the system around. The >> idea is to try to get this aspect of PV right once, and at some point in >> the future, perhaps vbus will be as ubiquitous as PCI. Well, perhaps >> not *that* ubiquitous, but you get the idea ;) >> > > That is exactly the design goal of virtio (except it limits itself to > virtualization). No, virtio is only part of the picture. It not including the backend models, or how to do memory/signal-path abstraction for in-kernel, for instance. But otherwise, virtio as a device model is compatible with vbus as a bus model. They compliment one another. > >> Then device models like virtio can ride happily on top and we end up >> with a really robust and high-performance Linux-based stack. I don't >> buy the argument that we already have PCI so lets use it. I don't think >> its the best design and I am not afraid to make an investment in a >> change here because I think it will pay off in the long run. >> > > Sorry, I don't think you've shown any quantifiable advantages. We can agree to disagree then, eh? There are certainly quantifiable differences. Waving your hand at the differences to say they are not advantages is merely an opinion, one that is not shared universally. The bottom line is all of these design distinctions are encapsulated within the vbus subsystem and do not affect the kvm code-base. So agreement with kvm upstream is not a requirement, but would be advantageous for collaboration. Kind Regards, -Greg
Michael S. Tsirkin wrote: > On Mon, Aug 17, 2009 at 04:17:09PM -0400, Gregory Haskins wrote: >> Michael S. Tsirkin wrote: >>> On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote: >>>> Case in point: Take an upstream kernel and you can modprobe the >>>> vbus-pcibridge in and virtio devices will work over that transport >>>> unmodified. >>>> >>>> See http://lkml.org/lkml/2009/8/6/244 for details. >>> The modprobe you are talking about would need >>> to be done in guest kernel, correct? >> Yes, and your point is? "unmodified" (pardon the psuedo pun) modifies >> "virtio", not "guest". >> It means you can take an off-the-shelf kernel >> with off-the-shelf virtio (ala distro-kernel) and modprobe >> vbus-pcibridge and get alacrityvm acceleration. > > Heh, by that logic ksplice does not modify running kernel either :) Sigh...this is just fud. Again, I never said I do not modify the guest. I only said that virtio is unmodified and all the existing devices can work unmodified. I hardly think its fair to compare something like loading a pci-bridge driver into a running kernel is the same as patching the kernel. You just load a driver to get access to your IO resources...standard stuff really. > >> It is not a design goal of mine to forbid the loading of a new driver, >> so I am ok with that requirement. >> >>>> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm, >>>> and its likewise constrained by various limitations of that decision >>>> (such as its reliance of the PCI model, and the kvm memory scheme). >>> vhost is actually not related to PCI in any way. It simply leaves all >>> setup for userspace to do. And the memory scheme was intentionally >>> separated from kvm so that it can easily support e.g. lguest. >>> >> I think you have missed my point. I mean that vhost requires a separate >> bus-model (ala qemu-pci). > > So? That can be in userspace, and can be anything including vbus. -ENOPARSE Can you elaborate? > >> And no, your memory scheme is not separated, >> at least, not very well. It still assumes memory-regions and >> copy_to_user(), which is very kvm-esque. > > I don't think so: works for lguest, kvm, UML and containers kvm _esque_ , meaning anything that follows the region+copy_to_user model. Not all things do. > >> Vbus has people using things >> like userspace containers (no regions), > > vhost by default works without regions Thats a start, but not good enough if you were trying to achieve the same thing as vbus. As I said before, I've never said you had to achieve the same thing, but do note they are distinctly different with different goals. You are solving a directed problem. I am solving a general problem, and trying to solve it once. > >> and physical hardware (dma >> controllers, so no regions or copy_to_user) so your scheme quickly falls >> apart once you get away from KVM. > > Someone took a driver and is building hardware for it ... so what? What is your point? > >> Don't get me wrong: That design may have its place. Perhaps you only >> care about fixing KVM, which is a perfectly acceptable strategy. >> Its just not a strategy that I think is the best approach. Essentially you >> are promoting the proliferation of competing backends, and I am trying >> to unify them (which is ironic that this thread started with concerns I >> was fragmenting things ;). > > So, you don't see how venet fragments things? It's pretty obvious ... I never said it doesn't. venet started as a test harness, but now it is inadvertently fragmenting the virtio-net effort. I admit it. It wasn't intentional, but just worked out that way. Until your vhost idea is vetted and benchmarked, its not even in the running. Venet is currently the highest performing 802.x acceleration for KVM that I am aware of, so it will continue to garner interest from users concerned with performance. But likewise, vhost has the potential to fragment the back-end model. That was my point. > >> The bottom line is, you have a simpler solution that is more finely >> targeted at KVM and virtio-networking. It fixes probably a lot of >> problems with the existing implementation, but it still has limitations. >> >> OTOH, what I am promoting is more complex, but more flexible. That is >> the tradeoff. You can't have both ;) > > We can. connect eventfds to hypercalls, and vhost will work with vbus. -ENOPARSE vbus doesnt use hypercalls, and I do not see why or how you would connect two backend models together like this. Can you elaborate. > >> So do not for one second think >> that what you implemented is equivalent, because they are not. >> >> In fact, I believe I warned you about this potential problem when you >> decided to implement your own version. I think I said something to the >> effect of "you will either have a subset of functionality, or you will >> ultimately reinvent what I did". Right now you are in the subset phase. > > No. Unlike vbus, vhost supports unmodified guests and live migration. By "subset", I am referring to your interfaces and the scope of its applicability. The things you need to do to make vhost work and a vbus device work from a memory and signaling abstration POV are going to be extremely similar. The difference in how the guest sees them these backends is all contained in the vbus-connector. Therefore, what you *could* have done is simply written a connector that does something like only support "virtio" backends, and surfaced them as regular PCI devices to the guest. Then you could have reused all the abstraction features in vbus, instead of reinventing them (case in point, your region+copy_to_user code). And likewise, anyone using vbus could use your virtio-net backend. Instead, I am still left with no virtio-net backend implemented, and you were left designing, writing, and testing facilities that I've already completed. So it was duplicative effort. Kind Regards, -Greg
Michael S. Tsirkin wrote: > On Mon, Aug 17, 2009 at 03:33:30PM -0400, Gregory Haskins wrote: >> There is a secondary question of venet (a vbus native device) verses >> virtio-net (a virtio native device that works with PCI or VBUS). If >> this contention is really around venet vs virtio-net, I may possibly >> conceed and retract its submission to mainline. > > For me yes, venet+ioq competing with virtio+virtqueue. > >> I've been pushing it to date because people are using it and I don't >> see any reason that the driver couldn't be upstream. > > If virtio is just as fast, they can just use it without knowing it. > Clearly, that's better since we support virtio anyway ... More specifically: kvm can support whatever it wants. I am not asking kvm to support venet. If we (the alacrityvm community) decide to keep maintaining venet, _we_ will support it, and I have no problem with that. As of right now, we are doing some interesting things with it in the lab and its certainly more flexible for us as a platform since we maintain the ABI and feature set. So for now, I do not think its a big deal if they both co-exist, and it has no bearing on KVM upstream. > >> -- Issues -- >> >> Out of all this, I think the biggest contention point is the design of >> the vbus-connector that I use in AlacrityVM (Avi, correct me if I am >> wrong and you object to other aspects as well). I suspect that if I had >> designed the vbus-connector to surface vbus devices as PCI devices via >> QEMU, the patches would potentially have been pulled in a while ago. >> >> There are, of course, reasons why vbus does *not* render as PCI, so this >> is the meat of of your question, I believe. >> >> At a high level, PCI was designed for software-to-hardware interaction, >> so it makes assumptions about that relationship that do not necessarily >> apply to virtualization. > > I'm not hung up on PCI, myself. An idea that might help you get Avi > on-board: do setup in userspace, over PCI. Note that this is exactly what I do. In AlacrityVM, the guest learns of the available acceleration by the presence of the PCI-BRIDGE. It then uses that bridge, using standard PCI mechanisms, to set everything up in the slow-path. > Negotiate hypercall support > (e.g. with a PCI capability) and then switch to that for fastpath. Hmm? > >> As another example, the connector design coalesces *all* shm-signals >> into a single interrupt (by prio) that uses the same context-switch >> mitigation techniques that help boost things like networking. This >> effectively means we can detect and optimize out ack/eoi cycles from the >> APIC as the IO load increases (which is when you need it most). PCI has >> no such concept. > > Could you elaborate on this one for me? How does context-switch > mitigation work? What I did was I commoditized the concept of signal-mitigation. I then reuse that concept all over the place to do "NAPI" like mitigation of the signal path for everthing: for individual interrupts, of course, but also for things like hypercalls, kthread wakeups, and the interrupt controller too. > >> In addition, the signals and interrupts are priority aware, which is >> useful for things like 802.1p networking where you may establish 8-tx >> and 8-rx queues for your virtio-net device. x86 APIC really has no >> usable equivalent, so PCI is stuck here. > > By the way, multiqueue support in virtio would be very nice to have, Actually what I am talking about is a little different than MQ, but I agree that both priority-based and concurrency-based MQ would require similar facilities. > and seems mostly unrelated to vbus. Mostly, but not totally. The priority stuff wouldn't work quite right without similar provisions to the entire signal path, like vbus does. Kind Regards, -Greg
Avi Kivity wrote: > On 08/18/2009 04:16 PM, Gregory Haskins wrote: >> The issue here is that vbus is designed to be a generic solution to >> in-kernel virtual-IO. It will support (via abstraction of key >> subsystems) a variety of environments that may or may not be similar in >> facilities to KVM, and therefore it represents the >> least-common-denominator as far as what external dependencies it >> requires. >> > > Maybe it will be easier to evaluate it in the context of these other > environments. It's difficult to assess this without an example. When they are ready, I will cross post the announcement to KVM. > >> The bottom line is this: despite the tendency for people to jump at >> "don't put much in the kernel!", the fact is that a "bus" designed for >> software to software (such as vbus) is almost laughably trivial. Its >> essentially a list of objects that have an int (dev-id) and char* >> (dev-type) attribute. All the extra goo that you see me setting up in >> something like the kvm-connector needs to be done for fast-path >> _anyway_, so transporting the verbs to query this simple list is not >> really a big deal. >> > > It's not laughably trivial when you try to support the full feature set > of kvm (for example, live migration will require dirty memory tracking, > and exporting all state stored in the kernel to userspace). Doesn't vhost suffer from the same issue? If not, could I also apply the same technique to support live-migration in vbus? > >> Note that I didn't really want to go that route. As you know, I tried >> pushing this straight through kvm first since earlier this year, but I >> was met with reluctance to even bother truly understanding what I was >> proposing, comments like "tell me your ideas so I can steal them", and >> > > Oh come on, I wrote "steal" as a convenient shorthand for > "cross-pollinate your ideas into our code according to the letter and > spirit of the GNU General Public License". Is that supposed to make me feel better about working with you? I mean, writing, testing, polishing patches for LKML-type submission is time consuming. If all you are going to do is take those ideas and rewrite it yourself, why should I go through that effort? And its not like that was the first time you have said that to me. > Since we're all trying to improve Linux we may as well cooperate. Well, I don't think anyone can say that I haven't been trying. > >> "sorry, we are going to reinvent our own instead". > > No. Adopting venet/vbus would mean reinventing something that already > existed. But yet, it doesn't. > Continuing to support virtio/pci is not reinventing anything. No one asked you to do otherwise. > >> This isn't exactly >> going to motivate someone to continue pushing these ideas within that >> community. I was made to feel (purposely?) unwelcome at times. So I >> can either roll over and die, or start my own project. >> > > You haven't convinced me that your ideas are worth the effort of > abandoning virtio/pci or maintaining both venet/vbus and virtio/pci. With all due respect, I didnt ask you do to anything, especially not abandon something you are happy with. All I did was push guest drivers to LKML. The code in question is independent of KVM, and its proven to improve the experience of using Linux as a platform. There are people interested in using them (by virtue of the number of people that have signed up for the AlacrityVM list, and have mailed me privately about this work). So where is the problem here? > I'm sorry if that made you feel unwelcome. There's no reason to > interpret disagreement as malice though. > Ok. Kind Regards, -Greg
On Tue, Aug 18, 2009 at 11:46:06AM +0300, Michael S. Tsirkin wrote: > On Mon, Aug 17, 2009 at 04:17:09PM -0400, Gregory Haskins wrote: > > Michael S. Tsirkin wrote: > > > On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote: > > >> Case in point: Take an upstream kernel and you can modprobe the > > >> vbus-pcibridge in and virtio devices will work over that transport > > >> unmodified. > > >> > > >> See http://lkml.org/lkml/2009/8/6/244 for details. > > > > > > The modprobe you are talking about would need > > > to be done in guest kernel, correct? > > > > Yes, and your point is? "unmodified" (pardon the psuedo pun) modifies > > "virtio", not "guest". > > It means you can take an off-the-shelf kernel > > with off-the-shelf virtio (ala distro-kernel) and modprobe > > vbus-pcibridge and get alacrityvm acceleration. > > Heh, by that logic ksplice does not modify running kernel either :) > > > It is not a design goal of mine to forbid the loading of a new driver, > > so I am ok with that requirement. > > > > >> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm, > > >> and its likewise constrained by various limitations of that decision > > >> (such as its reliance of the PCI model, and the kvm memory scheme). > > > > > > vhost is actually not related to PCI in any way. It simply leaves all > > > setup for userspace to do. And the memory scheme was intentionally > > > separated from kvm so that it can easily support e.g. lguest. > > > > > > > I think you have missed my point. I mean that vhost requires a separate > > bus-model (ala qemu-pci). > > So? That can be in userspace, and can be anything including vbus. > > > And no, your memory scheme is not separated, > > at least, not very well. It still assumes memory-regions and > > copy_to_user(), which is very kvm-esque. > > I don't think so: works for lguest, kvm, UML and containers > > > Vbus has people using things > > like userspace containers (no regions), > > vhost by default works without regions > > > and physical hardware (dma > > controllers, so no regions or copy_to_user) so your scheme quickly falls > > apart once you get away from KVM. > > Someone took a driver and is building hardware for it ... so what? > I think Greg is referring to something like my virtio-over-PCI patch. I'm pretty sure that vhost is completely useless for my situation. I'd like to see vhost work for my use, so I'll try to explain what I'm doing. I've got a system where I have about 20 computers connected via PCI. The PCI master is a normal x86 system, and the PCI agents are PowerPC systems. The PCI agents act just like any other PCI card, except they are running Linux, and have their own RAM and peripherals. I wrote a custom driver which imitated a network interface and a serial port. I tried to push it towards mainline, and DavidM rejected it, with the argument, "use virtio, don't add another virtualization layer to the kernel." I think he has a decent argument, so I wrote virtio-over-PCI. Now, there are some things about virtio that don't work over PCI. Mainly, memory is not truly shared. It is extremely slow to access memory that is "far away", meaning "across the PCI bus." This can be worked around by using a DMA controller to transfer all data, along with an intelligent scheme to perform only writes across the bus. If you're careful, reads are never needed. So, in my system, copy_(to|from)_user() is completely wrong. There is no userspace, only a physical system. In fact, because normal x86 computers do not have DMA controllers, the host system doesn't actually handle any data transfer! I used virtio-net in both the guest and host systems in my example virtio-over-PCI patch, and succeeded in getting them to communicate. However, the lack of any setup interface means that the devices must be hardcoded into both drivers, when the decision could be up to userspace. I think this is a problem that vbus could solve. For my own selfish reasons (I don't want to maintain an out-of-tree driver) I'd like to see *something* useful in mainline Linux. I'm happy to answer questions about my setup, just ask. Ira -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Gregory Haskins <gregory.haskins@gmail.com> wrote: > > You haven't convinced me that your ideas are worth the effort > > of abandoning virtio/pci or maintaining both venet/vbus and > > virtio/pci. > > With all due respect, I didnt ask you do to anything, especially > not abandon something you are happy with. > > All I did was push guest drivers to LKML. The code in question > is independent of KVM, and its proven to improve the experience > of using Linux as a platform. There are people interested in > using them (by virtue of the number of people that have signed up > for the AlacrityVM list, and have mailed me privately about this > work). This thread started because i asked you about your technical arguments why we'd want vbus instead of virtio. Your answer above now basically boils down to: "because I want it so, why dont you leave me alone". What you are doing here is to in essence to fork KVM, regardless of the technical counter arguments given against such a fork and regardless of the ample opportunity given to you to demostrate the technical advantages of your code. (in which case KVM would happily migrate to your code) We all love faster code and better management interfaces and tons of your prior patches got accepted by Avi. This time you didnt even _try_ to improve virtio. It's not like you posted a lot of virtio patches which were not applied. You didnt even try and you need to try _much_ harder than that before forking a project. And fragmentation matters quite a bit. To Linux users, developers, administrators, packagers it's a big deal whether two overlapping pieces of functionality for the same thing exist within the same kernel. The kernel is not an anarchy where everyone can have their own sys_fork() version or their own sys_write() version. Would you want to have two dozen read() variants, sys_read_oracle() and a sys_read_db2()? I certainly dont want that. Instead we (at great expense and work) try to reach the best technical solution. That means we throw away inferior code and adopt the better one. (with a reasonable migration period) You are ignoring that principle with hand-waving about 'the community wants this'. I can assure you, users _DONT WANT_ split interfaces and incompatible drivers for the same thing. They want stuff that works well. If the community wants this then why cannot you convince one of the most prominent representatives of that community, the KVM developers? Furthermore, 99% of your work is KVM, why dont you respect that work by not forking it? Why dont you respect the KVM community and Linux in general by improving existing pieces of infrastructure instead of forcefully forking it? Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 18, 2009 at 11:19:40AM -0400, Gregory Haskins wrote: > >>>> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm, > >>>> and its likewise constrained by various limitations of that decision > >>>> (such as its reliance of the PCI model, and the kvm memory scheme). > >>> vhost is actually not related to PCI in any way. It simply leaves all > >>> setup for userspace to do. And the memory scheme was intentionally > >>> separated from kvm so that it can easily support e.g. lguest. > >>> > >> I think you have missed my point. I mean that vhost requires a separate > >> bus-model (ala qemu-pci). > > > > So? That can be in userspace, and can be anything including vbus. > > -ENOPARSE > > Can you elaborate? Write a device that signals an eventfd on virtio kick, and poll eventfd for notifications, and you can use vhost-net. vbus, surely, can do this? > > > >> And no, your memory scheme is not separated, > >> at least, not very well. It still assumes memory-regions and > >> copy_to_user(), which is very kvm-esque. > > > > I don't think so: works for lguest, kvm, UML and containers > > kvm _esque_ , meaning anything that follows the region+copy_to_user > model. Not all things do. Pretty much all things where it makes sense to share code with vhost-net. If there's hardware that wants direct access to descriptor rings, it just needs a driver. > >> Vbus has people using things > >> like userspace containers (no regions), > > > > vhost by default works without regions > > Thats a start, but not good enough if you were trying to achieve the > same thing as vbus. As I said before, I've never said you had to > achieve the same thing, but do note they are distinctly different with > different goals. You are solving a directed problem. I am solving a > general problem, and trying to solve it once. Heh. A good demonstration of vbus generality would be a solution that speeds up virtio in guests. What venet seems to illustrate instead is that one has to rework all of host, guest and hypervisor to use vbus. Maybe it does not need to be that way - it just seems so. > >> and physical hardware (dma > >> controllers, so no regions or copy_to_user) so your scheme quickly falls > >> apart once you get away from KVM. > > > > Someone took a driver and is building hardware for it ... so what? > > What is your point? OK, can we forget about that physical hardware then? > >> Don't get me wrong: That design may have its place. Perhaps you only > >> care about fixing KVM, which is a perfectly acceptable strategy. > >> Its just not a strategy that I think is the best approach. Essentially you > >> are promoting the proliferation of competing backends, and I am trying > >> to unify them (which is ironic that this thread started with concerns I > >> was fragmenting things ;). > > > > So, you don't see how venet fragments things? It's pretty obvious ... > > I never said it doesn't. venet started as a test harness, but now it is > inadvertently fragmenting the virtio-net effort. I admit it. It wasn't > intentional, but just worked out that way. Until your vhost idea is > vetted and benchmarked, its not even in the running. > > Venet is currently > the highest performing 802.x acceleration for KVM that I am aware of, so > it will continue to garner interest from users concerned with performance. > > But likewise, vhost has the potential to fragment the back-end model. > That was my point. You don't see the difference? Long term vhost-net can just be enabled by default whenever it is present, and there is a single guest driver to support. OTOH, venet means that we have to support 2 guest drivers: virtio and venet, for a long time. > > > >> The bottom line is, you have a simpler solution that is more finely > >> targeted at KVM and virtio-networking. It fixes probably a lot of > >> problems with the existing implementation, but it still has limitations. > >> > >> OTOH, what I am promoting is more complex, but more flexible. That is > >> the tradeoff. You can't have both ;) > > > > We can. connect eventfds to hypercalls, and vhost will work with vbus. > > -ENOPARSE > > vbus doesnt use hypercalls, and I do not see why or how you would > connect two backend models together like this. Can you elaborate. I think some older version did. But whatever. signal eventfd on guest kick, poll eventfd to notify guest, and you can use vhost-net with vbus. > > > >> So do not for one second think > >> that what you implemented is equivalent, because they are not. > >> > >> In fact, I believe I warned you about this potential problem when you > >> decided to implement your own version. I think I said something to the > >> effect of "you will either have a subset of functionality, or you will > >> ultimately reinvent what I did". Right now you are in the subset phase. > > > > No. Unlike vbus, vhost supports unmodified guests and live migration. > > By "subset", I am referring to your interfaces and the scope of its > applicability. The things you need to do to make vhost work and a vbus > device work from a memory and signaling abstration POV are going to be > extremely similar. > > The difference in how the guest sees them these backends is all > contained in the vbus-connector. Therefore, what you *could* have done > is simply written a connector that does something like only support > "virtio" backends, and surfaced them as regular PCI devices to the > guest. Then you could have reused all the abstraction features in vbus, > instead of reinventing them (case in point, your region+copy_to_user > code). And likewise, anyone using vbus could use your virtio-net backend. > > Instead, I am still left with no virtio-net backend implemented, and you > were left designing, writing, and testing facilities that I've already > completed. So it was duplicative effort. > > Kind Regards, > -Greg > As I said, I couldn't reuse your code the way it's written. But happily you can reuse vhost - it's just a library, link with it - or even vhost net as I explained above.
On 08/18/2009 05:46 PM, Gregory Haskins wrote: > >> Can you explain how vbus achieves RDMA? >> >> I also don't see the connection to real time guests. >> > Both of these are still in development. Trying to stay true to the > "release early and often" mantra, the core vbus technology is being > pushed now so it can be reviewed. Stay tuned for these other developments. > Hopefully you can outline how it works. AFAICT, RDMA and kernel bypass will need device assignment. If you're bypassing the call into the host kernel, it doesn't really matter how that call is made, does it? >>> I also designed it in such a way that >>> we could, in theory, write one set of (linux-based) backends, and have >>> them work across a variety of environments (such as containers/VMs like >>> KVM, lguest, openvz, but also physical systems like blade enclosures and >>> clusters, or even applications running on the host). >>> >>> >> Sorry, I'm still confused. Why would openvz need vbus? >> > Its just an example. The point is that I abstracted what I think are > the key points of fast-io, memory routing, signal routing, etc, so that > it will work in a variety of (ideally, _any_) environments. > > There may not be _performance_ motivations for certain classes of VMs > because they already have decent support, but they may want a connector > anyway to gain some of the new features available in vbus. > > And looking forward, the idea is that we have commoditized the backend > so we don't need to redo this each time a new container comes along. > I'll wait until a concrete example shows up as I still don't understand. >> One point of contention is that this is all managementy stuff and should >> be kept out of the host kernel. Exposing shared memory, interrupts, and >> guest hypercalls can all be easily done from userspace (as virtio >> demonstrates). True, some devices need kernel acceleration, but that's >> no reason to put everything into the host kernel. >> > See my last reply to Anthony. My two points here are that: > > a) having it in-kernel makes it a complete subsystem, which perhaps has > diminished value in kvm, but adds value in most other places that we are > looking to use vbus. > It's not a complete system unless you want users to administer VMs using echo and cat and configfs. Some userspace support will always be necessary. > b) the in-kernel code is being overstated as "complex". We are not > talking about your typical virt thing, like an emulated ICH/PCI chipset. > Its really a simple list of devices with a handful of attributes. They > are managed using established linux interfaces, like sysfs/configfs. > They need to be connected to the real world somehow. What about security? can any user create a container and devices and link them to real interfaces? If not, do you need to run the VM as root? virtio and vhost-net solve these issues. Does vbus? The code may be simple to you. But the question is whether it's necessary, not whether it's simple or complex. >> Exposing devices as PCI is an important issue for me, as I have to >> consider non-Linux guests. >> > Thats your prerogative, but obviously not everyone agrees with you. > I hope everyone agrees that it's an important issue for me and that I have to consider non-Linux guests. I also hope that you're considering non-Linux guests since they have considerable market share. > Getting non-Linux guests to work is my problem if you chose to not be > part of the vbus community. > I won't be writing those drivers in any case. >> Another issue is the host kernel management code which I believe is >> superfluous. >> > In your opinion, right? > Yes, this is why I wrote "I believe". >> Given that, why spread to a new model? >> > Note: I haven't asked you to (at least, not since April with the vbus-v3 > release). Spreading to a new model is currently the role of the > AlacrityVM project, since we disagree on the utility of a new model. > Given I'm not the gateway to inclusion of vbus/venet, you don't need to ask me anything. I'm still free to give my opinion. >>> A) hardware can only generate byte/word sized requests at a time because >>> that is all the pcb-etch and silicon support. So hardware is usually >>> expressed in terms of some number of "registers". >>> >>> >> No, hardware happily DMAs to and fro main memory. >> > Yes, now walk me through how you set up DMA to do something like a call > when you do not know addresses apriori. Hint: count the number of > MMIO/PIOs you need. If the number is> 1, you've lost. > With virtio, the number is 1 (or less if you amortize). Set up the ring entries and kick. >> Some hardware of >> course uses mmio registers extensively, but not virtio hardware. With >> the recent MSI support no registers are touched in the fast path. >> > Note we are not talking about virtio here. Just raw PCI and why I > advocate vbus over it. > There's no such thing as raw PCI. Every PCI device has a protocol. The protocol virtio chose is optimized for virtualization. >>> D) device-ids are in a fixed width register and centrally assigned from >>> an authority (e.g. PCI-SIG). >>> >>> >> That's not an issue either. Qumranet/Red Hat has donated a range of >> device IDs for use in virtio. >> > Yes, and to get one you have to do what? Register it with kvm.git, > right? Kind of like registering a MAJOR/MINOR, would you agree? Maybe > you do not mind (especially given your relationship to kvm.git), but > there are disadvantages to that model for most of the rest of us. > Send an email, it's not that difficult. There's also an experimental range. >> Device IDs are how devices are associated >> with drivers, so you'll need something similar for vbus. >> > Nope, just like you don't need to do anything ahead of time for using a > dynamic misc-device name. You just have both the driver and device know > what they are looking for (its part of the ABI). > If you get a device ID clash, you fail. If you get a device name clash, you fail in the same way. >>> E) Interrupt/MSI routing is per-device oriented >>> >>> >> Please elaborate. What is the issue? How does vbus solve it? >> > There are no "interrupts" in vbus..only shm-signals. You can establish > an arbitrary amount of shm regions, each with an optional shm-signal > associated with it. To do this, the driver calls dev->shm(), and you > get back a shm_signal object. > > Underneath the hood, the vbus-connector (e.g. vbus-pcibridge) decides > how it maps real interrupts to shm-signals (on a system level, not per > device). This can be 1:1, or any other scheme. vbus-pcibridge uses one > system-wide interrupt per priority level (today this is 8 levels), each > with an IOQ based event channel. "signals" come as an event on that > channel. > > So the "issue" is that you have no real choice with PCI. You just get > device oriented interrupts. With vbus, its abstracted. So you can > still get per-device standard MSI, or you can do fancier things like do > coalescing and prioritization. > As I've mentioned before, prioritization is available on x86, and coalescing scales badly. >>> F) Interrupts/MSI are assumed cheap to inject >>> >>> >> Interrupts are not assumed cheap; that's why interrupt mitigation is >> used (on real and virtual hardware). >> > Its all relative. IDT dispatch and EOI overhead are "baseline" on real > hardware, whereas they are significantly more expensive to do the > vmenters and vmexits on virt (and you have new exit causes, like > irq-windows, etc, that do not exist in real HW). > irq window exits ought to be pretty rare, so we're only left with injection vmexits. At around 1us/vmexit, even 100,000 interrupts/vcpu (which is excessive) will only cost you 10% cpu time. >>> G) Interrupts/MSI are non-priortizable. >>> >>> >> They are prioritizable; Linux ignores this though (Windows doesn't). >> Please elaborate on what the problem is and how vbus solves it. >> > It doesn't work right. The x86 sense of interrupt priority is, sorry to > say it, half-assed at best. I've worked with embedded systems that have > real interrupt priority support in the hardware, end to end, including > the PIC. The LAPIC on the other hand is really weak in this dept, and > as you said, Linux doesn't even attempt to use whats there. > Maybe prioritization is not that important then. If it is, it needs to be fixed at the lapic level, otherwise you have no real prioritization wrt non-vbus interrupts. >>> H) Interrupts/MSI are statically established >>> >>> >> Can you give an example of why this is a problem? >> > Some of the things we are building use the model of having a device that > hands out shm-signal in response to guest events (say, the creation of > an IPC channel). This would generally be handled by a specific device > model instance, and it would need to do this without pre-declaring the > MSI vectors (to use PCI as an example). > You're free to demultiplex an MSI to however many consumers you want, there's no need for a new bus for that. >> What performance oriented items have been left unaddressed? >> > Well, the interrupt model to name one. > Like I mentioned, you can merge MSI interrupts, but that's not necessarily a good idea. >> How do you handle conflicts? Again you need a central authority to hand >> out names or prefixes. >> > Not really, no. If you really wanted to be formal about it, you could > adopt any series of UUID schemes. For instance, perhaps venet should be > "com.novell::virtual-ethernet". Heck, I could use uuidgen. > Do you use DNS. We use PCI-SIG. If Novell is a PCI-SIG member you can get a vendor ID and control your own virtio space. >>> As another example, the connector design coalesces *all* shm-signals >>> into a single interrupt (by prio) that uses the same context-switch >>> mitigation techniques that help boost things like networking. This >>> effectively means we can detect and optimize out ack/eoi cycles from the >>> APIC as the IO load increases (which is when you need it most). PCI has >>> no such concept. >>> >>> >> That's a bug, not a feature. It means poor scaling as the number of >> vcpus increases and as the number of devices increases. >> > So the "avi-vbus-connector" can use 1:1, if you prefer. Large vcpu > counts (which are not typical) and irq-affinity is not a target > application for my design, so I prefer the coalescing model in the > vbus-pcibridge included in this series. YMMV > So far you've left out live migration, Windows, large guests, and multiqueue out of your design. If you wish to position vbus/venet for large scale use you'll need to address all of them. >> Note nothing prevents steering multiple MSIs into a single vector. It's >> a bad idea though. >> > Yes, it is a bad idea...and not the same thing either. This would > effectively create a shared-line scenario in the irq code, which is not > what happens in vbus. > Ok. >>> In addition, the signals and interrupts are priority aware, which is >>> useful for things like 802.1p networking where you may establish 8-tx >>> and 8-rx queues for your virtio-net device. x86 APIC really has no >>> usable equivalent, so PCI is stuck here. >>> >>> >> x86 APIC is priority aware. >> > Have you ever tried to use it? > I haven't, but Windows does. >>> Also, the signals can be allocated on-demand for implementing things >>> like IPC channels in response to guest requests since there is no >>> assumption about device-to-interrupt mappings. This is more flexible. >>> >>> >> Yes. However given that vectors are a scarce resource you're severely >> limited in that. >> > The connector I am pushing out does not have this limitation. > Okay. > >> And if you're multiplexing everything on one vector, >> then you can just as well demultiplex your channels in the virtio driver >> code. >> > Only per-device, not system wide. > Right. I still think multiplexing interrupts is a bad idea in a large system. In a small system... why would you do it at all? >>> And through all of this, this design would work in any guest even if it >>> doesn't have PCI (e.g. lguest, UML, physical systems, etc). >>> >>> >> That is true for virtio which works on pci-less lguest and s390. >> > Yes, and lguest and s390 had to build their own bus-model to do it, right? > They had to build connectors just like you propose to do. > Thank you for bringing this up, because it is one of the main points > here. What I am trying to do is generalize the bus to prevent the > proliferation of more of these isolated models in the future. Build > one, fast, in-kernel model so that we wouldn't need virtio-X, and > virtio-Y in the future. They can just reuse the (performance optimized) > bus and models, and only need to build the connector to bridge them. > But you still need vbus-connector-lguest and vbus-connector-s390 because they all talk to the host differently. So what's changed? the names? >> That is exactly the design goal of virtio (except it limits itself to >> virtualization). >> > No, virtio is only part of the picture. It not including the backend > models, or how to do memory/signal-path abstraction for in-kernel, for > instance. But otherwise, virtio as a device model is compatible with > vbus as a bus model. They compliment one another. > Well, venet doesn't complement virtio-net, and virtio-pci doesn't complement vbus-connector. >>> Then device models like virtio can ride happily on top and we end up >>> with a really robust and high-performance Linux-based stack. I don't >>> buy the argument that we already have PCI so lets use it. I don't think >>> its the best design and I am not afraid to make an investment in a >>> change here because I think it will pay off in the long run. >>> >>> >> Sorry, I don't think you've shown any quantifiable advantages. >> > We can agree to disagree then, eh? There are certainly quantifiable > differences. Waving your hand at the differences to say they are not > advantages is merely an opinion, one that is not shared universally. > I've addressed them one by one. We can agree to disagree on interrupt multiplexing, and the importance of compatibility, Windows, large guests, multiqueue, and DNS vs. PCI-SIG. > The bottom line is all of these design distinctions are encapsulated > within the vbus subsystem and do not affect the kvm code-base. So > agreement with kvm upstream is not a requirement, but would be > advantageous for collaboration. > Certainly.
On Tue, Aug 18, 2009 at 11:39:25AM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Mon, Aug 17, 2009 at 03:33:30PM -0400, Gregory Haskins wrote: > >> There is a secondary question of venet (a vbus native device) verses > >> virtio-net (a virtio native device that works with PCI or VBUS). If > >> this contention is really around venet vs virtio-net, I may possibly > >> conceed and retract its submission to mainline. > > > > For me yes, venet+ioq competing with virtio+virtqueue. > > > >> I've been pushing it to date because people are using it and I don't > >> see any reason that the driver couldn't be upstream. > > > > If virtio is just as fast, they can just use it without knowing it. > > Clearly, that's better since we support virtio anyway ... > > More specifically: kvm can support whatever it wants. I am not asking > kvm to support venet. > > If we (the alacrityvm community) decide to keep maintaining venet, _we_ > will support it, and I have no problem with that. > > As of right now, we are doing some interesting things with it in the lab > and its certainly more flexible for us as a platform since we maintain > the ABI and feature set. So for now, I do not think its a big deal if > they both co-exist, and it has no bearing on KVM upstream. As someone who extended them recently, both ABI and feature set with virtio are pretty flexible. What's the problem? Will every single contributor now push a driver with an incompatible ABI upstream because this way he maintains both ABI and feature set? Oh well ...
On 08/18/2009 06:51 PM, Gregory Haskins wrote: > >> It's not laughably trivial when you try to support the full feature set >> of kvm (for example, live migration will require dirty memory tracking, >> and exporting all state stored in the kernel to userspace). >> > Doesn't vhost suffer from the same issue? If not, could I also apply > the same technique to support live-migration in vbus? > It does. There are two possible solutions to that: dropping the entire protocol to userspace, or the one I prefer, proxying the ring and eventfds in userspace but otherwise letting vhost-net run normally. This way userspace gets to see descriptors and mark the pages as dirty. Both these approaches rely on vhost-net being an accelerator to a userspace based component, but maybe you can adapt venet to use something similar. >> Oh come on, I wrote "steal" as a convenient shorthand for >> "cross-pollinate your ideas into our code according to the letter and >> spirit of the GNU General Public License". >> > Is that supposed to make me feel better about working with you? I mean, > writing, testing, polishing patches for LKML-type submission is time > consuming. If all you are going to do is take those ideas and rewrite > it yourself, why should I go through that effort? > If you're posting your ideas for everyone to read in the form of code, why not post them in the form of design ideas as well? In any case you've given up any secrets. In the worst case you've lost nothing, in the best case you may get some hopefully constructive criticism and maybe improvements. I'm perfectly happy picking up ideas from competing projects (and I have) and seeing my ideas picked up in competing projects (which I also have). Really, isn't that the point of open source? Share code, but also share ideas? > And its not like that was the first time you have said that to me. > And I meant it every time. Haven't you just asked how vhost-net plans to do live migration? >> Since we're all trying to improve Linux we may as well cooperate. >> > Well, I don't think anyone can say that I haven't been trying. > I'd be obliged if you reveal some of your secret sauce then (only the parts you plan to GPL anyway of course). >>> "sorry, we are going to reinvent our own instead". >>> >> No. Adopting venet/vbus would mean reinventing something that already >> existed. >> > But yet, it doesn't. > We'll need to do the agree to disagree thing again here. >> Continuing to support virtio/pci is not reinventing anything. >> > No one asked you to do otherwise. > Right, and I'm not keen on supporting both. See why I want to stick to virtio/pci as long as I possibly can? >> You haven't convinced me that your ideas are worth the effort of >> abandoning virtio/pci or maintaining both venet/vbus and virtio/pci. >> > With all due respect, I didnt ask you do to anything, especially not > abandon something you are happy with. > > All I did was push guest drivers to LKML. The code in question is > independent of KVM, and its proven to improve the experience of using > Linux as a platform. There are people interested in using them (by > virtue of the number of people that have signed up for the AlacrityVM > list, and have mailed me privately about this work). > > So where is the problem here? > I'm unhappy with the duplication of effort and potential fragmentation of the developer and user communities, that's all. I'd rather see the work going into vbus/venet going into virtio. I think it's a legitimate concern.
On 08/18/2009 06:53 PM, Ira W. Snyder wrote: > So, in my system, copy_(to|from)_user() is completely wrong. There is no > userspace, only a physical system. In fact, because normal x86 computers > do not have DMA controllers, the host system doesn't actually handle any > data transfer! > In fact, modern x86s do have dma engines these days (google for Intel I/OAT), and one of our plans for vhost-net is to allow their use for packets above a certain size. So a patch allowing vhost-net to optionally use a dma engine is a good thing. > I used virtio-net in both the guest and host systems in my example > virtio-over-PCI patch, and succeeded in getting them to communicate. > However, the lack of any setup interface means that the devices must be > hardcoded into both drivers, when the decision could be up to userspace. > I think this is a problem that vbus could solve. > Exposing a knob to userspace is not an insurmountable problem; vhost-net already allows changing the memory layout, for example.
On Tue, Aug 18, 2009 at 11:51:59AM -0400, Gregory Haskins wrote: > > It's not laughably trivial when you try to support the full feature set > > of kvm (for example, live migration will require dirty memory tracking, > > and exporting all state stored in the kernel to userspace). > > Doesn't vhost suffer from the same issue? If not, could I also apply > the same technique to support live-migration in vbus? vhost does this by switching to userspace for the duration of live migration. venet could do this I guess, but you'd need to write a userspace implementation. vhost just reuses existing userspace virtio. > With all due respect, I didnt ask you do to anything, especially not > abandon something you are happy with. > > All I did was push guest drivers to LKML. The code in question is > independent of KVM, and its proven to improve the experience of using > Linux as a platform. There are people interested in using them (by > virtue of the number of people that have signed up for the AlacrityVM > list, and have mailed me privately about this work). > > So where is the problem here? If virtio net in guest could be improved instead, everyone would benefit. I am doing this, and I wish more people would join. Instead, you change ABI in a incompatible way. So now, there's no single place to work on kvm networking performance. Now, it would all be understandable if the reason was e.g. better performance. But you say yourself it isn't. See the problem?
On Tue, Aug 18, 2009 at 07:51:21PM +0300, Avi Kivity wrote: > On 08/18/2009 06:53 PM, Ira W. Snyder wrote: >> So, in my system, copy_(to|from)_user() is completely wrong. There is no >> userspace, only a physical system. In fact, because normal x86 computers >> do not have DMA controllers, the host system doesn't actually handle any >> data transfer! >> > > In fact, modern x86s do have dma engines these days (google for Intel > I/OAT), and one of our plans for vhost-net is to allow their use for > packets above a certain size. So a patch allowing vhost-net to > optionally use a dma engine is a good thing. > Yes, I'm aware that very modern x86 PCs have general purpose DMA engines, even though I don't have any capable hardware. However, I think it is better to support using any PC (with or without DMA engine, any architecture) as the PCI master, and just handle the DMA all from the PCI agent, which is known to have DMA? >> I used virtio-net in both the guest and host systems in my example >> virtio-over-PCI patch, and succeeded in getting them to communicate. >> However, the lack of any setup interface means that the devices must be >> hardcoded into both drivers, when the decision could be up to userspace. >> I think this is a problem that vbus could solve. >> > > Exposing a knob to userspace is not an insurmountable problem; vhost-net > already allows changing the memory layout, for example. > Let me explain the most obvious problem I ran into: setting the MAC addresses used in virtio. On the host (PCI master), I want eth0 (virtio-net) to get a random MAC address. On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC address, aa:bb:cc:dd:ee:ff. The virtio feature negotiation code handles this, by seeing the VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC address. This is because the feature negotiation code only accepts a feature if it is offered by both sides of the connection. In this case, I must have the guest generate a random MAC address and have the host put aa:bb:cc:dd:ee:ff into the guest's configuration space. This basically means hardcoding the MAC addresses in the Linux drivers, which is a big no-no. What would I expose to userspace to make this situation manageable? Thanks for the response, Ira -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/18/2009 08:27 PM, Ira W. Snyder wrote: >> In fact, modern x86s do have dma engines these days (google for Intel >> I/OAT), and one of our plans for vhost-net is to allow their use for >> packets above a certain size. So a patch allowing vhost-net to >> optionally use a dma engine is a good thing. >> > Yes, I'm aware that very modern x86 PCs have general purpose DMA > engines, even though I don't have any capable hardware. However, I think > it is better to support using any PC (with or without DMA engine, any > architecture) as the PCI master, and just handle the DMA all from the > PCI agent, which is known to have DMA? > Certainly; but if your PCI agent will support the DMA API, then the same vhost code will work with both I/OAT and your specialized hardware. >> Exposing a knob to userspace is not an insurmountable problem; vhost-net >> already allows changing the memory layout, for example. >> >> > Let me explain the most obvious problem I ran into: setting the MAC > addresses used in virtio. > > On the host (PCI master), I want eth0 (virtio-net) to get a random MAC > address. > > On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC > address, aa:bb:cc:dd:ee:ff. > > The virtio feature negotiation code handles this, by seeing the > VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do > not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC > address. This is because the feature negotiation code only accepts a > feature if it is offered by both sides of the connection. > > In this case, I must have the guest generate a random MAC address and > have the host put aa:bb:cc:dd:ee:ff into the guest's configuration > space. This basically means hardcoding the MAC addresses in the Linux > drivers, which is a big no-no. > > What would I expose to userspace to make this situation manageable? > > I think in this case you want one side to be virtio-net (I'm guessing the x86) and the other side vhost-net (the ppc boards with the dma engine). virtio-net on x86 would communicate with userspace on the ppc board to negotiate features and get a mac address, the fast path would be between virtio-net and vhost-net (which would use the dma engine to push and pull data).
On Tuesday 18 August 2009, Gregory Haskins wrote: > Avi Kivity wrote: > > On 08/17/2009 10:33 PM, Gregory Haskins wrote: > > > > One point of contention is that this is all managementy stuff and should > > be kept out of the host kernel. Exposing shared memory, interrupts, and > > guest hypercalls can all be easily done from userspace (as virtio > > demonstrates). True, some devices need kernel acceleration, but that's > > no reason to put everything into the host kernel. > > See my last reply to Anthony. My two points here are that: > > a) having it in-kernel makes it a complete subsystem, which perhaps has > diminished value in kvm, but adds value in most other places that we are > looking to use vbus. > > b) the in-kernel code is being overstated as "complex". We are not > talking about your typical virt thing, like an emulated ICH/PCI chipset. > Its really a simple list of devices with a handful of attributes. They > are managed using established linux interfaces, like sysfs/configfs. IMHO the complexity of the code is not so much of a problem. What I see as a problem is the complexity a kernel/user space interface that manages a the devices with global state. One of the greatest features of Michaels vhost driver is that all the state is associated with open file descriptors that either exist already or belong to the vhost_net misc device. When a process dies, all the file descriptors get closed and the whole state is cleaned up implicitly. AFAICT, you can't do that with the vbus host model. > > What performance oriented items have been left unaddressed? > > Well, the interrupt model to name one. The performance aspects of your interrupt model are independent of the vbus proxy, or at least they should be. Let's assume for now that your event notification mechanism gives significant performance improvements (which we can't measure independently right now). I don't see a reason why we could not get the same performance out of a paravirtual interrupt controller that uses the same method, and it would be straightforward to implement one and use that together with all the existing emulated PCI devices and virtio devices including vhost_net. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 18, 2009 at 08:47:04PM +0300, Avi Kivity wrote: > On 08/18/2009 08:27 PM, Ira W. Snyder wrote: >>> In fact, modern x86s do have dma engines these days (google for Intel >>> I/OAT), and one of our plans for vhost-net is to allow their use for >>> packets above a certain size. So a patch allowing vhost-net to >>> optionally use a dma engine is a good thing. >>> >> Yes, I'm aware that very modern x86 PCs have general purpose DMA >> engines, even though I don't have any capable hardware. However, I think >> it is better to support using any PC (with or without DMA engine, any >> architecture) as the PCI master, and just handle the DMA all from the >> PCI agent, which is known to have DMA? >> > > Certainly; but if your PCI agent will support the DMA API, then the same > vhost code will work with both I/OAT and your specialized hardware. > Yes, that's true. My ppc is a Freescale MPC8349EMDS. It has a Linux DMAEngine driver in mainline, which I've used. That's excellent. >>> Exposing a knob to userspace is not an insurmountable problem; vhost-net >>> already allows changing the memory layout, for example. >>> >>> >> Let me explain the most obvious problem I ran into: setting the MAC >> addresses used in virtio. >> >> On the host (PCI master), I want eth0 (virtio-net) to get a random MAC >> address. >> >> On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC >> address, aa:bb:cc:dd:ee:ff. >> >> The virtio feature negotiation code handles this, by seeing the >> VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do >> not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC >> address. This is because the feature negotiation code only accepts a >> feature if it is offered by both sides of the connection. >> >> In this case, I must have the guest generate a random MAC address and >> have the host put aa:bb:cc:dd:ee:ff into the guest's configuration >> space. This basically means hardcoding the MAC addresses in the Linux >> drivers, which is a big no-no. >> >> What would I expose to userspace to make this situation manageable? >> >> > > I think in this case you want one side to be virtio-net (I'm guessing > the x86) and the other side vhost-net (the ppc boards with the dma > engine). virtio-net on x86 would communicate with userspace on the ppc > board to negotiate features and get a mac address, the fast path would > be between virtio-net and vhost-net (which would use the dma engine to > push and pull data). > Ah, that seems backwards, but it should work after vhost-net learns how to use the DMAEngine API. I haven't studied vhost-net very carefully yet. As soon as I saw the copy_(to|from)_user() I stopped reading, because it seemed useless for my case. I'll look again and try to find where vhost-net supports setting MAC addresses and other features. Also, in my case I'd like to boot Linux with my rootfs over NFS. Is vhost-net capable of this? I've had Arnd, BenH, and Grant Likely (and others, privately) contact me about devices they are working with that would benefit from something like virtio-over-PCI. I'd like to see vhost-net be merged with the capability to support my use case. There are plenty of others that would benefit, not just myself. I'm not sure vhost-net is being written with this kind of future use in mind. I'd hate to see it get merged, and then have to change the ABI to support physical-device-to-device usage. It would be better to keep future use in mind now, rather than try and hack it in later. Thanks for the comments. Ira -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/18/2009 09:27 PM, Ira W. Snyder wrote: >> I think in this case you want one side to be virtio-net (I'm guessing >> the x86) and the other side vhost-net (the ppc boards with the dma >> engine). virtio-net on x86 would communicate with userspace on the ppc >> board to negotiate features and get a mac address, the fast path would >> be between virtio-net and vhost-net (which would use the dma engine to >> push and pull data). >> >> > > Ah, that seems backwards, but it should work after vhost-net learns how > to use the DMAEngine API. > > I haven't studied vhost-net very carefully yet. As soon as I saw the > copy_(to|from)_user() I stopped reading, because it seemed useless for > my case. I'll look again and try to find where vhost-net supports > setting MAC addresses and other features. > It doesn't; all it does is pump the rings, leaving everything else to userspace. > Also, in my case I'd like to boot Linux with my rootfs over NFS. Is > vhost-net capable of this? > It's just another network interface. You'd need an initramfs though to contain the needed userspace. > I've had Arnd, BenH, and Grant Likely (and others, privately) contact me > about devices they are working with that would benefit from something > like virtio-over-PCI. I'd like to see vhost-net be merged with the > capability to support my use case. There are plenty of others that would > benefit, not just myself. > > I'm not sure vhost-net is being written with this kind of future use in > mind. I'd hate to see it get merged, and then have to change the ABI to > support physical-device-to-device usage. It would be better to keep > future use in mind now, rather than try and hack it in later. > Please review and comment then. I'm fairly confident there won't be any ABI issues since vhost-net does so little outside pumping the rings. Note the signalling paths go through eventfd: when vhost-net wants the other side to look at its ring, it tickles an eventfd which is supposed to trigger an interrupt on the other side. Conversely, when another eventfd is signalled, vhost-net will look at the ring and process any data there. You'll need to wire your signalling to those eventfds, either in userspace or in the kernel.
On 08/18/2009 09:20 PM, Arnd Bergmann wrote: >> Well, the interrupt model to name one. >> > The performance aspects of your interrupt model are independent > of the vbus proxy, or at least they should be. Let's assume for > now that your event notification mechanism gives significant > performance improvements (which we can't measure independently > right now). I don't see a reason why we could not get the > same performance out of a paravirtual interrupt controller > that uses the same method, and it would be straightforward > to implement one and use that together with all the existing > emulated PCI devices and virtio devices including vhost_net. > Interesting. You could even configure those vectors using the standard MSI configuration mechanism; simply replace the address/data pair with something meaningful to the paravirt interrupt controller. I'd have to see really hard numbers to be tempted to merge something like this though. We've merged paravirt mmu, for example, and now it underperforms both hardware two-level paging and software shadow paging.
On Tue, Aug 18, 2009 at 11:27:35AM -0700, Ira W. Snyder wrote: > I haven't studied vhost-net very carefully yet. As soon as I saw the > copy_(to|from)_user() I stopped reading, because it seemed useless for > my case. I'll look again and try to find where vhost-net supports > setting MAC addresses and other features. vhost net doesn't do this at all. You bind raw socket to a network device, and program that with usual userspace interfaces. > Also, in my case I'd like to boot Linux with my rootfs over NFS. Is > vhost-net capable of this? > > I've had Arnd, BenH, and Grant Likely (and others, privately) contact me > about devices they are working with that would benefit from something > like virtio-over-PCI. I'd like to see vhost-net be merged with the > capability to support my use case. There are plenty of others that would > benefit, not just myself. > > I'm not sure vhost-net is being written with this kind of future use in > mind. I'd hate to see it get merged, and then have to change the ABI to > support physical-device-to-device usage. It would be better to keep > future use in mind now, rather than try and hack it in later. I still need to think your usage over. I am not so sure this fits what vhost is trying to do. If not, possibly it's better to just have a separate driver for your device. > Thanks for the comments. > Ira -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 18, 2009 at 10:27:52AM -0700, Ira W. Snyder wrote: > On Tue, Aug 18, 2009 at 07:51:21PM +0300, Avi Kivity wrote: > > On 08/18/2009 06:53 PM, Ira W. Snyder wrote: > >> So, in my system, copy_(to|from)_user() is completely wrong. There is no > >> userspace, only a physical system. In fact, because normal x86 computers > >> do not have DMA controllers, the host system doesn't actually handle any > >> data transfer! > >> > > > > In fact, modern x86s do have dma engines these days (google for Intel > > I/OAT), and one of our plans for vhost-net is to allow their use for > > packets above a certain size. So a patch allowing vhost-net to > > optionally use a dma engine is a good thing. > > > > Yes, I'm aware that very modern x86 PCs have general purpose DMA > engines, even though I don't have any capable hardware. However, I think > it is better to support using any PC (with or without DMA engine, any > architecture) as the PCI master, and just handle the DMA all from the > PCI agent, which is known to have DMA? > > >> I used virtio-net in both the guest and host systems in my example > >> virtio-over-PCI patch, and succeeded in getting them to communicate. > >> However, the lack of any setup interface means that the devices must be > >> hardcoded into both drivers, when the decision could be up to userspace. > >> I think this is a problem that vbus could solve. > >> > > > > Exposing a knob to userspace is not an insurmountable problem; vhost-net > > already allows changing the memory layout, for example. > > > > Let me explain the most obvious problem I ran into: setting the MAC > addresses used in virtio. > > On the host (PCI master), I want eth0 (virtio-net) to get a random MAC > address. > > On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC > address, aa:bb:cc:dd:ee:ff. > > The virtio feature negotiation code handles this, by seeing the > VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do > not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC > address. This is because the feature negotiation code only accepts a > feature if it is offered by both sides of the connection. > > In this case, I must have the guest generate a random MAC address and > have the host put aa:bb:cc:dd:ee:ff into the guest's configuration > space. This basically means hardcoding the MAC addresses in the Linux > drivers, which is a big no-no. > > What would I expose to userspace to make this situation manageable? > > Thanks for the response, > Ira This calls for some kind of change in guest virtio. vhost being a host kernel only feature, does not deal with this problem. But assuming virtio in guest supports this somehow, vhost will not interfere: you do the setup in qemu userspace anyway, vhost will happily use a network device however you chose to set it up.
On Tue, Aug 18, 2009 at 08:53:29AM -0700, Ira W. Snyder wrote: > I think Greg is referring to something like my virtio-over-PCI patch. > I'm pretty sure that vhost is completely useless for my situation. I'd > like to see vhost work for my use, so I'll try to explain what I'm > doing. > > I've got a system where I have about 20 computers connected via PCI. The > PCI master is a normal x86 system, and the PCI agents are PowerPC > systems. The PCI agents act just like any other PCI card, except they > are running Linux, and have their own RAM and peripherals. > > I wrote a custom driver which imitated a network interface and a serial > port. I tried to push it towards mainline, and DavidM rejected it, with > the argument, "use virtio, don't add another virtualization layer to the > kernel." I think he has a decent argument, so I wrote virtio-over-PCI. > > Now, there are some things about virtio that don't work over PCI. > Mainly, memory is not truly shared. It is extremely slow to access > memory that is "far away", meaning "across the PCI bus." This can be > worked around by using a DMA controller to transfer all data, along with > an intelligent scheme to perform only writes across the bus. If you're > careful, reads are never needed. > > So, in my system, copy_(to|from)_user() is completely wrong. > There is no userspace, only a physical system. Can guests do DMA to random host memory? Or is there some kind of IOMMU and DMA API involved? If the later, then note that you'll still need some kind of driver for your device. The question we need to ask ourselves then is whether this driver can reuse bits from vhost. > In fact, because normal x86 computers > do not have DMA controllers, the host system doesn't actually handle any > data transfer! Is it true that PPC has to initiate all DMA then? How do you manage not to do DMA reads then? > I used virtio-net in both the guest and host systems in my example > virtio-over-PCI patch, and succeeded in getting them to communicate. > However, the lack of any setup interface means that the devices must be > hardcoded into both drivers, when the decision could be up to userspace. > I think this is a problem that vbus could solve. What you describe (passing setup from host to guest) seems like a feature that guest devices need to support. It seems unlikely that vbus, being a transport layer, can address this. > > For my own selfish reasons (I don't want to maintain an out-of-tree > driver) I'd like to see *something* useful in mainline Linux. I'm happy > to answer questions about my setup, just ask. > > Ira Thanks Ira, I'll think about it. A couple of questions: - Could you please describe what kind of communication needs to happen? - I'm not familiar with DMA engine in question. I'm guessing it's the usual thing: in/out buffers need to be kernel memory, interface is asynchronous, small limited number of outstanding requests? Is there a userspace interface for it and if yes how does it work?
On Tue, Aug 18, 2009 at 09:52:48PM +0300, Avi Kivity wrote: > On 08/18/2009 09:27 PM, Ira W. Snyder wrote: >>> I think in this case you want one side to be virtio-net (I'm guessing >>> the x86) and the other side vhost-net (the ppc boards with the dma >>> engine). virtio-net on x86 would communicate with userspace on the ppc >>> board to negotiate features and get a mac address, the fast path would >>> be between virtio-net and vhost-net (which would use the dma engine to >>> push and pull data). >>> >>> >> >> Ah, that seems backwards, but it should work after vhost-net learns how >> to use the DMAEngine API. >> >> I haven't studied vhost-net very carefully yet. As soon as I saw the >> copy_(to|from)_user() I stopped reading, because it seemed useless for >> my case. I'll look again and try to find where vhost-net supports >> setting MAC addresses and other features. >> > > It doesn't; all it does is pump the rings, leaving everything else to > userspace. > Ok. On a non shared-memory system (where the guest's RAM is not just a chunk of userspace RAM in the host system), virtio's management model seems to fall apart. Feature negotiation doesn't work as one would expect. This does appear to be solved by vbus, though I haven't written a vbus-over-PCI implementation, so I cannot be completely sure. I'm not at all clear on how to get feature negotiation to work on a system like mine. From my study of lguest and kvm (see below) it looks like userspace will need to be involved, via a miscdevice. >> Also, in my case I'd like to boot Linux with my rootfs over NFS. Is >> vhost-net capable of this? >> > > It's just another network interface. You'd need an initramfs though to > contain the needed userspace. > Ok. I'm using an initramfs already, so adding some more userspace to it isn't a problem. >> I've had Arnd, BenH, and Grant Likely (and others, privately) contact me >> about devices they are working with that would benefit from something >> like virtio-over-PCI. I'd like to see vhost-net be merged with the >> capability to support my use case. There are plenty of others that would >> benefit, not just myself. >> >> I'm not sure vhost-net is being written with this kind of future use in >> mind. I'd hate to see it get merged, and then have to change the ABI to >> support physical-device-to-device usage. It would be better to keep >> future use in mind now, rather than try and hack it in later. >> > > Please review and comment then. I'm fairly confident there won't be any > ABI issues since vhost-net does so little outside pumping the rings. > Ok. I thought I should at least express my concerns while we're discussing this, rather than being too late after finding the time to study the driver. Off the top of my head, I would think that transporting userspace addresses in the ring (for copy_(to|from)_user()) vs. physical addresses (for DMAEngine) might be a problem. Pinning userspace pages into memory for DMA is a bit of a pain, though it is possible. There is also the problem of different endianness between host and guest in virtio-net. The struct virtio_net_hdr (include/linux/virtio_net.h) defines fields in host byte order. Which totally breaks if the guest has a different endianness. This is a virtio-net problem though, and is not transport specific. > Note the signalling paths go through eventfd: when vhost-net wants the > other side to look at its ring, it tickles an eventfd which is supposed > to trigger an interrupt on the other side. Conversely, when another > eventfd is signalled, vhost-net will look at the ring and process any > data there. You'll need to wire your signalling to those eventfds, > either in userspace or in the kernel. > Ok. I've never used eventfd before, so that'll take yet more studying. I've browsed over both the kvm and lguest code, and it looks like they each re-invent a mechanism for transporting interrupts between the host and guest, using eventfd. They both do this by implementing a miscdevice, which is basically their management interface. See drivers/lguest/lguest_user.c (see write() and LHREQ_EVENTFD) and kvm-kmod-devel-88/x86/kvm_main.c (see kvm_vm_ioctl(), called via kvm_dev_ioctl()) for how they hook up eventfd's. I can now imagine how two userspace programs (host and guest) could work together to implement a management interface, including hotplug of devices, etc. Of course, this would basically reinvent the vbus management interface into a specific driver. I think this is partly what Greg is trying to abstract out into generic code. I haven't studied the actual data transport mechanisms in vbus, though I have studied virtio's transport mechanism. I think a generic management interface for virtio might be a good thing to consider, because it seems there are at least two implementations already: kvm and lguest. Thanks for answering my questions. It helps to talk with someone more familiar with the issues than I am. Ira -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 18 August 2009 20:35:22 Michael S. Tsirkin wrote: > On Tue, Aug 18, 2009 at 10:27:52AM -0700, Ira W. Snyder wrote: > > Also, in my case I'd like to boot Linux with my rootfs over NFS. Is > > vhost-net capable of this? > > > > I've had Arnd, BenH, and Grant Likely (and others, privately) contact me > > about devices they are working with that would benefit from something > > like virtio-over-PCI. I'd like to see vhost-net be merged with the > > capability to support my use case. There are plenty of others that would > > benefit, not just myself. yes. > > I'm not sure vhost-net is being written with this kind of future use in > > mind. I'd hate to see it get merged, and then have to change the ABI to > > support physical-device-to-device usage. It would be better to keep > > future use in mind now, rather than try and hack it in later. > > I still need to think your usage over. I am not so sure this fits what > vhost is trying to do. If not, possibly it's better to just have a > separate driver for your device. I now think we need both. virtio-over-PCI does it the right way for its purpose and can be rather generic. It could certainly be extended to support virtio-net on both sides (host and guest) of KVM, but I think it better fits the use where a kernel wants to communicate with some other machine where you normally wouldn't think of using qemu. Vhost-net OTOH is great in the way that it serves as an easy way to move the virtio-net code from qemu into the kernel, without changing its behaviour. It should even straightforward to do live-migration between hosts with and without it, something that would be much harder with the virtio-over-PCI logic. Also, its internal state is local to the process owning its file descriptor, which makes it much easier to manage permissions and cleanup of its resources. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/18/2009 11:59 PM, Ira W. Snyder wrote: > On a non shared-memory system (where the guest's RAM is not just a chunk > of userspace RAM in the host system), virtio's management model seems to > fall apart. Feature negotiation doesn't work as one would expect. > In your case, virtio-net on the main board accesses PCI config space registers to perform the feature negotiation; software on your PCI cards needs to trap these config space accesses and respond to them according to virtio ABI. (There's no real guest on your setup, right? just a kernel running on and x86 system and other kernels running on the PCI cards?) > This does appear to be solved by vbus, though I haven't written a > vbus-over-PCI implementation, so I cannot be completely sure. > Even if virtio-pci doesn't work out for some reason (though it should), you can write your own virtio transport and implement its config space however you like. > I'm not at all clear on how to get feature negotiation to work on a > system like mine. From my study of lguest and kvm (see below) it looks > like userspace will need to be involved, via a miscdevice. > I don't see why. Is the kernel on the PCI cards in full control of all accesses? > Ok. I thought I should at least express my concerns while we're > discussing this, rather than being too late after finding the time to > study the driver. > > Off the top of my head, I would think that transporting userspace > addresses in the ring (for copy_(to|from)_user()) vs. physical addresses > (for DMAEngine) might be a problem. Pinning userspace pages into memory > for DMA is a bit of a pain, though it is possible. > Oh, the ring doesn't transport userspace addresses. It transports guest addresses, and it's up to vhost to do something with them. Currently vhost supports two translation modes: 1. virtio address == host virtual address (using copy_to_user) 2. virtio address == offsetted host virtual address (using copy_to_user) The latter mode is used for kvm guests (with multiple offsets, skipping some details). I think you need to add a third mode, virtio address == host physical address (using dma engine). Once you do that, and wire up the signalling, things should work. > There is also the problem of different endianness between host and guest > in virtio-net. The struct virtio_net_hdr (include/linux/virtio_net.h) > defines fields in host byte order. Which totally breaks if the guest has > a different endianness. This is a virtio-net problem though, and is not > transport specific. > Yeah. You'll need to add byteswaps. > I've browsed over both the kvm and lguest code, and it looks like they > each re-invent a mechanism for transporting interrupts between the host > and guest, using eventfd. They both do this by implementing a > miscdevice, which is basically their management interface. > > See drivers/lguest/lguest_user.c (see write() and LHREQ_EVENTFD) and > kvm-kmod-devel-88/x86/kvm_main.c (see kvm_vm_ioctl(), called via > kvm_dev_ioctl()) for how they hook up eventfd's. > > I can now imagine how two userspace programs (host and guest) could work > together to implement a management interface, including hotplug of > devices, etc. Of course, this would basically reinvent the vbus > management interface into a specific driver. > You don't need anything in the guest userspace (virtio-net) side. > I think this is partly what Greg is trying to abstract out into generic > code. I haven't studied the actual data transport mechanisms in vbus, > though I have studied virtio's transport mechanism. I think a generic > management interface for virtio might be a good thing to consider, > because it seems there are at least two implementations already: kvm and > lguest. > Management code in the kernel doesn't really help unless you plan to manage things with echo and cat.
On 08/19/2009 12:26 AM, Avi Kivity wrote: >> >> Off the top of my head, I would think that transporting userspace >> addresses in the ring (for copy_(to|from)_user()) vs. physical addresses >> (for DMAEngine) might be a problem. Pinning userspace pages into memory >> for DMA is a bit of a pain, though it is possible. > > > Oh, the ring doesn't transport userspace addresses. It transports > guest addresses, and it's up to vhost to do something with them. > > Currently vhost supports two translation modes: > > 1. virtio address == host virtual address (using copy_to_user) > 2. virtio address == offsetted host virtual address (using copy_to_user) > > The latter mode is used for kvm guests (with multiple offsets, > skipping some details). > > I think you need to add a third mode, virtio address == host physical > address (using dma engine). Once you do that, and wire up the > signalling, things should work. You don't need in fact a third mode. You can mmap the x86 address space into your ppc userspace and use the second mode. All you need then is the dma engine glue and byte swapping.
On Tue, Aug 18, 2009 at 11:57:48PM +0300, Michael S. Tsirkin wrote: > On Tue, Aug 18, 2009 at 08:53:29AM -0700, Ira W. Snyder wrote: > > I think Greg is referring to something like my virtio-over-PCI patch. > > I'm pretty sure that vhost is completely useless for my situation. I'd > > like to see vhost work for my use, so I'll try to explain what I'm > > doing. > > > > I've got a system where I have about 20 computers connected via PCI. The > > PCI master is a normal x86 system, and the PCI agents are PowerPC > > systems. The PCI agents act just like any other PCI card, except they > > are running Linux, and have their own RAM and peripherals. > > > > I wrote a custom driver which imitated a network interface and a serial > > port. I tried to push it towards mainline, and DavidM rejected it, with > > the argument, "use virtio, don't add another virtualization layer to the > > kernel." I think he has a decent argument, so I wrote virtio-over-PCI. > > > > Now, there are some things about virtio that don't work over PCI. > > Mainly, memory is not truly shared. It is extremely slow to access > > memory that is "far away", meaning "across the PCI bus." This can be > > worked around by using a DMA controller to transfer all data, along with > > an intelligent scheme to perform only writes across the bus. If you're > > careful, reads are never needed. > > > > So, in my system, copy_(to|from)_user() is completely wrong. > > There is no userspace, only a physical system. > > Can guests do DMA to random host memory? Or is there some kind of IOMMU > and DMA API involved? If the later, then note that you'll still need > some kind of driver for your device. The question we need to ask > ourselves then is whether this driver can reuse bits from vhost. > Mostly. All of my systems are 32 bit (both x86 and ppc). From the view of the ppc (and DMAEngine), I can view the first 1GB of host memory. This limited view is due to address space limitations on the ppc. The view of PCI memory must live somewhere in the ppc address space, along with the ppc's SDRAM, flash, and other peripherals. Since this is a 32bit processor, I only have 4GB of address space to work with. The PCI address space could be up to 4GB in size. If I tried to allow the ppc boards to view all 4GB of PCI address space, then they would have no address space left for their onboard SDRAM, etc. Hopefully that makes sense. I use dma_set_mask(dev, DMA_BIT_MASK(30) on the host system to ensure that when dma_map_sg() is called, it returns addresses that can be accessed directly by the device. The DMAEngine can access any local (ppc) memory without any restriction. I have used the Linux DMAEngine API (include/linux/dmaengine.h) to handle all data transfer across the PCI bus. The Intel I/OAT (and many others) use the same API. > > In fact, because normal x86 computers > > do not have DMA controllers, the host system doesn't actually handle any > > data transfer! > > Is it true that PPC has to initiate all DMA then? How do you > manage not to do DMA reads then? > Yes, the ppc initiates all DMA. It handles all data transfer (both reads and writes) across the PCI bus, for speed reasons. A CPU cannot create burst transactions on the PCI bus. This is the reason that most (all?) network cards (as a familiar example) use DMA to transfer packet contents into RAM. Sorry if I made a confusing statement ("no reads are necessary") earlier. What I meant to say was: If you are very careful, it is not necessary for the CPU to do any reads over the PCI bus to maintain state. Writes are the only necessary CPU-initiated transaction. I implemented this in my virtio-over-PCI patch, copying as much as possible from the virtio vring structure. The descriptors in the rings are only changed by one "side" of the connection, therefore they can be cached as they are written (via the CPU) across the PCI bus, with the knowledge that both sides will have a consistent view. I'm sorry, this is hard to explain via email. It is much easier in a room with a whiteboard. :) > > I used virtio-net in both the guest and host systems in my example > > virtio-over-PCI patch, and succeeded in getting them to communicate. > > However, the lack of any setup interface means that the devices must be > > hardcoded into both drivers, when the decision could be up to userspace. > > I think this is a problem that vbus could solve. > > What you describe (passing setup from host to guest) seems like > a feature that guest devices need to support. It seems unlikely that > vbus, being a transport layer, can address this. > I think I explained this poorly as well. Virtio needs two things to function: 1) a set of descriptor rings (1 or more) 2) a way to kick each ring. With the amount of space available in the ppc's PCI BAR's (which point at a small chunk of SDRAM), I could potentially make ~6 virtqueues + 6 kick interrupts available. Right now, my virtio-over-PCI driver hardcoded the first and second virtqueues to be for virtio-net only, and nothing else. What if the user wanted 2 virtio-console and 2 virtio-net? They'd have to change the driver, because virtio doesn't have much of a management interface. Vbus does have a management interface: you create devices via configfs. The vbus-connector on the guest notices new devices, and triggers hotplug events on the guest. As far as I understand it, vbus is a bus model, not just a transport layer. > > > > For my own selfish reasons (I don't want to maintain an out-of-tree > > driver) I'd like to see *something* useful in mainline Linux. I'm happy > > to answer questions about my setup, just ask. > > > > Ira > > Thanks Ira, I'll think about it. > A couple of questions: > - Could you please describe what kind of communication needs to happen? > - I'm not familiar with DMA engine in question. I'm guessing it's the > usual thing: in/out buffers need to be kernel memory, interface is > asynchronous, small limited number of outstanding requests? Is there a > userspace interface for it and if yes how does it work? > The DMA engine can handle transferring from any two physical addresses, as seen from the ppc address map. The things of interest are: 1) ppc sdram 2) host sdram (first 1GB only, explained above) The Linux DMAEngine API allows you to do sync or async requests with callbacks, and an unlimited number of outstanding requests (until you exhaust memory). The interface is in-kernel only. See include/linux/dmaengine.h for the details, but the most important part is dma_async_memcpy_buf_to_buf(), which will copy between two kernel virtual addresses. It is trivial to code up an implementation which will transfer between physical addresses instead, which I found much more convenient in my code. I'm happy to provide the function if/when needed. Ira -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 19, 2009 at 12:26:23AM +0300, Avi Kivity wrote: > On 08/18/2009 11:59 PM, Ira W. Snyder wrote: >> On a non shared-memory system (where the guest's RAM is not just a chunk >> of userspace RAM in the host system), virtio's management model seems to >> fall apart. Feature negotiation doesn't work as one would expect. >> > > In your case, virtio-net on the main board accesses PCI config space > registers to perform the feature negotiation; software on your PCI cards > needs to trap these config space accesses and respond to them according > to virtio ABI. > Is this "real PCI" (physical hardware) or "fake PCI" (software PCI emulation) that you are describing? The host (x86, PCI master) must use "real PCI" to actually configure the boards, enable bus mastering, etc. Just like any other PCI device, such as a network card. On the guests (ppc, PCI agents) I cannot add/change PCI functions (the last .[0-9] in the PCI address) nor can I change PCI BAR's once the board has started. I'm pretty sure that would violate the PCI spec, since the PCI master would need to re-scan the bus, and re-assign addresses, which is a task for the BIOS. > (There's no real guest on your setup, right? just a kernel running on > and x86 system and other kernels running on the PCI cards?) > Yes, the x86 (PCI master) runs Linux (booted via PXELinux). The ppc's (PCI agents) also run Linux (booted via U-Boot). They are independent Linux systems, with a physical PCI interconnect. The x86 has CONFIG_PCI=y, however the ppc's have CONFIG_PCI=n. Linux's PCI stack does bad things as a PCI agent. It always assumes it is a PCI master. It is possible for me to enable CONFIG_PCI=y on the ppc's by removing the PCI bus from their list of devices provided by OpenFirmware. They can not access PCI via normal methods. PCI drivers cannot work on the ppc's, because Linux assumes it is a PCI master. To the best of my knowledge, I cannot trap configuration space accesses on the PCI agents. I haven't needed that for anything I've done thus far. >> This does appear to be solved by vbus, though I haven't written a >> vbus-over-PCI implementation, so I cannot be completely sure. >> > > Even if virtio-pci doesn't work out for some reason (though it should), > you can write your own virtio transport and implement its config space > however you like. > This is what I did with virtio-over-PCI. The way virtio-net negotiates features makes this work non-intuitively. >> I'm not at all clear on how to get feature negotiation to work on a >> system like mine. From my study of lguest and kvm (see below) it looks >> like userspace will need to be involved, via a miscdevice. >> > > I don't see why. Is the kernel on the PCI cards in full control of all > accesses? > I'm not sure what you mean by this. Could you be more specific? This is a normal, unmodified vanilla Linux kernel running on the PCI agents. >> Ok. I thought I should at least express my concerns while we're >> discussing this, rather than being too late after finding the time to >> study the driver. >> >> Off the top of my head, I would think that transporting userspace >> addresses in the ring (for copy_(to|from)_user()) vs. physical addresses >> (for DMAEngine) might be a problem. Pinning userspace pages into memory >> for DMA is a bit of a pain, though it is possible. >> > > Oh, the ring doesn't transport userspace addresses. It transports guest > addresses, and it's up to vhost to do something with them. > > Currently vhost supports two translation modes: > > 1. virtio address == host virtual address (using copy_to_user) > 2. virtio address == offsetted host virtual address (using copy_to_user) > > The latter mode is used for kvm guests (with multiple offsets, skipping > some details). > > I think you need to add a third mode, virtio address == host physical > address (using dma engine). Once you do that, and wire up the > signalling, things should work. > Ok. In my virtio-over-PCI patch, I hooked two virtio-net's together. I wrote an algorithm to pair the tx/rx queues together. Since virtio-net pre-fills its rx queues with buffers, I was able to use the DMA engine to copy from the tx queue into the pre-allocated memory in the rx queue. I have an intuitive idea about how I think vhost-net works in this case. >> There is also the problem of different endianness between host and guest >> in virtio-net. The struct virtio_net_hdr (include/linux/virtio_net.h) >> defines fields in host byte order. Which totally breaks if the guest has >> a different endianness. This is a virtio-net problem though, and is not >> transport specific. >> > > Yeah. You'll need to add byteswaps. > I wonder if Rusty would accept a new feature: VIRTIO_F_NET_LITTLE_ENDIAN, which would allow the virtio-net driver to use LE for all of it's multi-byte fields. I don't think the transport should have to care about the endianness. >> I've browsed over both the kvm and lguest code, and it looks like they >> each re-invent a mechanism for transporting interrupts between the host >> and guest, using eventfd. They both do this by implementing a >> miscdevice, which is basically their management interface. >> >> See drivers/lguest/lguest_user.c (see write() and LHREQ_EVENTFD) and >> kvm-kmod-devel-88/x86/kvm_main.c (see kvm_vm_ioctl(), called via >> kvm_dev_ioctl()) for how they hook up eventfd's. >> >> I can now imagine how two userspace programs (host and guest) could work >> together to implement a management interface, including hotplug of >> devices, etc. Of course, this would basically reinvent the vbus >> management interface into a specific driver. >> > > You don't need anything in the guest userspace (virtio-net) side. > >> I think this is partly what Greg is trying to abstract out into generic >> code. I haven't studied the actual data transport mechanisms in vbus, >> though I have studied virtio's transport mechanism. I think a generic >> management interface for virtio might be a good thing to consider, >> because it seems there are at least two implementations already: kvm and >> lguest. >> > > Management code in the kernel doesn't really help unless you plan to > manage things with echo and cat. > True. It's slowpath setup, so I don't care how fast it is. For reasons outside my control, the x86 (PCI master) is running a RHEL5 system. This means glibc-2.5, which doesn't have eventfd support, AFAIK. I could try and push for an upgrade. This obviously makes cat/echo really nice, it doesn't depend on glibc, only the kernel version. I don't give much weight to the above, because I can use the eventfd syscalls directly, without glibc support. It is just more painful. Ira -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 19, 2009 at 01:06:45AM +0300, Avi Kivity wrote: > On 08/19/2009 12:26 AM, Avi Kivity wrote: >>> >>> Off the top of my head, I would think that transporting userspace >>> addresses in the ring (for copy_(to|from)_user()) vs. physical addresses >>> (for DMAEngine) might be a problem. Pinning userspace pages into memory >>> for DMA is a bit of a pain, though it is possible. >> >> >> Oh, the ring doesn't transport userspace addresses. It transports >> guest addresses, and it's up to vhost to do something with them. >> >> Currently vhost supports two translation modes: >> >> 1. virtio address == host virtual address (using copy_to_user) >> 2. virtio address == offsetted host virtual address (using copy_to_user) >> >> The latter mode is used for kvm guests (with multiple offsets, >> skipping some details). >> >> I think you need to add a third mode, virtio address == host physical >> address (using dma engine). Once you do that, and wire up the >> signalling, things should work. > > > You don't need in fact a third mode. You can mmap the x86 address space > into your ppc userspace and use the second mode. All you need then is > the dma engine glue and byte swapping. > Hmm, I'll have to think about that. The ppc is a 32-bit processor, so it has 4GB of address space for everything, including PCI, SDRAM, flash memory, and all other peripherals. This is exactly like 32bit x86, where you cannot have a PCI card that exposes a 4GB PCI BAR. The system would have no address space left for its own SDRAM. On my x86 computers, I only have 1GB of physical RAM, and so the ppc's have plenty of room in their address spaces to map the entire x86 RAM into their own address space. That is exactly what I do now. Accesses to ppc physical address 0x80000000 "magically" hit x86 physical address 0x0. Ira -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ingo Molnar wrote: > * Gregory Haskins <gregory.haskins@gmail.com> wrote: > >>> You haven't convinced me that your ideas are worth the effort >>> of abandoning virtio/pci or maintaining both venet/vbus and >>> virtio/pci. >> With all due respect, I didnt ask you do to anything, especially >> not abandon something you are happy with. >> >> All I did was push guest drivers to LKML. The code in question >> is independent of KVM, and its proven to improve the experience >> of using Linux as a platform. There are people interested in >> using them (by virtue of the number of people that have signed up >> for the AlacrityVM list, and have mailed me privately about this >> work). > > This thread started because i asked you about your technical > arguments why we'd want vbus instead of virtio. (You mean vbus vs pci, right? virtio works fine, is untouched, and is out-of-scope here) Right, and I do believe I answered your questions. Do you feel as though this was not a satisfactory response? > Your answer above > now basically boils down to: "because I want it so, why dont you > leave me alone". Well, with all due respect, please do not put words in my mouth. This is not what I am saying at all. What I *am* saying is: fact: this thread is about linux guest drivers to support vbus fact: these drivers do not touch kvm code. fact: these drivers to not force kvm to alter its operation in any way. fact: these drivers do not alter ABIs that KVM currently supports. Therefore, all this talk about "abandoning", "supporting", and "changing" things in KVM is, premature, irrelevant, and/or, FUD. No one proposed such changes, so I am highlighting this fact to bring the thread back on topic. That KVM talk is merely a distraction at this point in time. > > What you are doing here is to in essence to fork KVM, No, that is incorrect. What I am doing here is a downstream development point for the integration of KVM and vbus. Its akin to kvm.git or tip.git to develop a subsystem intended for eventual inclusion upstream. If and when the code goes upstream in a manner acceptable to all parties involved, and AlacrityVM exceeds its utility as a separate project, I will _gladly_ dissolve it and migrate to use upstream KVM instead. As stated on the project wiki: "It is a goal of AlacrityVM to work towards upstream acceptance of the project on a timeline that suits the community. In the meantime, this wiki will serve as the central coordination point for development and discussion of the technology" (citation: http://developer.novell.com/wiki/index.php/AlacrityVM) And I meant it when I said it. Until then, the project is a much more efficient way for us (the vbus developers) to work together than pointing people at my patch series posted to kvm@vger. I tried that way first. It sucked, and didn't work. Users were having trouble patching the various pieces, building, etc. Now I can offer a complete solution from a central point, with all the proper pieces in place to play around with it. Ultimately, it is up to upstream to decide if this is to become merged or remain out of tree forever as a "fork". Not me. I will continue to make every effort to find common ground with my goals coincident with the blessing of upstream, as I have been from the beginning. Now I have a more official forum to do it in. > regardless of > the technical counter arguments given against such a fork and > regardless of the ample opportunity given to you to demostrate the > technical advantages of your code. (in which case KVM would happily > migrate to your code) In an ideal world, perhaps. Avi and I currently have a fundamental disagreement about the best way to do PV. He sees the world through PCI glasses, and I don't. Despite attempts on both sides to rectify this disagreement, we currently do not see eye to eye on every front. This doesn't mean he is right, and I am wrong per se. It just means we disagree. Period. Avi is a sharp guy, and I respect him. But upstream KVM doesn't have a corner on "correct" ;) The community as a whole will ultimately decide if my ideas live or die, wouldn't you agree? Avi can correct me if I am wrong, but what we _do_ agree on is that core KVM doesn't need to be directly involved in this vbus (or vhost) discussion, per se. It just wants to have the hooks to support various PV solutions (such as irqfd/ioeventfd), and vbus is one such solution. > > We all love faster code and better management interfaces and tons > of your prior patches got accepted by Avi. This time you didnt even > _try_ to improve virtio. Im sorry, but you are mistaken: http://lkml.indiana.edu/hypermail/linux/kernel/0904.2/02443.html > It's not like you posted a lot of virtio > patches which were not applied. You didnt even try and you need to > try _much_ harder than that before forking a project. I really do not think you are in a position to say when someone can or cannot fork a project, so please do not try to lecture on that. Perhaps you could offer advice on when someone, in your opinion, *should* or *should not* fork, because that would be interesting to hear. You are also wrong to say that I didn't try to avoid creating a downstream effort first. I believe the public record of the mailing lists will back me up that I tried politely pushing this directly though kvm first. It was only after Avi recently informed me that they would be building their own version of an in-kernel backend in lieu of working with me to adapt vbus to their needs that I decided to put my own project together. What should I have done otherwise, in your opinion? > > And fragmentation matters quite a bit. To Linux users, developers, > administrators, packagers it's a big deal whether two overlapping > pieces of functionality for the same thing exist within the same > kernel. So the only thing that could be construed as overlapping here is venet vs virtio-net. If I dropped the contentious venet and focused on making a virtio-net backend that we can all re-use, do you see that as a path of compromise here? > The kernel is not an anarchy where everyone can have their > own sys_fork() version or their own sys_write() version. Would you > want to have two dozen read() variants, sys_read_oracle() and a > sys_read_db2()? No, and I am not advocating that either. > > I certainly dont want that. Instead we (at great expense and work) > try to reach the best technical solution. This is all I want, as well. > That means we throw away > inferior code and adopt the better one. (with a reasonable > migration period) > > You are ignoring that principle with hand-waving about 'the > community wants this'. I call it like I see it. I get private emails all the time encouraging my efforts and asking about the project. I'm sorry if you see this as hand-waving. Perhaps the people involved will become more vocal in the community to back me up, perhaps not. Time will tell. > I can assure you, users _DONT WANT_ split > interfaces and incompatible drivers for the same thing. They want > stuff that works well. And I can respect that, and am trying to provide that. > > If the community wants this then why cannot you convince one of the > most prominent representatives of that community, the KVM > developers? Its a chicken and egg at times. Perhaps the KVM developers do not have the motivation or time to properly consider such a proposal _until_ the community presents its demand. And sometimes you cannot build demand unless you have an easy way to use the idea, such as a project to back it. Since vbus+kvm has many moving parts (guest side, host-side, userspace-side, etc), its difficult to use as a patch series pulled in from a mailing list. This is the role of the AlacrityVM project. Make it easy to use and develop. If it draws a community, perhaps KVM will reconsider its stance. If it does not draw a community, it will naturally die. End of story. But please do not confuse one particular groups opinion as the sole validation of an idea, no matter how "prominent". There are numerous reasons why one group may hold an opinion that have nothing to do with the actual technical merits of the idea, or the community demand for it. > > Furthermore, 99% of your work is KVM Actually, no. Almost none of it is. I think there are about 2-3 patches in the series that touch KVM, the rest are all original (and primarily stand-alone code). AlacrityVM is the application of kvm and vbus (and, of course, Linux) together as a complete unit, but I do not try to hide this relationship. By your argument, KVM is 99% QEMU+Linux. ;) > why dont you respect that work by not forking it? Lighten up on the fork FUD, please. It's counter productive. Kind Regards, -Greg
On 08/19/2009 07:27 AM, Gregory Haskins wrote: > >> This thread started because i asked you about your technical >> arguments why we'd want vbus instead of virtio. >> > (You mean vbus vs pci, right? virtio works fine, is untouched, and is > out-of-scope here) > I guess he meant venet vs virtio-net. Without venet vbus is currently userless. > Right, and I do believe I answered your questions. Do you feel as > though this was not a satisfactory response? > Others and I have shown you its wrong. There's no inherent performance problem in pci. The vbus approach has inherent problems (the biggest of which is compatibility, the second managability). >> Your answer above >> now basically boils down to: "because I want it so, why dont you >> leave me alone". >> > Well, with all due respect, please do not put words in my mouth. This > is not what I am saying at all. > > What I *am* saying is: > > fact: this thread is about linux guest drivers to support vbus > > fact: these drivers do not touch kvm code. > > fact: these drivers to not force kvm to alter its operation in any way. > > fact: these drivers do not alter ABIs that KVM currently supports. > > Therefore, all this talk about "abandoning", "supporting", and > "changing" things in KVM is, premature, irrelevant, and/or, FUD. No one > proposed such changes, so I am highlighting this fact to bring the > thread back on topic. That KVM talk is merely a distraction at this > point in time. > s/kvm/kvm stack/. virtio/pci is part of the kvm stack, even if it is not part of kvm itself. If vbus/venet were to be merged, users and developers would have to choose one or the other. That's the fragmentation I'm worried about. And you can prefix that with "fact:" as well. >> We all love faster code and better management interfaces and tons >> of your prior patches got accepted by Avi. This time you didnt even >> _try_ to improve virtio. >> > Im sorry, but you are mistaken: > > http://lkml.indiana.edu/hypermail/linux/kernel/0904.2/02443.html > That does nothing to improve virtio. Existing guests (Linux and Windows) which support virtio will cease to work if the host moves to vbus-virtio. Existing hosts (running virtio-pci) won't be able to talk to newer guests running virtio-vbus. The patch doesn't improve performance without the entire vbus stack in the host kernel and a vbus-virtio-net-host host kernel driver. Perhaps if you posted everything needed to make vbus-virtio work and perform we could compare that to vhost-net and you'll see another reason why vhost-net is the better approach. > You are also wrong to say that I didn't try to avoid creating a > downstream effort first. I believe the public record of the mailing > lists will back me up that I tried politely pushing this directly though > kvm first. It was only after Avi recently informed me that they would > be building their own version of an in-kernel backend in lieu of working > with me to adapt vbus to their needs that I decided to put my own > project together. > There's no way we can adapt vbus to our needs. Don't you think we'd preferred it rather than writing our own? the current virtio-net issues are hurting us. Our needs are compatibility, performance, and managability. vbus fails all three, your impressive venet numbers notwithstanding. > What should I have done otherwise, in your opinion? > You could come up with uses where vbus truly is superior to virtio/pci/whatever (not words about etch constraints). Showing some of those non-virt uses, for example. The fact that your only user duplicates existing functionality doesn't help. >> And fragmentation matters quite a bit. To Linux users, developers, >> administrators, packagers it's a big deal whether two overlapping >> pieces of functionality for the same thing exist within the same >> kernel. >> > So the only thing that could be construed as overlapping here is venet > vs virtio-net. If I dropped the contentious venet and focused on making > a virtio-net backend that we can all re-use, do you see that as a path > of compromise here? > That's a step in the right direction. >> I certainly dont want that. Instead we (at great expense and work) >> try to reach the best technical solution. >> > This is all I want, as well. > Note whenever I mention migration, large guests, or Windows you say these are not your design requirements. The best technical solution will have to consider those. >> If the community wants this then why cannot you convince one of the >> most prominent representatives of that community, the KVM >> developers? >> > Its a chicken and egg at times. Perhaps the KVM developers do not have > the motivation or time to properly consider such a proposal _until_ the > community presents its demand. I've spent quite a lot of time arguing with you, no doubt influenced by the fact that you can write a lot faster than I can read. >> Furthermore, 99% of your work is KVM >> > Actually, no. Almost none of it is. I think there are about 2-3 > patches in the series that touch KVM, the rest are all original (and > primarily stand-alone code). AlacrityVM is the application of kvm and > vbus (and, of course, Linux) together as a complete unit, but I do not > try to hide this relationship. > > By your argument, KVM is 99% QEMU+Linux. ;) > That's one of the kvm strong points...
On 08/19/2009 03:44 AM, Ira W. Snyder wrote: >> You don't need in fact a third mode. You can mmap the x86 address space >> into your ppc userspace and use the second mode. All you need then is >> the dma engine glue and byte swapping. >> >> > Hmm, I'll have to think about that. > > The ppc is a 32-bit processor, so it has 4GB of address space for > everything, including PCI, SDRAM, flash memory, and all other > peripherals. > > This is exactly like 32bit x86, where you cannot have a PCI card that > exposes a 4GB PCI BAR. The system would have no address space left for > its own SDRAM. > (you actually can, since x86 has a 36-40 bit physical address space even with a 32-bit virtual address space, but that doesn't help you). > On my x86 computers, I only have 1GB of physical RAM, and so the ppc's > have plenty of room in their address spaces to map the entire x86 RAM > into their own address space. That is exactly what I do now. Accesses to > ppc physical address 0x80000000 "magically" hit x86 physical address > 0x0. > So if you mmap() that, you could work with virtual addresses. It may be more efficient to work with physical addresses directly though.
Michael S. Tsirkin wrote: > On Tue, Aug 18, 2009 at 11:51:59AM -0400, Gregory Haskins wrote: >>> It's not laughably trivial when you try to support the full feature set >>> of kvm (for example, live migration will require dirty memory tracking, >>> and exporting all state stored in the kernel to userspace). >> Doesn't vhost suffer from the same issue? If not, could I also apply >> the same technique to support live-migration in vbus? > > vhost does this by switching to userspace for the duration of live > migration. venet could do this I guess, but you'd need to write a > userspace implementation. vhost just reuses existing userspace virtio. > >> With all due respect, I didnt ask you do to anything, especially not >> abandon something you are happy with. >> >> All I did was push guest drivers to LKML. The code in question is >> independent of KVM, and its proven to improve the experience of using >> Linux as a platform. There are people interested in using them (by >> virtue of the number of people that have signed up for the AlacrityVM >> list, and have mailed me privately about this work). >> >> So where is the problem here? > > If virtio net in guest could be improved instead, everyone would > benefit. So if I whip up a virtio-net backend for vbus with a PCI compliant connector, you are happy? > I am doing this, and I wish more people would join. Instead, > you change ABI in a incompatible way. Only by choice of my particular connector. The ABI is a function of the connector design. So one such model is to terminate the connector in qemu, and surface the resulting objects as PCI devices. I choose not to use this particular design for my connector that I am pushing upstream because I am of the opinion that I can do better by terminating it in the guest directly as a PV optimized bus. However, both connectors can theoretically coexist peacefully. The advantage that this would give us is that one in-kernel virtio-net model could be surfaced to all vbus users (pci, or otherwise), which will hopefully be growing over time. This would have gained vbus a virtio-net backend, and it would have saved you from re-inventing the various abstractions and management interfaces that vbus has in place. > So now, there's no single place to > work on kvm networking performance. Now, it would all be understandable > if the reason was e.g. better performance. But you say yourself it > isn't. Actually, I really didn't say that. As far as I know, your patch hasnt been performance proven to my knowledge, but I just gave you the benefit of the doubt. What I said was that for a limited type of benchmark, it *may* get similar numbers if you implemented vhost optimally. For others (for instance, when we can start to take advantage of priority, or scaling the number of interfaces) it may not since my proposed connector was designed to optimize this over raw PCI facilities. But I digress. Please post results when you have numbers, as I had to give up my 10GE rig in the lab. I suspect you will have performance issues until you at least address GSO, but you may already be there by now. Kind Regards, -Greg
Arnd Bergmann wrote: > On Tuesday 18 August 2009, Gregory Haskins wrote: >> Avi Kivity wrote: >>> On 08/17/2009 10:33 PM, Gregory Haskins wrote: >>> >>> One point of contention is that this is all managementy stuff and should >>> be kept out of the host kernel. Exposing shared memory, interrupts, and >>> guest hypercalls can all be easily done from userspace (as virtio >>> demonstrates). True, some devices need kernel acceleration, but that's >>> no reason to put everything into the host kernel. >> See my last reply to Anthony. My two points here are that: >> >> a) having it in-kernel makes it a complete subsystem, which perhaps has >> diminished value in kvm, but adds value in most other places that we are >> looking to use vbus. >> >> b) the in-kernel code is being overstated as "complex". We are not >> talking about your typical virt thing, like an emulated ICH/PCI chipset. >> Its really a simple list of devices with a handful of attributes. They >> are managed using established linux interfaces, like sysfs/configfs. > > IMHO the complexity of the code is not so much of a problem. What I > see as a problem is the complexity a kernel/user space interface that > manages a the devices with global state. > > One of the greatest features of Michaels vhost driver is that all > the state is associated with open file descriptors that either exist > already or belong to the vhost_net misc device. When a process dies, > all the file descriptors get closed and the whole state is cleaned > up implicitly. > > AFAICT, you can't do that with the vbus host model. It should work the same. When a driver opens a vbus device, it calls "interface->connect()" and gets back a "connection" object. The connection->release() method is invoked when the driver "goes away", which would include the scenario you present. This gives the device-model the opportunity to cleanup in the same way. > >>> What performance oriented items have been left unaddressed? >> Well, the interrupt model to name one. > > The performance aspects of your interrupt model are independent > of the vbus proxy, or at least they should be. Let's assume for > now that your event notification mechanism gives significant > performance improvements (which we can't measure independently > right now). I don't see a reason why we could not get the > same performance out of a paravirtual interrupt controller > that uses the same method, and it would be straightforward > to implement one and use that together with all the existing > emulated PCI devices and virtio devices including vhost_net. Agreed. I proposed this before and Avi rejected the idea. -Greg
On 08/19/2009 03:38 AM, Ira W. Snyder wrote: > On Wed, Aug 19, 2009 at 12:26:23AM +0300, Avi Kivity wrote: > >> On 08/18/2009 11:59 PM, Ira W. Snyder wrote: >> >>> On a non shared-memory system (where the guest's RAM is not just a chunk >>> of userspace RAM in the host system), virtio's management model seems to >>> fall apart. Feature negotiation doesn't work as one would expect. >>> >>> >> In your case, virtio-net on the main board accesses PCI config space >> registers to perform the feature negotiation; software on your PCI cards >> needs to trap these config space accesses and respond to them according >> to virtio ABI. >> >> > Is this "real PCI" (physical hardware) or "fake PCI" (software PCI > emulation) that you are describing? > > Real PCI. > The host (x86, PCI master) must use "real PCI" to actually configure the > boards, enable bus mastering, etc. Just like any other PCI device, such > as a network card. > > On the guests (ppc, PCI agents) I cannot add/change PCI functions (the > last .[0-9] in the PCI address) nor can I change PCI BAR's once the > board has started. I'm pretty sure that would violate the PCI spec, > since the PCI master would need to re-scan the bus, and re-assign > addresses, which is a task for the BIOS. > Yes. Can the boards respond to PCI config space cycles coming from the host, or is the config space implemented in silicon and immutable? (reading on, I see the answer is no). virtio-pci uses the PCI config space to configure the hardware. >> (There's no real guest on your setup, right? just a kernel running on >> and x86 system and other kernels running on the PCI cards?) >> >> > Yes, the x86 (PCI master) runs Linux (booted via PXELinux). The ppc's > (PCI agents) also run Linux (booted via U-Boot). They are independent > Linux systems, with a physical PCI interconnect. > > The x86 has CONFIG_PCI=y, however the ppc's have CONFIG_PCI=n. Linux's > PCI stack does bad things as a PCI agent. It always assumes it is a PCI > master. > > It is possible for me to enable CONFIG_PCI=y on the ppc's by removing > the PCI bus from their list of devices provided by OpenFirmware. They > can not access PCI via normal methods. PCI drivers cannot work on the > ppc's, because Linux assumes it is a PCI master. > > To the best of my knowledge, I cannot trap configuration space accesses > on the PCI agents. I haven't needed that for anything I've done thus > far. > > Well, if you can't do that, you can't use virtio-pci on the host. You'll need another virtio transport (equivalent to "fake pci" you mentioned above). >>> This does appear to be solved by vbus, though I haven't written a >>> vbus-over-PCI implementation, so I cannot be completely sure. >>> >>> >> Even if virtio-pci doesn't work out for some reason (though it should), >> you can write your own virtio transport and implement its config space >> however you like. >> >> > This is what I did with virtio-over-PCI. The way virtio-net negotiates > features makes this work non-intuitively. > I think you tried to take two virtio-nets and make them talk together? That won't work. You need the code from qemu to talk to virtio-net config space, and vhost-net to pump the rings. >>> I'm not at all clear on how to get feature negotiation to work on a >>> system like mine. From my study of lguest and kvm (see below) it looks >>> like userspace will need to be involved, via a miscdevice. >>> >>> >> I don't see why. Is the kernel on the PCI cards in full control of all >> accesses? >> >> > I'm not sure what you mean by this. Could you be more specific? This is > a normal, unmodified vanilla Linux kernel running on the PCI agents. > I meant, does board software implement the config space accesses issued from the host, and it seems the answer is no. > In my virtio-over-PCI patch, I hooked two virtio-net's together. I wrote > an algorithm to pair the tx/rx queues together. Since virtio-net > pre-fills its rx queues with buffers, I was able to use the DMA engine > to copy from the tx queue into the pre-allocated memory in the rx queue. > > Please find a name other than virtio-over-PCI since it conflicts with virtio-pci. You're tunnelling virtio config cycles (which are usually done on pci config cycles) on a new protocol which is itself tunnelled over PCI shared memory. >>> >>> >> Yeah. You'll need to add byteswaps. >> >> > I wonder if Rusty would accept a new feature: > VIRTIO_F_NET_LITTLE_ENDIAN, which would allow the virtio-net driver to > use LE for all of it's multi-byte fields. > > I don't think the transport should have to care about the endianness. > Given this is not mainstream use, it would have to have zero impact when configured out. > True. It's slowpath setup, so I don't care how fast it is. For reasons > outside my control, the x86 (PCI master) is running a RHEL5 system. This > means glibc-2.5, which doesn't have eventfd support, AFAIK. I could try > and push for an upgrade. This obviously makes cat/echo really nice, it > doesn't depend on glibc, only the kernel version. > > I don't give much weight to the above, because I can use the eventfd > syscalls directly, without glibc support. It is just more painful. > The x86 side only needs to run virtio-net, which is present in RHEL 5.3. You'd only need to run virtio-tunnel or however it's called. All the eventfd magic takes place on the PCI agents.
On 08/19/2009 08:36 AM, Gregory Haskins wrote: >> If virtio net in guest could be improved instead, everyone would >> benefit. >> > So if I whip up a virtio-net backend for vbus with a PCI compliant > connector, you are happy? > This doesn't improve virtio-net in any way. >> I am doing this, and I wish more people would join. Instead, >> you change ABI in a incompatible way. >> > Only by choice of my particular connector. The ABI is a function of the > connector design. So one such model is to terminate the connector in > qemu, and surface the resulting objects as PCI devices. I choose not to > use this particular design for my connector that I am pushing upstream > because I am of the opinion that I can do better by terminating it in > the guest directly as a PV optimized bus. However, both connectors can > theoretically coexist peacefully. > virtio already supports this model; see lguest and s390. Transporting virtio over vbus and vbus over something else doesn't gain anything over directly transporting virtio over that something else.
Avi Kivity wrote: > On 08/18/2009 05:46 PM, Gregory Haskins wrote: >> >>> Can you explain how vbus achieves RDMA? >>> >>> I also don't see the connection to real time guests. >>> >> Both of these are still in development. Trying to stay true to the >> "release early and often" mantra, the core vbus technology is being >> pushed now so it can be reviewed. Stay tuned for these other >> developments. >> > > Hopefully you can outline how it works. AFAICT, RDMA and kernel bypass > will need device assignment. If you're bypassing the call into the host > kernel, it doesn't really matter how that call is made, does it? This is for things like the setup of queue-pairs, and the transport of door-bells, and ib-verbs. I am not on the team doing that work, so I am not an expert in this area. What I do know is having a flexible and low-latency signal-path was deemed a key requirement. For real-time, a big part of it is relaying the guest scheduler state to the host, but in a smart way. For instance, the cpu priority for each vcpu is in a shared-table. When the priority is raised, we can simply update the table without taking a VMEXIT. When it is lowered, we need to inform the host of the change in case the underlying task needs to reschedule. This is where the really fast call() type mechanism is important. Its also about having the priority flow-end to end, and having the vcpu interrupt state affect the task-priority, etc (e.g. pending interrupts affect the vcpu task prio). etc, etc. I can go on and on (as you know ;), but will wait till this work is more concrete and proven. > >>>> I also designed it in such a way that >>>> we could, in theory, write one set of (linux-based) backends, and have >>>> them work across a variety of environments (such as containers/VMs like >>>> KVM, lguest, openvz, but also physical systems like blade enclosures >>>> and >>>> clusters, or even applications running on the host). >>>> >>>> >>> Sorry, I'm still confused. Why would openvz need vbus? >>> >> Its just an example. The point is that I abstracted what I think are >> the key points of fast-io, memory routing, signal routing, etc, so that >> it will work in a variety of (ideally, _any_) environments. >> >> There may not be _performance_ motivations for certain classes of VMs >> because they already have decent support, but they may want a connector >> anyway to gain some of the new features available in vbus. >> >> And looking forward, the idea is that we have commoditized the backend >> so we don't need to redo this each time a new container comes along. >> > > I'll wait until a concrete example shows up as I still don't understand. Ok. > >>> One point of contention is that this is all managementy stuff and should >>> be kept out of the host kernel. Exposing shared memory, interrupts, and >>> guest hypercalls can all be easily done from userspace (as virtio >>> demonstrates). True, some devices need kernel acceleration, but that's >>> no reason to put everything into the host kernel. >>> >> See my last reply to Anthony. My two points here are that: >> >> a) having it in-kernel makes it a complete subsystem, which perhaps has >> diminished value in kvm, but adds value in most other places that we are >> looking to use vbus. >> > > It's not a complete system unless you want users to administer VMs using > echo and cat and configfs. Some userspace support will always be > necessary. Well, more specifically, it doesn't require a userspace app to hang around. For instance, you can set up your devices with udev scripts, or whatever. But that is kind of a silly argument, since the kernel always needs userspace around to give it something interesting, right? ;) Basically, what it comes down to is both vbus and vhost need configuration/management. Vbus does it with sysfs/configfs, and vhost does it with ioctls. I ultimately decided to go with sysfs/configfs because, at least that the time I looked, it seemed like the "blessed" way to do user->kernel interfaces. > >> b) the in-kernel code is being overstated as "complex". We are not >> talking about your typical virt thing, like an emulated ICH/PCI chipset. >> Its really a simple list of devices with a handful of attributes. They >> are managed using established linux interfaces, like sysfs/configfs. >> > > They need to be connected to the real world somehow. What about > security? can any user create a container and devices and link them to > real interfaces? If not, do you need to run the VM as root? Today it has to be root as a result of weak mode support in configfs, so you have me there. I am looking for help patching this limitation, though. Also, venet-tap uses a bridge, which of course is not as slick as a raw-socket w.r.t. perms. > > virtio and vhost-net solve these issues. Does vbus? > > The code may be simple to you. But the question is whether it's > necessary, not whether it's simple or complex. > >>> Exposing devices as PCI is an important issue for me, as I have to >>> consider non-Linux guests. >>> >> Thats your prerogative, but obviously not everyone agrees with you. >> > > I hope everyone agrees that it's an important issue for me and that I > have to consider non-Linux guests. I also hope that you're considering > non-Linux guests since they have considerable market share. I didn't mean non-Linux guests are not important. I was disagreeing with your assertion that it only works if its PCI. There are numerous examples of IHV/ISV "bridge" implementations deployed in Windows, no? If vbus is exposed as a PCI-BRIDGE, how is this different? > >> Getting non-Linux guests to work is my problem if you chose to not be >> part of the vbus community. >> > > I won't be writing those drivers in any case. Ok. > >>> Another issue is the host kernel management code which I believe is >>> superfluous. >>> >> In your opinion, right? >> > > Yes, this is why I wrote "I believe". Fair enough. > > >>> Given that, why spread to a new model? >>> >> Note: I haven't asked you to (at least, not since April with the vbus-v3 >> release). Spreading to a new model is currently the role of the >> AlacrityVM project, since we disagree on the utility of a new model. >> > > Given I'm not the gateway to inclusion of vbus/venet, you don't need to > ask me anything. I'm still free to give my opinion. Agreed, and I didn't mean to suggest otherwise. It not clear if you are wearing the "kvm maintainer" hat, or the "lkml community member" hat at times, so its important to make that distinction. Otherwise, its not clear if this is edict as my superior, or input as my peer. ;) > >>>> A) hardware can only generate byte/word sized requests at a time >>>> because >>>> that is all the pcb-etch and silicon support. So hardware is usually >>>> expressed in terms of some number of "registers". >>>> >>>> >>> No, hardware happily DMAs to and fro main memory. >>> >> Yes, now walk me through how you set up DMA to do something like a call >> when you do not know addresses apriori. Hint: count the number of >> MMIO/PIOs you need. If the number is> 1, you've lost. >> > > With virtio, the number is 1 (or less if you amortize). Set up the ring > entries and kick. Again, I am just talking about basic PCI here, not the things we build on top. The point is: the things we build on top have costs associated with them, and I aim to minimize it. For instance, to do a "call()" kind of interface, you generally need to pre-setup some per-cpu mappings so that you can just do a single iowrite32() to kick the call off. Those per-cpu mappings have a cost if you want them to be high-performance, so my argument is that you ideally want to limit the number of times you have to do this. My current design reduces this to "once". > >>> Some hardware of >>> course uses mmio registers extensively, but not virtio hardware. With >>> the recent MSI support no registers are touched in the fast path. >>> >> Note we are not talking about virtio here. Just raw PCI and why I >> advocate vbus over it. >> > > There's no such thing as raw PCI. Every PCI device has a protocol. The > protocol virtio chose is optimized for virtualization. And its a question of how that protocol scales, more than how the protocol works. Obviously the general idea of the protocol works, as vbus itself is implemented as a PCI-BRIDGE and is therefore limited to the underlying characteristics that I can get out of PCI (like PIO latency). > > >>>> D) device-ids are in a fixed width register and centrally assigned from >>>> an authority (e.g. PCI-SIG). >>>> >>>> >>> That's not an issue either. Qumranet/Red Hat has donated a range of >>> device IDs for use in virtio. >>> >> Yes, and to get one you have to do what? Register it with kvm.git, >> right? Kind of like registering a MAJOR/MINOR, would you agree? Maybe >> you do not mind (especially given your relationship to kvm.git), but >> there are disadvantages to that model for most of the rest of us. >> > > Send an email, it's not that difficult. There's also an experimental > range. Ugly.... > >>> Device IDs are how devices are associated >>> with drivers, so you'll need something similar for vbus. >>> >> Nope, just like you don't need to do anything ahead of time for using a >> dynamic misc-device name. You just have both the driver and device know >> what they are looking for (its part of the ABI). >> > > If you get a device ID clash, you fail. If you get a device name clash, > you fail in the same way. No argument here. > >>>> E) Interrupt/MSI routing is per-device oriented >>>> >>>> >>> Please elaborate. What is the issue? How does vbus solve it? >>> >> There are no "interrupts" in vbus..only shm-signals. You can establish >> an arbitrary amount of shm regions, each with an optional shm-signal >> associated with it. To do this, the driver calls dev->shm(), and you >> get back a shm_signal object. >> >> Underneath the hood, the vbus-connector (e.g. vbus-pcibridge) decides >> how it maps real interrupts to shm-signals (on a system level, not per >> device). This can be 1:1, or any other scheme. vbus-pcibridge uses one >> system-wide interrupt per priority level (today this is 8 levels), each >> with an IOQ based event channel. "signals" come as an event on that >> channel. >> >> So the "issue" is that you have no real choice with PCI. You just get >> device oriented interrupts. With vbus, its abstracted. So you can >> still get per-device standard MSI, or you can do fancier things like do >> coalescing and prioritization. >> > > As I've mentioned before, prioritization is available on x86 But as Ive mentioned, it doesn't work very well. >, and coalescing scales badly. Depends on what is scaling. Scaling vcpus? Yes, you are right. Scaling the number of devices? No, this is where it improves. > >>>> F) Interrupts/MSI are assumed cheap to inject >>>> >>>> >>> Interrupts are not assumed cheap; that's why interrupt mitigation is >>> used (on real and virtual hardware). >>> >> Its all relative. IDT dispatch and EOI overhead are "baseline" on real >> hardware, whereas they are significantly more expensive to do the >> vmenters and vmexits on virt (and you have new exit causes, like >> irq-windows, etc, that do not exist in real HW). >> > > irq window exits ought to be pretty rare, so we're only left with > injection vmexits. At around 1us/vmexit, even 100,000 interrupts/vcpu > (which is excessive) will only cost you 10% cpu time. 1us is too much for what I am building, IMHO. Besides, there are a slew of older machines (like Woodcrests) that are more like 2+us per exit, so 1us is a best-case scenario. > >>>> G) Interrupts/MSI are non-priortizable. >>>> >>>> >>> They are prioritizable; Linux ignores this though (Windows doesn't). >>> Please elaborate on what the problem is and how vbus solves it. >>> >> It doesn't work right. The x86 sense of interrupt priority is, sorry to >> say it, half-assed at best. I've worked with embedded systems that have >> real interrupt priority support in the hardware, end to end, including >> the PIC. The LAPIC on the other hand is really weak in this dept, and >> as you said, Linux doesn't even attempt to use whats there. >> > > Maybe prioritization is not that important then. If it is, it needs to > be fixed at the lapic level, otherwise you have no real prioritization > wrt non-vbus interrupts. While this is true, I am generally not worried about it. For the environments that care, I plan on having it be predominantly vbus devices and using an -rt kernel (with irq-threads). > >>>> H) Interrupts/MSI are statically established >>>> >>>> >>> Can you give an example of why this is a problem? >>> >> Some of the things we are building use the model of having a device that >> hands out shm-signal in response to guest events (say, the creation of >> an IPC channel). This would generally be handled by a specific device >> model instance, and it would need to do this without pre-declaring the >> MSI vectors (to use PCI as an example). >> > > You're free to demultiplex an MSI to however many consumers you want, > there's no need for a new bus for that. Hmmm...can you elaborate? > >>> What performance oriented items have been left unaddressed? >>> >> Well, the interrupt model to name one. >> > > Like I mentioned, you can merge MSI interrupts, but that's not > necessarily a good idea. > >>> How do you handle conflicts? Again you need a central authority to hand >>> out names or prefixes. >>> >> Not really, no. If you really wanted to be formal about it, you could >> adopt any series of UUID schemes. For instance, perhaps venet should be >> "com.novell::virtual-ethernet". Heck, I could use uuidgen. >> > > Do you use DNS. We use PCI-SIG. If Novell is a PCI-SIG member you can > get a vendor ID and control your own virtio space. Yeah, we have our own id. I am more concerned about making this design make sense outside of PCI oriented environments. > >>>> As another example, the connector design coalesces *all* shm-signals >>>> into a single interrupt (by prio) that uses the same context-switch >>>> mitigation techniques that help boost things like networking. This >>>> effectively means we can detect and optimize out ack/eoi cycles from >>>> the >>>> APIC as the IO load increases (which is when you need it most). PCI >>>> has >>>> no such concept. >>>> >>>> >>> That's a bug, not a feature. It means poor scaling as the number of >>> vcpus increases and as the number of devices increases. vcpu increases, I agree (and am ok with, as I expect low vcpu count machines to be typical). nr of devices, I disagree. can you elaborate? >>> >> So the "avi-vbus-connector" can use 1:1, if you prefer. Large vcpu >> counts (which are not typical) and irq-affinity is not a target >> application for my design, so I prefer the coalescing model in the >> vbus-pcibridge included in this series. YMMV >> > > So far you've left out live migration guilty as charged. > Windows, Work in progress. > large guests Can you elaborate? I am not familiar with the term. > and multiqueue out of your design. AFAICT, multiqueue should work quite nicely with vbus. Can you elaborate on where you see the problem? > If you wish to position vbus/venet for > large scale use you'll need to address all of them. > >>> Note nothing prevents steering multiple MSIs into a single vector. It's >>> a bad idea though. >>> >> Yes, it is a bad idea...and not the same thing either. This would >> effectively create a shared-line scenario in the irq code, which is not >> what happens in vbus. >> > > Ok. > >>>> In addition, the signals and interrupts are priority aware, which is >>>> useful for things like 802.1p networking where you may establish 8-tx >>>> and 8-rx queues for your virtio-net device. x86 APIC really has no >>>> usable equivalent, so PCI is stuck here. >>>> >>>> >>> x86 APIC is priority aware. >>> >> Have you ever tried to use it? >> > > I haven't, but Windows does. Yeah, it doesn't really work well. Its an extremely rigid model that (IIRC) only lets you prioritize in 16 groups spaced by IDT (0-15 are one level, 16-31 are another, etc). Most of the embedded PICs I have worked with supported direct remapping, etc. But in any case, Linux doesn't support it so we are hosed no matter how good it is. > >>>> Also, the signals can be allocated on-demand for implementing things >>>> like IPC channels in response to guest requests since there is no >>>> assumption about device-to-interrupt mappings. This is more flexible. >>>> >>>> >>> Yes. However given that vectors are a scarce resource you're severely >>> limited in that. >>> >> The connector I am pushing out does not have this limitation. >> > > Okay. > >> >>> And if you're multiplexing everything on one vector, >>> then you can just as well demultiplex your channels in the virtio driver >>> code. >>> >> Only per-device, not system wide. >> > > Right. I still think multiplexing interrupts is a bad idea in a large > system. In a small system... why would you do it at all? device scaling, like for running a device-domain / bridge in a guest. > >>>> And through all of this, this design would work in any guest even if it >>>> doesn't have PCI (e.g. lguest, UML, physical systems, etc). >>>> >>>> >>> That is true for virtio which works on pci-less lguest and s390. >>> >> Yes, and lguest and s390 had to build their own bus-model to do it, >> right? >> > > They had to build connectors just like you propose to do. More importantly, they had to build back-end busses too, no? > >> Thank you for bringing this up, because it is one of the main points >> here. What I am trying to do is generalize the bus to prevent the >> proliferation of more of these isolated models in the future. Build >> one, fast, in-kernel model so that we wouldn't need virtio-X, and >> virtio-Y in the future. They can just reuse the (performance optimized) >> bus and models, and only need to build the connector to bridge them. >> > > But you still need vbus-connector-lguest and vbus-connector-s390 because > they all talk to the host differently. So what's changed? the names? The fact that they don't need to redo most of the in-kernel backend stuff. Just the connector. > >>> That is exactly the design goal of virtio (except it limits itself to >>> virtualization). >>> >> No, virtio is only part of the picture. It not including the backend >> models, or how to do memory/signal-path abstraction for in-kernel, for >> instance. But otherwise, virtio as a device model is compatible with >> vbus as a bus model. They compliment one another. >> > > Well, venet doesn't complement virtio-net, and virtio-pci doesn't > complement vbus-connector. Agreed, but virtio complements vbus by virtue of virtio-vbus. > >>>> Then device models like virtio can ride happily on top and we end up >>>> with a really robust and high-performance Linux-based stack. I don't >>>> buy the argument that we already have PCI so lets use it. I don't >>>> think >>>> its the best design and I am not afraid to make an investment in a >>>> change here because I think it will pay off in the long run. >>>> >>>> >>> Sorry, I don't think you've shown any quantifiable advantages. >>> >> We can agree to disagree then, eh? There are certainly quantifiable >> differences. Waving your hand at the differences to say they are not >> advantages is merely an opinion, one that is not shared universally. >> > > I've addressed them one by one. We can agree to disagree on interrupt > multiplexing, and the importance of compatibility, Windows, large > guests, multiqueue, and DNS vs. PCI-SIG. > >> The bottom line is all of these design distinctions are encapsulated >> within the vbus subsystem and do not affect the kvm code-base. So >> agreement with kvm upstream is not a requirement, but would be >> advantageous for collaboration. >> > > Certainly. > Kind Regards, -Greg
>>> On 8/19/2009 at 1:48 AM, in message <4A8B9241.20300@redhat.com>, Avi Kivity <avi@redhat.com> wrote: > On 08/19/2009 08:36 AM, Gregory Haskins wrote: >>> If virtio net in guest could be improved instead, everyone would >>> benefit. >>> >> So if I whip up a virtio-net backend for vbus with a PCI compliant >> connector, you are happy? >> > > This doesn't improve virtio-net in any way. Any why not? (Did you notice I said "PCI compliant", i.e. over virtio-pci) > >>> I am doing this, and I wish more people would join. Instead, >>> you change ABI in a incompatible way. >>> >> Only by choice of my particular connector. The ABI is a function of the >> connector design. So one such model is to terminate the connector in >> qemu, and surface the resulting objects as PCI devices. I choose not to >> use this particular design for my connector that I am pushing upstream >> because I am of the opinion that I can do better by terminating it in >> the guest directly as a PV optimized bus. However, both connectors can >> theoretically coexist peacefully. >> > > virtio already supports this model; see lguest and s390. Transporting > virtio over vbus and vbus over something else doesn't gain anything over > directly transporting virtio over that something else. This is not what I am advocating. Kind Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/19/2009 09:28 AM, Gregory Haskins wrote: > Avi Kivity wrote: > >> On 08/18/2009 05:46 PM, Gregory Haskins wrote: >> >>> >>>> Can you explain how vbus achieves RDMA? >>>> >>>> I also don't see the connection to real time guests. >>>> >>>> >>> Both of these are still in development. Trying to stay true to the >>> "release early and often" mantra, the core vbus technology is being >>> pushed now so it can be reviewed. Stay tuned for these other >>> developments. >>> >>> >> Hopefully you can outline how it works. AFAICT, RDMA and kernel bypass >> will need device assignment. If you're bypassing the call into the host >> kernel, it doesn't really matter how that call is made, does it? >> > This is for things like the setup of queue-pairs, and the transport of > door-bells, and ib-verbs. I am not on the team doing that work, so I am > not an expert in this area. What I do know is having a flexible and > low-latency signal-path was deemed a key requirement. > That's not a full bypass, then. AFAIK kernel bypass has userspace talking directly to the device. Given that both virtio and vbus can use ioeventfds, I don't see how one can perform better than the other. > For real-time, a big part of it is relaying the guest scheduler state to > the host, but in a smart way. For instance, the cpu priority for each > vcpu is in a shared-table. When the priority is raised, we can simply > update the table without taking a VMEXIT. When it is lowered, we need > to inform the host of the change in case the underlying task needs to > reschedule. > This is best done using cr8/tpr so you don't have to exit at all. See also my vtpr support for Windows which does this in software, generally avoiding the exit even when lowering priority. > This is where the really fast call() type mechanism is important. > > Its also about having the priority flow-end to end, and having the vcpu > interrupt state affect the task-priority, etc (e.g. pending interrupts > affect the vcpu task prio). > > etc, etc. > > I can go on and on (as you know ;), but will wait till this work is more > concrete and proven. > Generally cpu state shouldn't flow through a device but rather through MSRs, hypercalls, and cpu registers. > Basically, what it comes down to is both vbus and vhost need > configuration/management. Vbus does it with sysfs/configfs, and vhost > does it with ioctls. I ultimately decided to go with sysfs/configfs > because, at least that the time I looked, it seemed like the "blessed" > way to do user->kernel interfaces. > I really dislike that trend but that's an unrelated discussion. >> They need to be connected to the real world somehow. What about >> security? can any user create a container and devices and link them to >> real interfaces? If not, do you need to run the VM as root? >> > Today it has to be root as a result of weak mode support in configfs, so > you have me there. I am looking for help patching this limitation, though. > > Well, do you plan to address this before submission for inclusion? >> I hope everyone agrees that it's an important issue for me and that I >> have to consider non-Linux guests. I also hope that you're considering >> non-Linux guests since they have considerable market share. >> > I didn't mean non-Linux guests are not important. I was disagreeing > with your assertion that it only works if its PCI. There are numerous > examples of IHV/ISV "bridge" implementations deployed in Windows, no? > I don't know. > If vbus is exposed as a PCI-BRIDGE, how is this different? > Technically it would work, but given you're not interested in Windows, who would write a driver? >> Given I'm not the gateway to inclusion of vbus/venet, you don't need to >> ask me anything. I'm still free to give my opinion. >> > Agreed, and I didn't mean to suggest otherwise. It not clear if you are > wearing the "kvm maintainer" hat, or the "lkml community member" hat at > times, so its important to make that distinction. Otherwise, its not > clear if this is edict as my superior, or input as my peer. ;) > When I wear a hat, it is a Red Hat. However I am bareheaded most often. (that is, look at the contents of my message, not who wrote it or his role). >> With virtio, the number is 1 (or less if you amortize). Set up the ring >> entries and kick. >> > Again, I am just talking about basic PCI here, not the things we build > on top. > Whatever that means, it isn't interesting. Performance is measure for the whole stack. > The point is: the things we build on top have costs associated with > them, and I aim to minimize it. For instance, to do a "call()" kind of > interface, you generally need to pre-setup some per-cpu mappings so that > you can just do a single iowrite32() to kick the call off. Those > per-cpu mappings have a cost if you want them to be high-performance, so > my argument is that you ideally want to limit the number of times you > have to do this. My current design reduces this to "once". > Do you mean minimizing the setup cost? Seriously? >> There's no such thing as raw PCI. Every PCI device has a protocol. The >> protocol virtio chose is optimized for virtualization. >> > And its a question of how that protocol scales, more than how the > protocol works. > > Obviously the general idea of the protocol works, as vbus itself is > implemented as a PCI-BRIDGE and is therefore limited to the underlying > characteristics that I can get out of PCI (like PIO latency). > I thought we agreed that was insignificant? >> As I've mentioned before, prioritization is available on x86 >> > But as Ive mentioned, it doesn't work very well. > I guess it isn't that important then. I note that clever prioritization in a guest is pointless if you can't do the same prioritization in the host. >> , and coalescing scales badly. >> > Depends on what is scaling. Scaling vcpus? Yes, you are right. > Scaling the number of devices? No, this is where it improves. > If you queue pending messages instead of walking the device list, you may be right. Still, if hard interrupt processing takes 10% of your time you'll only have coalesced 10% of interrupts on average. >> irq window exits ought to be pretty rare, so we're only left with >> injection vmexits. At around 1us/vmexit, even 100,000 interrupts/vcpu >> (which is excessive) will only cost you 10% cpu time. >> > 1us is too much for what I am building, IMHO. You can't use current hardware then. >> You're free to demultiplex an MSI to however many consumers you want, >> there's no need for a new bus for that. >> > Hmmm...can you elaborate? > Point all those MSIs at one vector. Its handler will have to poll all the attached devices though. >> Do you use DNS. We use PCI-SIG. If Novell is a PCI-SIG member you can >> get a vendor ID and control your own virtio space. >> > Yeah, we have our own id. I am more concerned about making this design > make sense outside of PCI oriented environments. > IIRC we reuse the PCI IDs for non-PCI. >>>> That's a bug, not a feature. It means poor scaling as the number of >>>> vcpus increases and as the number of devices increases. >>>> > vcpu increases, I agree (and am ok with, as I expect low vcpu count > machines to be typical). I'm not okay with it. If you wish people to adopt vbus over virtio you'll have to address all concerns, not just yours. > nr of devices, I disagree. can you elaborate? > With message queueing, I retract my remark. >> Windows, >> > Work in progress. > Interesting. Do you plan to open source the code? If not, will the binaries be freely available? > >> large guests >> > Can you elaborate? I am not familiar with the term. > Many vcpus. > >> and multiqueue out of your design. >> > AFAICT, multiqueue should work quite nicely with vbus. Can you > elaborate on where you see the problem? > You said you aren't interested in it previously IIRC. >>>> x86 APIC is priority aware. >>>> >>>> >>> Have you ever tried to use it? >>> >>> >> I haven't, but Windows does. >> > Yeah, it doesn't really work well. Its an extremely rigid model that > (IIRC) only lets you prioritize in 16 groups spaced by IDT (0-15 are one > level, 16-31 are another, etc). Most of the embedded PICs I have worked > with supported direct remapping, etc. But in any case, Linux doesn't > support it so we are hosed no matter how good it is. > I agree that it isn't very clever (not that I am a real time expert) but I disagree about dismissing Linux support so easily. If prioritization is such a win it should be a win on the host as well and we should make it work on the host as well. Further I don't see how priorities on the guest can work if they don't on the host. >>> >>> >> They had to build connectors just like you propose to do. >> > More importantly, they had to build back-end busses too, no? > They had to write 414 lines in drivers/s390/kvm/kvm_virtio.c and something similar for lguest. >> But you still need vbus-connector-lguest and vbus-connector-s390 because >> they all talk to the host differently. So what's changed? the names? >> > The fact that they don't need to redo most of the in-kernel backend > stuff. Just the connector. > So they save 414 lines but have to write a connector which is... how large? >> Well, venet doesn't complement virtio-net, and virtio-pci doesn't >> complement vbus-connector. >> > Agreed, but virtio complements vbus by virtue of virtio-vbus. > I don't see what vbus adds to virtio-net.
On 08/19/2009 09:40 AM, Gregory Haskins wrote: > > >>> So if I whip up a virtio-net backend for vbus with a PCI compliant >>> connector, you are happy? >>> >>> >> This doesn't improve virtio-net in any way. >> > Any why not? (Did you notice I said "PCI compliant", i.e. over virtio-pci) > Because virtio-net will have gained nothing that it didn't have before. >> virtio already supports this model; see lguest and s390. Transporting >> virtio over vbus and vbus over something else doesn't gain anything over >> directly transporting virtio over that something else. >> > This is not what I am advocating. > > What are you advocating? As far as I can tell your virtio-vbus connector plus the vbus-kvm connector is just that.
>>> On 8/19/2009 at 3:13 AM, in message <4A8BA635.9010902@redhat.com>, Avi Kivity <avi@redhat.com> wrote: > On 08/19/2009 09:40 AM, Gregory Haskins wrote: >> >> >>>> So if I whip up a virtio-net backend for vbus with a PCI compliant >>>> connector, you are happy? >>>> >>>> >>> This doesn't improve virtio-net in any way. >>> >> Any why not? (Did you notice I said "PCI compliant", i.e. over virtio-pci) >> > > Because virtio-net will have gained nothing that it didn't have before. ?? *) ABI is virtio-pci compatible, as you like *) fast-path is in-kernel, as we all like *) model is in vbus so it would work in all environments that vbus supports. > > > > >>> virtio already supports this model; see lguest and s390. Transporting >>> virtio over vbus and vbus over something else doesn't gain anything over >>> directly transporting virtio over that something else. >>> >> This is not what I am advocating. >> >> > > What are you advocating? As far as I can tell your virtio-vbus > connector plus the vbus-kvm connector is just that. I wouldn't classify it anything like that, no. Its just virtio over vbus. -Greg -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/19/2009 02:40 PM, Gregory Haskins wrote: > >>>>> So if I whip up a virtio-net backend for vbus with a PCI compliant >>>>> connector, you are happy? >>>>> >>>>> >>>>> >>>> This doesn't improve virtio-net in any way. >>>> >>>> >>> Any why not? (Did you notice I said "PCI compliant", i.e. over virtio-pci) >>> >>> >> Because virtio-net will have gained nothing that it didn't have before. >> > ?? > > *) ABI is virtio-pci compatible, as you like > That's not a gain, that's staying in the same place. > *) fast-path is in-kernel, as we all like > That's not a gain as we have vhost-net (sure, in development, but your proposed backend isn't even there yet). > *) model is in vbus so it would work in all environments that vbus supports. > The ABI can be virtio-pci compatible or it can be vbus-comaptible. How can it be both? The ABIs are different. Note that if you had submitted a virtio-net backend I'd have asked you to strip away all the management / bus layers and we'd have ended up with vhost-net. >>>> virtio already supports this model; see lguest and s390. Transporting >>>> virtio over vbus and vbus over something else doesn't gain anything over >>>> directly transporting virtio over that something else. >>>> >>>> >>> This is not what I am advocating. >>> >>> >>> >> What are you advocating? As far as I can tell your virtio-vbus >> connector plus the vbus-kvm connector is just that. >> > I wouldn't classify it anything like that, no. Its just virtio over vbus. > We're in a loop. Doesn't virtio over vbus need a virtio-vbus connector? and doesn't vbus need a connector to talk to the hypervisor?
Avi Kivity wrote: > On 08/19/2009 02:40 PM, Gregory Haskins wrote: >> >>>>>> So if I whip up a virtio-net backend for vbus with a PCI compliant >>>>>> connector, you are happy? >>>>>> >>>>>> >>>>>> >>>>> This doesn't improve virtio-net in any way. >>>>> >>>>> >>>> Any why not? (Did you notice I said "PCI compliant", i.e. over >>>> virtio-pci) >>>> >>>> >>> Because virtio-net will have gained nothing that it didn't have before. >>> >> ?? >> >> *) ABI is virtio-pci compatible, as you like >> > > That's not a gain, that's staying in the same place. > >> *) fast-path is in-kernel, as we all like >> > > That's not a gain as we have vhost-net (sure, in development, but your > proposed backend isn't even there yet). > >> *) model is in vbus so it would work in all environments that vbus >> supports. >> > > The ABI can be virtio-pci compatible or it can be vbus-comaptible. How > can it be both? The ABIs are different. > > Note that if you had submitted a virtio-net backend I'd have asked you > to strip away all the management / bus layers and we'd have ended up > with vhost-net. Sigh... > >>>>> virtio already supports this model; see lguest and s390. Transporting >>>>> virtio over vbus and vbus over something else doesn't gain anything >>>>> over >>>>> directly transporting virtio over that something else. >>>>> >>>>> >>>> This is not what I am advocating. >>>> >>>> >>>> >>> What are you advocating? As far as I can tell your virtio-vbus >>> connector plus the vbus-kvm connector is just that. >>> >> I wouldn't classify it anything like that, no. Its just virtio over >> vbus. >> > > We're in a loop. Doesn't virtio over vbus need a virtio-vbus > connector? and doesn't vbus need a connector to talk to the hypervisor? > No, it doesnt work like that. There is only one connector. Kind Regards, -Greg
Avi Kivity wrote: > On 08/19/2009 07:27 AM, Gregory Haskins wrote: >> >>> This thread started because i asked you about your technical >>> arguments why we'd want vbus instead of virtio. >>> >> (You mean vbus vs pci, right? virtio works fine, is untouched, and is >> out-of-scope here) >> > > I guess he meant venet vs virtio-net. Without venet vbus is currently > userless. > >> Right, and I do believe I answered your questions. Do you feel as >> though this was not a satisfactory response? >> > > Others and I have shown you its wrong. No, you have shown me that you disagree. I'm sorry, but do not assume they are the same. Case in point: You also said that threading the ethernet model was wrong when I proposed it, and later conceded when I showed you the numbers that you were wrong. I don't say this to be a jerk. I am wrong myself all the time too. I only say it to highlight that perhaps we just don't (yet) see each others POV. Therefore, do not be so quick to put a "wrong" label on something, especially when the line of questioning/debate indicates to me that there are still fundamental issues in understanding exactly how things work. > There's no inherent performance > problem in pci. The vbus approach has inherent problems (the biggest of > which is compatibility Trying to be backwards compatible in all dimensions is not a design goal, as already stated. , the second managability). > Where are the management problems? >>> Your answer above >>> now basically boils down to: "because I want it so, why dont you >>> leave me alone". >>> >> Well, with all due respect, please do not put words in my mouth. This >> is not what I am saying at all. >> >> What I *am* saying is: >> >> fact: this thread is about linux guest drivers to support vbus >> >> fact: these drivers do not touch kvm code. >> >> fact: these drivers to not force kvm to alter its operation in any way. >> >> fact: these drivers do not alter ABIs that KVM currently supports. >> >> Therefore, all this talk about "abandoning", "supporting", and >> "changing" things in KVM is, premature, irrelevant, and/or, FUD. No one >> proposed such changes, so I am highlighting this fact to bring the >> thread back on topic. That KVM talk is merely a distraction at this >> point in time. >> > > s/kvm/kvm stack/. virtio/pci is part of the kvm stack, even if it is > not part of kvm itself. If vbus/venet were to be merged, users and > developers would have to choose one or the other. That's the > fragmentation I'm worried about. And you can prefix that with "fact:" > as well. Noted > >>> We all love faster code and better management interfaces and tons >>> of your prior patches got accepted by Avi. This time you didnt even >>> _try_ to improve virtio. >>> >> Im sorry, but you are mistaken: >> >> http://lkml.indiana.edu/hypermail/linux/kernel/0904.2/02443.html >> > > That does nothing to improve virtio. I'm sorry, but thats just plain false. > Existing guests (Linux and > Windows) which support virtio will cease to work if the host moves to > vbus-virtio. Sigh...please re-read "fact" section. And even if this work is accepted upstream as it is, how you configure the host and guest is just that: a configuration. If your guest and host both speak vbus, use it. If they don't, don't use it. Simple as that. Saying anything else is just more FUD, and I can say the same thing about a variety of other configuration options currently available. > Existing hosts (running virtio-pci) won't be able to talk > to newer guests running virtio-vbus. The patch doesn't improve > performance without the entire vbus stack in the host kernel and a > vbus-virtio-net-host host kernel driver. <rewind years=2>Existing hosts (running realtek emulation) won't be able to talk to newer guests running virtio-net. Virtio-net doesn't do anything to improve realtek emulation without the entire virtio stack in the host.</rewind> You gotta start somewhere. You're argument buys you nothing other than backwards compat, which I've already stated is not a specific goal here. I am not against "modprobe vbus-pcibridge", and I am sure there are users out that that do not object to this either. > > Perhaps if you posted everything needed to make vbus-virtio work and > perform we could compare that to vhost-net and you'll see another reason > why vhost-net is the better approach. Yet, you must recognize that an alternative outcome is that we can look at issues outside of virtio-net on KVM and perhaps you will see vbus is a better approach. > >> You are also wrong to say that I didn't try to avoid creating a >> downstream effort first. I believe the public record of the mailing >> lists will back me up that I tried politely pushing this directly though >> kvm first. It was only after Avi recently informed me that they would >> be building their own version of an in-kernel backend in lieu of working >> with me to adapt vbus to their needs that I decided to put my own >> project together. >> > > There's no way we can adapt vbus to our needs. Really? Did you ever bother to ask how? I'm pretty sure you can. And if you couldn't, I would have considered changes to make it work. > Don't you think we'd preferred it rather than writing our own? Honestly, I am not so sure based on your responses. > the current virtio-net issues > are hurting us. Indeed. > > Our needs are compatibility, performance, and managability. vbus fails > all three, your impressive venet numbers notwithstanding. > >> What should I have done otherwise, in your opinion? >> > > You could come up with uses where vbus truly is superior to > virtio/pci/whatever I've already listed numerous examples on why I advocate vbus over PCI, and have already stated I am not competing against virtio. > (not words about etch constraints). I was asked about the design, and that was background on some of my motivations. Don't try to spin that into something its not. > Showing some of those non-virt uses, for example. Actually, Ira's chassis discussed earlier is a classic example. Vbus actually fits neatly into his model, I believe (and much better than the vhost proposals, IMO). Basically, IMO we want to invert Ira's bus (so that the PPC boards see host-based devices, instead of the other way around). You write a connector that transports the vbus verbs over the PCI link. You write a udev rule that responds to the PPC board "arrival" event to create a new vbus container, and assign the board to that context. Then, whatever devices you instantiate in the vbus container will surface on the PPC board's "vbus-proxy" bus. This can include "virtio" type devices which are serviced by the virtio-vbus code to render these devices to the virtio-bus. Finally, drivers like virtio-net and virtio-console load and run normally. The host-side administers the available inventory on a per-board basis and its configuration using sysfs operations. > The fact that your only user duplicates existing functionality doesn't help. Certainly at some level, that is true and is unfortunate, I agree. In retrospect, I wish I started with something non-overlapping with virtio as the demo, just to avoid this aspect of controversy. At another level, its the highest-performance 802.x interface for KVM at the moment, since we still have not seen benchmarks for vhost. Given that I have spent a lot of time lately optimizing KVM, I can tell you its not trivial to get it to work better than the userspace virtio. Michael is clearly a smart guy, so the odds are in his favor. But do not count your chickens before they hatch, because its not guaranteed success. Long story short, my patches are not duplicative on all levels (i.e. performance). Its just another ethernet driver, of which there are probably hundreds of alternatives in the kernel already. You could also argue that we already have multiple models in qemu (realtek, e1000, virtio-net, etc) so this is not without precedent. So really all this "fragmentation" talk is FUD. Lets stay on-point, please. > > >>> And fragmentation matters quite a bit. To Linux users, developers, >>> administrators, packagers it's a big deal whether two overlapping >>> pieces of functionality for the same thing exist within the same >>> kernel. >>> >> So the only thing that could be construed as overlapping here is venet >> vs virtio-net. If I dropped the contentious venet and focused on making >> a virtio-net backend that we can all re-use, do you see that as a path >> of compromise here? >> > > That's a step in the right direction. Ok. I am concerned it would be a waste of my time given your current statements regarding the backend aspects of my design. Can we talk more about that at some point? I think you will see its not some "evil, heavy duty" infrastructure that some comments seem to be trying to paint it as. I think its similar in concept to what you need to do for a vhost like design, but (with all due respect to Michael) a little bit more thought into the necessary abstraction points to allow broader application. > >>> I certainly dont want that. Instead we (at great expense and work) >>> try to reach the best technical solution. >>> >> This is all I want, as well. >> > > Note whenever I mention migration, large guests, or Windows you say > these are not your design requirements. Actually, I don't think I've ever said that, per se. I said that those things are not a priority for me, personally. I never made a design decision that I knew would preclude the support for such concepts. In fact, afaict, the design would support them just fine, given resources the develop them. For the record: I never once said "vbus is done". There is plenty of work left to do. This is natural (kvm I'm sure wasn't 100% when it went in either, nor is it today) > The best technical solution will have to consider those. We are on the same page here. > >>> If the community wants this then why cannot you convince one of the >>> most prominent representatives of that community, the KVM >>> developers? >>> >> Its a chicken and egg at times. Perhaps the KVM developers do not have >> the motivation or time to properly consider such a proposal _until_ the >> community presents its demand. > > I've spent quite a lot of time arguing with you, no doubt influenced by > the fact that you can write a lot faster than I can read. :) > >>> Furthermore, 99% of your work is KVM >>> >> Actually, no. Almost none of it is. I think there are about 2-3 >> patches in the series that touch KVM, the rest are all original (and >> primarily stand-alone code). AlacrityVM is the application of kvm and >> vbus (and, of course, Linux) together as a complete unit, but I do not >> try to hide this relationship. >> >> By your argument, KVM is 99% QEMU+Linux. ;) >> > > That's one of the kvm strong points... As AlacrityVMs, as well ;) Kind Regards, -Greg
On Wed, Aug 19, 2009 at 01:36:14AM -0400, Gregory Haskins wrote: > Please post results when you have numbers, as I had to > give up my 10GE rig in the lab. > I suspect you will have performance > issues until you at least address GSO, but you may already be there by now. Yes, measuring streaming bandwidth probably does not make sense yet, as I do not have GSO, and I do not have VM exit mitigation. But RSN. Meanwhile udp_rr does not need any of these, so I checked that and numbers look like what you'd expect. My systems seem slower than yours, but the virtualization overhead is same: around 20us (sometimes it's a bit higher, up to 25us). host to host: [root@virtlab18 netperf-2.4.5]# ~mst/netperf-2.4.5/bin/netperf -H 20.1.50.1 -t +udp_rr UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 20.1.50.1 +(20.1.50.1) port 0 AF_INET Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 262144 262144 1 1 10.00 13890.41 124928 124928 host to guest: [root@virtlab18 linux-2.6]# ~mst/netperf-2.4.5/bin/netperf -H 20.1.50.3 -t udp_rr UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 20.1.50.3 +(20.1.50.3) port 0 AF_INET Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 262144 262144 1 1 10.00 10884.78 124928 124928
On 08/19/2009 04:27 PM, Gregory Haskins wrote: >> There's no inherent performance >> problem in pci. The vbus approach has inherent problems (the biggest of >> which is compatibility >> > Trying to be backwards compatible in all dimensions is not a design > goal, as already stated. > It's important to me. If you ignore what's important to me don't expect me to support your code. > > , the second managability). > >> > Where are the management problems? > Requiring root, negotiation in the kernel (making it harder to set up a compatible "migration pool", but wait, you don't care about migration either. > No, you have shown me that you disagree. I'm sorry, but do not assume > they are the same. [...] > I'm sorry, but thats just plain false. > Don't you mean, "I disagree but that's completely different from you being wrong". >> Existing guests (Linux and >> Windows) which support virtio will cease to work if the host moves to >> vbus-virtio. >> > Sigh...please re-read "fact" section. And even if this work is accepted > upstream as it is, how you configure the host and guest is just that: a > configuration. If your guest and host both speak vbus, use it. If they > don't, don't use it. Simple as that. Saying anything else is just more > FUD, and I can say the same thing about a variety of other configuration > options currently available. > The host, yes. The guest, no. I have RHEL 5.3 and Windows guests that work with virtio now, and I'd like to keep it that way. Given that I need to keep the current virtio-net/pci ABI, I have no motivation to add other ABIs. Given that host userspace configuration works, I have no motivation to move it into a kernel configfs/vbus based system. The only thing that's hurting me is virtio-net's performance problems and we're addressing it by moving the smallest possible component into the kernel: vhost-net. >> Existing hosts (running virtio-pci) won't be able to talk >> to newer guests running virtio-vbus. The patch doesn't improve >> performance without the entire vbus stack in the host kernel and a >> vbus-virtio-net-host host kernel driver. >> > <rewind years=2>Existing hosts (running realtek emulation) won't be able > to talk to newer guests running virtio-net. Virtio-net doesn't do > anything to improve realtek emulation without the entire virtio stack in > the host.</rewind> > > You gotta start somewhere. You're argument buys you nothing other than > backwards compat, which I've already stated is not a specific goal here. > I am not against "modprobe vbus-pcibridge", and I am sure there are > users out that that do not object to this either. > Two years ago we had something that was set in stone and had a very limited performance future. That's not the case now. If every two years we start from scratch we'll be in a pretty pickle fairly soon. virtio-net/pci is here to stay. I see no convincing reason to pour efforts into a competitor and then have to support both. >> Perhaps if you posted everything needed to make vbus-virtio work and >> perform we could compare that to vhost-net and you'll see another reason >> why vhost-net is the better approach. >> > Yet, you must recognize that an alternative outcome is that we can look > at issues outside of virtio-net on KVM and perhaps you will see vbus is > a better approach. > We won't know until that experiment takes place. >>> You are also wrong to say that I didn't try to avoid creating a >>> downstream effort first. I believe the public record of the mailing >>> lists will back me up that I tried politely pushing this directly though >>> kvm first. It was only after Avi recently informed me that they would >>> be building their own version of an in-kernel backend in lieu of working >>> with me to adapt vbus to their needs that I decided to put my own >>> project together. >>> >>> >> There's no way we can adapt vbus to our needs. >> > Really? Did you ever bother to ask how? I'm pretty sure you can. And > if you couldn't, I would have considered changes to make it work. > Our needs are: compatibility, live migration, Windows, managebility (nonroot, userspace control over configuration). Non-requirements but highly desirable: minimal kernel impact. >> Don't you think we'd preferred it rather than writing our own? >> > Honestly, I am not so sure based on your responses. > Does your experience indicate that I reject patches from others in favour of writing my own? Look for your own name in the kernel's git log. > I've already listed numerous examples on why I advocate vbus over PCI, > and have already stated I am not competing against virtio. > Well, your examples didn't convince me, and vbus's deficiencies (compatibility, live migration, Windows, managebility, kernel impact) aren't helping. >> Showing some of those non-virt uses, for example. >> > Actually, Ira's chassis discussed earlier is a classic example. Vbus > actually fits neatly into his model, I believe (and much better than the > vhost proposals, IMO). > > Basically, IMO we want to invert Ira's bus (so that the PPC boards see > host-based devices, instead of the other way around). You write a > connector that transports the vbus verbs over the PCI link. You write a > udev rule that responds to the PPC board "arrival" event to create a new > vbus container, and assign the board to that context. > It's not inverted at all. vhost-net corresponds to the device side, where a real NIC's DMA engine lives, while virtio-net is the guest side which drives the device and talks only to its main memory (and device registers). It may seem backwards but it's quite natural when you consider DMA. If you wish to push vbus for non-virt uses, I have nothing to say. If you wish to push vbus for some other hypervisor (like AlacrityVM), that's the other hypervisor's maintainer's turf. But vbus as I understand it doesn't suit kvm's needs (compatibility, live migration, Windows, managebility, kernel impact). >> The fact that your only user duplicates existing functionality doesn't help. >> > Certainly at some level, that is true and is unfortunate, I agree. In > retrospect, I wish I started with something non-overlapping with virtio > as the demo, just to avoid this aspect of controversy. > > At another level, its the highest-performance 802.x interface for KVM at > the moment, since we still have not seen benchmarks for vhost. Given > that I have spent a lot of time lately optimizing KVM, I can tell you > its not trivial to get it to work better than the userspace virtio. > Michael is clearly a smart guy, so the odds are in his favor. But do > not count your chickens before they hatch, because its not guaranteed > success. > Well the latency numbers seem to match (after normalizing for host-host baseline). Obviously throughput needs more work, but I have confidence we'll see pretty good results. > Long story short, my patches are not duplicative on all levels (i.e. > performance). Its just another ethernet driver, of which there are > probably hundreds of alternatives in the kernel already. You could also > argue that we already have multiple models in qemu (realtek, e1000, > virtio-net, etc) so this is not without precedent. So really all this > "fragmentation" talk is FUD. Lets stay on-point, please. > It's not FUD and please talk technical, not throw words around. If there are a limited number of kvm developers, then every new device dilutes the effort. Further, e1000 and friends don't need drivers for a bunch of OSs, v* do. > Can we talk more about that at some point? I think you will see its not > some "evil, heavy duty" infrastructure that some comments seem to be > trying to paint it as. I think its similar in concept to what you need > to do for a vhost like design, but (with all due respect to Michael) a > little bit more thought into the necessary abstraction points to allow > broader application. > vhost-net only pumps the rings. It leaves everything else for userspace. vbus/venet leave almost nothing to userspace. vbus redoes everything that the guest's native bus provides, virtio-pci relies on pci. I haven't called it evil or heavy duty, just unnecessary. (btw, your current alacrityvm patch is larger than kvm when it was first merged into Linux) >>> >>> >> Note whenever I mention migration, large guests, or Windows you say >> these are not your design requirements. >> > Actually, I don't think I've ever said that, per se. I said that those > things are not a priority for me, personally. I never made a design > decision that I knew would preclude the support for such concepts. In > fact, afaict, the design would support them just fine, given resources > the develop them. > So given three choices: 1. merge vbus without those things that we need 2. merge vbus and start working on them 3. not merge vbus As choice 1 gives me nothing and choice 2 takes away development effort, choice 3 is the winner. > For the record: I never once said "vbus is done". There is plenty of > work left to do. This is natural (kvm I'm sure wasn't 100% when it went > in either, nor is it today) > Which is why I want to concentrate effort in one direction, not wander off in many.
On Wed, Aug 19, 2009 at 08:40:33AM +0300, Avi Kivity wrote: > On 08/19/2009 03:38 AM, Ira W. Snyder wrote: >> On Wed, Aug 19, 2009 at 12:26:23AM +0300, Avi Kivity wrote: >> >>> On 08/18/2009 11:59 PM, Ira W. Snyder wrote: >>> >>>> On a non shared-memory system (where the guest's RAM is not just a chunk >>>> of userspace RAM in the host system), virtio's management model seems to >>>> fall apart. Feature negotiation doesn't work as one would expect. >>>> >>>> >>> In your case, virtio-net on the main board accesses PCI config space >>> registers to perform the feature negotiation; software on your PCI cards >>> needs to trap these config space accesses and respond to them according >>> to virtio ABI. >>> >>> >> Is this "real PCI" (physical hardware) or "fake PCI" (software PCI >> emulation) that you are describing? >> >> > > Real PCI. > >> The host (x86, PCI master) must use "real PCI" to actually configure the >> boards, enable bus mastering, etc. Just like any other PCI device, such >> as a network card. >> >> On the guests (ppc, PCI agents) I cannot add/change PCI functions (the >> last .[0-9] in the PCI address) nor can I change PCI BAR's once the >> board has started. I'm pretty sure that would violate the PCI spec, >> since the PCI master would need to re-scan the bus, and re-assign >> addresses, which is a task for the BIOS. >> > > Yes. Can the boards respond to PCI config space cycles coming from the > host, or is the config space implemented in silicon and immutable? > (reading on, I see the answer is no). virtio-pci uses the PCI config > space to configure the hardware. > Yes, the PCI config space is implemented in silicon. I can change a few things (mostly PCI BAR attributes), but not much. >>> (There's no real guest on your setup, right? just a kernel running on >>> and x86 system and other kernels running on the PCI cards?) >>> >>> >> Yes, the x86 (PCI master) runs Linux (booted via PXELinux). The ppc's >> (PCI agents) also run Linux (booted via U-Boot). They are independent >> Linux systems, with a physical PCI interconnect. >> >> The x86 has CONFIG_PCI=y, however the ppc's have CONFIG_PCI=n. Linux's >> PCI stack does bad things as a PCI agent. It always assumes it is a PCI >> master. >> >> It is possible for me to enable CONFIG_PCI=y on the ppc's by removing >> the PCI bus from their list of devices provided by OpenFirmware. They >> can not access PCI via normal methods. PCI drivers cannot work on the >> ppc's, because Linux assumes it is a PCI master. >> >> To the best of my knowledge, I cannot trap configuration space accesses >> on the PCI agents. I haven't needed that for anything I've done thus >> far. >> >> > > Well, if you can't do that, you can't use virtio-pci on the host. > You'll need another virtio transport (equivalent to "fake pci" you > mentioned above). > Ok. Is there something similar that I can study as an example? Should I look at virtio-pci? >>>> This does appear to be solved by vbus, though I haven't written a >>>> vbus-over-PCI implementation, so I cannot be completely sure. >>>> >>>> >>> Even if virtio-pci doesn't work out for some reason (though it should), >>> you can write your own virtio transport and implement its config space >>> however you like. >>> >>> >> This is what I did with virtio-over-PCI. The way virtio-net negotiates >> features makes this work non-intuitively. >> > > I think you tried to take two virtio-nets and make them talk together? > That won't work. You need the code from qemu to talk to virtio-net > config space, and vhost-net to pump the rings. > It *is* possible to make two unmodified virtio-net's talk together. I've done it, and it is exactly what the virtio-over-PCI patch does. Study it and you'll see how I connected the rx/tx queues together. The feature negotiation code also works, but in a very unintuitive manner. I made it work in the virtio-over-PCI patch, but the devices are hardcoded into the driver. It would be quite a bit of work to swap virtio-net and virtio-console, for example. >>>> I'm not at all clear on how to get feature negotiation to work on a >>>> system like mine. From my study of lguest and kvm (see below) it looks >>>> like userspace will need to be involved, via a miscdevice. >>>> >>>> >>> I don't see why. Is the kernel on the PCI cards in full control of all >>> accesses? >>> >>> >> I'm not sure what you mean by this. Could you be more specific? This is >> a normal, unmodified vanilla Linux kernel running on the PCI agents. >> > > I meant, does board software implement the config space accesses issued > from the host, and it seems the answer is no. > > >> In my virtio-over-PCI patch, I hooked two virtio-net's together. I wrote >> an algorithm to pair the tx/rx queues together. Since virtio-net >> pre-fills its rx queues with buffers, I was able to use the DMA engine >> to copy from the tx queue into the pre-allocated memory in the rx queue. >> >> > > Please find a name other than virtio-over-PCI since it conflicts with > virtio-pci. You're tunnelling virtio config cycles (which are usually > done on pci config cycles) on a new protocol which is itself tunnelled > over PCI shared memory. > Sorry about that. Do you have suggestions for a better name? I called it virtio-over-PCI in my previous postings to LKML, so until a new patch is written and posted, I'll keep referring to it by the name used in the past, so people can search for it. When I post virtio patches, should I CC another mailing list in addition to LKML? >>>> >>>> >>> Yeah. You'll need to add byteswaps. >>> >>> >> I wonder if Rusty would accept a new feature: >> VIRTIO_F_NET_LITTLE_ENDIAN, which would allow the virtio-net driver to >> use LE for all of it's multi-byte fields. >> >> I don't think the transport should have to care about the endianness. >> > > Given this is not mainstream use, it would have to have zero impact when > configured out. > Yes, of course. That said, I'm not sure how qemu-system-ppc running on x86 could possibly communicate using virtio-net. This would mean the guest is an emulated big-endian PPC, while the host is a little-endian x86. I haven't actually tested this situation, so perhaps I am wrong. >> True. It's slowpath setup, so I don't care how fast it is. For reasons >> outside my control, the x86 (PCI master) is running a RHEL5 system. This >> means glibc-2.5, which doesn't have eventfd support, AFAIK. I could try >> and push for an upgrade. This obviously makes cat/echo really nice, it >> doesn't depend on glibc, only the kernel version. >> >> I don't give much weight to the above, because I can use the eventfd >> syscalls directly, without glibc support. It is just more painful. >> > > The x86 side only needs to run virtio-net, which is present in RHEL 5.3. > You'd only need to run virtio-tunnel or however it's called. All the > eventfd magic takes place on the PCI agents. > I can upgrade the kernel to anything I want on both the x86 and ppc's. I'd like to avoid changing the x86 (RHEL5) userspace, though. On the ppc's, I have full control over the userspace environment. Thanks, Ira -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/19/2009 06:28 PM, Ira W. Snyder wrote: > >> Well, if you can't do that, you can't use virtio-pci on the host. >> You'll need another virtio transport (equivalent to "fake pci" you >> mentioned above). >> >> > Ok. > > Is there something similar that I can study as an example? Should I look > at virtio-pci? > > There's virtio-lguest, virtio-s390, and virtio-vbus. >> I think you tried to take two virtio-nets and make them talk together? >> That won't work. You need the code from qemu to talk to virtio-net >> config space, and vhost-net to pump the rings. >> >> > It *is* possible to make two unmodified virtio-net's talk together. I've > done it, and it is exactly what the virtio-over-PCI patch does. Study it > and you'll see how I connected the rx/tx queues together. > Right, crossing the cables works, but feature negotiation is screwed up, and both sides think the data is in their RAM. vhost-net doesn't do negotiation and doesn't assume the data lives in its address space. >> Please find a name other than virtio-over-PCI since it conflicts with >> virtio-pci. You're tunnelling virtio config cycles (which are usually >> done on pci config cycles) on a new protocol which is itself tunnelled >> over PCI shared memory. >> >> > Sorry about that. Do you have suggestions for a better name? > > virtio-$yourhardware or maybe virtio-dma > I called it virtio-over-PCI in my previous postings to LKML, so until a > new patch is written and posted, I'll keep referring to it by the name > used in the past, so people can search for it. > > When I post virtio patches, should I CC another mailing list in addition > to LKML? > virtualization@lists.linux-foundation.org is virtio's home. > That said, I'm not sure how qemu-system-ppc running on x86 could > possibly communicate using virtio-net. This would mean the guest is an > emulated big-endian PPC, while the host is a little-endian x86. I > haven't actually tested this situation, so perhaps I am wrong. > I'm confused now. You don't actually have any guest, do you, so why would you run qemu at all? >> The x86 side only needs to run virtio-net, which is present in RHEL 5.3. >> You'd only need to run virtio-tunnel or however it's called. All the >> eventfd magic takes place on the PCI agents. >> >> > I can upgrade the kernel to anything I want on both the x86 and ppc's. > I'd like to avoid changing the x86 (RHEL5) userspace, though. On the > ppc's, I have full control over the userspace environment. > You don't need any userspace on virtio-net's side. Your ppc boards emulate a virtio-net device, so all you need is the virtio-net module (and virtio bindings). If you chose to emulate, say, an e1000 card all you'd need is the e1000 driver.
On Wed, Aug 19, 2009 at 06:37:06PM +0300, Avi Kivity wrote: > On 08/19/2009 06:28 PM, Ira W. Snyder wrote: >> >>> Well, if you can't do that, you can't use virtio-pci on the host. >>> You'll need another virtio transport (equivalent to "fake pci" you >>> mentioned above). >>> >>> >> Ok. >> >> Is there something similar that I can study as an example? Should I look >> at virtio-pci? >> >> > > There's virtio-lguest, virtio-s390, and virtio-vbus. > >>> I think you tried to take two virtio-nets and make them talk together? >>> That won't work. You need the code from qemu to talk to virtio-net >>> config space, and vhost-net to pump the rings. >>> >>> >> It *is* possible to make two unmodified virtio-net's talk together. I've >> done it, and it is exactly what the virtio-over-PCI patch does. Study it >> and you'll see how I connected the rx/tx queues together. >> > > Right, crossing the cables works, but feature negotiation is screwed up, > and both sides think the data is in their RAM. > > vhost-net doesn't do negotiation and doesn't assume the data lives in > its address space. > Yes, that is exactly what I did: crossed the cables (in software). I'll take a closer look at vhost-net now, and make sure I understand how it works. >>> Please find a name other than virtio-over-PCI since it conflicts with >>> virtio-pci. You're tunnelling virtio config cycles (which are usually >>> done on pci config cycles) on a new protocol which is itself tunnelled >>> over PCI shared memory. >>> >>> >> Sorry about that. Do you have suggestions for a better name? >> >> > > virtio-$yourhardware or maybe virtio-dma > How about virtio-phys? Arnd and BenH are both looking at PPC systems (similar to mine). Grant Likely is looking at talking to an processor core running on an FPGA, IIRC. Most of the code can be shared, very little should need to be board-specific, I hope. >> I called it virtio-over-PCI in my previous postings to LKML, so until a >> new patch is written and posted, I'll keep referring to it by the name >> used in the past, so people can search for it. >> >> When I post virtio patches, should I CC another mailing list in addition >> to LKML? >> > > virtualization@lists.linux-foundation.org is virtio's home. > >> That said, I'm not sure how qemu-system-ppc running on x86 could >> possibly communicate using virtio-net. This would mean the guest is an >> emulated big-endian PPC, while the host is a little-endian x86. I >> haven't actually tested this situation, so perhaps I am wrong. >> > > I'm confused now. You don't actually have any guest, do you, so why > would you run qemu at all? > I do not run qemu. I am just stating a problem with virtio-net that I noticed. This is just so someone more knowledgeable can be aware of the problem. >>> The x86 side only needs to run virtio-net, which is present in RHEL 5.3. >>> You'd only need to run virtio-tunnel or however it's called. All the >>> eventfd magic takes place on the PCI agents. >>> >>> >> I can upgrade the kernel to anything I want on both the x86 and ppc's. >> I'd like to avoid changing the x86 (RHEL5) userspace, though. On the >> ppc's, I have full control over the userspace environment. >> > > You don't need any userspace on virtio-net's side. > > Your ppc boards emulate a virtio-net device, so all you need is the > virtio-net module (and virtio bindings). If you chose to emulate, say, > an e1000 card all you'd need is the e1000 driver. > Thanks for the replies. Ira -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/19/2009 07:29 PM, Ira W. Snyder wrote: > > >> virtio-$yourhardware or maybe virtio-dma >> >> > How about virtio-phys? > Could work. > Arnd and BenH are both looking at PPC systems (similar to mine). Grant > Likely is looking at talking to an processor core running on an FPGA, > IIRC. Most of the code can be shared, very little should need to be > board-specific, I hope. > Excellent. >>> That said, I'm not sure how qemu-system-ppc running on x86 could >>> possibly communicate using virtio-net. This would mean the guest is an >>> emulated big-endian PPC, while the host is a little-endian x86. I >>> haven't actually tested this situation, so perhaps I am wrong. >>> >>> >> I'm confused now. You don't actually have any guest, do you, so why >> would you run qemu at all? >> >> > I do not run qemu. I am just stating a problem with virtio-net that I > noticed. This is just so someone more knowledgeable can be aware of the > problem. > > Ah, it certainly doesn't byteswap. Maybe nobody tried it. Hollis?
On Wed, 2009-08-19 at 10:11 +0300, Avi Kivity wrote: > On 08/19/2009 09:28 AM, Gregory Haskins wrote: > > Avi Kivity wrote: <SNIP> > > Basically, what it comes down to is both vbus and vhost need > > configuration/management. Vbus does it with sysfs/configfs, and vhost > > does it with ioctls. I ultimately decided to go with sysfs/configfs > > because, at least that the time I looked, it seemed like the "blessed" > > way to do user->kernel interfaces. > > > > I really dislike that trend but that's an unrelated discussion. > > >> They need to be connected to the real world somehow. What about > >> security? can any user create a container and devices and link them to > >> real interfaces? If not, do you need to run the VM as root? > >> > > Today it has to be root as a result of weak mode support in configfs, so > > you have me there. I am looking for help patching this limitation, though. > > > > > > Well, do you plan to address this before submission for inclusion? > Greetings Avi and Co, I have been following this thread, and although I cannot say that I am intimately fimilar with all of the virtualization considerations involved to really add anything use to that side of the discussion, I think you guys are doing a good job of explaining the technical issues for the non virtualization wizards following this thread. :-) Anyways, I was wondering if you might be interesting in sharing your concerns wrt to configfs (conigfs maintainer CC'ed), at some point..? As you may recall, I have been using configfs extensively for the 3.x generic target core infrastructure and iSCSI fabric modules living in lio-core-2.6.git/drivers/target/target_core_configfs.c and lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found it to be extraordinarly useful for the purposes of a implementing a complex kernel level target mode stack that is expected to manage massive amounts of metadata, allow for real-time configuration, share data structures (eg: SCSI Target Ports) between other kernel fabric modules and manage the entire set of fabrics using only intrepetered userspace code. Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs <-> iSCSI Target Endpoints inside of a KVM Guest (from the results in May posted with IOMMU aware 10 Gb on modern Nahelem hardware, see http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to dump the entire running target fabric configfs hierarchy to a single struct file on a KVM Guest root device using python code on the order of ~30 seconds for those 10000 active iSCSI endpoints. In configfs terms, this means: *) 7 configfs groups (directories), ~50 configfs attributes (files) per Virtual HBA+FILEIO LUN *) 15 configfs groups (directories), ~60 configfs attributes (files per iSCSI fabric Endpoint Which comes out to a total of ~220000 groups and ~1100000 attributes active configfs objects living in the configfs_dir_cache that are being dumped inside of the single KVM guest instances, including symlinks between the fabric modules to establish the SCSI ports containing complete set of SPC-4 and RFC-3720 features, et al. Also on the kernel <-> user API interaction compatibility side, I have found the 3.x configfs enabled code adventagous over the LIO 2.9 code (that used an ioctl for everything) because it allows us to do backwards compat for future versions without using any userspace C code, which in IMHO makes maintaining userspace packages for complex kernel stacks with massive amounts of metadata + real-time configuration considerations. No longer having ioctl compatibility issues between LIO versions as the structures passed via ioctl change, and being able to do backwards compat with small amounts of interpreted code against configfs layout changes makes maintaining the kernel <-> user API really have made this that much easier for me. Anyways, I though these might be useful to the discussion as it releates to potental uses of configfs on the KVM Host or other projects that really make sense, and/or to improve the upstream implementation so that other users (like myself) can benefit from improvements to configfs. Many thanks for your most valuable of time, --nab -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote: > On 08/19/2009 09:28 AM, Gregory Haskins wrote: >> Avi Kivity wrote: >> >>> On 08/18/2009 05:46 PM, Gregory Haskins wrote: >>> >>>> >>>>> Can you explain how vbus achieves RDMA? >>>>> >>>>> I also don't see the connection to real time guests. >>>>> >>>>> >>>> Both of these are still in development. Trying to stay true to the >>>> "release early and often" mantra, the core vbus technology is being >>>> pushed now so it can be reviewed. Stay tuned for these other >>>> developments. >>>> >>>> >>> Hopefully you can outline how it works. AFAICT, RDMA and kernel bypass >>> will need device assignment. If you're bypassing the call into the host >>> kernel, it doesn't really matter how that call is made, does it? >>> >> This is for things like the setup of queue-pairs, and the transport of >> door-bells, and ib-verbs. I am not on the team doing that work, so I am >> not an expert in this area. What I do know is having a flexible and >> low-latency signal-path was deemed a key requirement. >> > > That's not a full bypass, then. AFAIK kernel bypass has userspace > talking directly to the device. Like I said, I am not an expert on the details here. I only work on the vbus plumbing. FWIW, the work is derivative from the "Xen-IB" project http://www.openib.org/archives/nov2006sc/xen-ib-presentation.pdf There were issues with getting Xen-IB to map well into the Xen model. Vbus was specifically designed to address some of those short-comings. > > Given that both virtio and vbus can use ioeventfds, I don't see how one > can perform better than the other. > >> For real-time, a big part of it is relaying the guest scheduler state to >> the host, but in a smart way. For instance, the cpu priority for each >> vcpu is in a shared-table. When the priority is raised, we can simply >> update the table without taking a VMEXIT. When it is lowered, we need >> to inform the host of the change in case the underlying task needs to >> reschedule. >> > > This is best done using cr8/tpr so you don't have to exit at all. See > also my vtpr support for Windows which does this in software, generally > avoiding the exit even when lowering priority. You can think of vTPR as a good model, yes. Generally, you can't actually use it for our purposes for several reasons, however: 1) the prio granularity is too coarse (16 levels, -rt has 100) 2) it is too scope limited (it covers only interrupts, we need to have additional considerations, like nested guest/host scheduling algorithms against the vcpu, and prio-remap policies) 3) I use "priority" generally..there may be other non-priority based policies that need to add state to the table (such as EDF deadlines, etc). but, otherwise, the idea is the same. Besides, this was one example. > >> This is where the really fast call() type mechanism is important. >> >> Its also about having the priority flow-end to end, and having the vcpu >> interrupt state affect the task-priority, etc (e.g. pending interrupts >> affect the vcpu task prio). >> >> etc, etc. >> >> I can go on and on (as you know ;), but will wait till this work is more >> concrete and proven. >> > > Generally cpu state shouldn't flow through a device but rather through > MSRs, hypercalls, and cpu registers. Well, you can blame yourself for that one ;) The original vbus was implemented as cpuid+hypercalls, partly for that reason. You kicked me out of kvm.ko, so I had to make due with plan B via a less direct PCI-BRIDGE route. But in reality, it doesn't matter much. You can certainly have "system" devices sitting on vbus that fit a similar role as "MSRs", so the access method is more of an implementation detail. The key is it needs to be fast, and optimize out extraneous exits when possible. > >> Basically, what it comes down to is both vbus and vhost need >> configuration/management. Vbus does it with sysfs/configfs, and vhost >> does it with ioctls. I ultimately decided to go with sysfs/configfs >> because, at least that the time I looked, it seemed like the "blessed" >> way to do user->kernel interfaces. >> > > I really dislike that trend but that's an unrelated discussion. Ok > >>> They need to be connected to the real world somehow. What about >>> security? can any user create a container and devices and link them to >>> real interfaces? If not, do you need to run the VM as root? >>> >> Today it has to be root as a result of weak mode support in configfs, so >> you have me there. I am looking for help patching this limitation, >> though. >> >> > > Well, do you plan to address this before submission for inclusion? Maybe, maybe not. Its workable for now (i.e. run as root), so its inclusion is not predicated on the availability of the fix, per se (at least IMHO). If I can get it working before I get to pushing the core, great! Patches welcome. > >>> I hope everyone agrees that it's an important issue for me and that I >>> have to consider non-Linux guests. I also hope that you're considering >>> non-Linux guests since they have considerable market share. >>> >> I didn't mean non-Linux guests are not important. I was disagreeing >> with your assertion that it only works if its PCI. There are numerous >> examples of IHV/ISV "bridge" implementations deployed in Windows, no? >> > > I don't know. > >> If vbus is exposed as a PCI-BRIDGE, how is this different? >> > > Technically it would work, but given you're not interested in Windows, s/interested in/priortizing For the time being, windows will not be RT, and windows can fall-back to use virtio-net, etc. So I am ok with this. It will come in due time. > who would write a driver? Someone from the vbus community who is motivated enough and has the time to do it, I suppose. We have people interested in looking at this internally, but other items have pushed it primarily to the back-burner. > >>> Given I'm not the gateway to inclusion of vbus/venet, you don't need to >>> ask me anything. I'm still free to give my opinion. >>> >> Agreed, and I didn't mean to suggest otherwise. It not clear if you are >> wearing the "kvm maintainer" hat, or the "lkml community member" hat at >> times, so its important to make that distinction. Otherwise, its not >> clear if this is edict as my superior, or input as my peer. ;) >> > > When I wear a hat, it is a Red Hat. However I am bareheaded most often. > > (that is, look at the contents of my message, not who wrote it or his > role). Like it or not, maintainers always carry more weight when they opine what can and can't be done w.r.t. what can be perceived as their relevant subsystem. > >>> With virtio, the number is 1 (or less if you amortize). Set up the ring >>> entries and kick. >>> >> Again, I am just talking about basic PCI here, not the things we build >> on top. >> > > Whatever that means, it isn't interesting. Performance is measure for > the whole stack. > >> The point is: the things we build on top have costs associated with >> them, and I aim to minimize it. For instance, to do a "call()" kind of >> interface, you generally need to pre-setup some per-cpu mappings so that >> you can just do a single iowrite32() to kick the call off. Those >> per-cpu mappings have a cost if you want them to be high-performance, so >> my argument is that you ideally want to limit the number of times you >> have to do this. My current design reduces this to "once". >> > > Do you mean minimizing the setup cost? Seriously? Not the time-to-complete-setup overhead. The residual costs, like heap/vmap usage at run-time. You generally have to set up per-cpu mappings to gain maximum performance. You would need it per-device, I do it per-system. Its not a big deal in the grand-scheme of things, really. But chalk that up as an advantage to my approach over yours, nonetheless. > >>> There's no such thing as raw PCI. Every PCI device has a protocol. The >>> protocol virtio chose is optimized for virtualization. >>> >> And its a question of how that protocol scales, more than how the >> protocol works. >> >> Obviously the general idea of the protocol works, as vbus itself is >> implemented as a PCI-BRIDGE and is therefore limited to the underlying >> characteristics that I can get out of PCI (like PIO latency). >> > > I thought we agreed that was insignificant? I think I was agreeing with you, there. (e.g. obviously PIO latency is acceptable, as I use it to underpin vbus) > >>> As I've mentioned before, prioritization is available on x86 >>> >> But as Ive mentioned, it doesn't work very well. >> > > I guess it isn't that important then. I note that clever prioritization > in a guest is pointless if you can't do the same prioritization in the > host. I answer this below... > >>> , and coalescing scales badly. >>> >> Depends on what is scaling. Scaling vcpus? Yes, you are right. >> Scaling the number of devices? No, this is where it improves. >> > > If you queue pending messages instead of walking the device list, you > may be right. Still, if hard interrupt processing takes 10% of your > time you'll only have coalesced 10% of interrupts on average. > >>> irq window exits ought to be pretty rare, so we're only left with >>> injection vmexits. At around 1us/vmexit, even 100,000 interrupts/vcpu >>> (which is excessive) will only cost you 10% cpu time. >>> >> 1us is too much for what I am building, IMHO. > > You can't use current hardware then. The point is that I am eliminating as many exits as possible, so 1us, 2us, whatever...it doesn't matter. The fastest exit is the one you don't have to take. > >>> You're free to demultiplex an MSI to however many consumers you want, >>> there's no need for a new bus for that. >>> >> Hmmm...can you elaborate? >> > > Point all those MSIs at one vector. Its handler will have to poll all > the attached devices though. Right, thats broken. > >>> Do you use DNS. We use PCI-SIG. If Novell is a PCI-SIG member you can >>> get a vendor ID and control your own virtio space. >>> >> Yeah, we have our own id. I am more concerned about making this design >> make sense outside of PCI oriented environments. >> > > IIRC we reuse the PCI IDs for non-PCI. You already know how I feel about this gem. > > > > >>>>> That's a bug, not a feature. It means poor scaling as the number of >>>>> vcpus increases and as the number of devices increases. >>>>> >> vcpu increases, I agree (and am ok with, as I expect low vcpu count >> machines to be typical). > > I'm not okay with it. If you wish people to adopt vbus over virtio > you'll have to address all concerns, not just yours. By building a community around the development of vbus, isnt this what I am doing? Working towards making it usable for all? > >> nr of devices, I disagree. can you elaborate? >> > > With message queueing, I retract my remark. Ok. > >>> Windows, >>> >> Work in progress. >> > > Interesting. Do you plan to open source the code? If not, will the > binaries be freely available? Ideally, yeah. But I guess that has to go through legal, etc. Right now its primarily back-burnered. If someone wants to submit code to support this, great! > >> >>> large guests >>> >> Can you elaborate? I am not familiar with the term. >> > > Many vcpus. > >> >>> and multiqueue out of your design. >>> >> AFAICT, multiqueue should work quite nicely with vbus. Can you >> elaborate on where you see the problem? >> > > You said you aren't interested in it previously IIRC. > I don't think so, no. Perhaps I misspoke or was misunderstood. I actually think its a good idea and will be looking to do this. >>>>> x86 APIC is priority aware. >>>>> >>>>> >>>> Have you ever tried to use it? >>>> >>>> >>> I haven't, but Windows does. >>> >> Yeah, it doesn't really work well. Its an extremely rigid model that >> (IIRC) only lets you prioritize in 16 groups spaced by IDT (0-15 are one >> level, 16-31 are another, etc). Most of the embedded PICs I have worked >> with supported direct remapping, etc. But in any case, Linux doesn't >> support it so we are hosed no matter how good it is. >> > > I agree that it isn't very clever (not that I am a real time expert) but > I disagree about dismissing Linux support so easily. If prioritization > is such a win it should be a win on the host as well and we should make > it work on the host as well. Further I don't see how priorities on the > guest can work if they don't on the host. Its more about task priority in the case of real-time. We do stuff with 802.1p as well for control messages, etc. But for the most part, this is an orthogonal effort. And yes, you are right, it would be nice to have this interrupt classification capability in the host. Generally this is mitigated by the use of irq-threads. You could argue that if irq-threads help the host without a prioritized interrupt controller, why cant the guest? The answer is simply that the host can afford sub-optimal behavior w.r.t. IDT injection here, where the guest cannot (due to the disparity of hw-injection vs guest-injection overheads). IOW: The cost of an IDT dispatch in real-hardware adds minimal latency, even if a low-priority IDT preempts a high-priority interrupt thread. The cost of an IDT dispatch in a guest, OTOH, especially when you factor in the complete picture (IPI-exit, inject, eoi exit, re-enter) is greater...to great, in fact. So if you can get the guests interrupts priority aware, you can avoid even the IDT preempting the irq-thread until the system is in the ideal state. > >>>> >>>> >>> They had to build connectors just like you propose to do. >>> >> More importantly, they had to build back-end busses too, no? >> > > They had to write 414 lines in drivers/s390/kvm/kvm_virtio.c and > something similar for lguest. Well, then I retract that statement. I think the small amount of code is probably because they are re-using the qemu device-models, however. Note that I am essentially advocating the same basic idea here. > >>> But you still need vbus-connector-lguest and vbus-connector-s390 because >>> they all talk to the host differently. So what's changed? the names? >>> >> The fact that they don't need to redo most of the in-kernel backend >> stuff. Just the connector. >> > > So they save 414 lines but have to write a connector which is... how large? I guess that depends on the features they want. A pci-based connector would probably be pretty thin, since you don't need event channels like I use in the pci-bridge connector. The idea, of course, is that the vbus can become your whole bus if you want. So you wouldn't need to tunnel, say, vbus over some lguest bus. You just base the design on vbus outright. Note that this was kind of what the first pass of vbus did for KVM. The bus was exposed via cpuid and hypercalls as kind of a system-service. It wasn't until later that I surfaced it as a bridge model. > >>> Well, venet doesn't complement virtio-net, and virtio-pci doesn't >>> complement vbus-connector. >>> >> Agreed, but virtio complements vbus by virtue of virtio-vbus. >> > > I don't see what vbus adds to virtio-net. Well, as you stated in your last reply, you don't want it. So I guess that doesn't matter much at this point. I will continue developing vbus, and pushing things your way. You can opt to accept or reject those things at your own discretion. Kind Regards, -Greg
Hi Nicholas Nicholas A. Bellinger wrote: > On Wed, 2009-08-19 at 10:11 +0300, Avi Kivity wrote: >> On 08/19/2009 09:28 AM, Gregory Haskins wrote: >>> Avi Kivity wrote: > > <SNIP> > >>> Basically, what it comes down to is both vbus and vhost need >>> configuration/management. Vbus does it with sysfs/configfs, and vhost >>> does it with ioctls. I ultimately decided to go with sysfs/configfs >>> because, at least that the time I looked, it seemed like the "blessed" >>> way to do user->kernel interfaces. >>> >> I really dislike that trend but that's an unrelated discussion. >> >>>> They need to be connected to the real world somehow. What about >>>> security? can any user create a container and devices and link them to >>>> real interfaces? If not, do you need to run the VM as root? >>>> >>> Today it has to be root as a result of weak mode support in configfs, so >>> you have me there. I am looking for help patching this limitation, though. >>> >>> >> Well, do you plan to address this before submission for inclusion? >> > > Greetings Avi and Co, > > I have been following this thread, and although I cannot say that I am > intimately fimilar with all of the virtualization considerations > involved to really add anything use to that side of the discussion, I > think you guys are doing a good job of explaining the technical issues > for the non virtualization wizards following this thread. :-) > > Anyways, I was wondering if you might be interesting in sharing your > concerns wrt to configfs (conigfs maintainer CC'ed), at some point..? So for those tuning in, the reference here is the use of configfs for the management of this component of AlacrityVM, called "virtual-bus" http://developer.novell.com/wiki/index.php/Virtual-bus > As you may recall, I have been using configfs extensively for the 3.x > generic target core infrastructure and iSCSI fabric modules living in > lio-core-2.6.git/drivers/target/target_core_configfs.c and > lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found > it to be extraordinarly useful for the purposes of a implementing a > complex kernel level target mode stack that is expected to manage > massive amounts of metadata, allow for real-time configuration, share > data structures (eg: SCSI Target Ports) between other kernel fabric > modules and manage the entire set of fabrics using only intrepetered > userspace code. I concur. Configfs provided me a very natural model to express resource-containers and their respective virtual-device objects. > > Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs <-> iSCSI Target > Endpoints inside of a KVM Guest (from the results in May posted with > IOMMU aware 10 Gb on modern Nahelem hardware, see > http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to > dump the entire running target fabric configfs hierarchy to a single > struct file on a KVM Guest root device using python code on the order of > ~30 seconds for those 10000 active iSCSI endpoints. In configfs terms, > this means: > > *) 7 configfs groups (directories), ~50 configfs attributes (files) per > Virtual HBA+FILEIO LUN > *) 15 configfs groups (directories), ~60 configfs attributes (files per > iSCSI fabric Endpoint > > Which comes out to a total of ~220000 groups and ~1100000 attributes > active configfs objects living in the configfs_dir_cache that are being > dumped inside of the single KVM guest instances, including symlinks > between the fabric modules to establish the SCSI ports containing > complete set of SPC-4 and RFC-3720 features, et al. > > Also on the kernel <-> user API interaction compatibility side, I have > found the 3.x configfs enabled code adventagous over the LIO 2.9 code > (that used an ioctl for everything) because it allows us to do backwards > compat for future versions without using any userspace C code, which in > IMHO makes maintaining userspace packages for complex kernel stacks with > massive amounts of metadata + real-time configuration considerations. > No longer having ioctl compatibility issues between LIO versions as the > structures passed via ioctl change, and being able to do backwards > compat with small amounts of interpreted code against configfs layout > changes makes maintaining the kernel <-> user API really have made this > that much easier for me. > > Anyways, I though these might be useful to the discussion as it releates > to potental uses of configfs on the KVM Host or other projects that > really make sense, and/or to improve the upstream implementation so that > other users (like myself) can benefit from improvements to configfs. > > Many thanks for your most valuable of time, Thank you for the explanation of your setup. Configfs mostly works for the vbus project "as is". As Avi pointed out, I currently have a limitation w.r.t. perms. Forgive me if what I am about to say is overly simplistic. Its been quite a few months since I worked on the configfs portion of the code, so my details may be fuzzy. What it boiled down to is I need is a way to better manage perms (and to be able to do it cross sysfs and configfs would be ideal). For instance, I would like to be able to assign groups to configfs directories, like /config/vbus/devices, such that mkdir /config/vbus/devices/foo would not require root if that GID was permitted. Are there ways to do this (now, or in upcoming releases)? If not, I may be interested in helping to add this feature, so please advise how best to achieve this. Kind Regards, -Greg
On Wed, 2009-08-19 at 14:39 -0400, Gregory Haskins wrote: > Hi Nicholas > > Nicholas A. Bellinger wrote: > > On Wed, 2009-08-19 at 10:11 +0300, Avi Kivity wrote: > >> On 08/19/2009 09:28 AM, Gregory Haskins wrote: > >>> Avi Kivity wrote: > > > > <SNIP> > > > >>> Basically, what it comes down to is both vbus and vhost need > >>> configuration/management. Vbus does it with sysfs/configfs, and vhost > >>> does it with ioctls. I ultimately decided to go with sysfs/configfs > >>> because, at least that the time I looked, it seemed like the "blessed" > >>> way to do user->kernel interfaces. > >>> > >> I really dislike that trend but that's an unrelated discussion. > >> > >>>> They need to be connected to the real world somehow. What about > >>>> security? can any user create a container and devices and link them to > >>>> real interfaces? If not, do you need to run the VM as root? > >>>> > >>> Today it has to be root as a result of weak mode support in configfs, so > >>> you have me there. I am looking for help patching this limitation, though. > >>> > >>> > >> Well, do you plan to address this before submission for inclusion? > >> > > > > Greetings Avi and Co, > > > > I have been following this thread, and although I cannot say that I am > > intimately fimilar with all of the virtualization considerations > > involved to really add anything use to that side of the discussion, I > > think you guys are doing a good job of explaining the technical issues > > for the non virtualization wizards following this thread. :-) > > > > Anyways, I was wondering if you might be interesting in sharing your > > concerns wrt to configfs (conigfs maintainer CC'ed), at some point..? > > So for those tuning in, the reference here is the use of configfs for > the management of this component of AlacrityVM, called "virtual-bus" > > http://developer.novell.com/wiki/index.php/Virtual-bus > > > As you may recall, I have been using configfs extensively for the 3.x > > generic target core infrastructure and iSCSI fabric modules living in > > lio-core-2.6.git/drivers/target/target_core_configfs.c and > > lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found > > it to be extraordinarly useful for the purposes of a implementing a > > complex kernel level target mode stack that is expected to manage > > massive amounts of metadata, allow for real-time configuration, share > > data structures (eg: SCSI Target Ports) between other kernel fabric > > modules and manage the entire set of fabrics using only intrepetered > > userspace code. > > I concur. Configfs provided me a very natural model to express > resource-containers and their respective virtual-device objects. > > > > > Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs <-> iSCSI Target > > Endpoints inside of a KVM Guest (from the results in May posted with > > IOMMU aware 10 Gb on modern Nahelem hardware, see > > http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to > > dump the entire running target fabric configfs hierarchy to a single > > struct file on a KVM Guest root device using python code on the order of > > ~30 seconds for those 10000 active iSCSI endpoints. In configfs terms, > > this means: > > > > *) 7 configfs groups (directories), ~50 configfs attributes (files) per > > Virtual HBA+FILEIO LUN > > *) 15 configfs groups (directories), ~60 configfs attributes (files per > > iSCSI fabric Endpoint > > > > Which comes out to a total of ~220000 groups and ~1100000 attributes > > active configfs objects living in the configfs_dir_cache that are being > > dumped inside of the single KVM guest instances, including symlinks > > between the fabric modules to establish the SCSI ports containing > > complete set of SPC-4 and RFC-3720 features, et al. > > > > Also on the kernel <-> user API interaction compatibility side, I have > > found the 3.x configfs enabled code adventagous over the LIO 2.9 code > > (that used an ioctl for everything) because it allows us to do backwards > > compat for future versions without using any userspace C code, which in > > IMHO makes maintaining userspace packages for complex kernel stacks with > > massive amounts of metadata + real-time configuration considerations. > > No longer having ioctl compatibility issues between LIO versions as the > > structures passed via ioctl change, and being able to do backwards > > compat with small amounts of interpreted code against configfs layout > > changes makes maintaining the kernel <-> user API really have made this > > that much easier for me. > > > > Anyways, I though these might be useful to the discussion as it releates > > to potental uses of configfs on the KVM Host or other projects that > > really make sense, and/or to improve the upstream implementation so that > > other users (like myself) can benefit from improvements to configfs. > > > > Many thanks for your most valuable of time, > > Thank you for the explanation of your setup. > > Configfs mostly works for the vbus project "as is". As Avi pointed out, > I currently have a limitation w.r.t. perms. Forgive me if what I am > about to say is overly simplistic. Its been quite a few months since I > worked on the configfs portion of the code, so my details may be fuzzy. > > What it boiled down to is I need is a way to better manage perms I have not looked at implementing this personally, so I am not sure how this would look in fs/configfs/ off the top of my head.. Joel, have you had any thoughts on this..? > (and to > be able to do it cross sysfs and configfs would be ideal). > I had coded up a patch last year to to allow configfs to access sysfs symlinks in the context of target_core_mod storage object (Linux/SCSI, Linux/Block, Linux/FILEIO) registration, which did work but ended up not really making sense and was (thankully) rejected by GregKH, more of that discussion here: http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-10/msg06559.html I am not sure if the sharing of permissions between sysfs and configfs would run into the same types of limitiations as the above.. > For instance, I would like to be able to assign groups to configfs > directories, like /config/vbus/devices, such that > > mkdir /config/vbus/devices/foo > > would not require root if that GID was permitted. > > Are there ways to do this (now, or in upcoming releases)? If not, I may > be interested in helping to add this feature, so please advise how best > to achieve this. > Not that I am aware of. However, I think this would be useful for generic configfs, and I think user/group permissions on configfs groups/dirs and attribute/items would be quite useful for the LIO 3.x configfs enabled generic target engine. Many thanks for your most valuable of time, --nab > Kind Regards, > -Greg > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-08-19 at 12:19 -0700, Nicholas A. Bellinger wrote: > On Wed, 2009-08-19 at 14:39 -0400, Gregory Haskins wrote: > > Hi Nicholas > > > > Nicholas A. Bellinger wrote: > > > On Wed, 2009-08-19 at 10:11 +0300, Avi Kivity wrote: > > >> On 08/19/2009 09:28 AM, Gregory Haskins wrote: > > >>> Avi Kivity wrote: > > > > > > <SNIP> > > > > > >>> Basically, what it comes down to is both vbus and vhost need > > >>> configuration/management. Vbus does it with sysfs/configfs, and vhost > > >>> does it with ioctls. I ultimately decided to go with sysfs/configfs > > >>> because, at least that the time I looked, it seemed like the "blessed" > > >>> way to do user->kernel interfaces. > > >>> > > >> I really dislike that trend but that's an unrelated discussion. > > >> > > >>>> They need to be connected to the real world somehow. What about > > >>>> security? can any user create a container and devices and link them to > > >>>> real interfaces? If not, do you need to run the VM as root? > > >>>> > > >>> Today it has to be root as a result of weak mode support in configfs, so > > >>> you have me there. I am looking for help patching this limitation, though. > > >>> > > >>> > > >> Well, do you plan to address this before submission for inclusion? > > >> > > > > > > Greetings Avi and Co, > > > > > > I have been following this thread, and although I cannot say that I am > > > intimately fimilar with all of the virtualization considerations > > > involved to really add anything use to that side of the discussion, I > > > think you guys are doing a good job of explaining the technical issues > > > for the non virtualization wizards following this thread. :-) > > > > > > Anyways, I was wondering if you might be interesting in sharing your > > > concerns wrt to configfs (conigfs maintainer CC'ed), at some point..? > > > > So for those tuning in, the reference here is the use of configfs for > > the management of this component of AlacrityVM, called "virtual-bus" > > > > http://developer.novell.com/wiki/index.php/Virtual-bus > > > > > As you may recall, I have been using configfs extensively for the 3.x > > > generic target core infrastructure and iSCSI fabric modules living in > > > lio-core-2.6.git/drivers/target/target_core_configfs.c and > > > lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found > > > it to be extraordinarly useful for the purposes of a implementing a > > > complex kernel level target mode stack that is expected to manage > > > massive amounts of metadata, allow for real-time configuration, share > > > data structures (eg: SCSI Target Ports) between other kernel fabric > > > modules and manage the entire set of fabrics using only intrepetered > > > userspace code. > > > > I concur. Configfs provided me a very natural model to express > > resource-containers and their respective virtual-device objects. > > > > > > > > Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs <-> iSCSI Target > > > Endpoints inside of a KVM Guest (from the results in May posted with > > > IOMMU aware 10 Gb on modern Nahelem hardware, see > > > http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to > > > dump the entire running target fabric configfs hierarchy to a single > > > struct file on a KVM Guest root device using python code on the order of > > > ~30 seconds for those 10000 active iSCSI endpoints. In configfs terms, > > > this means: > > > > > > *) 7 configfs groups (directories), ~50 configfs attributes (files) per > > > Virtual HBA+FILEIO LUN > > > *) 15 configfs groups (directories), ~60 configfs attributes (files per > > > iSCSI fabric Endpoint > > > > > > Which comes out to a total of ~220000 groups and ~1100000 attributes > > > active configfs objects living in the configfs_dir_cache that are being > > > dumped inside of the single KVM guest instances, including symlinks > > > between the fabric modules to establish the SCSI ports containing > > > complete set of SPC-4 and RFC-3720 features, et al. > > > > > > Also on the kernel <-> user API interaction compatibility side, I have > > > found the 3.x configfs enabled code adventagous over the LIO 2.9 code > > > (that used an ioctl for everything) because it allows us to do backwards > > > compat for future versions without using any userspace C code, which in > > > IMHO makes maintaining userspace packages for complex kernel stacks with > > > massive amounts of metadata + real-time configuration considerations. > > > No longer having ioctl compatibility issues between LIO versions as the > > > structures passed via ioctl change, and being able to do backwards > > > compat with small amounts of interpreted code against configfs layout > > > changes makes maintaining the kernel <-> user API really have made this > > > that much easier for me. > > > > > > Anyways, I though these might be useful to the discussion as it releates > > > to potental uses of configfs on the KVM Host or other projects that > > > really make sense, and/or to improve the upstream implementation so that > > > other users (like myself) can benefit from improvements to configfs. > > > > > > Many thanks for your most valuable of time, > > > > Thank you for the explanation of your setup. > > > > Configfs mostly works for the vbus project "as is". As Avi pointed out, > > I currently have a limitation w.r.t. perms. Forgive me if what I am > > about to say is overly simplistic. Its been quite a few months since I > > worked on the configfs portion of the code, so my details may be fuzzy. > > > > What it boiled down to is I need is a way to better manage perms > > I have not looked at implementing this personally, so I am not sure how > this would look in fs/configfs/ off the top of my head.. Joel, have you > had any thoughts on this..? > Actually, something that I have been using is for simple stuff is: if (!capable(CAP_SYS_ADMIN)) for controlling I/O to configfs attributes from non priviledged users for iSCSI authentication information living in struct config_item_operations lio_target_nacl_auth_cit, the code is here: http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=blob;f=drivers/lio-core/iscsi_target_configfs.c;h=1230b74577076a184b756b3883fb56c6050c7d87;hb=HEAD#l803 I am also using the CONFIGFS Extended Macros, CONFIGS_EATTR() which I created to allow me to use more than one struct config_groups per parent structure, and use less lines of code when defining configfs attributes using generic store() and show() functions: http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=blob;f=include/target/configfs_macros.h --nab -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote: > Anyways, I was wondering if you might be interesting in sharing your > concerns wrt to configfs (conigfs maintainer CC'ed), at some point..? > My concerns aren't specifically with configfs, but with all the text based pseudo filesystems that the kernel exposes. My high level concern is that we're optimizing for the active sysadmin, not for libraries and management programs. configfs and sysfs are easy to use from the shell, discoverable, and easily scripted. But they discourage documentation, the text format is ambiguous, and they require a lot of boilerplate to use in code. You could argue that you can wrap *fs in a library that hides the details of accessing it, but that's the wrong approach IMO. We should make the information easy to use and manipulate for programs; one of these programs can be a fuse filesystem for the active sysadmin if someone thinks it's important. Now for the low level concerns: - efficiency Each attribute access requires an open/read/close triplet and binary->ascii->binary conversions. In contrast an ordinary syscall/ioctl interface can fetch all attributes of an object, or even all attributes of all objects, in one call. - atomicity One attribute per file means that, lacking userspace-visible transactions, there is no way to change several attributes at once. When you read attributes, there is no way to read several attributes atomically so you can be sure their values correlate. Another example of a problem is when an object disappears while reading its attributes. Sure, openat() can mitigate this, but it's better to avoid introducing problem than having a fix. - ambiguity What format is the attribute? does it accept lowercase or uppercase hex digits? is there a newline at the end? how many digits can it take before the attribute overflows? All of this has to be documented and checked by the OS, otherwise we risk regressions later. In contrast, __u64 says everything in a binary interface. - lifetime and access control If a process brings an object into being (using mkdir) and then dies, the object remains behind. The syscall/ioctl approach ties the object into an fd, which will be destroyed when the process dies, and which can be passed around using SCM_RIGHTS, allowing a server process to create and configure an object before passing it to an unprivileged program - notifications It's hard to notify users about changes in attributes. Sure, you can use inotify, but that limits you to watching subtrees. Once you do get the notification, you run into the atomicity problem. When do you know all attributes are valid? This can be solved using sequence counters, but that's just gratuitous complexity. Netlink type interfaces are much more robust and flexible. - readdir You can either list everything, or nothing. Sure, you can have trees to ease searching, even multiple views of the same data, but it's painful. You may argue, correctly, that syscalls and ioctls are not as flexible. But this is because no one has invested the effort in making them so. A struct passed as an argument to a syscall is not extensible. But if you pass the size of the structure, and also a bitmap of which attributes are present, you gain extensibility and retain the atomicity property of a syscall interface. I don't think a lot of effort is needed to make an extensible syscall interface just as usable and a lot more efficient than configfs/sysfs. It should also be simple to bolt a fuse interface on top to expose it to us commandline types. > As you may recall, I have been using configfs extensively for the 3.x > generic target core infrastructure and iSCSI fabric modules living in > lio-core-2.6.git/drivers/target/target_core_configfs.c and > lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found > it to be extraordinarly useful for the purposes of a implementing a > complex kernel level target mode stack that is expected to manage > massive amounts of metadata, allow for real-time configuration, share > data structures (eg: SCSI Target Ports) between other kernel fabric > modules and manage the entire set of fabrics using only intrepetered > userspace code. > > Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs<-> iSCSI Target > Endpoints inside of a KVM Guest (from the results in May posted with > IOMMU aware 10 Gb on modern Nahelem hardware, see > http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to > dump the entire running target fabric configfs hierarchy to a single > struct file on a KVM Guest root device using python code on the order of > ~30 seconds for those 10000 active iSCSI endpoints. In configfs terms, > this means: > > *) 7 configfs groups (directories), ~50 configfs attributes (files) per > Virtual HBA+FILEIO LUN > *) 15 configfs groups (directories), ~60 configfs attributes (files per > iSCSI fabric Endpoint > > Which comes out to a total of ~220000 groups and ~1100000 attributes > active configfs objects living in the configfs_dir_cache that are being > dumped inside of the single KVM guest instances, including symlinks > between the fabric modules to establish the SCSI ports containing > complete set of SPC-4 and RFC-3720 features, et al. > You achieved 3 million syscalls/sec from Python code? That's very impressive. Note with syscalls you could have done it with 10K syscalls (Python supports packing and unpacking structs quite well, and also directly calling C code IIRC). > Also on the kernel<-> user API interaction compatibility side, I have > found the 3.x configfs enabled code adventagous over the LIO 2.9 code > (that used an ioctl for everything) because it allows us to do backwards > compat for future versions without using any userspace C code, which in > IMHO makes maintaining userspace packages for complex kernel stacks with > massive amounts of metadata + real-time configuration considerations. > No longer having ioctl compatibility issues between LIO versions as the > structures passed via ioctl change, and being able to do backwards > compat with small amounts of interpreted code against configfs layout > changes makes maintaining the kernel<-> user API really have made this > that much easier for me. > configfs is more maintainable that a bunch of hand-maintained ioctls. But if we put some effort into an extendable syscall infrastructure (perhaps to the point of using an IDL) I'm sure we can improve on that without the problems pseudo filesystems introduce. > Anyways, I though these might be useful to the discussion as it releates > to potental uses of configfs on the KVM Host or other projects that > really make sense, and/or to improve the upstream implementation so that > other users (like myself) can benefit from improvements to configfs. > I can't really fault a project for using configfs; it's an accepted and recommented (by the community) interface. I'd much prefer it though if there was an effort to create a usable fd/struct based alternative.
On 08/19/2009 09:26 PM, Gregory Haskins wrote: >>> This is for things like the setup of queue-pairs, and the transport of >>> door-bells, and ib-verbs. I am not on the team doing that work, so I am >>> not an expert in this area. What I do know is having a flexible and >>> low-latency signal-path was deemed a key requirement. >>> >>> >> That's not a full bypass, then. AFAIK kernel bypass has userspace >> talking directly to the device. >> > Like I said, I am not an expert on the details here. I only work on the > vbus plumbing. FWIW, the work is derivative from the "Xen-IB" project > > http://www.openib.org/archives/nov2006sc/xen-ib-presentation.pdf > > There were issues with getting Xen-IB to map well into the Xen model. > Vbus was specifically designed to address some of those short-comings. > Well I'm not an Infiniband expert. But from what I understand VMM bypass means avoiding the call to the VMM entirely by exposing hardware registers directly to the guest. >> This is best done using cr8/tpr so you don't have to exit at all. See >> also my vtpr support for Windows which does this in software, generally >> avoiding the exit even when lowering priority. >> > You can think of vTPR as a good model, yes. Generally, you can't > actually use it for our purposes for several reasons, however: > > 1) the prio granularity is too coarse (16 levels, -rt has 100) > > 2) it is too scope limited (it covers only interrupts, we need to have > additional considerations, like nested guest/host scheduling algorithms > against the vcpu, and prio-remap policies) > > 3) I use "priority" generally..there may be other non-priority based > policies that need to add state to the table (such as EDF deadlines, etc). > > but, otherwise, the idea is the same. Besides, this was one example. > Well, if priority is so important then I'd recommend exposing it via a virtual interrupt controller. A bus is the wrong model to use, because its scope is only the devices it contains, and because it is system-wide in nature, not per-cpu. >>> This is where the really fast call() type mechanism is important. >>> >>> Its also about having the priority flow-end to end, and having the vcpu >>> interrupt state affect the task-priority, etc (e.g. pending interrupts >>> affect the vcpu task prio). >>> >>> etc, etc. >>> >>> I can go on and on (as you know ;), but will wait till this work is more >>> concrete and proven. >>> >>> >> Generally cpu state shouldn't flow through a device but rather through >> MSRs, hypercalls, and cpu registers. >> > > Well, you can blame yourself for that one ;) > > The original vbus was implemented as cpuid+hypercalls, partly for that > reason. You kicked me out of kvm.ko, so I had to make due with plan B > via a less direct PCI-BRIDGE route. > A bus has no business doing these things. But cpu state definitely needs to be manipulated using hypercalls, see the pvmmu and vtpr hypercalls or the pvclock msr. > But in reality, it doesn't matter much. You can certainly have "system" > devices sitting on vbus that fit a similar role as "MSRs", so the access > method is more of an implementation detail. The key is it needs to be > fast, and optimize out extraneous exits when possible. > No, percpu state belongs in the vcpu model, not the device model. cpu priority is logically a cpu register or state, not device state. >> Well, do you plan to address this before submission for inclusion? >> > Maybe, maybe not. Its workable for now (i.e. run as root), so its > inclusion is not predicated on the availability of the fix, per se (at > least IMHO). If I can get it working before I get to pushing the core, > great! Patches welcome. > The lack of so many feature indicates the whole thing is immature. That would be find if the whole thing was the first of its kind, but it isn't. > For the time being, windows will not be RT, and windows can fall-back to > use virtio-net, etc. So I am ok with this. It will come in due time. > > So we need to work on optimizing both virtio-net and venet. Great. >>> The point is: the things we build on top have costs associated with >>> them, and I aim to minimize it. For instance, to do a "call()" kind of >>> interface, you generally need to pre-setup some per-cpu mappings so that >>> you can just do a single iowrite32() to kick the call off. Those >>> per-cpu mappings have a cost if you want them to be high-performance, so >>> my argument is that you ideally want to limit the number of times you >>> have to do this. My current design reduces this to "once". >>> >>> >> Do you mean minimizing the setup cost? Seriously? >> > Not the time-to-complete-setup overhead. The residual costs, like > heap/vmap usage at run-time. You generally have to set up per-cpu > mappings to gain maximum performance. You would need it per-device, I > do it per-system. Its not a big deal in the grand-scheme of things, > really. But chalk that up as an advantage to my approach over yours, > nonetheless. > Without measurements, it's just handwaving. >> I guess it isn't that important then. I note that clever prioritization >> in a guest is pointless if you can't do the same prioritization in the >> host. >> > I answer this below... > > The point is that I am eliminating as many exits as possible, so 1us, > 2us, whatever...it doesn't matter. The fastest exit is the one you > don't have to take. > You'll still have to exit if the host takes a low priority interrupt, schedule the irq thread according to its priority, and return to the guest. At this point you may as well inject the interrupt and let the guest do the same thing. >> IIRC we reuse the PCI IDs for non-PCI. >> > > You already know how I feel about this gem. > The earth keeps rotating despite the widespread use of PCI IDs. >> I'm not okay with it. If you wish people to adopt vbus over virtio >> you'll have to address all concerns, not just yours. >> > By building a community around the development of vbus, isnt this what I > am doing? Working towards making it usable for all? > I've no idea if you're actually doing that. Maybe inclusion should be predicated on achieving feature parity. >>>> and multiqueue out of your design. >>>> >>>> >>> AFAICT, multiqueue should work quite nicely with vbus. Can you >>> elaborate on where you see the problem? >>> >>> >> You said you aren't interested in it previously IIRC. >> >> > I don't think so, no. Perhaps I misspoke or was misunderstood. I > actually think its a good idea and will be looking to do this. > When I pointed out that multiplexing all interrupts onto a single vector is bad for per-vcpu multiqueue, you said you're not interested in that. >> I agree that it isn't very clever (not that I am a real time expert) but >> I disagree about dismissing Linux support so easily. If prioritization >> is such a win it should be a win on the host as well and we should make >> it work on the host as well. Further I don't see how priorities on the >> guest can work if they don't on the host. >> > Its more about task priority in the case of real-time. We do stuff with > 802.1p as well for control messages, etc. But for the most part, this > is an orthogonal effort. And yes, you are right, it would be nice to > have this interrupt classification capability in the host. > > Generally this is mitigated by the use of irq-threads. You could argue > that if irq-threads help the host without a prioritized interrupt > controller, why cant the guest? The answer is simply that the host can > afford sub-optimal behavior w.r.t. IDT injection here, where the guest > cannot (due to the disparity of hw-injection vs guest-injection overheads). > Guest injection overhead is not too bad, most of the cost is the exit itself, and you can't avoid that without host task priorities. >> They had to write 414 lines in drivers/s390/kvm/kvm_virtio.c and >> something similar for lguest. >> > Well, then I retract that statement. I think the small amount of code > is probably because they are re-using the qemu device-models, however. > No that's guest code, it isn't related to qemu. > Note that I am essentially advocating the same basic idea here. > Right, duplicating existing infrastructure. >> I don't see what vbus adds to virtio-net. >> > Well, as you stated in your last reply, you don't want it. So I guess > that doesn't matter much at this point. I will continue developing > vbus, and pushing things your way. You can opt to accept or reject > those things at your own discretion. > I'm not the one to merge it. However my opinion is that it shouldn't be merged.
* Avi Kivity <avi@redhat.com> wrote: > You may argue, correctly, that syscalls and ioctls are > not as flexible. But this is because no one has > invested the effort in making them so. A struct passed > as an argument to a syscall is not extensible. But if > you pass the size of the structure, and also a bitmap > of which attributes are present, you gain extensibility > and retain the atomicity property of a syscall > interface. I don't think a lot of effort is needed to > make an extensible syscall interface just as usable and > a lot more efficient than configfs/sysfs. It should > also be simple to bolt a fuse interface on top to > expose it to us commandline types. FYI, an example of such a syscall design and implementation has been merged upstream in the .31 merge window, see: kernel/perf_counter.c::sys_perf_counter_open() SYSCALL_DEFINE5(perf_counter_open, struct perf_counter_attr __user *, attr_uptr, pid_t, pid, int, cpu, int, group_fd, unsigned long, flags) We embedd a '.size' field in struct perf_counter_attr. We copy the attribute from user-space in an 'auto-extend-to-zero' way: ret = perf_copy_attr(attr_uptr, &attr); if (ret) return ret; where perf_copy_attr() extends the possibly-smaller user-space structure to the in-kernel structure and zeroes out remaining fields. This means that older binaries can pass in older (smaller) versions of the structure. This syscall ABI design works very well and has a lot of advantages: - is extensible in a flexible way - it is forwards ABI compatible - the kernel is backwards compatible with applications - extensions to the ABI dont uglify the interface. - new applications can fall back gracefully to older ABI versions if they so choose. (the kernel will reject overlarge attr.size) So full forwards and backwards compatibility can be implemented, if an app wants to. - 'same version' ABI uses dont have any interface quirk or performance penalty. (i.e. there's no increasingly complex maze of add-on ABI details for the syscall to multiplex through) - the system call stays nice and readable We've made use of this property of the perfcounters ABI and extended it in a compatible way several times already, with great success. Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Avi Kivity <avi@redhat.com> wrote: >>> IIRC we reuse the PCI IDs for non-PCI. >>> >> >> You already know how I feel about this gem. > > The earth keeps rotating despite the widespread use of > PCI IDs. Btw., PCI IDs are a great way to arbitrate interfaces planet-wide, in an OS-neutral, depoliticized and well-established way. It's a bit like CPUID for CPUs, just on a much larger scope. Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/19/2009 11:48 PM, Ingo Molnar wrote: > > FYI, an example of such a syscall design and > implementation has been merged upstream in the .31 merge > window, see: > > <big snip> > > Exactly. It's beautiful.
On Wed, 2009-08-19 at 19:38 +0300, Avi Kivity wrote: > On 08/19/2009 07:29 PM, Ira W. Snyder wrote: > > > > > >> virtio-$yourhardware or maybe virtio-dma > >> > >> > > How about virtio-phys? > > > > Could work. > > > Arnd and BenH are both looking at PPC systems (similar to mine). Grant > > Likely is looking at talking to an processor core running on an FPGA, > > IIRC. Most of the code can be shared, very little should need to be > > board-specific, I hope. > > > > Excellent. > > >>> That said, I'm not sure how qemu-system-ppc running on x86 could > >>> possibly communicate using virtio-net. This would mean the guest is an > >>> emulated big-endian PPC, while the host is a little-endian x86. I > >>> haven't actually tested this situation, so perhaps I am wrong. > >>> > >>> > >> I'm confused now. You don't actually have any guest, do you, so why > >> would you run qemu at all? > >> > >> > > I do not run qemu. I am just stating a problem with virtio-net that I > > noticed. This is just so someone more knowledgeable can be aware of the > > problem. > > > > > > Ah, it certainly doesn't byteswap. Maybe nobody tried it. Hollis? I've never tried it. I've only used virtio with matching guest/host architectures.
On Wed, 2009-08-19 at 23:12 +0300, Avi Kivity wrote: > On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote: > > Anyways, I was wondering if you might be interesting in sharing your > > concerns wrt to configfs (conigfs maintainer CC'ed), at some point..? > > > > My concerns aren't specifically with configfs, but with all the text > based pseudo filesystems that the kernel exposes. > <nod> > My high level concern is that we're optimizing for the active sysadmin, > not for libraries and management programs. configfs and sysfs are easy > to use from the shell, discoverable, and easily scripted. But they > discourage documentation, the text format is ambiguous, and they require > a lot of boilerplate to use in code. > > You could argue that you can wrap *fs in a library that hides the > details of accessing it, but that's the wrong approach IMO. We should > make the information easy to use and manipulate for programs; one of > these programs can be a fuse filesystem for the active sysadmin if > someone thinks it's important. > > Now for the low level concerns: > > - efficiency > > Each attribute access requires an open/read/close triplet and > binary->ascii->binary conversions. In contrast an ordinary > syscall/ioctl interface can fetch all attributes of an object, or even > all attributes of all objects, in one call. > I agree that syscalls/ioctls can, given enough coding effort, use a potentially much smaller amount of total syscalls than a pseudo filesystem such as configfs. In the case of the configfs enabled generic target engine, I have not found this to be particularly limiting in terms of management on modern x86_64 virtualized hardware inside of KVM Guests with my development so far.. > - atomicity > > One attribute per file means that, lacking userspace-visible > transactions, there is no way to change several attributes at once. > When you read attributes, Actually, something like this can be done in struct config_item_type->ct_attrs[] by changing the attributes you want, but not making them active until pulling a seperate configfs item 'trigger' in the group to make the changes take effect. I am doing something similar to this now during fabric bringup while each iSCSI Target module is configured, and then a enable trigger throw to allow iSCSI Initiators to actually login to the endpoint, and to prevent endpoints from being active before all of the Ports and ACLs have been configured for each configured iSCSI endpoint. This logic is not built into ConfigFS of course, but it does give the same effect. > there is no way to read several attributes > atomically so you can be sure their values correlate. In this case, even though adding multiple values per attribute is discouraged per the upstream sysfs layout, using a single configfs attribute to read multiple values of another individual attributes that need to be read atomically is primary option today wrt existing code. Not ideal with configfs, but it is easy to do. > Another example > of a problem is when an object disappears while reading its attributes. > Sure, openat() can mitigate this, but it's better to avoid introducing > problem than having a fix. > <not sure on this one..> > - ambiguity > > What format is the attribute? does it accept lowercase or uppercase hex > digits? is there a newline at the end? how many digits can it take > before the attribute overflows? All of this has to be documented and > checked by the OS, otherwise we risk regressions later. In contrast, > __u64 says everything in a binary interface. > Yes, you need to make strict_str*() calls on the configfs attribute store() functions with casts to locally defined variable types. Using strtoul() and strtoull() have been working fine for me in the context of the generic target engine, but point taken about the usefulness in having access to the format metadata of a given attribute. > - lifetime and access control > > If a process brings an object into being (using mkdir) and then dies, > the object remains behind. I think this depends on how the struct configfs_item_grops->make_group() and ->drop_item() are being used. For example, I typically allocate a TCM related data structure during the make_group() call containing a struct config_group member that is registered with config_group_init_type_name() upon a successful mkdir(2) call. When drop_item() is called via rmdir(2), that references the struct config_group, the original data structure containing the struct config_group is released with config_item_put(), and the TCM allocated data structure released. While in use, the registered struct config_group can be pinned with configfs_depend_item(), which has some interesting limitiations of its own. > The syscall/ioctl approach ties the object > into an fd, which will be destroyed when the process dies, and which can > be passed around using SCM_RIGHTS, allowing a server process to create > and configure an object before passing it to an unprivileged program > <nod> I have not personally had this requirement so I can't add much here.. > - notifications > > It's hard to notify users about changes in attributes. Sure, you can > use inotify, but that limits you to watching subtrees. Once you do get > the notification, you run into the atomicity problem. When do you know > all attributes are valid? This can be solved using sequence counters, > but that's just gratuitous complexity. Netlink type interfaces are much > more robust and flexible. > nor the notifiy case either.. > - readdir > > You can either list everything, or nothing. Sure, you can have trees to > ease searching, even multiple views of the same data, but it's painful. > > You may argue, correctly, that syscalls and ioctls are not as flexible. > But this is because no one has invested the effort in making them so. I think that new syscalls are great when you can get them merged (as KVM is quite important, that means not a problem), and I am sure you guys can make an ioctl contort into all manner of positions. Perhaps it is just that I think that the code to manage complex ioctl interaction can get quite ugly from my experience, and doing backwards compat with interpreted code makes life for easier, at least for me. > A > struct passed as an argument to a syscall is not extensible. But if you > pass the size of the structure, and also a bitmap of which attributes > are present, you gain extensibility and retain the atomicity property of > a syscall interface. I don't think a lot of effort is needed to make an > extensible syscall interface just as usable and a lot more efficient > than configfs/sysfs. Good point, however in terms of typical mangement scenarios in my experience with TCM/LIO 3.x, I have not found the lost efficiently of using configfs compared to legacy IOCTL for controlling the fabric in typical usage cases. That said, I am sure there must be particular cases in the virtualization world where having those syscalls is critical, for which a configfs enabled generic target does not make sense. > It should also be simple to bolt a fuse interface > on top to expose it to us commandline types. > That would be interesting.. > > As you may recall, I have been using configfs extensively for the 3.x > > generic target core infrastructure and iSCSI fabric modules living in > > lio-core-2.6.git/drivers/target/target_core_configfs.c and > > lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found > > it to be extraordinarly useful for the purposes of a implementing a > > complex kernel level target mode stack that is expected to manage > > massive amounts of metadata, allow for real-time configuration, share > > data structures (eg: SCSI Target Ports) between other kernel fabric > > modules and manage the entire set of fabrics using only intrepetered > > userspace code. > > > > Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs<-> iSCSI Target > > Endpoints inside of a KVM Guest (from the results in May posted with > > IOMMU aware 10 Gb on modern Nahelem hardware, see > > http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to > > dump the entire running target fabric configfs hierarchy to a single > > struct file on a KVM Guest root device using python code on the order of > > ~30 seconds for those 10000 active iSCSI endpoints. In configfs terms, > > this means: > > > > *) 7 configfs groups (directories), ~50 configfs attributes (files) per > > Virtual HBA+FILEIO LUN > > *) 15 configfs groups (directories), ~60 configfs attributes (files per > > iSCSI fabric Endpoint > > > > Which comes out to a total of ~220000 groups and ~1100000 attributes > > active configfs objects living in the configfs_dir_cache that are being > > dumped inside of the single KVM guest instances, including symlinks > > between the fabric modules to establish the SCSI ports containing > > complete set of SPC-4 and RFC-3720 features, et al. > > > > You achieved 3 million syscalls/sec from Python code? That's very > impressive. Well, that is dumping the running configfs for everything. In more typical usage cases of the TCM/LIO configfs fabric, specific Virtual HBAs+LUNs and iSCSI Fabric endpoints would be changing individually, as each Virtual HBA and iSCSI endpoint are completely independent of each other and are intended to be administrated that way. You can even run multiple for loops from different shell procceses to create the endpoints in parallel using UUID and iSCSI WWN naming for doing multithreaded configfs fabric bringup. > > Note with syscalls you could have done it with 10K syscalls (Python > supports packing and unpacking structs quite well, and also directly > calling C code IIRC). > > > Also on the kernel<-> user API interaction compatibility side, I have > > found the 3.x configfs enabled code adventagous over the LIO 2.9 code > > (that used an ioctl for everything) because it allows us to do backwards > > compat for future versions without using any userspace C code, which in > > IMHO makes maintaining userspace packages for complex kernel stacks with > > massive amounts of metadata + real-time configuration considerations. > > No longer having ioctl compatibility issues between LIO versions as the > > structures passed via ioctl change, and being able to do backwards > > compat with small amounts of interpreted code against configfs layout > > changes makes maintaining the kernel<-> user API really have made this > > that much easier for me. > > > > configfs is more maintainable that a bunch of hand-maintained ioctls. <nod> > But if we put some effort into an extendable syscall infrastructure > (perhaps to the point of using an IDL) I'm sure we can improve on that > without the problems pseudo filesystems introduce. > Understood, while I think configfs is grand for a number of purposes, I am certainly not foolish enough to think it is perfect for everything > > Anyways, I though these might be useful to the discussion as it releates > > to potental uses of configfs on the KVM Host or other projects that > > really make sense, and/or to improve the upstream implementation so that > > other users (like myself) can benefit from improvements to configfs. > > > > I can't really fault a project for using configfs; it's an accepted and > recommented (by the community) interface. I'd much prefer it though if > there was an effort to create a usable fd/struct based alternative. > Thanks for your great comments Avi! --nab -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote: > On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote: >> Anyways, I was wondering if you might be interesting in sharing your >> concerns wrt to configfs (conigfs maintainer CC'ed), at some point..? >> > > My concerns aren't specifically with configfs, but with all the text > based pseudo filesystems that the kernel exposes. > > My high level concern is that we're optimizing for the active sysadmin, > not for libraries and management programs. configfs and sysfs are easy > to use from the shell, discoverable, and easily scripted. But they > discourage documentation, the text format is ambiguous, and they require > a lot of boilerplate to use in code. > > You could argue that you can wrap *fs in a library that hides the > details of accessing it, but that's the wrong approach IMO. We should > make the information easy to use and manipulate for programs; one of > these programs can be a fuse filesystem for the active sysadmin if > someone thinks it's important. > > Now for the low level concerns: > > - efficiency > > Each attribute access requires an open/read/close triplet and > binary->ascii->binary conversions. In contrast an ordinary > syscall/ioctl interface can fetch all attributes of an object, or even > all attributes of all objects, in one call. I can only speak for vbus, but *fs access efficiency is not a problem. Its all slow-path anyway. > > - atomicity > > One attribute per file means that, lacking userspace-visible > transactions, there is no way to change several attributes at once. Actually, I do think configfs has some rudimentary, but incomplete IIUC, support for transactional commits of updates. In lieu of formal support, this is also not generally a problem: You can just put your own transaction in by the form of an explicit attribute. For instance, see the "enabled" attribute in venet-tap. This lets you set all the parameters and then hit "enabled" to turn it act on the other settings atomically. For sysfs kernel updates, I think you can update the values under a lock. For sysfs userspace updates, I suppose you could do a similar "explicit commit" attribute if it was needed. > When you read attributes, there is no way to read several attributes > atomically so you can be sure their values correlate. This isn't a valid concern for configfs, unless you have multiple userspace applications updating concurrently. IIUC, configfs is only changed by userspace, not the kernel. So I suppose if you were concerned about supporting this, you could use an advisory flock or something. For sysfs, this is a valid concern. Generally, I do not think *fs interfaces are a good match if you need that type of behavior (atomic read of rapidly changing attributes), however. FWIW, vbus does not need this (the parameters do not generally change once established). > Another example > of a problem is when an object disappears while reading its attributes. > Sure, openat() can mitigate this, but it's better to avoid introducing > problem than having a fix. Again, that can only happen if another userspace app did that to you. Possible solutions might be advisory locking. > > - ambiguity > > What format is the attribute? does it accept lowercase or uppercase hex > digits? is there a newline at the end? how many digits can it take > before the attribute overflows? All of this has to be documented and > checked by the OS, otherwise we risk regressions later. In contrast, > __u64 says everything in a binary interface. I don't think this is a legit concern. I would thing you have to understand the ABI to use the interface regardless, no matter the transport. And either way, the kernel has to validate the input. > > - lifetime and access control > > If a process brings an object into being (using mkdir) and then dies, > the object remains behind. This is one of the big problems with configfs, I agree. I guess you could argue that the ioctl approach has the opposite problem (resource goes if the owner goes), which is to say it requires the app to hang around. Syscall is kind of in the middle, since it doesn't expressly have a policy against a given resource if a task dies. You can certainly modify kernel/exit.c to add such a policy, I suppose. But ioctl has a distinct advantage in this regard. All in all, I think ioctl wins here. > The syscall/ioctl approach ties the object > into an fd, which will be destroyed when the process dies, and which can > be passed around using SCM_RIGHTS, allowing a server process to create > and configure an object before passing it to an unprivileged program > > - notifications > > It's hard to notify users about changes in attributes. Sure, you can > use inotify, but that limits you to watching subtrees. Whats worse, inotify doesn't seem to work very well against *fs from what I hear. > Once you do get > the notification, you run into the atomicity problem. When do you know > all attributes are valid? This can be solved using sequence counters, > but that's just gratuitous complexity. Netlink type interfaces are much > more robust and flexible. > > - readdir > > You can either list everything, or nothing. Sure, you can have trees to > ease searching, even multiple views of the same data, but it's painful. I do not see the problem here. *fs structures dirs as objects, and files as attributes. A logical presentation of the data from that perspective ensues. Why is "readdir" a problem? It gets all the attributes of an "object" (sans potential consistency problems, as you point out above). > > You may argue, correctly, that syscalls and ioctls are not as flexible. > But this is because no one has invested the effort in making them so. A > struct passed as an argument to a syscall is not extensible. But if you > pass the size of the structure, and also a bitmap of which attributes > are present, you gain extensibility and retain the atomicity property of > a syscall interface. I don't think a lot of effort is needed to make an > extensible syscall interface just as usable and a lot more efficient > than configfs/sysfs. It should also be simple to bolt a fuse interface > on top to expose it to us commandline types. I think the strongest argument about having *fs like models, is its a way to keep the "management tool" coupled with the kernel that understands it. This is quite nice in practice. Its true that the interface exposed by *fs could be construed as an "ABI", but that is generally more of an issue for userspace tools that would turn around and read it, as opposed to a human sitting at the shell. So therefore, both *fs and syscall/ioctl approaches suffer from ABI mis-sync issues w.r.t. tools. But the *fs wins here because generally a human can adapt dynamically to the change (e.g. by running 'tree' and looking for something recognizable), whereas syscall/ioctl have no choice...they are hosed. It's true you could make an extensible syscall/ioctl interface, but do note you can use similar techniques (e.g. only add new attributes to existing objects) on the *fs front as well. So to me it comes down to more or less the lifetime question (ioctl wins), vs the auto-synchronized tool (*fs wins) benefit. I am honestly not sure what is better. > >> As you may recall, I have been using configfs extensively for the 3.x >> generic target core infrastructure and iSCSI fabric modules living in >> lio-core-2.6.git/drivers/target/target_core_configfs.c and >> lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found >> it to be extraordinarly useful for the purposes of a implementing a >> complex kernel level target mode stack that is expected to manage >> massive amounts of metadata, allow for real-time configuration, share >> data structures (eg: SCSI Target Ports) between other kernel fabric >> modules and manage the entire set of fabrics using only intrepetered >> userspace code. >> >> Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs<-> iSCSI Target >> Endpoints inside of a KVM Guest (from the results in May posted with >> IOMMU aware 10 Gb on modern Nahelem hardware, see >> http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to >> dump the entire running target fabric configfs hierarchy to a single >> struct file on a KVM Guest root device using python code on the order of >> ~30 seconds for those 10000 active iSCSI endpoints. In configfs terms, >> this means: >> >> *) 7 configfs groups (directories), ~50 configfs attributes (files) per >> Virtual HBA+FILEIO LUN >> *) 15 configfs groups (directories), ~60 configfs attributes (files per >> iSCSI fabric Endpoint >> >> Which comes out to a total of ~220000 groups and ~1100000 attributes >> active configfs objects living in the configfs_dir_cache that are being >> dumped inside of the single KVM guest instances, including symlinks >> between the fabric modules to establish the SCSI ports containing >> complete set of SPC-4 and RFC-3720 features, et al. >> > > You achieved 3 million syscalls/sec from Python code? That's very > impressive. > > Note with syscalls you could have done it with 10K syscalls (Python > supports packing and unpacking structs quite well, and also directly > calling C code IIRC). > >> Also on the kernel<-> user API interaction compatibility side, I have >> found the 3.x configfs enabled code adventagous over the LIO 2.9 code >> (that used an ioctl for everything) because it allows us to do backwards >> compat for future versions without using any userspace C code, which in >> IMHO makes maintaining userspace packages for complex kernel stacks with >> massive amounts of metadata + real-time configuration considerations. >> No longer having ioctl compatibility issues between LIO versions as the >> structures passed via ioctl change, and being able to do backwards >> compat with small amounts of interpreted code against configfs layout >> changes makes maintaining the kernel<-> user API really have made this >> that much easier for me. >> > > configfs is more maintainable that a bunch of hand-maintained ioctls. > But if we put some effort into an extendable syscall infrastructure > (perhaps to the point of using an IDL) I'm sure we can improve on that > without the problems pseudo filesystems introduce. > >> Anyways, I though these might be useful to the discussion as it releates >> to potental uses of configfs on the KVM Host or other projects that >> really make sense, and/or to improve the upstream implementation so that >> other users (like myself) can benefit from improvements to configfs. >> > > I can't really fault a project for using configfs; it's an accepted and > recommented (by the community) interface. I'd much prefer it though if > there was an effort to create a usable fd/struct based alternative. Yeah, doing it manually with all the CAP bits gets old, fast, so I agree that improvement here is welcome. Kind Regards, -Greg
On Wed, Aug 19, 2009 at 11:12:43PM +0300, Avi Kivity wrote: > On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote: > >Anyways, I was wondering if you might be interesting in sharing your > >concerns wrt to configfs (conigfs maintainer CC'ed), at some point..? > > My concerns aren't specifically with configfs, but with all the text > based pseudo filesystems that the kernel exposes. Phew! It's not just me :-) > My high level concern is that we're optimizing for the active > sysadmin, not for libraries and management programs. configfs and > sysfs are easy to use from the shell, discoverable, and easily > scripted. But they discourage documentation, the text format is > ambiguous, and they require a lot of boilerplate to use in code. I don't think they "discourage documentation" anymore than any ioctl we've ever had. At least you can look at the names and values and take a good stab at it (configfs is better than sysfs at this, by virtue of what it does, but discoverability is certainly not as good as real documentation). With an ioctl() that isn't (well) documented, you have to go read the structure and probably even read the code that uses the structure to be sure what you are doing. > You could argue that you can wrap *fs in a library that hides the > details of accessing it, but that's the wrong approach IMO. We > should make the information easy to use and manipulate for programs; > one of these programs can be a fuse filesystem for the active > sysadmin if someone thinks it's important. You are absolutely correct that they are a boon to the sysadmin, where in theory programs can do better with binary interfaces. Except what programs? I can't do an ioctl or a syscall from a shell script (no, using bash's network capabilities to talk to netlink does not count). Same with perl/python/whatever where you have to write boilerplate to create binary structures. These interfaces have two opposing forces acting on them. They provide a reasonably nice way to cross the user<->kernel boundary, so people want to use them. Programmatic things, like a power management daemon for example, don't want sysadmins touching anything. It's just an interface for the daemon. Conversely, some things are really knobs for the sysadmin. There's nothing else to it. Why should they have to code up a C program just to turn a knob? Configfs, as its name implies, really does exist for that second case. It turns out that it's quite nice to use for the first case too, but if folks wanted to go the syscall route, no worries. I've said it many times. We will never come up with one over-arching solution to all the disparate use cases. Instead, we should use each facility - syscalls, ioctls, sysfs, configfs, etc - as appropriate. Even in the same program or subsystem. > - atomicity > > One attribute per file means that, lacking userspace-visible > transactions, there is no way to change several attributes at once. > When you read attributes, there is no way to read several attributes > atomically so you can be sure their values correlate. Another > example of a problem is when an object disappears while reading its > attributes. Sure, openat() can mitigate this, but it's better to > avoid introducing problem than having a fix. configfs has some atomicity capabilities, but not full atomicity. It's not the right too for that sort of thing. > - ambiguity > > What format is the attribute? does it accept lowercase or uppercase > hex digits? is there a newline at the end? how many digits can it > take before the attribute overflows? All of this has to be > documented and checked by the OS, otherwise we risk regressions > later. In contrast, __u64 says everything in a binary interface. Um, is that __u64 a pointer to a userspace object? A key to a lookup table? A file descriptor that is padded out? It's no less ambiguous. > - lifetime and access control > > If a process brings an object into being (using mkdir) and then > dies, the object remains behind. The syscall/ioctl approach ties > the object into an fd, which will be destroyed when the process > dies, and which can be passed around using SCM_RIGHTS, allowing a > server process to create and configure an object before passing it > to an unprivileged program Most things here do *not* want to be tied to the lifetime of one process. We don't want our cpu_freq governor changing just because the power manager died. > You may argue, correctly, that syscalls and ioctls are not as > flexible. But this is because no one has invested the effort in > making them so. A struct passed as an argument to a syscall is not > extensible. But if you pass the size of the structure, and also a > bitmap of which attributes are present, you gain extensibility and > retain the atomicity property of a syscall interface. I don't think > a lot of effort is needed to make an extensible syscall interface > just as usable and a lot more efficient than configfs/sysfs. It > should also be simple to bolt a fuse interface on top to expose it > to us commandline types. Your extensible syscall still needs to be known. The flexibility provided by configfs and sysfs is of generic access to non-generic things. It's different. The follow-ups regarding the perf_counter call are a good example. If you know the perf_counter call, you can code up a C program that asks what attributes or things are there. But if you don't, you've first got to find out that there's a perf_counter call, then learn how to use it. With configfs/sysfs, you notice that there's now a perf_counter directory under a tree, and you can figure out what attributes and items are there. But this is not the be-all-end-all. Our syscalls should be more flexible in the perf_counter way. Not everything really needs to be listable by some yokel sysadmin. > configfs is more maintainable that a bunch of hand-maintained > ioctls. But if we put some effort into an extendable syscall > infrastructure (perhaps to the point of using an IDL) I'm sure we > can improve on that without the problems pseudo filesystems > introduce. Oh, boy, IDL :-) Seriously, if you can solve the "how do I just poke around without actually writing C code or installing a domain-specific binary" problem, you will probably get somewhere. > I can't really fault a project for using configfs; it's an accepted > and recommented (by the community) interface. I'd much prefer it > though if there was an effort to create a usable fd/struct based > alternative. Oh, and configfs was explicitly designed to be interface agnostic to the client. The filesystem portions, to the best of my ability, are not exposed to client drivers. So you can replace the configfs filesystem interface with a system call set that does the same operations, and no configfs user will actually need to change their code (if you want to change from text values to non-text, that would require changing the show/store operation prototypes, but that's about it). Joel
On Wed, 2009-08-19 at 15:16 -0700, Joel Becker wrote: > On Wed, Aug 19, 2009 at 11:12:43PM +0300, Avi Kivity wrote: > > On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote: > > >Anyways, I was wondering if you might be interesting in sharing your > > >concerns wrt to configfs (conigfs maintainer CC'ed), at some point..? > > > > My concerns aren't specifically with configfs, but with all the text > > based pseudo filesystems that the kernel exposes. > > Phew! It's not just me :-) The points on *fs vs. ioctl are interesting. I think both have their benefits and both have the downfalls, for example efficiency vs. ease of (human) use. I suppose it comes down to whether you're in the fast path or not for the most part. However, just because and interface is not as efficient as it can be does not necessarily mean that it's not a good one. As as an example, many moons ago, I worked on implementing some serial comms between an embedded speed controller and its command console. Being young and efficiency starved ;) I disregarded our other controllers which implemented these serial comms with ASCII strings, and used binary blobs instead. I indeed got some respectable performance out of doing this, even to the effect of creating a "real time" status monitor that updated multiple times a second via the hand-held terminal. However, I totally missed the point of intentionally doing things "inefficiently." For example, our serial debugging setup consisted of two VT100 terminals wired up with a custom serial cable that went between two communicating units. Each term would show what each end was saying. Kinda crude, but effective. Of course with my new and improved controller that "spoke the binary language of moisture evaporators," well, all one saw was garbage. :) Additionally, someone debugging the controllers could just use a term to talk to it if it used simple ASCII commands and take terminals, hosts and other software out of the picture, but for my controller, well, you could only use the custom programmed hand-held term. I ended up supporting both ASCII and binary communications on the controller for these (and other) reasons. However, in the end, I ditched the binary comms since they really didn't add the efficiency in the fast path where it should be added. (Well, I also ran out of eprom space... :) In any case, having a humanly understandable communications protocol (or ABI) can be extremely useful, and just because it's not efficient doesn't automatically mean that it's a bad thing, especially if it's in the slow path. It does have it's down sides as mentioned in this thread, so we really need both types. Because of that, the fuse layer on top of a binary ABI is an interesting idea. Alex -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-08-19 at 15:16 -0700, Joel Becker wrote: > On Wed, Aug 19, 2009 at 11:12:43PM +0300, Avi Kivity wrote: > > On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote: > > >Anyways, I was wondering if you might be interesting in sharing your > > >concerns wrt to configfs (conigfs maintainer CC'ed), at some point..? > > > > My concerns aren't specifically with configfs, but with all the text > > based pseudo filesystems that the kernel exposes. > > Phew! It's not just me :-) > > > My high level concern is that we're optimizing for the active > > sysadmin, not for libraries and management programs. configfs and > > sysfs are easy to use from the shell, discoverable, and easily > > scripted. But they discourage documentation, the text format is > > ambiguous, and they require a lot of boilerplate to use in code. > > I don't think they "discourage documentation" anymore than any > ioctl we've ever had. At least you can look at the names and values and > take a good stab at it (configfs is better than sysfs at this, by virtue > of what it does, but discoverability is certainly not as good as real > documentation). > With an ioctl() that isn't (well) documented, you have to go > read the structure and probably even read the code that uses the > structure to be sure what you are doing. > Good point.. > > > You could argue that you can wrap *fs in a library that hides the > > details of accessing it, but that's the wrong approach IMO. We > > should make the information easy to use and manipulate for programs; > > one of these programs can be a fuse filesystem for the active > > sysadmin if someone thinks it's important. > > You are absolutely correct that they are a boon to the sysadmin, > where in theory programs can do better with binary interfaces. Except > what programs? I can't do an ioctl or a syscall from a shell script > (no, using bash's network capabilities to talk to netlink does not > count). Same with perl/python/whatever where you have to write > boilerplate to create binary structures. <nod>, then I suppose it then begins to get down to how easy those boilderplates can be used to add new groups and attributes for developers.. In my experience using the CONFIGFS_EATTR() macros with multiple struct config_groups hanging of the same make_group() allocated internal TCM structure, this has been very easy for me once I figured out why I really needed the extended macro set (again, to hang multiple differently named struct config_groups off a single internally allocated structure). Joel, I know that you have been keeping the configfs macros in sync with the parameters used for original matching sysfs macros (and that I have been using my own configfs macro that can be used together with existing code) but I really do think the extended macro set has benefit for users of configfs who put a little bit of effort to understand how they work. > These interfaces have two opposing forces acting on them. They > provide a reasonably nice way to cross the user<->kernel boundary, so > people want to use them. Programmatic things, like a power management > daemon for example, don't want sysadmins touching anything. It's just > an interface for the daemon. Conversely, some things are really knobs > for the sysadmin. There's nothing else to it. Why should they have to > code up a C program just to turn a knob? Configfs, as its name implies, > really does exist for that second case. I think this is a very good point that really shows the benefits of a configfs based design for real world admin useablility and configurability (CLI building blocks for higher level UIs). Having the ability to modify non compiled code to suit their needs on top of a user defined configfs directory structure of groups/directories (assuming config groups have some sort of project defined naming requrements in each defined struct configfs_item_operations->make_group()) with synchronization done on a individual configfs group context for creation/deletion and optionally the I/O access of attributes within said group. > It turns out that it's quite > nice to use for the first case too, but if folks wanted to go the > syscall route, no worries. > I've said it many times. We will never come up with one > over-arching solution to all the disparate use cases. Instead, we > should use each facility - syscalls, ioctls, sysfs, configfs, etc - as > appropriate. Even in the same program or subsystem. > > > - atomicity > > > > One attribute per file means that, lacking userspace-visible > > transactions, there is no way to change several attributes at once. > > When you read attributes, there is no way to read several attributes > > atomically so you can be sure their values correlate. Another > > example of a problem is when an object disappears while reading its > > attributes. Sure, openat() can mitigate this, but it's better to > > avoid introducing problem than having a fix. > > configfs has some atomicity capabilities, but not full > atomicity. It's not the right too for that sort of thing. > > > - ambiguity > > > > What format is the attribute? does it accept lowercase or uppercase > > hex digits? is there a newline at the end? how many digits can it > > take before the attribute overflows? All of this has to be > > documented and checked by the OS, otherwise we risk regressions > > later. In contrast, __u64 says everything in a binary interface. > > Um, is that __u64 a pointer to a userspace object? A key to a > lookup table? A file descriptor that is padded out? It's no less > ambiguous. > > > - lifetime and access control > > > > If a process brings an object into being (using mkdir) and then > > dies, the object remains behind. The syscall/ioctl approach ties > > the object into an fd, which will be destroyed when the process > > dies, and which can be passed around using SCM_RIGHTS, allowing a > > server process to create and configure an object before passing it > > to an unprivileged program > > Most things here do *not* want to be tied to the lifetime of one > process. We don't want our cpu_freq governor changing just because the > power manager died. > > > > You may argue, correctly, that syscalls and ioctls are not as > > flexible. But this is because no one has invested the effort in > > making them so. A struct passed as an argument to a syscall is not > > extensible. But if you pass the size of the structure, and also a > > bitmap of which attributes are present, you gain extensibility and > > retain the atomicity property of a syscall interface. I don't think > > a lot of effort is needed to make an extensible syscall interface > > just as usable and a lot more efficient than configfs/sysfs. It > > should also be simple to bolt a fuse interface on top to expose it > > to us commandline types. > > Your extensible syscall still needs to be known. The > flexibility provided by configfs and sysfs is of generic access to > non-generic things. It's different. > The follow-ups regarding the perf_counter call are a good > example. If you know the perf_counter call, you can code up a C program > that asks what attributes or things are there. But if you don't, you've > first got to find out that there's a perf_counter call, then learn how > to use it. With configfs/sysfs, you notice that there's now a > perf_counter directory under a tree, and you can figure out what > attributes and items are there. > But this is not the be-all-end-all. Our syscalls should be more > flexible in the perf_counter way. Not everything really needs to be > listable by some yokel sysadmin. > > > configfs is more maintainable that a bunch of hand-maintained > > ioctls. But if we put some effort into an extendable syscall > > infrastructure (perhaps to the point of using an IDL) I'm sure we > > can improve on that without the problems pseudo filesystems > > introduce. > > Oh, boy, IDL :-) Seriously, if you can solve the "how do I just > poke around without actually writing C code or installing a > domain-specific binary" problem, you will probably get somewhere. > Also, having the configfs directory hierarchy that is based on names provided by user that can be accessed by higher level code or directly by the shell, 'tree' and friends is pretty nice too if you are the admin running the box. ;-) > > I can't really fault a project for using configfs; it's an accepted > > and recommented (by the community) interface. I'd much prefer it > > though if there was an effort to create a usable fd/struct based > > alternative. > > Oh, and configfs was explicitly designed to be interface > agnostic to the client. The filesystem portions, to the best of my > ability, are not exposed to client drivers. So you can replace the > configfs filesystem interface with a system call set that does the same > operations, and no configfs user will actually need to change their > code (if you want to change from text values to non-text, that would > require changing the show/store operation prototypes, but that's about > it). > Wow really..? I was wondering if something like this was possible in terms of different client interfaces for configfs ops, and where it would (ever..?) make sense.. --nab > Joel > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/20/2009 01:16 AM, Joel Becker wrote: >> My high level concern is that we're optimizing for the active >> sysadmin, not for libraries and management programs. configfs and >> sysfs are easy to use from the shell, discoverable, and easily >> scripted. But they discourage documentation, the text format is >> ambiguous, and they require a lot of boilerplate to use in code. >> > I don't think they "discourage documentation" anymore than any > ioctl we've ever had. At least you can look at the names and values and > take a good stab at it (configfs is better than sysfs at this, by virtue > of what it does, but discoverability is certainly not as good as real > documentation). > With an ioctl() that isn't (well) documented, you have to go > read the structure and probably even read the code that uses the > structure to be sure what you are doing. > An ioctl structure and a configfs/sysfs readdir provide similar information (the structure also provides the types of fields and isn't able to hide some of these fields). "Looking at the values" is what I meant by discouraging documentation. That implies looking at a self-documenting live system. But that tells you nothing about which fields were added in which versions, or fields which are hidden because your hardware doesn't support them or because you didn't echo 1 > somewhere. >> You could argue that you can wrap *fs in a library that hides the >> details of accessing it, but that's the wrong approach IMO. We >> should make the information easy to use and manipulate for programs; >> one of these programs can be a fuse filesystem for the active >> sysadmin if someone thinks it's important. >> > You are absolutely correct that they are a boon to the sysadmin, > where in theory programs can do better with binary interfaces. Except > what programs? I can't do an ioctl or a syscall from a shell script > (no, using bash's network capabilities to talk to netlink does not > count). Same with perl/python/whatever where you have to write > boilerplate to create binary structures. > The maintainer of the subsystem should provide a library that talks to the binary interface and a CLI program that talks to the library. Boring nonkernely work. Alternatively a fuse filesystem to talk to the library, or an IDL can replace the library. > These interfaces have two opposing forces acting on them. They > provide a reasonably nice way to cross the user<->kernel boundary, so > people want to use them. Programmatic things, like a power management > daemon for example, don't want sysadmins touching anything. It's just > an interface for the daemon. Many things start oriented at people and then, if they're useful, cross the lines to machines. You can convert a machine interface to a human interface at the cost of some work, but it's difficult to undo the deficiencies of a human oriented interface so it can be used by a program. > Conversely, some things are really knobs > for the sysadmin. I disagree. If it's useful for a human, it's useful for a machine. Moreover, *fs+bash is a user interface. It happens that bash is good at processing files, and filesystems are easily discoverable, so we code to that. But we make it more difficult to provide other interfaces to the same controls. > There's nothing else to it. Why should they have to > code up a C program just to turn a knob? Many kernel developers believe that userspace is burned into ROM and the only thing they can change is the kernel. That turns out to be incorrect. If you don't want users to write C programs to access your interface, write your own library+CLI. That will have the added benefit of providing meaningful errors as well ("Invalid argument" vs "frob must be between 52 and 91"). The program can have a configuration file so you don't need to reecho the values on boot. It can have a --daemon mode and do something when an event occurs. > Configfs, as its name implies, > really does exist for that second case. It turns out that it's quite > nice to use for the first case too, but if folks wanted to go the > syscall route, no worries. > Eventually everything is used in the first case. For example in the virtualization space it is common to have a zillion nodes running virtual machine that are only accessed by a management node. > I've said it many times. We will never come up with one > over-arching solution to all the disparate use cases. Instead, we > should use each facility - syscalls, ioctls, sysfs, configfs, etc - as > appropriate. Even in the same program or subsystem. > configfs is optional, but sysfs is not. Everything exposed via sysfs needs to continue to be exposed via sysfs, and new things as well for consistency. So now if someone wants a syscall interface they must duplicate the syscall interface, not replace it. >> - ambiguity >> >> What format is the attribute? does it accept lowercase or uppercase >> hex digits? is there a newline at the end? how many digits can it >> take before the attribute overflows? All of this has to be >> documented and checked by the OS, otherwise we risk regressions >> later. In contrast, __u64 says everything in a binary interface. >> > Um, is that __u64 a pointer to a userspace object? A key to a > lookup table? A file descriptor that is padded out? It's no less > ambiguous. > __u64 says everything about the type and space requirements of a field. It doesn't describe everything (like the name of the field or what it means) but it does provide a bunch of boring information that people rarely document in other ways. If my program reads a *fs field into a u32 and it later turns out the field was a u64, I'll get an overflow. It's a lot harder to get that wrong with a typed interface. >> - lifetime and access control >> >> If a process brings an object into being (using mkdir) and then >> dies, the object remains behind. The syscall/ioctl approach ties >> the object into an fd, which will be destroyed when the process >> dies, and which can be passed around using SCM_RIGHTS, allowing a >> server process to create and configure an object before passing it >> to an unprivileged program >> > Most things here do *not* want to be tied to the lifetime of one > process. We don't want our cpu_freq governor changing just because the > power manager died. > Using file descriptors doesn't force you to tie their lifetime to the fd; it only allows it. >> You may argue, correctly, that syscalls and ioctls are not as >> flexible. But this is because no one has invested the effort in >> making them so. A struct passed as an argument to a syscall is not >> extensible. But if you pass the size of the structure, and also a >> bitmap of which attributes are present, you gain extensibility and >> retain the atomicity property of a syscall interface. I don't think >> a lot of effort is needed to make an extensible syscall interface >> just as usable and a lot more efficient than configfs/sysfs. It >> should also be simple to bolt a fuse interface on top to expose it >> to us commandline types. >> > Your extensible syscall still needs to be known. The > flexibility provided by configfs and sysfs is of generic access to > non-generic things. It's different. > The follow-ups regarding the perf_counter call are a good > example. If you know the perf_counter call, you can code up a C program > that asks what attributes or things are there. But if you don't, you've > first got to find out that there's a perf_counter call, then learn how > to use it. With configfs/sysfs, you notice that there's now a > perf_counter directory under a tree, and you can figure out what > attributes and items are there. > Right, that's the great allure of *fs, discoverability. Everything is at your fingertips. Except if you're writing a program to manage things. The program can't explore *fs until it's run and usually does not want to present nongeneric things in a generic way. Ultimately most of our users are behind programs. >> configfs is more maintainable that a bunch of hand-maintained >> ioctls. But if we put some effort into an extendable syscall >> infrastructure (perhaps to the point of using an IDL) I'm sure we >> can improve on that without the problems pseudo filesystems >> introduce. >> > Oh, boy, IDL :-) Seriously, if you can solve the "how do I just > poke around without actually writing C code or installing a > domain-specific binary" problem, you will probably get somewhere. > IDL is very unpleasant to work with but it gets the work done. I don't see an issue with domain specific binaries (except that you have to write them). Some say there's the problem of distribution, but if the kernel distributed itself to the user somehow then the tool can be distributed just as well (maybe via tools/). >> I can't really fault a project for using configfs; it's an accepted >> and recommented (by the community) interface. I'd much prefer it >> though if there was an effort to create a usable fd/struct based >> alternative. >> > Oh, and configfs was explicitly designed to be interface > agnostic to the client. The filesystem portions, to the best of my > ability, are not exposed to client drivers. So you can replace the > configfs filesystem interface with a system call set that does the same > operations, and no configfs user will actually need to change their > code (if you want to change from text values to non-text, that would > require changing the show/store operation prototypes, but that's about > it). > > > But the user visible part is now ABI. I have no issues with the kernel internals.
On Wed, Aug 19, 2009 at 10:05 PM, Hollis Blanchard<hollisb@us.ibm.com> wrote: > On Wed, 2009-08-19 at 19:38 +0300, Avi Kivity wrote: >> On 08/19/2009 07:29 PM, Ira W. Snyder wrote: >> >>> That said, I'm not sure how qemu-system-ppc running on x86 could >> >>> possibly communicate using virtio-net. This would mean the guest is an >> >>> emulated big-endian PPC, while the host is a little-endian x86. I >> >>> haven't actually tested this situation, so perhaps I am wrong. Cross-platform virtio works when endianness is known in advance. For a hypervisor and a guest: 1. virtio-pci I/O registers use PCI endianness 2. vring uses guest endianness (hypervisor must byteswap) 3. guest memory buffers use guest endianness (hypervisor must byteswap) I know of no existing way when endianness is not known in advance. Perhaps a transport bit could be added to mark the endianness of the guest/driver side. This can be negotiated because virtio-pci has a known endianness. After negotiation, the host knows whether or not byteswapping is necessary for structures in guest memory. Stefan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/20/2009 12:57 PM, Stefan Hajnoczi wrote: > Cross-platform virtio works when endianness is known in advance. For > a hypervisor and a guest: > 1. virtio-pci I/O registers use PCI endianness > 2. vring uses guest endianness (hypervisor must byteswap) > 3. guest memory buffers use guest endianness (hypervisor must byteswap) > > I know of no existing way when endianness is not known in advance. > Perhaps a transport bit could be added to mark the endianness of the > guest/driver side. This can be negotiated because virtio-pci has a > known endianness. After negotiation, the host knows whether or not > byteswapping is necessary for structures in guest memory. > > Some processors are capable of switching their gender at runtime, so you cannot tell the guest endianness in advance.
On Wed, Aug 19, 2009 at 01:36:14AM -0400, Gregory Haskins wrote: > >> So where is the problem here? > > > > If virtio net in guest could be improved instead, everyone would > > benefit. > > So if I whip up a virtio-net backend for vbus with a PCI compliant > connector, you are happy? I'm currently worried about venet versus virtio-net guest situation, if you drop it and switch to virtio net instead that issue's resolved. I don't have an opinion on vbus versus pci, and I only speak for myself.
On Wed, Aug 19, 2009 at 11:37:16PM +0300, Avi Kivity wrote: > On 08/19/2009 09:26 PM, Gregory Haskins wrote: >>>> This is for things like the setup of queue-pairs, and the >>>> transport of door-bells, and ib-verbs. I am not on the team >>>> doing that work, so I am not an expert in this area. What I do >>>> know is having a flexible and low-latency signal-path was deemed >>>> a key requirement. >>>> >>>> >>> That's not a full bypass, then. AFAIK kernel bypass has userspace >>> talking directly to the device. >>> >> Like I said, I am not an expert on the details here. I only work >> on the vbus plumbing. FWIW, the work is derivative from the >> "Xen-IB" project >> >> http://www.openib.org/archives/nov2006sc/xen-ib-presentation.pdf >> >> There were issues with getting Xen-IB to map well into the Xen >> model. Vbus was specifically designed to address some of those >> short-comings. > > Well I'm not an Infiniband expert. But from what I understand VMM > bypass means avoiding the call to the VMM entirely by exposing > hardware registers directly to the guest. The original IB VMM bypass work predates SR-IOV (i.e., does not assume that the adapter has multiple hardware register windows for multiple devices). The way it worked was to split all device operations into `privileged' and `non-privileged'. Privileged operations such as mapping and pinning memory went through the hypervisor. Non-privileged operations such reading or writing previously mapped memory went directly to the adpater. Now-days with SR-IOV devices, VMM bypass usually means bypassing the hypervisor completely. Cheers, Muli
On Wed, Aug 19, 2009 at 1:37 PM, Avi Kivity<avi@redhat.com> wrote: > > Well I'm not an Infiniband expert. But from what I understand VMM bypass > means avoiding the call to the VMM entirely by exposing hardware registers > directly to the guest. > It enables clients to talk directly to the hardware. Whether or not that involves registers would be model specific. But frequently the queues being written were in the client's memory, and only a "doorbell ring" involved actual device resources. But whatever the mechanism, it enables the client to provide buffer addresses directly to the hardware in a manner that cannot damage another client. The two key requirements are a) client cannot enable access to pages that it does not already have access to, and b) client can delegate that authority to the Adapter without needing to invoke OS or Hypervisor on a per message basis. Traditionally that meant that memory maps ("Regions") were created on the privileged path to enable fast/non-privileged references by the client. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 20, 2009 at 09:09:21AM +0300, Avi Kivity wrote: > On 08/20/2009 01:16 AM, Joel Becker wrote: > > With an ioctl() that isn't (well) documented, you have to go > >read the structure and probably even read the code that uses the > >structure to be sure what you are doing. > > An ioctl structure and a configfs/sysfs readdir provide similar > information (the structure also provides the types of fields and > isn't able to hide some of these fields). With an ioctl structure, I can't take a look at what the values look like unless I read the code or write up a C program. With a configfs file, I can just cat the thing. > "Looking at the values" is what I meant by discouraging > documentation. That implies looking at a self-documenting live > system. But that tells you nothing about which fields were added in > which versions, or fields which are hidden because your hardware > doesn't support them or because you didn't echo 1 > somewhere. Most ioctls don't tell you that either. It certainly won't let you know that field foo_arg1 is ignored unless foo_arg2 is set to 2, or things like that. The problem of versioning requires discipline either way. It's not obvious from many ioctls. Conversely, you can create versioned configfs items via attributes or directories (same for sysfs, etc). > The maintainer of the subsystem should provide a library that talks > to the binary interface and a CLI program that talks to the library. > Boring nonkernely work. Alternatively a fuse filesystem to talk to > the library, or an IDL can replace the library. Again, that helps the user nothing. I don't know it exists. I don't have it installed. Unless it ships with the kernel, I have no idea about it. > Many things start oriented at people and then, if they're useful, > cross the lines to machines. You can convert a machine interface to > a human interface at the cost of some work, but it's difficult to > undo the deficiencies of a human oriented interface so it can be > used by a program. It's work to convert either way. Outside of fast-path things, the time it takes to strtoll() is unimportant. Don't use configfs/sysfs for fast-path things. > I disagree. If it's useful for a human, it's useful for a machine. And if it's useful for a machine, a human might want to peek at it by hand someday to debug it. > Moreover, *fs+bash is a user interface. It happens that bash is > good at processing files, and filesystems are easily discoverable, > so we code to that. But we make it more difficult to provide other > interfaces to the same controls. Not really. Writing a sane CLI to a binary interface takes about as much work as writing a sane API library to a text interface. The hard part is not the conversion, in either direction. The hard part is defining the interface. > >Configfs, as its name implies, > >really does exist for that second case. It turns out that it's quite > >nice to use for the first case too, but if folks wanted to go the > >syscall route, no worries. > > Eventually everything is used in the first case. For example in the > virtualization space it is common to have a zillion nodes running > virtual machine that are only accessed by a management node. Everything is eventually used in the second case, and admin or a developer debugging why the daemon is going wrong. Much easier from a shell or other generic accessor. Much faster than having to download your library's source, learn how to build it, add some printfs, discover you have the wrong printfs... > __u64 says everything about the type and space requirements of a > field. It doesn't describe everything (like the name of the field > or what it means) but it does provide a bunch of boring information > that people rarely document in other ways. > > If my program reads a *fs field into a u32 and it later turns out > the field was a u64, I'll get an overflow. It's a lot harder to get > that wrong with a typed interface. And if you send the wrong thing to configfs or sysfs you'll get an EINVAL or the like. It doesn't look like configfs and sysfs will work for you. Don't use 'em! Write your interfaces with ioctls and syscalls. Write your libraries and CLIs. In the end, you're the one who has to maintain them. I don't ever want anyone thinking I want to force configfs on them. I wrote it because it solves its class of problem well, and many people find it fits them too. So I'll use configfs, you'll use ioctl, and our users will be happy either way because we make it work! Joel
On 08/21/2009 01:48 AM, Joel Becker wrote: > On Thu, Aug 20, 2009 at 09:09:21AM +0300, Avi Kivity wrote: > >> On 08/20/2009 01:16 AM, Joel Becker wrote: >> >>> With an ioctl() that isn't (well) documented, you have to go >>> read the structure and probably even read the code that uses the >>> structure to be sure what you are doing. >>> >> An ioctl structure and a configfs/sysfs readdir provide similar >> information (the structure also provides the types of fields and >> isn't able to hide some of these fields). >> > With an ioctl structure, I can't take a look at what the values > look like unless I read the code or write up a C program. With a > configfs file, I can just cat the thing. > Unless it's system dependent like many sysfs files. If you're coding something that's supposed to run on several boxes, coding by example is not a good idea. Look up the documentation to find out what the values look like (unfortunately often there is no documentation). Looking at the value on your box does not indicate the range of values on other boxes or even if the value will be present on other boxes (due to having older kernels or different configurations). > > >> "Looking at the values" is what I meant by discouraging >> documentation. That implies looking at a self-documenting live >> system. But that tells you nothing about which fields were added in >> which versions, or fields which are hidden because your hardware >> doesn't support them or because you didn't echo 1> somewhere. >> > Most ioctls don't tell you that either. It certainly won't let > you know that field foo_arg1 is ignored unless foo_arg2 is set to 2, or > things like that. > Correct. What I mean is that discoverability is great for a sysadmin or kernel developers exploring the system, but pretty useless for a programmer writing code that will run on other systems. The majority of lkml users will find *fs easy to use and useful, but that's not the majority of our users. > The problem of versioning requires discipline either way. It's > not obvious from many ioctls. Conversely, you can create versioned > configfs items via attributes or directories (same for sysfs, etc). > Sure. >> The maintainer of the subsystem should provide a library that talks >> to the binary interface and a CLI program that talks to the library. >> Boring nonkernely work. Alternatively a fuse filesystem to talk to >> the library, or an IDL can replace the library. >> > Again, that helps the user nothing. I don't know it exists. I > don't have it installed. Unless it ships with the kernel, I have no > idea about it. > That's true for the lkml reader downloading a kernel from kernel.org (use git already) and run it on a random system. But again the majority of users will run a distro which is supposed to integrate the kernel and userspace. The short term gratification of early adopters harms the integration that more mainstream users expect. >> Many things start oriented at people and then, if they're useful, >> cross the lines to machines. You can convert a machine interface to >> a human interface at the cost of some work, but it's difficult to >> undo the deficiencies of a human oriented interface so it can be >> used by a program. >> > It's work to convert either way. Outside of fast-path things, > the time it takes to strtoll() is unimportant. Don't use configfs/sysfs > for fast-path things. > Infrastructure must be careful not to code itself into a corner. Already udev takes quite a bit of time to run and I have some memories of problems on thousand-disk configurations. What works reasonably well with one disk may not work as well with 1000. No doubt some of the problem is with udev, but I'm sure sysfs contributes. As a software development exercise reading a table of 1000 objects each with a couple dozen attributes should take less that a millisecond. >> I disagree. If it's useful for a human, it's useful for a machine. >> > And if it's useful for a machine, a human might want to peek at > it by hand someday to debug it. > We have strace and wireshark to decode binary syscall and wire streams. >> Moreover, *fs+bash is a user interface. It happens that bash is >> good at processing files, and filesystems are easily discoverable, >> so we code to that. But we make it more difficult to provide other >> interfaces to the same controls. >> > Not really. Writing a sane CLI to a binary interface takes > about as much work as writing a sane API library to a text interface. > The hard part is not the conversion, in either direction. The hard part > is defining the interface. > A *fs interface limits what you can do, so it makes writing the API library harder. I'm talking about the issues with atomicity and notifications. >>> Configfs, as its name implies, >>> really does exist for that second case. It turns out that it's quite >>> nice to use for the first case too, but if folks wanted to go the >>> syscall route, no worries. >>> >> Eventually everything is used in the first case. For example in the >> virtualization space it is common to have a zillion nodes running >> virtual machine that are only accessed by a management node. >> > Everything is eventually used in the second case, and admin or a > developer debugging why the daemon is going wrong. Much easier from a > shell or other generic accessor. Much faster than having to download > your library's source, learn how to build it, add some printfs, discover > you have the wrong printfs... > As a kernel/user interface, any syscall replacement for *fs is exposed via strace. It's true that debugging C code is harder than a bit of bash. >> __u64 says everything about the type and space requirements of a >> field. It doesn't describe everything (like the name of the field >> or what it means) but it does provide a bunch of boring information >> that people rarely document in other ways. >> >> If my program reads a *fs field into a u32 and it later turns out >> the field was a u64, I'll get an overflow. It's a lot harder to get >> that wrong with a typed interface. >> > And if you send the wrong thing to configfs or sysfs you'll get > an EINVAL or the like. > It doesn't look like configfs and sysfs will work for you. > Don't use 'em! Write your interfaces with ioctls and syscalls. Write > your libraries and CLIs. In the end, you're the one who has to maintain > them. I don't ever want anyone thinking I want to force configfs on > them. I wrote it because it solves its class of problem well, and many > people find it fits them too. So I'll use configfs, you'll use ioctl, > and our users will be happy either way because we make it work! > No, I have to use *fs (at least sysfs) since that's the current blessed interface. Fragmenting the kernel/userspace is the wrong thing to do, I value a consistent interface more than fixing the *fs problems (which are all fixable or tolerable). This is not a call to deprecate *fs and switch over to a yet another new thing. Users (and programmers) need some ABI stability. It just arose because I remarked that I'm not in love with *fs interfaces in an unrelated flamewar and someone asked me why.
diff --git a/MAINTAINERS b/MAINTAINERS index d0ea25c..83624e7 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5437,6 +5437,12 @@ S: Maintained F: Documentation/fb/uvesafb.txt F: drivers/video/uvesafb.* +VBUS +M: Gregory Haskins <ghaskins@novell.com> +S: Maintained +F: include/linux/vbus* +F: drivers/vbus/* + VFAT/FAT/MSDOS FILESYSTEM M: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> S: Maintained diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 13ffa5d..12f8fb3 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2037,6 +2037,8 @@ source "drivers/pcmcia/Kconfig" source "drivers/pci/hotplug/Kconfig" +source "drivers/vbus/Kconfig" + endmenu diff --git a/drivers/Makefile b/drivers/Makefile index bc4205d..d5bedb1 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -110,3 +110,4 @@ obj-$(CONFIG_VLYNQ) += vlynq/ obj-$(CONFIG_STAGING) += staging/ obj-y += platform/ obj-y += ieee802154/ +obj-y += vbus/ diff --git a/drivers/vbus/Kconfig b/drivers/vbus/Kconfig new file mode 100644 index 0000000..e1939f5 --- /dev/null +++ b/drivers/vbus/Kconfig @@ -0,0 +1,14 @@ +# +# Virtual-Bus (VBus) driver configuration +# + +config VBUS_PROXY + tristate "Virtual-Bus support" + select SHM_SIGNAL + default n + help + Adds support for a virtual-bus model drivers in a guest to connect + to host side virtual-bus resources. If you are using this kernel + in a virtualization solution which implements virtual-bus devices + on the backend, say Y. If unsure, say N. + diff --git a/drivers/vbus/Makefile b/drivers/vbus/Makefile new file mode 100644 index 0000000..a29a1e0 --- /dev/null +++ b/drivers/vbus/Makefile @@ -0,0 +1,3 @@ + +vbus-proxy-objs += bus-proxy.o +obj-$(CONFIG_VBUS_PROXY) += vbus-proxy.o diff --git a/drivers/vbus/bus-proxy.c b/drivers/vbus/bus-proxy.c new file mode 100644 index 0000000..3177f9f --- /dev/null +++ b/drivers/vbus/bus-proxy.c @@ -0,0 +1,152 @@ +/* + * Copyright 2009 Novell. All Rights Reserved. + * + * Author: + * Gregory Haskins <ghaskins@novell.com> + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +#include <linux/module.h> +#include <linux/vbus_driver.h> + +MODULE_AUTHOR("Gregory Haskins"); +MODULE_LICENSE("GPL"); + +#define VBUS_PROXY_NAME "vbus-proxy" + +static struct vbus_device_proxy *to_dev(struct device *_dev) +{ + return _dev ? container_of(_dev, struct vbus_device_proxy, dev) : NULL; +} + +static struct vbus_driver *to_drv(struct device_driver *_drv) +{ + return container_of(_drv, struct vbus_driver, drv); +} + +/* + * This function is invoked whenever a new driver and/or device is added + * to check if there is a match + */ +static int vbus_dev_proxy_match(struct device *_dev, struct device_driver *_drv) +{ + struct vbus_device_proxy *dev = to_dev(_dev); + struct vbus_driver *drv = to_drv(_drv); + + return !strcmp(dev->type, drv->type); +} + +/* + * This function is invoked after the bus infrastructure has already made a + * match. The device will contain a reference to the paired driver which + * we will extract. + */ +static int vbus_dev_proxy_probe(struct device *_dev) +{ + int ret = 0; + struct vbus_device_proxy *dev = to_dev(_dev); + struct vbus_driver *drv = to_drv(_dev->driver); + + if (drv->ops->probe) + ret = drv->ops->probe(dev); + + return ret; +} + +static struct bus_type vbus_proxy = { + .name = VBUS_PROXY_NAME, + .match = vbus_dev_proxy_match, +}; + +static struct device vbus_proxy_rootdev = { + .parent = NULL, + .init_name = VBUS_PROXY_NAME, +}; + +static int __init vbus_init(void) +{ + int ret; + + ret = bus_register(&vbus_proxy); + BUG_ON(ret < 0); + + ret = device_register(&vbus_proxy_rootdev); + BUG_ON(ret < 0); + + return 0; +} + +postcore_initcall(vbus_init); + +static void device_release(struct device *dev) +{ + struct vbus_device_proxy *_dev; + + _dev = container_of(dev, struct vbus_device_proxy, dev); + + _dev->ops->release(_dev); +} + +int vbus_device_proxy_register(struct vbus_device_proxy *new) +{ + new->dev.parent = &vbus_proxy_rootdev; + new->dev.bus = &vbus_proxy; + new->dev.release = &device_release; + + return device_register(&new->dev); +} +EXPORT_SYMBOL_GPL(vbus_device_proxy_register); + +void vbus_device_proxy_unregister(struct vbus_device_proxy *dev) +{ + device_unregister(&dev->dev); +} +EXPORT_SYMBOL_GPL(vbus_device_proxy_unregister); + +static int match_device_id(struct device *_dev, void *data) +{ + struct vbus_device_proxy *dev = to_dev(_dev); + u64 id = *(u64 *)data; + + return dev->id == id; +} + +struct vbus_device_proxy *vbus_device_proxy_find(u64 id) +{ + struct device *dev; + + dev = bus_find_device(&vbus_proxy, NULL, &id, &match_device_id); + + return to_dev(dev); +} +EXPORT_SYMBOL_GPL(vbus_device_proxy_find); + +int vbus_driver_register(struct vbus_driver *new) +{ + new->drv.bus = &vbus_proxy; + new->drv.name = new->type; + new->drv.owner = new->owner; + new->drv.probe = vbus_dev_proxy_probe; + + return driver_register(&new->drv); +} +EXPORT_SYMBOL_GPL(vbus_driver_register); + +void vbus_driver_unregister(struct vbus_driver *drv) +{ + driver_unregister(&drv->drv); +} +EXPORT_SYMBOL_GPL(vbus_driver_unregister); + diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h new file mode 100644 index 0000000..c53e13f --- /dev/null +++ b/include/linux/vbus_driver.h @@ -0,0 +1,73 @@ +/* + * Copyright 2009 Novell. All Rights Reserved. + * + * Mediates access to a host VBUS from a guest kernel by providing a + * global view of all VBUS devices + * + * Author: + * Gregory Haskins <ghaskins@novell.com> + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +#ifndef _LINUX_VBUS_DRIVER_H +#define _LINUX_VBUS_DRIVER_H + +#include <linux/device.h> +#include <linux/shm_signal.h> + +struct vbus_device_proxy; +struct vbus_driver; + +struct vbus_device_proxy_ops { + int (*open)(struct vbus_device_proxy *dev, int version, int flags); + int (*close)(struct vbus_device_proxy *dev, int flags); + int (*shm)(struct vbus_device_proxy *dev, int id, int prio, + void *ptr, size_t len, + struct shm_signal_desc *sigdesc, struct shm_signal **signal, + int flags); + int (*call)(struct vbus_device_proxy *dev, u32 func, + void *data, size_t len, int flags); + void (*release)(struct vbus_device_proxy *dev); +}; + +struct vbus_device_proxy { + char *type; + u64 id; + void *priv; /* Used by drivers */ + struct vbus_device_proxy_ops *ops; + struct device dev; +}; + +int vbus_device_proxy_register(struct vbus_device_proxy *dev); +void vbus_device_proxy_unregister(struct vbus_device_proxy *dev); + +struct vbus_device_proxy *vbus_device_proxy_find(u64 id); + +struct vbus_driver_ops { + int (*probe)(struct vbus_device_proxy *dev); + int (*remove)(struct vbus_device_proxy *dev); +}; + +struct vbus_driver { + char *type; + struct module *owner; + struct vbus_driver_ops *ops; + struct device_driver drv; +}; + +int vbus_driver_register(struct vbus_driver *drv); +void vbus_driver_unregister(struct vbus_driver *drv); + +#endif /* _LINUX_VBUS_DRIVER_H */
This will generally be used for hypervisors to publish any host-side virtual devices up to a guest. The guest will have the opportunity to consume any devices present on the vbus-proxy as if they were platform devices, similar to existing buses like PCI. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- MAINTAINERS | 6 ++ arch/x86/Kconfig | 2 + drivers/Makefile | 1 drivers/vbus/Kconfig | 14 ++++ drivers/vbus/Makefile | 3 + drivers/vbus/bus-proxy.c | 152 +++++++++++++++++++++++++++++++++++++++++++ include/linux/vbus_driver.h | 73 +++++++++++++++++++++ 7 files changed, 251 insertions(+), 0 deletions(-) create mode 100644 drivers/vbus/Kconfig create mode 100644 drivers/vbus/Makefile create mode 100644 drivers/vbus/bus-proxy.c create mode 100644 include/linux/vbus_driver.h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html