Message ID | 1418399932-7658-2-git-send-email-lersek@redhat.com |
---|---|
State | New |
Headers | show |
On Fri, Dec 12, 2014 at 04:58:45PM +0100, Laszlo Ersek wrote: > Make it clear that the maximum access size to the MMIO data register > determines the full size of the memory region. > > Currently the max access size is 1. Ensure that if a larger size were used > in "fw_cfg_data_mem_ops.valid.max_access_size", the memory subsystem would > split the access to byte-sized accesses internally, in increasing address > order. > > fw_cfg_data_mem_read() and fw_cfg_data_mem_write() don't care about > "address" or "size"; they just call the sequential fw_cfg_read() and > fw_cfg_write() functions, correspondingly. Therefore the automatic > splitting is just right. (The endianness of "fw_cfg_data_mem_ops" is > native.) This 'is native' caught my eye. Laszlo's Documentation/devicetree/bindings/arm/fw-cfg.txt patch states that the selector register is LE and " The data register allows accesses with 8, 16, 32 and 64-bit width. Accesses larger than a byte are interpreted as arrays, bundled together only for better performance. The bytes constituting such a word, in increasing address order, correspond to the bytes that would have been transferred by byte-wide accesses in chronological order. " I chatted with Laszlo to make sure the host-is-BE case was considered. It looks like there may be an issue there that Laszlo promised to look into. Laszlo had already noticed that the selector was defined as native in qemu as well, but should be LE. Now that we support > 1 byte reads and writes from the data port, then maybe we should look into changing that as well. drew > > This patch doesn't change behavior. > > Signed-off-by: Laszlo Ersek <lersek@redhat.com> > --- > > Notes: > v4: > - unchanged > > v3: > - new in v3 [Drew Jones] > > hw/nvram/fw_cfg.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c > index c4b78ed..7f6031c 100644 > --- a/hw/nvram/fw_cfg.c > +++ b/hw/nvram/fw_cfg.c > @@ -30,9 +30,8 @@ > #include "qemu/error-report.h" > #include "qemu/config-file.h" > > #define FW_CFG_SIZE 2 > -#define FW_CFG_DATA_SIZE 1 > #define TYPE_FW_CFG "fw_cfg" > #define FW_CFG_NAME "fw_cfg" > #define FW_CFG_PATH "/machine/" FW_CFG_NAME > #define FW_CFG(obj) OBJECT_CHECK(FWCfgState, (obj), TYPE_FW_CFG) > @@ -323,8 +322,9 @@ static const MemoryRegionOps fw_cfg_data_mem_ops = { > .valid = { > .min_access_size = 1, > .max_access_size = 1, > }, > + .impl.max_access_size = 1, > }; > > static const MemoryRegionOps fw_cfg_comb_mem_ops = { > .read = fw_cfg_comb_read, > @@ -608,9 +608,10 @@ static void fw_cfg_initfn(Object *obj) > memory_region_init_io(&s->ctl_iomem, OBJECT(s), &fw_cfg_ctl_mem_ops, s, > "fwcfg.ctl", FW_CFG_SIZE); > sysbus_init_mmio(sbd, &s->ctl_iomem); > memory_region_init_io(&s->data_iomem, OBJECT(s), &fw_cfg_data_mem_ops, s, > - "fwcfg.data", FW_CFG_DATA_SIZE); > + "fwcfg.data", > + fw_cfg_data_mem_ops.valid.max_access_size); > sysbus_init_mmio(sbd, &s->data_iomem); > /* In case ctl and data overlap: */ > memory_region_init_io(&s->comb_iomem, OBJECT(s), &fw_cfg_comb_mem_ops, s, > "fwcfg", FW_CFG_SIZE); > -- > 1.8.3.1 > >
On 12/16/14 14:48, Andrew Jones wrote: > On Fri, Dec 12, 2014 at 04:58:45PM +0100, Laszlo Ersek wrote: >> Make it clear that the maximum access size to the MMIO data register >> determines the full size of the memory region. >> >> Currently the max access size is 1. Ensure that if a larger size were >> used in "fw_cfg_data_mem_ops.valid.max_access_size", the memory >> subsystem would split the access to byte-sized accesses internally, >> in increasing address order. >> >> fw_cfg_data_mem_read() and fw_cfg_data_mem_write() don't care about >> "address" or "size"; they just call the sequential fw_cfg_read() and >> fw_cfg_write() functions, correspondingly. Therefore the automatic >> splitting is just right. (The endianness of "fw_cfg_data_mem_ops" is >> native.) > > This 'is native' caught my eye. Laszlo's > Documentation/devicetree/bindings/arm/fw-cfg.txt patch states that the > selector register is LE and > > " > The data register allows accesses with 8, 16, 32 and 64-bit width. > Accesses larger than a byte are interpreted as arrays, bundled > together only for better performance. The bytes constituting such a > word, in increasing address order, correspond to the bytes that would > have been transferred by byte-wide accesses in chronological order. > " > > I chatted with Laszlo to make sure the host-is-BE case was considered. > It looks like there may be an issue there that Laszlo promised to look > into. Laszlo had already noticed that the selector was defined as > native in qemu as well, but should be LE. Now that we support > 1 byte > reads and writes from the data port, then maybe we should look into > changing that as well. Yes. The root of this question is what each of enum device_endian { DEVICE_NATIVE_ENDIAN, DEVICE_BIG_ENDIAN, DEVICE_LITTLE_ENDIAN, }; means. Consider the following call tree, which implements the splitting / combining of an MMIO read: memory_region_dispatch_read() [memory.c] memory_region_dispatch_read1() access_with_adjusted_size() memory_region_big_endian() for each byte in the wide access: memory_region_read_accessor() fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c] fw_cfg_read() adjust_endianness() memory_region_wrong_endianness() bswapXX() The function access_with_adjusted_size() always iterates over the MMIO address range in incrementing address order. However, it can calculate the shift count for memory_region_read_accessor() in two ways. When memory_region_big_endian() returns "true", the shift count decreases as the MMIO address increases. When memory_region_big_endian() returns "false", the shift count increases as the MMIO address increases. In memory_region_read_accessor(), the shift count is used for a logical (ie. numeric) bitwise left shift (<<). That operator works with numeric values and hides (ie. accounts for) host endianness. Let's consider - an array of two bytes, [0xaa, 0xbb], - a uint16_t access made from the guest, - and all twelve possible cases. In the table below, the "host", "guest" and "device_endian" columns are input variables. The columns to the right are calculated / derived values. The arrows above the columns show the data dependencies. After memory_region_dispatch_read1() constructs the value that is visible in the "host value" column, it is passed to adjust_endianness(). If memory_region_wrong_endianness() returns "true", then the host value is byte-swapped. The result is then passed to the guest. +---------------------------------------------------------------------------------------------------------------+----------+ | | | +---- ------+-------------------------------------------------------------------------+ | | | | | | | | +----------------------------------------------------------+---------+ | | | | | | | | | | | | | +-----------+-------------------+ +----------------+ | | +------------------------+-------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | v | v | v | v | v | v # host guest device_endian memory_region_big_endian() host value host repr. memory_region_wrong_endianness() guest repr. guest value -- ---- ----- ------------- -------------------------- ---------- ------------ -------------------------------- ------------ ----------- 1 LE LE native 0 0xbbaa [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa 2 LE LE BE 1 0xaabb [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xbbaa 3 LE LE LE 0 0xbbaa [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa 4 LE BE native 1 0xaabb [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa 5 LE BE BE 1 0xaabb [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa 6 LE BE LE 0 0xbbaa [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xbbaa 7 BE LE native 0 0xbbaa [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb 8 BE LE BE 1 0xaabb [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xaabb 9 BE LE LE 0 0xbbaa [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb 10 BE BE native 1 0xaabb [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb 11 BE BE BE 1 0xaabb [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb 12 BE BE LE 0 0xbbaa [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xaabb Looking at the two rightmost columns, we must conclude: (a) The "device_endian" field has absolutely no significance wrt. what the guest sees. In each triplet of cases, when we go from DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN, the guest sees the exact same value. I don't understand this result (it makes me doubt device_endian makes any sense). What did I do wrong? Just to be sure that I was not seeing ghosts, I actually tested this result. On an x86_64 hosts I tested the aarch64 guest firmware (using TCG), cycling the "fw_cfg_data_mem_ops.endianness" field through all three possible values. That is, I tested cases #1 to #3. They *all* worked! (b) Considering a given host endianness (ie. a group of six cases): when the guest goes from little endian to big endian, the *numerical value* the guest sees does not change. In addition, fixating the guest endianness, and changing the host endianness, the *numerical value* that the guest sees (for the original host-side array [0xaa, 0xbb]) changes. This means that this interface is *value preserving*, not representation preserving. In other words: when host and guest endiannesses are identical, the *array* is transferred okay (the guest array matches the initial host array [0xaa, 0xbb]). When guest and host differ in endianness, the guest will see an incorrect *array*. Which, alas, makes this interface completely unsuitable for the purpose at hand (ie. transferring kernel & initrd images in words wider than a byte). We couldn't care less as to what numerical value the array [0xaa, 0xbb] encodes on the host -- whatever it encodes, the guest wants to receive the same *array* (not the same value). And the byte order cannot be fixed up in the guest, because it depends on the XOR of guest and host endiannesses, and the guest doesn't know about the host's endianness. I apologize for wasting everyone's time, but I think both results are very non-intuitive. I noticed the use of the bitwise shift in memory_region_read_accessor() -- which internally accounts for the host-side byte order -- just today, while discussing this with Drew on IRC. I had assumed that it would store bytes in the recipient uint64_t by address, not by numerical order of magnitude. Looks like the DMA-like interface is the only way forward. :( Laszlo
On 16/12/2014 20:00, Laszlo Ersek wrote: > Yes. > > The root of this question is what each of > > enum device_endian { > DEVICE_NATIVE_ENDIAN, > DEVICE_BIG_ENDIAN, > DEVICE_LITTLE_ENDIAN, > }; Actually, I think the root of the answer :) is that fw_cfg_read (and thus fw_cfg_data_mem_read) is not idempotent. The split/compose stuff accesses the bytes at offsets 8,9,10,11,12,13,14,15 and composes them according to the endianness. In the case of fw_cfg it just retrieves 8 bytes, but in the case of host big endian it reads them in the "wrong" order for some reason (sorry, I haven't looked at this thoroughly). So the solution is: 1) make fw_cfg_data_mem_ops DEVICE_LITTLE_ENDIAN 2) make fw_cfg_data_mem_read and fw_cfg_data_mem_write call fw_cfg_read and fw_cfg_write SIZE times and build up a value from the lowest byte up. Paolo
On 12/16/14 20:49, Paolo Bonzini wrote: > > > On 16/12/2014 20:00, Laszlo Ersek wrote: >> Yes. >> >> The root of this question is what each of >> >> enum device_endian { >> DEVICE_NATIVE_ENDIAN, >> DEVICE_BIG_ENDIAN, >> DEVICE_LITTLE_ENDIAN, >> }; > > Actually, I think the root of the answer :) is that fw_cfg_read (and > thus fw_cfg_data_mem_read) is not idempotent. The split/compose stuff > accesses the bytes at offsets 8,9,10,11,12,13,14,15 and composes them > according to the endianness. > > In the case of fw_cfg it just retrieves 8 bytes, but in the case of host > big endian it reads them in the "wrong" order for some reason (sorry, I > haven't looked at this thoroughly). I can't imagine how that would happen; fw_cfg_data_mem_read() ignores both "addr" and "size", and fw_cfg_read() simply advances the "cur_offset" member. > So the solution is: > > 1) make fw_cfg_data_mem_ops DEVICE_LITTLE_ENDIAN > > 2) make fw_cfg_data_mem_read and fw_cfg_data_mem_write call fw_cfg_read > and fw_cfg_write SIZE times and build up a value from the lowest byte up. Nonetheless, that's a really nice idea! I got so stuck with the automatic splitting that I forgot about the possibility to act upon the "size" parameter in fw_cfg_data_mem_read(). Thanks! ... Another thing that Andrew mentioned but I didn't cover in my other email -- what about fw_cfg_ctl_mem_ops? It's currently DEVICE_NATIVE_ENDIAN. You flipped the combined ops to LE in commit 6fdf98f2 (and, apparently, I reviewed it). Shouldn't we do the same for the standalone selector? Thanks Laszlo
On 12/16/14 21:06, Laszlo Ersek wrote: > On 12/16/14 20:49, Paolo Bonzini wrote: >> >> >> On 16/12/2014 20:00, Laszlo Ersek wrote: >>> Yes. >>> >>> The root of this question is what each of >>> >>> enum device_endian { >>> DEVICE_NATIVE_ENDIAN, >>> DEVICE_BIG_ENDIAN, >>> DEVICE_LITTLE_ENDIAN, >>> }; >> >> Actually, I think the root of the answer :) is that fw_cfg_read (and >> thus fw_cfg_data_mem_read) is not idempotent. The split/compose stuff >> accesses the bytes at offsets 8,9,10,11,12,13,14,15 and composes them >> according to the endianness. >> >> In the case of fw_cfg it just retrieves 8 bytes, but in the case of host >> big endian it reads them in the "wrong" order for some reason (sorry, I >> haven't looked at this thoroughly). > > I can't imagine how that would happen; fw_cfg_data_mem_read() ignores > both "addr" and "size", and fw_cfg_read() simply advances the > "cur_offset" member. Ah okay, I understand your point now; you're probably saying that access_with_adjusted_size() traverses the offsets in the wrong order. ... I don't see how; the only difference in the access() param list is the shift count. (I don't know how it should work by design.) In any case fw_cfg should be able to handle the larger accesses itself. Thanks Laszlo
On 16/12/2014 21:06, Laszlo Ersek wrote: > On 12/16/14 20:49, Paolo Bonzini wrote: >> fw_cfg_read (and >> thus fw_cfg_data_mem_read) is not idempotent. The split/compose stuff >> accesses the bytes at offsets 8,9,10,11,12,13,14,15 and composes them >> according to the endianness. >> >> In the case of fw_cfg it just retrieves 8 bytes, but in the case of host >> big endian it reads them in the "wrong" order for some reason (sorry, I >> haven't looked at this thoroughly). > > I can't imagine how that would happen; fw_cfg_data_mem_read() ignores > both "addr" and "size", and fw_cfg_read() simply advances the > "cur_offset" member. Honestly neither can I. But still the automatic splitting (which is even tested by tests/endianness-test.c :)) assumes idempotency of the components and it's not entirely surprising that it somehow/sometimes breaks if you don't respect that. >> So the solution is: >> >> 1) make fw_cfg_data_mem_ops DEVICE_LITTLE_ENDIAN >> >> 2) make fw_cfg_data_mem_read and fw_cfg_data_mem_write call fw_cfg_read >> and fw_cfg_write SIZE times and build up a value from the lowest byte up. > > Nonetheless, that's a really nice idea! I got so stuck with the > automatic splitting that I forgot about the possibility to act upon the > "size" parameter in fw_cfg_data_mem_read(). Thanks! > > ... Another thing that Andrew mentioned but I didn't cover in my other > email -- what about fw_cfg_ctl_mem_ops? It's currently DEVICE_NATIVE_ENDIAN. > > You flipped the combined ops to LE in commit 6fdf98f2 (and, apparently, > I reviewed it). Shouldn't we do the same for the standalone selector? No. The standalone selector is used as MMIO, and the BE platforms expect the platform to be big-endian. The combined ops are only used on ISA ports, where the firmware expects them to be little-endian (as mentioned in the commit message). That said, the standalone selector is used by BE platforms only, so we know that the standalone selector is always DEVICE_BIG_ENDIAN. So if you want, you can make the standalone selector and the standalone datum BE and swap them in the firmware. If the suggestion doesn't make you jump up and down, I understand that. :) Paolo
On 16 December 2014 at 19:00, Laszlo Ersek <lersek@redhat.com> wrote: > The root of this question is what each of > > enum device_endian { > DEVICE_NATIVE_ENDIAN, > DEVICE_BIG_ENDIAN, > DEVICE_LITTLE_ENDIAN, > }; > > means. An opening remark: endianness is a horribly confusing topic and support of more than one endianness is even worse. I may have made some inadvertent errors in this reply; if you think so please let me know and I'll have another stab at it. That said: the device_endian options indicate what a device's internal idea of its endianness is. This is most relevant if a device accepts accesses at wider than byte width (for instance, if you can read a 32-bit value from address offset 0 and also an 8-bit value from offset 3 then how do those values line up? If you read a 32-bit value then which way round is it compared to what the device's io read/write function return?). NATIVE_ENDIAN means "same order as the CPU's main data bus's natural representation". (Note that this is not necessarily the same as "the endianness the CPU currently has"; on ARM you can flip the CPU between LE and BE at runtime, which is basically inserting a byte-swizzling step between data accesses and the CPU's data bus, which is always LE for ARMv7+.) Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN (because the guest vcpu reads it directly with the h/w cpu). > Consider the following call tree, which implements the splitting / > combining of an MMIO read: > > memory_region_dispatch_read() [memory.c] > memory_region_dispatch_read1() > access_with_adjusted_size() > memory_region_big_endian() > for each byte in the wide access: > memory_region_read_accessor() > fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c] > fw_cfg_read() > adjust_endianness() > memory_region_wrong_endianness() > bswapXX() > > The function access_with_adjusted_size() always iterates over the MMIO > address range in incrementing address order. However, it can calculate > the shift count for memory_region_read_accessor() in two ways. > > When memory_region_big_endian() returns "true", the shift count > decreases as the MMIO address increases. > > When memory_region_big_endian() returns "false", the shift count > increases as the MMIO address increases. Yep, because this is how the device has told us that it thinks of accesses as being put together. The column in your table "host value" is the 16 bit value read from the device, ie what we have decided (based on device_endian) that it would have returned us if it had supported a 16 bit read directly itself. BE devices compose 16 bit values with the high byte first, LE devices with the low byte first, and native-endian devices in the same order as guest-endianness. > In memory_region_read_accessor(), the shift count is used for a logical > (ie. numeric) bitwise left shift (<<). That operator works with numeric > values and hides (ie. accounts for) host endianness. > > Let's consider > - an array of two bytes, [0xaa, 0xbb], > - a uint16_t access made from the guest, > - and all twelve possible cases. > > In the table below, the "host", "guest" and "device_endian" columns are > input variables. The columns to the right are calculated / derived > values. The arrows above the columns show the data dependencies. > > After memory_region_dispatch_read1() constructs the value that is > visible in the "host value" column, it is passed to adjust_endianness(). > If memory_region_wrong_endianness() returns "true", then the host value > is byte-swapped. The result is then passed to the guest. > > +---------------------------------------------------------------------------------------------------------------+----------+ > | | | > +---- ------+-------------------------------------------------------------------------+ | | > | | | | | | > +----------------------------------------------------------+---------+ | | | > | | | | | | | | | > | +-----------+-------------------+ +----------------+ | | +------------------------+-------------------+ | | > | | | | | | | | | | | | | | | | | > | | | | | | v | v | v | v | v | v > # host guest device_endian memory_region_big_endian() host value host repr. memory_region_wrong_endianness() guest repr. guest value > -- ---- ----- ------------- -------------------------- ---------- ------------ -------------------------------- ------------ ----------- > 1 LE LE native 0 0xbbaa [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa > 2 LE LE BE 1 0xaabb [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xbbaa > 3 LE LE LE 0 0xbbaa [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa > > 4 LE BE native 1 0xaabb [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa > 5 LE BE BE 1 0xaabb [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa > 6 LE BE LE 0 0xbbaa [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xbbaa > > 7 BE LE native 0 0xbbaa [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb > 8 BE LE BE 1 0xaabb [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xaabb > 9 BE LE LE 0 0xbbaa [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb > > 10 BE BE native 1 0xaabb [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb > 11 BE BE BE 1 0xaabb [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb > 12 BE BE LE 0 0xbbaa [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xaabb The column you have labelled 'guest repr' here is the returned data from io_mem_read, whose API contract is "write the data from the device into this host C uint16_t (or whatever) such that it is the value returned by the device read as a native host value". It's not related to the guest order at all. So for instance, io_mem_read() is called by cpu_physical_memory_rw(), which passes it a local variable "val". So now "val" has the "guest repr" column's bytes in it, and (as a host C variable) the value: 1,2,3 : 0xbbaa 4,5,6 : 0xaabb 7,8,9 : 0xbbaa 10,11,12 : 0xaabb As you can see, this depends on the "guest endianness" (which is kind of the endianness of the bus): a BE guest 16 bit access to this device would return the 16 bit value 0xaabb, and an LE guest 0xbbaa, and we have exactly those values in our host C variable. If this is TCG, then we'll copy that 16 bit host value into the CPUState struct field corresponding to the destination guest register as-is. (TCG CPUState fields are always in host-C-order.) However, to pass them up to KVM we have to put them into a buffer in RAM as per the KVM_EXIT_MMIO ABI. So: cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store in target CPU endianness"], so now buf has the bytes: 1,2,3 : [0xaa, 0xbb] 4,5,6 : [0xaa, 0xbb] 7,8,9 : [0xaa, 0xbb] 10,11,12 : [0xaa, 0xbb] ...which is the same for every case. This buffer is (for KVM) the run->mmio.data buffer, whose semantics are "the value as it would appear if the VCPU performed a load or store of the appropriate width directly to the byte array". Which is what we want -- your device has two bytes in order 0xaa, 0xbb, and we did a 16 bit load in the guest CPU, so we should get the same answer as if we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not. I think the fact that all of these things come out to the same set of bytes in the mmio.data buffer is the indication that all QEMU's byte swapping is correct. > Looking at the two rightmost columns, we must conclude: > > (a) The "device_endian" field has absolutely no significance wrt. what > the guest sees. In each triplet of cases, when we go from > DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN, > the guest sees the exact same value. > > I don't understand this result (it makes me doubt device_endian > makes any sense). What did I do wrong? I think it's because you defined your device as only supporting byte accesses that you didn't see any difference between the various device_endian settings. If a device presents itself as just an array of bytes then it doesn't have an endianness, really. As Paolo says, if you make your device support wider accesses directly and build up the value yourself then you'll see that setting the device_endian to LE vs BE vs native does change the values you see in the guest (and that you'll need to set it to LE and interpret the words in the guest as LE to get the right behaviour). > This means that this interface is *value preserving*, not > representation preserving. In other words: when host and guest > endiannesses are identical, the *array* is transferred okay (the > guest array matches the initial host array [0xaa, 0xbb]). When guest > and host differ in endianness, the guest will see an incorrect > *array*. Think of a device which supports only byte accesses as being like a lump of RAM. A big-endian guest CPU which reads 32 bits at a time will get different values in registers to an LE guest which does the same. *However*, if both CPUs just do "read 32 bits; write 32 bits to RAM" (ie a kind of memcpy but with the source being the mmio register rather than some other bit of RAM) then you should get a bytewise-correct copy of the data in RAM. So I *think* that would be a viable approach: have your QEMU device code as it is now, and just make sure that if the guest is doing wider-than-byte accesses it does them as do { load word from mmio register; store word to RAM; } while (count); and doesn't try to interpret the byte order of the values while they're in the CPU register in the middle of the loop. Paolo's suggestion would work too, if you prefer that. > I apologize for wasting everyone's time, but I think both results are > very non-intuitive. Everything around endianness handling is non-intuitive -- it's the nature of the problem, I'm afraid. (Some of this is also QEMU's fault for not having good documentation of the semantics of each of the layers involved in memory accesses. I have on my todo list the vague idea of trying to write these all up as a blog post.) thanks -- PMM
On 16 December 2014 at 20:40, Paolo Bonzini <pbonzini@redhat.com> wrote: > Honestly neither can I. But still the automatic splitting (which is > even tested by tests/endianness-test.c :)) assumes idempotency of the > components and it's not entirely surprising that it somehow/sometimes > breaks if you don't respect that. Oh, good point. Yeah, I don't think we want to make any guarantee about the order in which the N byte accesses hit the device if we're assembling an N-byte access. Implementing the device as a proper wide-access-supporting device is definitely the way to go. -- PMM
On 16/12/2014 21:17, Laszlo Ersek wrote: >> > I can't imagine how that would happen; fw_cfg_data_mem_read() ignores >> > both "addr" and "size", and fw_cfg_read() simply advances the >> > "cur_offset" member. > Ah okay, I understand your point now; you're probably saying that > access_with_adjusted_size() traverses the offsets in the wrong order. > ... I don't see how; the only difference in the access() param list is > the shift count. (I don't know how it should work by design.) I think I have figured it out. Guest endianness affects where those bytes are placed, but not the order in which they are fetched; and the effects of guest endianness are always cleaned up by adjust_endianness, so ultimately they do not matter. Think of how you would implement the uint64_t read in fw_cfg: File bytes 12 34 56 78 9a bc de f0 fw_cfg_data_mem_read should read size==4 BE host 0x12345678 size==4 LE host 0x78563412 size==8 BE host 0x123456789abcdef0 size==8 LE host 0xf0debc9a78563412 So the implementation of fw_cfg_data_mem_read must depend on host endianness. Instead, memory.c always fills in bytes in the same order, on the assumption that the reads are idempotent. Paolo
On 12/16/14 22:47, Paolo Bonzini wrote: > On 16/12/2014 21:17, Laszlo Ersek wrote: >>>> I can't imagine how that would happen; fw_cfg_data_mem_read() ignores >>>> both "addr" and "size", and fw_cfg_read() simply advances the >>>> "cur_offset" member. >> Ah okay, I understand your point now; you're probably saying that >> access_with_adjusted_size() traverses the offsets in the wrong order. >> ... I don't see how; the only difference in the access() param list is >> the shift count. (I don't know how it should work by design.) > > I think I have figured it out. > > Guest endianness affects where those bytes are placed, but not the order > in which they are fetched; and the effects of guest endianness are > always cleaned up by adjust_endianness, so ultimately they do not matter. > > Think of how you would implement the uint64_t read in fw_cfg: > > File bytes 12 34 56 78 9a bc de f0 > > fw_cfg_data_mem_read should read > > size==4 BE host 0x12345678 > size==4 LE host 0x78563412 > size==8 BE host 0x123456789abcdef0 > size==8 LE host 0xf0debc9a78563412 > > So the implementation of fw_cfg_data_mem_read must depend on host > endianness. Instead, memory.c always fills in bytes in the same order, > on the assumption that the reads are idempotent. I see. Thanks! Laszlo
On 12/16/14 21:40, Paolo Bonzini wrote: > On 16/12/2014 21:06, Laszlo Ersek wrote: >> You flipped the combined ops to LE in commit 6fdf98f2 (and, apparently, >> I reviewed it). Shouldn't we do the same for the standalone selector? > > No. The standalone selector is used as MMIO, and the BE platforms > expect the platform to be big-endian. The combined ops are only used on > ISA ports, where the firmware expects them to be little-endian (as > mentioned in the commit message). > > That said, the standalone selector is used by BE platforms only, so we > know that the standalone selector is always DEVICE_BIG_ENDIAN. This series exposes the standalone selector (as MMIO) to ARM guests as well; and in "Documentation/devicetree/bindings/arm/fw-cfg.txt" for the kernel I'm saying that the selector is little endian. Therefore I think that the standalone selector is not only (going to be) used by BE platforms (or I don't understand your above statement correctly). But, the current (and to be preserved) NATIVE_ENDIAN setting still matches what I say in "Documentation/devicetree/bindings/arm/fw-cfg.txt", because, Peter said: > NATIVE_ENDIAN means "same order as the CPU's main data bus's natural > representation". (Note that this is not necessarily the same as "the > endianness the CPU currently has"; on ARM you can flip the CPU between > LE and BE at runtime, which is basically inserting a byte-swizzling > step between data accesses and the CPU's data bus, which is always LE > for ARMv7+.) In other words, the standalone selector is NATIVE_ENDIAN, but in the description of the *ARM* bindings, we can simply say that it's little endian. Is that right? Thanks Laszlo > > So if you want, you can make the standalone selector and the standalone > datum BE and swap them in the firmware. If the suggestion doesn't make > you jump up and down, I understand that. :) > > Paolo >
On 12/16/14 21:41, Peter Maydell wrote: > On 16 December 2014 at 19:00, Laszlo Ersek <lersek@redhat.com> wrote: >> The root of this question is what each of >> >> enum device_endian { >> DEVICE_NATIVE_ENDIAN, >> DEVICE_BIG_ENDIAN, >> DEVICE_LITTLE_ENDIAN, >> }; >> >> means. > > An opening remark: endianness is a horribly confusing topic and > support of more than one endianness is even worse. I may have made > some inadvertent errors in this reply; if you think so please > let me know and I'll have another stab at it. > > That said: the device_endian options indicate what a device's > internal idea of its endianness is. This is most relevant if > a device accepts accesses at wider than byte width > (for instance, if you can read a 32-bit value from > address offset 0 and also an 8-bit value from offset 3 then > how do those values line up? If you read a 32-bit value then > which way round is it compared to what the device's io read/write > function return?). > > NATIVE_ENDIAN means "same order as the CPU's main data bus's > natural representation". (Note that this is not necessarily > the same as "the endianness the CPU currently has"; on ARM > you can flip the CPU between LE and BE at runtime, which > is basically inserting a byte-swizzling step between data > accesses and the CPU's data bus, which is always LE for ARMv7+.) > > Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN > (because the guest vcpu reads it directly with the h/w cpu). > >> Consider the following call tree, which implements the splitting / >> combining of an MMIO read: >> >> memory_region_dispatch_read() [memory.c] >> memory_region_dispatch_read1() >> access_with_adjusted_size() >> memory_region_big_endian() >> for each byte in the wide access: >> memory_region_read_accessor() >> fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c] >> fw_cfg_read() >> adjust_endianness() >> memory_region_wrong_endianness() >> bswapXX() >> >> The function access_with_adjusted_size() always iterates over the MMIO >> address range in incrementing address order. However, it can calculate >> the shift count for memory_region_read_accessor() in two ways. >> >> When memory_region_big_endian() returns "true", the shift count >> decreases as the MMIO address increases. >> >> When memory_region_big_endian() returns "false", the shift count >> increases as the MMIO address increases. > > Yep, because this is how the device has told us that it thinks > of accesses as being put together. > > The column in your table "host value" is the 16 bit value read from > the device, ie what we have decided (based on device_endian) that > it would have returned us if it had supported a 16 bit read directly > itself. BE devices compose 16 bit values with the high byte first, > LE devices with the low byte first, and native-endian devices > in the same order as guest-endianness. > >> In memory_region_read_accessor(), the shift count is used for a logical >> (ie. numeric) bitwise left shift (<<). That operator works with numeric >> values and hides (ie. accounts for) host endianness. >> >> Let's consider >> - an array of two bytes, [0xaa, 0xbb], >> - a uint16_t access made from the guest, >> - and all twelve possible cases. >> >> In the table below, the "host", "guest" and "device_endian" columns are >> input variables. The columns to the right are calculated / derived >> values. The arrows above the columns show the data dependencies. >> >> After memory_region_dispatch_read1() constructs the value that is >> visible in the "host value" column, it is passed to adjust_endianness(). >> If memory_region_wrong_endianness() returns "true", then the host value >> is byte-swapped. The result is then passed to the guest. >> >> +---------------------------------------------------------------------------------------------------------------+----------+ >> | | | >> +---- ------+-------------------------------------------------------------------------+ | | >> | | | | | | >> +----------------------------------------------------------+---------+ | | | >> | | | | | | | | | >> | +-----------+-------------------+ +----------------+ | | +------------------------+-------------------+ | | >> | | | | | | | | | | | | | | | | | >> | | | | | | v | v | v | v | v | v >> # host guest device_endian memory_region_big_endian() host value host repr. memory_region_wrong_endianness() guest repr. guest value >> -- ---- ----- ------------- -------------------------- ---------- ------------ -------------------------------- ------------ ----------- >> 1 LE LE native 0 0xbbaa [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa >> 2 LE LE BE 1 0xaabb [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xbbaa >> 3 LE LE LE 0 0xbbaa [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa >> >> 4 LE BE native 1 0xaabb [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa >> 5 LE BE BE 1 0xaabb [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa >> 6 LE BE LE 0 0xbbaa [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xbbaa >> >> 7 BE LE native 0 0xbbaa [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb >> 8 BE LE BE 1 0xaabb [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xaabb >> 9 BE LE LE 0 0xbbaa [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb >> >> 10 BE BE native 1 0xaabb [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb >> 11 BE BE BE 1 0xaabb [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb >> 12 BE BE LE 0 0xbbaa [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xaabb > > The column you have labelled 'guest repr' here is the returned data > from io_mem_read, whose API contract is "write the data from the > device into this host C uint16_t (or whatever) such that it is the > value returned by the device read as a native host value". It's > not related to the guest order at all. > > So for instance, io_mem_read() is called by cpu_physical_memory_rw(), > which passes it a local variable "val". So now "val" has the > "guest repr" column's bytes in it, and (as a host C variable) the > value: > 1,2,3 : 0xbbaa > 4,5,6 : 0xaabb > 7,8,9 : 0xbbaa > 10,11,12 : 0xaabb > > As you can see, this depends on the "guest endianness" (which is > kind of the endianness of the bus): a BE guest 16 bit access to > this device would return the 16 bit value 0xaabb, and an LE guest > 0xbbaa, and we have exactly those values in our host C variable. > If this is TCG, then we'll copy that 16 bit host value into the > CPUState struct field corresponding to the destination guest > register as-is. (TCG CPUState fields are always in host-C-order.) > > However, to pass them up to KVM we have to put them into a > buffer in RAM as per the KVM_EXIT_MMIO ABI. So: > cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store > in target CPU endianness"], so now buf has the bytes: > 1,2,3 : [0xaa, 0xbb] > 4,5,6 : [0xaa, 0xbb] > 7,8,9 : [0xaa, 0xbb] > 10,11,12 : [0xaa, 0xbb] > > ...which is the same for every case. > > This buffer is (for KVM) the run->mmio.data buffer, whose semantics > are "the value as it would appear if the VCPU performed a load or store > of the appropriate width directly to the byte array". Which is what we > want -- your device has two bytes in order 0xaa, 0xbb, and we did > a 16 bit load in the guest CPU, so we should get the same answer as if > we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be > 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not. > > I think the fact that all of these things come out to the same > set of bytes in the mmio.data buffer is the indication that all > QEMU's byte swapping is correct. > >> Looking at the two rightmost columns, we must conclude: >> >> (a) The "device_endian" field has absolutely no significance wrt. what >> the guest sees. In each triplet of cases, when we go from >> DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN, >> the guest sees the exact same value. >> >> I don't understand this result (it makes me doubt device_endian >> makes any sense). What did I do wrong? > > I think it's because you defined your device as only supporting > byte accesses that you didn't see any difference between the > various device_endian settings. If a device presents itself as > just an array of bytes then it doesn't have an endianness, really. > > As Paolo says, if you make your device support wider accesses > directly and build up the value yourself then you'll see that > setting the device_endian to LE vs BE vs native does change > the values you see in the guest (and that you'll need to set it > to LE and interpret the words in the guest as LE to get the > right behaviour). > >> This means that this interface is *value preserving*, not >> representation preserving. In other words: when host and guest >> endiannesses are identical, the *array* is transferred okay (the >> guest array matches the initial host array [0xaa, 0xbb]). When guest >> and host differ in endianness, the guest will see an incorrect >> *array*. > > Think of a device which supports only byte accesses as being > like a lump of RAM. A big-endian guest CPU which reads 32 bits > at a time will get different values in registers to an LE guest > which does the same. > > *However*, if both CPUs just do "read 32 bits; write 32 bits to > RAM" (ie a kind of memcpy but with the source being the mmio > register rather than some other bit of RAM) then you should get > a bytewise-correct copy of the data in RAM. > > So I *think* that would be a viable approach: have your QEMU > device code as it is now, and just make sure that if the guest > is doing wider-than-byte accesses it does them as > do { > load word from mmio register; > store word to RAM; > } while (count); > > and doesn't try to interpret the byte order of the values while > they're in the CPU register in the middle of the loop. > > Paolo's suggestion would work too, if you prefer that. > >> I apologize for wasting everyone's time, but I think both results are >> very non-intuitive. > > Everything around endianness handling is non-intuitive -- > it's the nature of the problem, I'm afraid. (Some of this is > also QEMU's fault for not having good documentation of the > semantics of each of the layers involved in memory accesses. > I have on my todo list the vague idea of trying to write these > all up as a blog post.) Thanks for taking the time to write this up. My analysis must have missed at least two important things then: - the device's read/write function needs to consider address & size, and return values that match host byte order. fw_cfg doesn't conform ATM. - there's one more layer outside the call tree that I checked that can perform endianness conversion. I'll try to implement Paolo's suggestion (ie. support wide accesses in fw_cfg internally, not relying on splitting/combining by memory.c). Thanks Laszlo
> Am 17.12.2014 um 08:13 schrieb Laszlo Ersek <lersek@redhat.com>: > >> On 12/16/14 21:41, Peter Maydell wrote: >>> On 16 December 2014 at 19:00, Laszlo Ersek <lersek@redhat.com> wrote: >>> The root of this question is what each of >>> >>> enum device_endian { >>> DEVICE_NATIVE_ENDIAN, >>> DEVICE_BIG_ENDIAN, >>> DEVICE_LITTLE_ENDIAN, >>> }; >>> >>> means. >> >> An opening remark: endianness is a horribly confusing topic and >> support of more than one endianness is even worse. I may have made >> some inadvertent errors in this reply; if you think so please >> let me know and I'll have another stab at it. >> >> That said: the device_endian options indicate what a device's >> internal idea of its endianness is. This is most relevant if >> a device accepts accesses at wider than byte width >> (for instance, if you can read a 32-bit value from >> address offset 0 and also an 8-bit value from offset 3 then >> how do those values line up? If you read a 32-bit value then >> which way round is it compared to what the device's io read/write >> function return?). >> >> NATIVE_ENDIAN means "same order as the CPU's main data bus's >> natural representation". (Note that this is not necessarily >> the same as "the endianness the CPU currently has"; on ARM >> you can flip the CPU between LE and BE at runtime, which >> is basically inserting a byte-swizzling step between data >> accesses and the CPU's data bus, which is always LE for ARMv7+.) >> >> Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN >> (because the guest vcpu reads it directly with the h/w cpu). >> >>> Consider the following call tree, which implements the splitting / >>> combining of an MMIO read: >>> >>> memory_region_dispatch_read() [memory.c] >>> memory_region_dispatch_read1() >>> access_with_adjusted_size() >>> memory_region_big_endian() >>> for each byte in the wide access: >>> memory_region_read_accessor() >>> fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c] >>> fw_cfg_read() >>> adjust_endianness() >>> memory_region_wrong_endianness() >>> bswapXX() >>> >>> The function access_with_adjusted_size() always iterates over the MMIO >>> address range in incrementing address order. However, it can calculate >>> the shift count for memory_region_read_accessor() in two ways. >>> >>> When memory_region_big_endian() returns "true", the shift count >>> decreases as the MMIO address increases. >>> >>> When memory_region_big_endian() returns "false", the shift count >>> increases as the MMIO address increases. >> >> Yep, because this is how the device has told us that it thinks >> of accesses as being put together. >> >> The column in your table "host value" is the 16 bit value read from >> the device, ie what we have decided (based on device_endian) that >> it would have returned us if it had supported a 16 bit read directly >> itself. BE devices compose 16 bit values with the high byte first, >> LE devices with the low byte first, and native-endian devices >> in the same order as guest-endianness. >> >>> In memory_region_read_accessor(), the shift count is used for a logical >>> (ie. numeric) bitwise left shift (<<). That operator works with numeric >>> values and hides (ie. accounts for) host endianness. >>> >>> Let's consider >>> - an array of two bytes, [0xaa, 0xbb], >>> - a uint16_t access made from the guest, >>> - and all twelve possible cases. >>> >>> In the table below, the "host", "guest" and "device_endian" columns are >>> input variables. The columns to the right are calculated / derived >>> values. The arrows above the columns show the data dependencies. >>> >>> After memory_region_dispatch_read1() constructs the value that is >>> visible in the "host value" column, it is passed to adjust_endianness(). >>> If memory_region_wrong_endianness() returns "true", then the host value >>> is byte-swapped. The result is then passed to the guest. >>> >>> +---------------------------------------------------------------------------------------------------------------+----------+ >>> | | | >>> +---- ------+-------------------------------------------------------------------------+ | | >>> | | | | | | >>> +----------------------------------------------------------+---------+ | | | >>> | | | | | | | | | >>> | +-----------+-------------------+ +----------------+ | | +------------------------+-------------------+ | | >>> | | | | | | | | | | | | | | | | | >>> | | | | | | v | v | v | v | v | v >>> # host guest device_endian memory_region_big_endian() host value host repr. memory_region_wrong_endianness() guest repr. guest value >>> -- ---- ----- ------------- -------------------------- ---------- ------------ -------------------------------- ------------ ----------- >>> 1 LE LE native 0 0xbbaa [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa >>> 2 LE LE BE 1 0xaabb [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xbbaa >>> 3 LE LE LE 0 0xbbaa [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa >>> >>> 4 LE BE native 1 0xaabb [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa >>> 5 LE BE BE 1 0xaabb [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa >>> 6 LE BE LE 0 0xbbaa [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xbbaa >>> >>> 7 BE LE native 0 0xbbaa [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb >>> 8 BE LE BE 1 0xaabb [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xaabb >>> 9 BE LE LE 0 0xbbaa [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb >>> >>> 10 BE BE native 1 0xaabb [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb >>> 11 BE BE BE 1 0xaabb [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb >>> 12 BE BE LE 0 0xbbaa [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xaabb >> >> The column you have labelled 'guest repr' here is the returned data >> from io_mem_read, whose API contract is "write the data from the >> device into this host C uint16_t (or whatever) such that it is the >> value returned by the device read as a native host value". It's >> not related to the guest order at all. >> >> So for instance, io_mem_read() is called by cpu_physical_memory_rw(), >> which passes it a local variable "val". So now "val" has the >> "guest repr" column's bytes in it, and (as a host C variable) the >> value: >> 1,2,3 : 0xbbaa >> 4,5,6 : 0xaabb >> 7,8,9 : 0xbbaa >> 10,11,12 : 0xaabb >> >> As you can see, this depends on the "guest endianness" (which is >> kind of the endianness of the bus): a BE guest 16 bit access to >> this device would return the 16 bit value 0xaabb, and an LE guest >> 0xbbaa, and we have exactly those values in our host C variable. >> If this is TCG, then we'll copy that 16 bit host value into the >> CPUState struct field corresponding to the destination guest >> register as-is. (TCG CPUState fields are always in host-C-order.) >> >> However, to pass them up to KVM we have to put them into a >> buffer in RAM as per the KVM_EXIT_MMIO ABI. So: >> cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store >> in target CPU endianness"], so now buf has the bytes: >> 1,2,3 : [0xaa, 0xbb] >> 4,5,6 : [0xaa, 0xbb] >> 7,8,9 : [0xaa, 0xbb] >> 10,11,12 : [0xaa, 0xbb] >> >> ...which is the same for every case. >> >> This buffer is (for KVM) the run->mmio.data buffer, whose semantics >> are "the value as it would appear if the VCPU performed a load or store >> of the appropriate width directly to the byte array". Which is what we >> want -- your device has two bytes in order 0xaa, 0xbb, and we did >> a 16 bit load in the guest CPU, so we should get the same answer as if >> we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be >> 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not. >> >> I think the fact that all of these things come out to the same >> set of bytes in the mmio.data buffer is the indication that all >> QEMU's byte swapping is correct. >> >>> Looking at the two rightmost columns, we must conclude: >>> >>> (a) The "device_endian" field has absolutely no significance wrt. what >>> the guest sees. In each triplet of cases, when we go from >>> DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN, >>> the guest sees the exact same value. >>> >>> I don't understand this result (it makes me doubt device_endian >>> makes any sense). What did I do wrong? >> >> I think it's because you defined your device as only supporting >> byte accesses that you didn't see any difference between the >> various device_endian settings. If a device presents itself as >> just an array of bytes then it doesn't have an endianness, really. >> >> As Paolo says, if you make your device support wider accesses >> directly and build up the value yourself then you'll see that >> setting the device_endian to LE vs BE vs native does change >> the values you see in the guest (and that you'll need to set it >> to LE and interpret the words in the guest as LE to get the >> right behaviour). >> >>> This means that this interface is *value preserving*, not >>> representation preserving. In other words: when host and guest >>> endiannesses are identical, the *array* is transferred okay (the >>> guest array matches the initial host array [0xaa, 0xbb]). When guest >>> and host differ in endianness, the guest will see an incorrect >>> *array*. >> >> Think of a device which supports only byte accesses as being >> like a lump of RAM. A big-endian guest CPU which reads 32 bits >> at a time will get different values in registers to an LE guest >> which does the same. >> >> *However*, if both CPUs just do "read 32 bits; write 32 bits to >> RAM" (ie a kind of memcpy but with the source being the mmio >> register rather than some other bit of RAM) then you should get >> a bytewise-correct copy of the data in RAM. >> >> So I *think* that would be a viable approach: have your QEMU >> device code as it is now, and just make sure that if the guest >> is doing wider-than-byte accesses it does them as >> do { >> load word from mmio register; >> store word to RAM; >> } while (count); >> >> and doesn't try to interpret the byte order of the values while >> they're in the CPU register in the middle of the loop. >> >> Paolo's suggestion would work too, if you prefer that. >> >>> I apologize for wasting everyone's time, but I think both results are >>> very non-intuitive. >> >> Everything around endianness handling is non-intuitive -- >> it's the nature of the problem, I'm afraid. (Some of this is >> also QEMU's fault for not having good documentation of the >> semantics of each of the layers involved in memory accesses. >> I have on my todo list the vague idea of trying to write these >> all up as a blog post.) > > Thanks for taking the time to write this up. My analysis must have > missed at least two important things then: > - the device's read/write function needs to consider address & size, and > return values that match host byte order. fw_cfg doesn't conform ATM. > - there's one more layer outside the call tree that I checked that can > perform endianness conversion. > > I'll try to implement Paolo's suggestion (ie. support wide accesses in > fw_cfg internally, not relying on splitting/combining by memory.c). Awesome :). Please define it as device little endian while you go as well. That should give us fewer headaches if we want to support wide access on ppc. Alex
On 12/17/14 09:28, Alexander Graf wrote: > > > >> Am 17.12.2014 um 08:13 schrieb Laszlo Ersek <lersek@redhat.com>: >> >>> On 12/16/14 21:41, Peter Maydell wrote: >>>> On 16 December 2014 at 19:00, Laszlo Ersek <lersek@redhat.com> wrote: >>>> The root of this question is what each of >>>> >>>> enum device_endian { >>>> DEVICE_NATIVE_ENDIAN, >>>> DEVICE_BIG_ENDIAN, >>>> DEVICE_LITTLE_ENDIAN, >>>> }; >>>> >>>> means. >>> >>> An opening remark: endianness is a horribly confusing topic and >>> support of more than one endianness is even worse. I may have made >>> some inadvertent errors in this reply; if you think so please >>> let me know and I'll have another stab at it. >>> >>> That said: the device_endian options indicate what a device's >>> internal idea of its endianness is. This is most relevant if >>> a device accepts accesses at wider than byte width >>> (for instance, if you can read a 32-bit value from >>> address offset 0 and also an 8-bit value from offset 3 then >>> how do those values line up? If you read a 32-bit value then >>> which way round is it compared to what the device's io read/write >>> function return?). >>> >>> NATIVE_ENDIAN means "same order as the CPU's main data bus's >>> natural representation". (Note that this is not necessarily >>> the same as "the endianness the CPU currently has"; on ARM >>> you can flip the CPU between LE and BE at runtime, which >>> is basically inserting a byte-swizzling step between data >>> accesses and the CPU's data bus, which is always LE for ARMv7+.) >>> >>> Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN >>> (because the guest vcpu reads it directly with the h/w cpu). >>> >>>> Consider the following call tree, which implements the splitting / >>>> combining of an MMIO read: >>>> >>>> memory_region_dispatch_read() [memory.c] >>>> memory_region_dispatch_read1() >>>> access_with_adjusted_size() >>>> memory_region_big_endian() >>>> for each byte in the wide access: >>>> memory_region_read_accessor() >>>> fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c] >>>> fw_cfg_read() >>>> adjust_endianness() >>>> memory_region_wrong_endianness() >>>> bswapXX() >>>> >>>> The function access_with_adjusted_size() always iterates over the MMIO >>>> address range in incrementing address order. However, it can calculate >>>> the shift count for memory_region_read_accessor() in two ways. >>>> >>>> When memory_region_big_endian() returns "true", the shift count >>>> decreases as the MMIO address increases. >>>> >>>> When memory_region_big_endian() returns "false", the shift count >>>> increases as the MMIO address increases. >>> >>> Yep, because this is how the device has told us that it thinks >>> of accesses as being put together. >>> >>> The column in your table "host value" is the 16 bit value read from >>> the device, ie what we have decided (based on device_endian) that >>> it would have returned us if it had supported a 16 bit read directly >>> itself. BE devices compose 16 bit values with the high byte first, >>> LE devices with the low byte first, and native-endian devices >>> in the same order as guest-endianness. >>> >>>> In memory_region_read_accessor(), the shift count is used for a logical >>>> (ie. numeric) bitwise left shift (<<). That operator works with numeric >>>> values and hides (ie. accounts for) host endianness. >>>> >>>> Let's consider >>>> - an array of two bytes, [0xaa, 0xbb], >>>> - a uint16_t access made from the guest, >>>> - and all twelve possible cases. >>>> >>>> In the table below, the "host", "guest" and "device_endian" columns are >>>> input variables. The columns to the right are calculated / derived >>>> values. The arrows above the columns show the data dependencies. >>>> >>>> After memory_region_dispatch_read1() constructs the value that is >>>> visible in the "host value" column, it is passed to adjust_endianness(). >>>> If memory_region_wrong_endianness() returns "true", then the host value >>>> is byte-swapped. The result is then passed to the guest. >>>> >>>> +---------------------------------------------------------------------------------------------------------------+----------+ >>>> | | | >>>> +---- ------+-------------------------------------------------------------------------+ | | >>>> | | | | | | >>>> +----------------------------------------------------------+---------+ | | | >>>> | | | | | | | | | >>>> | +-----------+-------------------+ +----------------+ | | +------------------------+-------------------+ | | >>>> | | | | | | | | | | | | | | | | | >>>> | | | | | | v | v | v | v | v | v >>>> # host guest device_endian memory_region_big_endian() host value host repr. memory_region_wrong_endianness() guest repr. guest value >>>> -- ---- ----- ------------- -------------------------- ---------- ------------ -------------------------------- ------------ ----------- >>>> 1 LE LE native 0 0xbbaa [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa >>>> 2 LE LE BE 1 0xaabb [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xbbaa >>>> 3 LE LE LE 0 0xbbaa [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa >>>> >>>> 4 LE BE native 1 0xaabb [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa >>>> 5 LE BE BE 1 0xaabb [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa >>>> 6 LE BE LE 0 0xbbaa [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xbbaa >>>> >>>> 7 BE LE native 0 0xbbaa [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb >>>> 8 BE LE BE 1 0xaabb [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xaabb >>>> 9 BE LE LE 0 0xbbaa [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb >>>> >>>> 10 BE BE native 1 0xaabb [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb >>>> 11 BE BE BE 1 0xaabb [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb >>>> 12 BE BE LE 0 0xbbaa [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xaabb >>> >>> The column you have labelled 'guest repr' here is the returned data >>> from io_mem_read, whose API contract is "write the data from the >>> device into this host C uint16_t (or whatever) such that it is the >>> value returned by the device read as a native host value". It's >>> not related to the guest order at all. >>> >>> So for instance, io_mem_read() is called by cpu_physical_memory_rw(), >>> which passes it a local variable "val". So now "val" has the >>> "guest repr" column's bytes in it, and (as a host C variable) the >>> value: >>> 1,2,3 : 0xbbaa >>> 4,5,6 : 0xaabb >>> 7,8,9 : 0xbbaa >>> 10,11,12 : 0xaabb >>> >>> As you can see, this depends on the "guest endianness" (which is >>> kind of the endianness of the bus): a BE guest 16 bit access to >>> this device would return the 16 bit value 0xaabb, and an LE guest >>> 0xbbaa, and we have exactly those values in our host C variable. >>> If this is TCG, then we'll copy that 16 bit host value into the >>> CPUState struct field corresponding to the destination guest >>> register as-is. (TCG CPUState fields are always in host-C-order.) >>> >>> However, to pass them up to KVM we have to put them into a >>> buffer in RAM as per the KVM_EXIT_MMIO ABI. So: >>> cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store >>> in target CPU endianness"], so now buf has the bytes: >>> 1,2,3 : [0xaa, 0xbb] >>> 4,5,6 : [0xaa, 0xbb] >>> 7,8,9 : [0xaa, 0xbb] >>> 10,11,12 : [0xaa, 0xbb] >>> >>> ...which is the same for every case. >>> >>> This buffer is (for KVM) the run->mmio.data buffer, whose semantics >>> are "the value as it would appear if the VCPU performed a load or store >>> of the appropriate width directly to the byte array". Which is what we >>> want -- your device has two bytes in order 0xaa, 0xbb, and we did >>> a 16 bit load in the guest CPU, so we should get the same answer as if >>> we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be >>> 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not. >>> >>> I think the fact that all of these things come out to the same >>> set of bytes in the mmio.data buffer is the indication that all >>> QEMU's byte swapping is correct. >>> >>>> Looking at the two rightmost columns, we must conclude: >>>> >>>> (a) The "device_endian" field has absolutely no significance wrt. what >>>> the guest sees. In each triplet of cases, when we go from >>>> DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN, >>>> the guest sees the exact same value. >>>> >>>> I don't understand this result (it makes me doubt device_endian >>>> makes any sense). What did I do wrong? >>> >>> I think it's because you defined your device as only supporting >>> byte accesses that you didn't see any difference between the >>> various device_endian settings. If a device presents itself as >>> just an array of bytes then it doesn't have an endianness, really. >>> >>> As Paolo says, if you make your device support wider accesses >>> directly and build up the value yourself then you'll see that >>> setting the device_endian to LE vs BE vs native does change >>> the values you see in the guest (and that you'll need to set it >>> to LE and interpret the words in the guest as LE to get the >>> right behaviour). >>> >>>> This means that this interface is *value preserving*, not >>>> representation preserving. In other words: when host and guest >>>> endiannesses are identical, the *array* is transferred okay (the >>>> guest array matches the initial host array [0xaa, 0xbb]). When guest >>>> and host differ in endianness, the guest will see an incorrect >>>> *array*. >>> >>> Think of a device which supports only byte accesses as being >>> like a lump of RAM. A big-endian guest CPU which reads 32 bits >>> at a time will get different values in registers to an LE guest >>> which does the same. >>> >>> *However*, if both CPUs just do "read 32 bits; write 32 bits to >>> RAM" (ie a kind of memcpy but with the source being the mmio >>> register rather than some other bit of RAM) then you should get >>> a bytewise-correct copy of the data in RAM. >>> >>> So I *think* that would be a viable approach: have your QEMU >>> device code as it is now, and just make sure that if the guest >>> is doing wider-than-byte accesses it does them as >>> do { >>> load word from mmio register; >>> store word to RAM; >>> } while (count); >>> >>> and doesn't try to interpret the byte order of the values while >>> they're in the CPU register in the middle of the loop. >>> >>> Paolo's suggestion would work too, if you prefer that. >>> >>>> I apologize for wasting everyone's time, but I think both results are >>>> very non-intuitive. >>> >>> Everything around endianness handling is non-intuitive -- >>> it's the nature of the problem, I'm afraid. (Some of this is >>> also QEMU's fault for not having good documentation of the >>> semantics of each of the layers involved in memory accesses. >>> I have on my todo list the vague idea of trying to write these >>> all up as a blog post.) >> >> Thanks for taking the time to write this up. My analysis must have >> missed at least two important things then: >> - the device's read/write function needs to consider address & size, and >> return values that match host byte order. fw_cfg doesn't conform ATM. >> - there's one more layer outside the call tree that I checked that can >> perform endianness conversion. >> >> I'll try to implement Paolo's suggestion (ie. support wide accesses in >> fw_cfg internally, not relying on splitting/combining by memory.c). > > Awesome :). Please define it as device little endian while you go as well. That should give us fewer headaches if we want to support wide access on ppc. Definitely; that was actually the first part of Paolo's suggestion. :) Thanks! Laszlo
On 17/12/2014 06:06, Laszlo Ersek wrote: >> > That said, the standalone selector is used by BE platforms only, so we >> > know that the standalone selector is always DEVICE_BIG_ENDIAN. > This series exposes the standalone selector (as MMIO) to ARM guests as > well; and in "Documentation/devicetree/bindings/arm/fw-cfg.txt" for the > kernel I'm saying that the selector is little endian. > > Therefore I think that the standalone selector is not only (going to be) > used by BE platforms (or I don't understand your above statement > correctly). It's not going to be used only by BE platforms. But so far it is used by BE platforms only. If you want to keep everything consistent, you can make it (and the wide datum) BE; and swizzle data in the firmware. Otherwise, NATIVE_ENDIAN is fine. > In other words, the standalone selector is NATIVE_ENDIAN, but in the > description of the *ARM* bindings, we can simply say that it's little > endian. Yes. Paolo
On 17.12.14 10:23, Paolo Bonzini wrote: > > > On 17/12/2014 06:06, Laszlo Ersek wrote: >>>> That said, the standalone selector is used by BE platforms only, so we >>>> know that the standalone selector is always DEVICE_BIG_ENDIAN. >> This series exposes the standalone selector (as MMIO) to ARM guests as >> well; and in "Documentation/devicetree/bindings/arm/fw-cfg.txt" for the >> kernel I'm saying that the selector is little endian. >> >> Therefore I think that the standalone selector is not only (going to be) >> used by BE platforms (or I don't understand your above statement >> correctly). > > It's not going to be used only by BE platforms. But so far it is used > by BE platforms only. If you want to keep everything consistent, you > can make it (and the wide datum) BE; and swizzle data in the firmware. > Otherwise, NATIVE_ENDIAN is fine. For the sake of keeping myself sane, I would really prefer not to have NATIVE_ENDIAN. Once you start thinking about little endian guests on PPC sharing code with little endian guests on x86 your head will start to explode otherwise. If code becomes simpler by making the mmio accessors BIG_ENDIAN, that's fine with me as well. Alex
diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c index c4b78ed..7f6031c 100644 --- a/hw/nvram/fw_cfg.c +++ b/hw/nvram/fw_cfg.c @@ -30,9 +30,8 @@ #include "qemu/error-report.h" #include "qemu/config-file.h" #define FW_CFG_SIZE 2 -#define FW_CFG_DATA_SIZE 1 #define TYPE_FW_CFG "fw_cfg" #define FW_CFG_NAME "fw_cfg" #define FW_CFG_PATH "/machine/" FW_CFG_NAME #define FW_CFG(obj) OBJECT_CHECK(FWCfgState, (obj), TYPE_FW_CFG) @@ -323,8 +322,9 @@ static const MemoryRegionOps fw_cfg_data_mem_ops = { .valid = { .min_access_size = 1, .max_access_size = 1, }, + .impl.max_access_size = 1, }; static const MemoryRegionOps fw_cfg_comb_mem_ops = { .read = fw_cfg_comb_read, @@ -608,9 +608,10 @@ static void fw_cfg_initfn(Object *obj) memory_region_init_io(&s->ctl_iomem, OBJECT(s), &fw_cfg_ctl_mem_ops, s, "fwcfg.ctl", FW_CFG_SIZE); sysbus_init_mmio(sbd, &s->ctl_iomem); memory_region_init_io(&s->data_iomem, OBJECT(s), &fw_cfg_data_mem_ops, s, - "fwcfg.data", FW_CFG_DATA_SIZE); + "fwcfg.data", + fw_cfg_data_mem_ops.valid.max_access_size); sysbus_init_mmio(sbd, &s->data_iomem); /* In case ctl and data overlap: */ memory_region_init_io(&s->comb_iomem, OBJECT(s), &fw_cfg_comb_mem_ops, s, "fwcfg", FW_CFG_SIZE);
Make it clear that the maximum access size to the MMIO data register determines the full size of the memory region. Currently the max access size is 1. Ensure that if a larger size were used in "fw_cfg_data_mem_ops.valid.max_access_size", the memory subsystem would split the access to byte-sized accesses internally, in increasing address order. fw_cfg_data_mem_read() and fw_cfg_data_mem_write() don't care about "address" or "size"; they just call the sequential fw_cfg_read() and fw_cfg_write() functions, correspondingly. Therefore the automatic splitting is just right. (The endianness of "fw_cfg_data_mem_ops" is native.) This patch doesn't change behavior. Signed-off-by: Laszlo Ersek <lersek@redhat.com> --- Notes: v4: - unchanged v3: - new in v3 [Drew Jones] hw/nvram/fw_cfg.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)