Message ID | 20241024102813.9855-3-phil@philjordan.eu |
---|---|
State | New |
Headers | show |
Series | macOS PV Graphics and new vmapple machine type | expand |
On 2024/10/24 19:28, Phil Dennis-Jordan wrote: > MacOS provides a framework (library) that allows any vmm to implement a > paravirtualized 3d graphics passthrough to the host metal stack called > ParavirtualizedGraphics.Framework (PVG). The library abstracts away > almost every aspect of the paravirtualized device model and only provides > and receives callbacks on MMIO access as well as to share memory address > space between the VM and PVG. > > This patch implements a QEMU device that drives PVG for the VMApple > variant of it. > > Signed-off-by: Alexander Graf <graf@amazon.com> > Co-authored-by: Alexander Graf <graf@amazon.com> > > Subsequent changes: > > * Cherry-pick/rebase conflict fixes, API use updates. > * Moved from hw/vmapple/ (useful outside that machine type) > * Overhaul of threading model, many thread safety improvements. > * Asynchronous rendering. > * Memory and object lifetime fixes. > * Refactoring to split generic and (vmapple) MMIO variant specific > code. > > Signed-off-by: Phil Dennis-Jordan <phil@philjordan.eu> > --- > > v2: > > * Cherry-pick/rebase conflict fixes > * BQL function renaming > * Moved from hw/vmapple/ (useful outside that machine type) > * Code review comments: Switched to DEFINE_TYPES macro & little endian > MMIO. > * Removed some dead/superfluous code > * Mad set_mode thread & memory safe > * Added migration blocker due to lack of (de-)serialisation. > * Fixes to ObjC refcounting and autorelease pool usage. > * Fixed ObjC new/init misuse > * Switched to ObjC category extension for private property. > * Simplified task memory mapping and made it thread safe. > * Refactoring to split generic and vmapple MMIO variant specific > code. > * Switched to asynchronous MMIO writes on x86-64 > * Rendering and graphics update are now done asynchronously > * Fixed cursor handling > * Coding convention fixes > * Removed software cursor compositing > > v3: > > * Rebased on latest upstream, fixed breakages including switching to Resettable methods. > * Squashed patches dealing with dGPUs, MMIO area size, and GPU picking. > * Allow re-entrant MMIO; this simplifies the code and solves the divergence > between x86-64 and arm64 variants. > > v4: > > * Renamed '-vmapple' device variant to '-mmio' > * MMIO device type now requires aarch64 host and guest > * Complete overhaul of the glue code for making Qemu's and > ParavirtualizedGraphics.framework's threading and synchronisation models > work together. Calls into PVG are from dispatch queues while the > BQL-holding initiating thread processes AIO context events; callbacks from > PVG are scheduled as BHs on the BQL/main AIO context, awaiting completion > where necessary. > * Guest frame rendering state is covered by the BQL, with only the PVG calls > outside the lock, and serialised on the named render_queue. > * Simplified logic for dropping frames in-flight during mode changes, fixed > bug in pending frames logic. > * Addressed smaller code review notes such as: function naming, object type > declarations, type names/declarations/casts, code formatting, #include > order, over-cautious ObjC retain/release, what goes in init vs realize, > etc. > > > hw/display/Kconfig | 9 + > hw/display/apple-gfx-mmio.m | 284 ++++++++++++++ > hw/display/apple-gfx.h | 58 +++ > hw/display/apple-gfx.m | 713 ++++++++++++++++++++++++++++++++++++ > hw/display/meson.build | 4 + > hw/display/trace-events | 26 ++ > meson.build | 4 + > 7 files changed, 1098 insertions(+) > create mode 100644 hw/display/apple-gfx-mmio.m > create mode 100644 hw/display/apple-gfx.h > create mode 100644 hw/display/apple-gfx.m > > diff --git a/hw/display/Kconfig b/hw/display/Kconfig > index 2250c740078..6a9b7b19ada 100644 > --- a/hw/display/Kconfig > +++ b/hw/display/Kconfig > @@ -140,3 +140,12 @@ config XLNX_DISPLAYPORT > > config DM163 > bool > + > +config MAC_PVG > + bool > + default y > + > +config MAC_PVG_MMIO > + bool > + depends on MAC_PVG && AARCH64 > + > diff --git a/hw/display/apple-gfx-mmio.m b/hw/display/apple-gfx-mmio.m > new file mode 100644 > index 00000000000..06131bc23f1 > --- /dev/null > +++ b/hw/display/apple-gfx-mmio.m > @@ -0,0 +1,284 @@ > +/* > + * QEMU Apple ParavirtualizedGraphics.framework device, MMIO (arm64) variant > + * > + * Copyright © 2023 Amazon.com, Inc. or its affiliates. All Rights Reserved. > + * > + * This work is licensed under the terms of the GNU GPL, version 2 or later. > + * See the COPYING file in the top-level directory. Use SPDX-License-Identifier. You can find some examples with grep. > + * > + * ParavirtualizedGraphics.framework is a set of libraries that macOS provides > + * which implements 3d graphics passthrough to the host as well as a > + * proprietary guest communication channel to drive it. This device model > + * implements support to drive that library from within QEMU as an MMIO-based > + * system device for macOS on arm64 VMs. > + */ > + > +#include "qemu/osdep.h" > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > +#include "apple-gfx.h" > +#include "monitor/monitor.h" > +#include "hw/sysbus.h" > +#include "hw/irq.h" > +#include "trace.h" > + > +OBJECT_DECLARE_SIMPLE_TYPE(AppleGFXMMIOState, APPLE_GFX_MMIO) > + > +/* > + * ParavirtualizedGraphics.Framework only ships header files for the PCI > + * variant which does not include IOSFC descriptors and host devices. We add > + * their definitions here so that we can also work with the ARM version. > + */ > +typedef bool(^IOSFCRaiseInterrupt)(uint32_t vector); > +typedef bool(^IOSFCUnmapMemory)( > + void *, void *, void *, void *, void *, void *); > +typedef bool(^IOSFCMapMemory)( > + uint64_t phys, uint64_t len, bool ro, void **va, void *, void *); > + > +@interface PGDeviceDescriptor (IOSurfaceMapper) > +@property (readwrite, nonatomic) bool usingIOSurfaceMapper; > +@end > + > +@interface PGIOSurfaceHostDeviceDescriptor : NSObject > +-(PGIOSurfaceHostDeviceDescriptor *)init; > +@property (readwrite, nonatomic, copy, nullable) IOSFCMapMemory mapMemory; > +@property (readwrite, nonatomic, copy, nullable) IOSFCUnmapMemory unmapMemory; > +@property (readwrite, nonatomic, copy, nullable) IOSFCRaiseInterrupt raiseInterrupt; > +@end > + > +@interface PGIOSurfaceHostDevice : NSObject > +-(instancetype)initWithDescriptor:(PGIOSurfaceHostDeviceDescriptor *)desc; > +-(uint32_t)mmioReadAtOffset:(size_t)offset; > +-(void)mmioWriteAtOffset:(size_t)offset value:(uint32_t)value; > +@end > + > +struct AppleGFXMapSurfaceMemoryJob; > +struct AppleGFXMMIOState { > + SysBusDevice parent_obj; > + > + AppleGFXState common; > + > + qemu_irq irq_gfx; > + qemu_irq irq_iosfc; > + MemoryRegion iomem_iosfc; > + PGIOSurfaceHostDevice *pgiosfc; > +}; > + > +typedef struct AppleGFXMMIOJob { > + AppleGFXMMIOState *state; > + uint64_t offset; > + uint64_t value; > + bool completed; > +} AppleGFXMMIOJob; > + > +static void iosfc_do_read(void *opaque) > +{ > + AppleGFXMMIOJob *job = opaque; > + job->value = [job->state->pgiosfc mmioReadAtOffset:job->offset]; > + qatomic_set(&job->completed, true); > + aio_wait_kick(); > +} > + > +static uint64_t iosfc_read(void *opaque, hwaddr offset, unsigned size) > +{ > + AppleGFXMMIOJob job = { > + .state = opaque, > + .offset = offset, > + .completed = false, > + }; > + AioContext *context = qemu_get_aio_context(); > + dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > + > + dispatch_async_f(queue, &job, iosfc_do_read); > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); Pass NULL as the first argument of AIO_WAIT_WHILE(). > + > + trace_apple_gfx_mmio_iosfc_read(offset, job.value); > + return job.value; > +} > + > +static void iosfc_do_write(void *opaque) > +{ > + AppleGFXMMIOJob *job = opaque; > + [job->state->pgiosfc mmioWriteAtOffset:job->offset value:job->value]; > + qatomic_set(&job->completed, true); > + aio_wait_kick(); > +} > + > +static void iosfc_write(void *opaque, hwaddr offset, uint64_t val, > + unsigned size) > +{ > + AppleGFXMMIOJob job = { > + .state = opaque, > + .offset = offset, > + .value = val, > + .completed = false, > + }; > + AioContext *context = qemu_get_aio_context(); > + dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > + > + dispatch_async_f(queue, &job, iosfc_do_write); > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > + > + trace_apple_gfx_mmio_iosfc_write(offset, val); > +} > + > +static const MemoryRegionOps apple_iosfc_ops = { > + .read = iosfc_read, > + .write = iosfc_write, > + .endianness = DEVICE_LITTLE_ENDIAN, > + .valid = { > + .min_access_size = 4, > + .max_access_size = 8, > + }, > + .impl = { > + .min_access_size = 4, > + .max_access_size = 8, > + }, > +}; > + > +static void raise_iosfc_irq(void *opaque) > +{ > + AppleGFXMMIOState *s = opaque; > + > + qemu_irq_pulse(s->irq_iosfc); > +} > + > +typedef struct AppleGFXMapSurfaceMemoryJob { > + uint64_t guest_physical_address; > + uint64_t guest_physical_length; > + void *result_mem; > + AppleGFXMMIOState *state; > + bool read_only; > + bool success; > + bool done; > +} AppleGFXMapSurfaceMemoryJob; > + > +static void apple_gfx_mmio_map_surface_memory(void *opaque) > +{ > + AppleGFXMapSurfaceMemoryJob *job = opaque; > + AppleGFXMMIOState *s = job->state; > + mach_vm_address_t mem; > + > + mem = apple_gfx_host_address_for_gpa_range(job->guest_physical_address, > + job->guest_physical_length, > + job->read_only); > + > + qemu_mutex_lock(&s->common.job_mutex); > + job->result_mem = (void*)mem; nit: write as (void *). > + job->success = mem != 0; > + job->done = true; > + qemu_cond_broadcast(&s->common.job_cond); > + qemu_mutex_unlock(&s->common.job_mutex); > +} > + > +static PGIOSurfaceHostDevice *apple_gfx_prepare_iosurface_host_device( > + AppleGFXMMIOState *s) > +{ > + PGIOSurfaceHostDeviceDescriptor *iosfc_desc = > + [PGIOSurfaceHostDeviceDescriptor new]; > + PGIOSurfaceHostDevice *iosfc_host_dev = nil; > + > + iosfc_desc.mapMemory = > + ^bool(uint64_t phys, uint64_t len, bool ro, void **va, void *e, void *f) { > + AppleGFXMapSurfaceMemoryJob job = { > + .guest_physical_address = phys, .guest_physical_length = len, > + .read_only = ro, .state = s, > + }; > + > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > + apple_gfx_mmio_map_surface_memory, &job); > + apple_gfx_await_bh_job(&s->common, &job.done); > + > + *va = job.result_mem; > + > + trace_apple_gfx_iosfc_map_memory(phys, len, ro, va, e, f, *va, > + job.success); > + > + return job.success; > + }; > + > + iosfc_desc.unmapMemory = > + ^bool(void *a, void *b, void *c, void *d, void *e, void *f) { > + trace_apple_gfx_iosfc_unmap_memory(a, b, c, d, e, f); > + return true; > + }; > + > + iosfc_desc.raiseInterrupt = ^bool(uint32_t vector) { > + trace_apple_gfx_iosfc_raise_irq(vector); > + aio_bh_schedule_oneshot(qemu_get_aio_context(), raise_iosfc_irq, s); Let's pass s->irq_iosfc here to unify raise_iosfc_irq() and raise_gfx_irq(). > + return true; > + }; > + > + iosfc_host_dev = > + [[PGIOSurfaceHostDevice alloc] initWithDescriptor:iosfc_desc]; > + [iosfc_desc release]; > + return iosfc_host_dev; > +} > + > +static void raise_gfx_irq(void *opaque) > +{ > + AppleGFXMMIOState *s = opaque; > + > + qemu_irq_pulse(s->irq_gfx); > +} > + > +static void apple_gfx_mmio_realize(DeviceState *dev, Error **errp) > +{ > + @autoreleasepool { > + AppleGFXMMIOState *s = APPLE_GFX_MMIO(dev); > + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; > + > + desc.raiseInterrupt = ^(uint32_t vector) { > + trace_apple_gfx_raise_irq(vector); > + aio_bh_schedule_oneshot(qemu_get_aio_context(), raise_gfx_irq, s); > + }; > + > + desc.usingIOSurfaceMapper = true; > + s->pgiosfc = apple_gfx_prepare_iosurface_host_device(s); > + > + apple_gfx_common_realize(&s->common, desc, errp); > + [desc release]; > + desc = nil; > + } > +} > + > +static void apple_gfx_mmio_init(Object *obj) > +{ > + AppleGFXMMIOState *s = APPLE_GFX_MMIO(obj); > + > + apple_gfx_common_init(obj, &s->common, TYPE_APPLE_GFX_MMIO); > + > + sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->common.iomem_gfx); > + memory_region_init_io(&s->iomem_iosfc, obj, &apple_iosfc_ops, s, > + TYPE_APPLE_GFX_MMIO, 0x10000); > + sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->iomem_iosfc); > + sysbus_init_irq(SYS_BUS_DEVICE(s), &s->irq_gfx); > + sysbus_init_irq(SYS_BUS_DEVICE(s), &s->irq_iosfc); > +} > + > +static void apple_gfx_mmio_reset(Object *obj, ResetType type) > +{ > + AppleGFXMMIOState *s = APPLE_GFX_MMIO(obj); > + [s->common.pgdev reset]; > +} > + > + > +static void apple_gfx_mmio_class_init(ObjectClass *klass, void *data) > +{ > + DeviceClass *dc = DEVICE_CLASS(klass); > + ResettableClass *rc = RESETTABLE_CLASS(klass); > + > + rc->phases.hold = apple_gfx_mmio_reset; > + dc->hotpluggable = false; > + dc->realize = apple_gfx_mmio_realize; > +} > + > +static TypeInfo apple_gfx_mmio_types[] = { > + { > + .name = TYPE_APPLE_GFX_MMIO, > + .parent = TYPE_SYS_BUS_DEVICE, > + .instance_size = sizeof(AppleGFXMMIOState), > + .class_init = apple_gfx_mmio_class_init, > + .instance_init = apple_gfx_mmio_init, > + } > +}; > +DEFINE_TYPES(apple_gfx_mmio_types) > diff --git a/hw/display/apple-gfx.h b/hw/display/apple-gfx.h > new file mode 100644 > index 00000000000..39931fba65a > --- /dev/null > +++ b/hw/display/apple-gfx.h > @@ -0,0 +1,58 @@ > +#ifndef QEMU_APPLE_GFX_H > +#define QEMU_APPLE_GFX_H > + > +#define TYPE_APPLE_GFX_MMIO "apple-gfx-mmio" > +#define TYPE_APPLE_GFX_PCI "apple-gfx-pci" > + > +#include "qemu/osdep.h" > +#include <dispatch/dispatch.h> > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > +#include "qemu/typedefs.h" > +#include "exec/memory.h" > +#include "ui/surface.h" > + > +@class PGDeviceDescriptor; > +@protocol PGDevice; > +@protocol PGDisplay; > +@protocol MTLDevice; > +@protocol MTLTexture; > +@protocol MTLCommandQueue; > + > +typedef QTAILQ_HEAD(, PGTask_s) PGTaskList; > + > +struct AppleGFXMapMemoryJob; Probably this declaration of AppleGFXMapMemoryJob is unnecessary. > +typedef struct AppleGFXState { > + MemoryRegion iomem_gfx; > + id<PGDevice> pgdev; > + id<PGDisplay> pgdisp; > + PGTaskList tasks; > + QemuConsole *con; > + id<MTLDevice> mtl; > + id<MTLCommandQueue> mtl_queue; > + bool cursor_show; > + QEMUCursor *cursor; > + > + /* For running PVG memory-mapping requests in the AIO context */ > + QemuCond job_cond; > + QemuMutex job_mutex; Use: QemuEvent > + > + dispatch_queue_t render_queue; > + /* The following fields should only be accessed from the BQL: */ Perhaps it may be better to document fields that can be accessed *without* the BQL; most things in QEMU implicitly require the BQL. > + bool gfx_update_requested; > + bool new_frame_ready; > + bool using_managed_texture_storage; > +} AppleGFXState; > + > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, const char* obj_name); > +void apple_gfx_common_realize(AppleGFXState *s, PGDeviceDescriptor *desc, > + Error **errp); > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t guest_physical, > + uint64_t length, bool read_only); > +void apple_gfx_await_bh_job(AppleGFXState *s, bool *job_done_flag); > + > +#endif > + > diff --git a/hw/display/apple-gfx.m b/hw/display/apple-gfx.m > new file mode 100644 > index 00000000000..46be9957f69 > --- /dev/null > +++ b/hw/display/apple-gfx.m > @@ -0,0 +1,713 @@ > +/* > + * QEMU Apple ParavirtualizedGraphics.framework device > + * > + * Copyright © 2023 Amazon.com, Inc. or its affiliates. All Rights Reserved. > + * > + * This work is licensed under the terms of the GNU GPL, version 2 or later. > + * See the COPYING file in the top-level directory. > + * > + * ParavirtualizedGraphics.framework is a set of libraries that macOS provides > + * which implements 3d graphics passthrough to the host as well as a > + * proprietary guest communication channel to drive it. This device model > + * implements support to drive that library from within QEMU. > + */ > + > +#include "qemu/osdep.h" > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > +#include <mach/mach_vm.h> > +#include "apple-gfx.h" > +#include "trace.h" > +#include "qemu-main.h" > +#include "exec/address-spaces.h" > +#include "migration/blocker.h" > +#include "monitor/monitor.h" > +#include "qemu/main-loop.h" > +#include "qemu/cutils.h" > +#include "qemu/log.h" > +#include "qapi/visitor.h" > +#include "qapi/error.h" > +#include "ui/console.h" > + > +static const PGDisplayCoord_t apple_gfx_modes[] = { > + { .x = 1440, .y = 1080 }, > + { .x = 1280, .y = 1024 }, > +}; > + > +/* This implements a type defined in <ParavirtualizedGraphics/PGDevice.h> > + * which is opaque from the framework's point of view. Typedef PGTask_t already > + * exists in the framework headers. */ > +struct PGTask_s { > + QTAILQ_ENTRY(PGTask_s) node; > + mach_vm_address_t address; > + uint64_t len; > +}; > + > +static Error *apple_gfx_mig_blocker; This does not have to be a static variable. > + > +static void apple_gfx_render_frame_completed(AppleGFXState *s, > + uint32_t width, uint32_t height); > + > +static inline dispatch_queue_t get_background_queue(void) Don't add inline. The only effect for modern compilers of inline is to suppress the unused function warnings. > +{ > + return dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > +} > + > +static PGTask_t *apple_gfx_new_task(AppleGFXState *s, uint64_t len) > +{ > + mach_vm_address_t task_mem; > + PGTask_t *task; > + kern_return_t r; > + > + r = mach_vm_allocate(mach_task_self(), &task_mem, len, VM_FLAGS_ANYWHERE); > + if (r != KERN_SUCCESS || task_mem == 0) { Let's remove the check for task_mem == 0. We have no reason to reject it if the platform insists it allocated a memory at address 0 though such a situation should never happen in practice. > + return NULL; > + } > + > + task = g_new0(PGTask_t, 1); > + > + task->address = task_mem; > + task->len = len; > + QTAILQ_INSERT_TAIL(&s->tasks, task, node); > + > + return task; > +} > + > +typedef struct AppleGFXIOJob { > + AppleGFXState *state; > + uint64_t offset; > + uint64_t value; > + bool completed; > +} AppleGFXIOJob; > + > +static void apple_gfx_do_read(void *opaque) > +{ > + AppleGFXIOJob *job = opaque; > + job->value = [job->state->pgdev mmioReadAtOffset:job->offset]; > + qatomic_set(&job->completed, true); > + aio_wait_kick(); > +} > + > +static uint64_t apple_gfx_read(void *opaque, hwaddr offset, unsigned size) > +{ > + AppleGFXIOJob job = { > + .state = opaque, > + .offset = offset, > + .completed = false, > + }; > + AioContext *context = qemu_get_aio_context(); > + dispatch_queue_t queue = get_background_queue(); > + > + dispatch_async_f(queue, &job, apple_gfx_do_read); > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > + > + trace_apple_gfx_read(offset, job.value); > + return job.value; > +} > + > +static void apple_gfx_do_write(void *opaque) > +{ > + AppleGFXIOJob *job = opaque; > + [job->state->pgdev mmioWriteAtOffset:job->offset value:job->value]; > + qatomic_set(&job->completed, true); > + aio_wait_kick(); > +} > + > +static void apple_gfx_write(void *opaque, hwaddr offset, uint64_t val, > + unsigned size) > +{ > + /* The methods mmioReadAtOffset: and especially mmioWriteAtOffset: can > + * trigger and block on operations on other dispatch queues, which in turn > + * may call back out on one or more of the callback blocks. For this reason, > + * and as we are holding the BQL, we invoke the I/O methods on a pool > + * thread and handle AIO tasks while we wait. Any work in the callbacks > + * requiring the BQL will in turn schedule BHs which this thread will > + * process while waiting. */ > + AppleGFXIOJob job = { > + .state = opaque, > + .offset = offset, > + .value = val, > + .completed = false, > + }; > + AioContext *context = qemu_get_current_aio_context(); > + dispatch_queue_t queue = get_background_queue(); > + > + dispatch_async_f(queue, &job, apple_gfx_do_write); > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > + > + trace_apple_gfx_write(offset, val); > +} > + > +static const MemoryRegionOps apple_gfx_ops = { > + .read = apple_gfx_read, > + .write = apple_gfx_write, > + .endianness = DEVICE_LITTLE_ENDIAN, > + .valid = { > + .min_access_size = 4, > + .max_access_size = 8, > + }, > + .impl = { > + .min_access_size = 4, > + .max_access_size = 4, > + }, > +}; > + > +static void apple_gfx_render_new_frame_bql_unlock(AppleGFXState *s) > +{ > + BOOL r; > + uint32_t width = surface_width(s->surface); > + uint32_t height = surface_height(s->surface); > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > + id<MTLCommandBuffer> command_buffer = [s->mtl_queue commandBuffer]; > + id<MTLTexture> texture = s->texture; > + > + assert(bql_locked()); > + [texture retain]; > + > + bql_unlock(); > + > + /* This is not safe to call from the BQL due to PVG-internal locks causing > + * deadlocks. */ > + r = [s->pgdisp encodeCurrentFrameToCommandBuffer:command_buffer > + texture:texture > + region:region]; > + if (!r) { > + [texture release]; > + bql_lock(); > + --s->pending_frames; > + bql_unlock(); > + qemu_log_mask(LOG_GUEST_ERROR, "apple_gfx_render_new_frame_bql_unlock: " Use: __func__ > + "encodeCurrentFrameToCommandBuffer:texture:region: failed\n"); > + return; > + } > + > + if (s->using_managed_texture_storage) { > + /* "Managed" textures exist in both VRAM and RAM and must be synced. */ > + id<MTLBlitCommandEncoder> blit = [command_buffer blitCommandEncoder]; > + [blit synchronizeResource:texture]; > + [blit endEncoding]; > + } > + [texture release]; > + [command_buffer addCompletedHandler: > + ^(id<MTLCommandBuffer> cb) > + { > + dispatch_async(s->render_queue, ^{ > + apple_gfx_render_frame_completed(s, width, height); > + }); > + }]; > + [command_buffer commit]; > +} > + > +static void copy_mtl_texture_to_surface_mem(id<MTLTexture> texture, void *vram) > +{ > + /* TODO: Skip this entirely on a pure Metal or headless/guest-only > + * rendering path, else use a blit command encoder? Needs careful > + * (double?) buffering design. */ > + size_t width = texture.width, height = texture.height; > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > + [texture getBytes:vram > + bytesPerRow:(width * 4) > + bytesPerImage:(width * height * 4) > + fromRegion:region > + mipmapLevel:0 > + slice:0]; > +}copy_mtl_texture_to_surface_mem > + > +static void apple_gfx_render_frame_completed(AppleGFXState *s, > + uint32_t width, uint32_t height) > +{ > + bql_lock(); > + --s->pending_frames; > + assert(s->pending_frames >= 0); > + > + /* Only update display if mode hasn't changed since we started rendering. */ > + if (width == surface_width(s->surface) && > + height == surface_height(s->surface)) { > + copy_mtl_texture_to_surface_mem(s->texture, s->vram); > + if (s->gfx_update_requested) { > + s->gfx_update_requested = false; > + dpy_gfx_update_full(s->con); > + graphic_hw_update_done(s->con); > + s->new_frame_ready = false; > + } else { > + s->new_frame_ready = true; > + } > + } > + if (s->pending_frames > 0) { > + apple_gfx_render_new_frame_bql_unlock(s); > + } else { > + bql_unlock(); > + } > +} > + > +static void apple_gfx_fb_update_display(void *opaque) > +{ > + AppleGFXState *s = opaque; > + > + assert(bql_locked()); > + if (s->new_frame_ready) { > + dpy_gfx_update_full(s->con); > + s->new_frame_ready = false; > + graphic_hw_update_done(s->con); > + } else if (s->pending_frames > 0) { > + s->gfx_update_requested = true; > + } else { > + graphic_hw_update_done(s->con); > + } > +} > + > +static const GraphicHwOps apple_gfx_fb_ops = { > + .gfx_update = apple_gfx_fb_update_display, > + .gfx_update_async = true, > +}; > + > +static void update_cursor(AppleGFXState *s) > +{ > + assert(bql_locked()); > + dpy_mouse_set(s->con, s->pgdisp.cursorPosition.x, > + s->pgdisp.cursorPosition.y, s->cursor_show); > +} > + > +static void set_mode(AppleGFXState *s, uint32_t width, uint32_t height) > +{ > + MTLTextureDescriptor *textureDescriptor; > + > + if (s->surface && > + width == surface_width(s->surface) && > + height == surface_height(s->surface)) { > + return; > + } > + > + g_free(s->vram); > + [s->texture release]; > + > + s->vram = g_malloc0_n(width * height, 4); > + s->surface = qemu_create_displaysurface_from(width, height, PIXMAN_LE_a8r8g8b8, > + width * 4, s->vram);> +> + @autoreleasepool { > + textureDescriptor = > + [MTLTextureDescriptor > + texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm > + width:width > + height:height > + mipmapped:NO]; > + textureDescriptor.usage = s->pgdisp.minimumTextureUsage; > + s->texture = [s->mtl newTextureWithDescriptor:textureDescriptor]; What about creating pixman_image_t from s->texture.buffer.contents? This should save memory usage by removing the duplication of texture. > + } > + > + s->using_managed_texture_storage = > + (s->texture.storageMode == MTLStorageModeManaged); > + dpy_gfx_replace_surface(s->con, s->surface); > +} > + > +static void create_fb(AppleGFXState *s) > +{ > + s->con = graphic_console_init(NULL, 0, &apple_gfx_fb_ops, s); > + set_mode(s, 1440, 1080); > + > + s->cursor_show = true; > +} > + > +static size_t apple_gfx_get_default_mmio_range_size(void) > +{ > + size_t mmio_range_size; > + @autoreleasepool { > + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; > + mmio_range_size = desc.mmioLength; > + [desc release]; > + } > + return mmio_range_size; > +} > + > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, const char* obj_name) > +{ > + size_t mmio_range_size = apple_gfx_get_default_mmio_range_size(); > + > + trace_apple_gfx_common_init(obj_name, mmio_range_size); > + memory_region_init_io(&s->iomem_gfx, obj, &apple_gfx_ops, s, obj_name, > + mmio_range_size); > + > + /* TODO: PVG framework supports serialising device state: integrate it! */ > +} > + > +typedef struct AppleGFXMapMemoryJob { > + AppleGFXState *state; > + PGTask_t *task; > + uint64_t virtual_offset; > + PGPhysicalMemoryRange_t *ranges; > + uint32_t range_count; > + bool read_only; > + bool success; > + bool done; > +} AppleGFXMapMemoryJob; > + > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t guest_physical, > + uint64_t length, bool read_only) > +{ > + MemoryRegion *ram_region; > + uintptr_t host_address; > + hwaddr ram_region_offset = 0; > + hwaddr ram_region_length = length; > + > + ram_region = address_space_translate(&address_space_memory, > + guest_physical, > + &ram_region_offset, > + &ram_region_length, !read_only, > + MEMTXATTRS_UNSPECIFIED); Call memory_region_ref() so that it won't go away. > + > + if (!ram_region || ram_region_length < length || > + !memory_access_is_direct(ram_region, !read_only)) { > + return 0; > + } > + > + host_address = (mach_vm_address_t)memory_region_get_ram_ptr(ram_region); host_address is typed as uintptr_t, not mach_vm_address_t. > + if (host_address == 0) { > + return 0; > + } > + host_address += ram_region_offset; > + > + return host_address; > +} > + > +static void apple_gfx_map_memory(void *opaque) > +{ > + AppleGFXMapMemoryJob *job = opaque; > + AppleGFXState *s = job->state; > + PGTask_t *task = job->task; > + uint32_t range_count = job->range_count; > + uint64_t virtual_offset = job->virtual_offset; > + PGPhysicalMemoryRange_t *ranges = job->ranges; > + bool read_only = job->read_only; > + kern_return_t r; > + mach_vm_address_t target, source; > + vm_prot_t cur_protection, max_protection; > + bool success = true; > + > + g_assert(bql_locked()); > + > + trace_apple_gfx_map_memory(task, range_count, virtual_offset, read_only); > + for (int i = 0; i < range_count; i++) { > + PGPhysicalMemoryRange_t *range = &ranges[i]; > + > + target = task->address + virtual_offset; > + virtual_offset += range->physicalLength; > + > + trace_apple_gfx_map_memory_range(i, range->physicalAddress, > + range->physicalLength); > + > + source = apple_gfx_host_address_for_gpa_range(range->physicalAddress, > + range->physicalLength, > + read_only); > + if (source == 0) { > + success = false; > + continue; > + } > + > + MemoryRegion* alt_mr = NULL; > + mach_vm_address_t alt_source = (mach_vm_address_t)gpa2hva(&alt_mr, range->physicalAddress, range->physicalLength, NULL); > + g_assert(alt_source == source); Remove this; I guess this is for debugging. > + > + cur_protection = 0; > + max_protection = 0; > + // Map guest RAM at range->physicalAddress into PG task memory range > + r = mach_vm_remap(mach_task_self(), > + &target, range->physicalLength, vm_page_size - 1, > + VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE, > + mach_task_self(), > + source, false /* shared mapping, no copy */, > + &cur_protection, &max_protection, > + VM_INHERIT_COPY); > + trace_apple_gfx_remap(r, source, target); > + g_assert(r == KERN_SUCCESS); > + } > + > + qemu_mutex_lock(&s->job_mutex); > + job->success = success; > + job->done = true; > + qemu_cond_broadcast(&s->job_cond); > + qemu_mutex_unlock(&s->job_mutex); > +} > + > +void apple_gfx_await_bh_job(AppleGFXState *s, bool *job_done_flag) > +{ > + qemu_mutex_lock(&s->job_mutex); > + while (!*job_done_flag) { > + qemu_cond_wait(&s->job_cond, &s->job_mutex); > + } > + qemu_mutex_unlock(&s->job_mutex); > +} > + > +typedef struct AppleGFXReadMemoryJob { > + AppleGFXState *s; > + hwaddr physical_address; > + uint64_t length; > + void *dst; > + bool done; > +} AppleGFXReadMemoryJob; > + > +static void apple_gfx_do_read_memory(void *opaque) > +{ > + AppleGFXReadMemoryJob *job = opaque; > + AppleGFXState *s = job->s; > + > + cpu_physical_memory_read(job->physical_address, job->dst, job->length); Use: dma_memory_read() > + > + qemu_mutex_lock(&s->job_mutex); > + job->done = true; > + qemu_cond_broadcast(&s->job_cond); > + qemu_mutex_unlock(&s->job_mutex); > +} > + > +static void apple_gfx_read_memory(AppleGFXState *s, hwaddr physical_address, > + uint64_t length, void *dst) > +{ > + AppleGFXReadMemoryJob job = { > + s, physical_address, length, dst > + }; > + > + trace_apple_gfx_read_memory(physical_address, length, dst); > + > + /* Traversing the memory map requires RCU/BQL, so do it in a BH. */ > + aio_bh_schedule_oneshot(qemu_get_aio_context(), apple_gfx_do_read_memory, > + &job); > + apple_gfx_await_bh_job(s, &job.done); > +} > + > +static void apple_gfx_register_task_mapping_handlers(AppleGFXState *s, > + PGDeviceDescriptor *desc) > +{ > + desc.createTask = ^(uint64_t vmSize, void * _Nullable * _Nonnull baseAddress) { > + PGTask_t *task = apple_gfx_new_task(s, vmSize); > + *baseAddress = (void *)task->address; > + trace_apple_gfx_create_task(vmSize, *baseAddress); > + return task; > + }; > + > + desc.destroyTask = ^(PGTask_t * _Nonnull task) { > + trace_apple_gfx_destroy_task(task); > + QTAILQ_REMOVE(&s->tasks, task, node); > + mach_vm_deallocate(mach_task_self(), task->address, task->len); > + g_free(task); > + }; > + > + desc.mapMemory = ^bool(PGTask_t * _Nonnull task, uint32_t range_count, > + uint64_t virtual_offset, bool read_only, > + PGPhysicalMemoryRange_t * _Nonnull ranges) { > + AppleGFXMapMemoryJob job = { > + .state = s, > + .task = task, .ranges = ranges, .range_count = range_count, > + .read_only = read_only, .virtual_offset = virtual_offset, > + .done = false, .success = true, > + }; > + if (range_count > 0) { > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > + apple_gfx_map_memory, &job); > + apple_gfx_await_bh_job(s, &job.done); > + } > + return job.success; > + }; > + > + desc.unmapMemory = ^bool(PGTask_t * _Nonnull task, uint64_t virtualOffset, > + uint64_t length) { > + kern_return_t r; > + mach_vm_address_t range_address; > + > + trace_apple_gfx_unmap_memory(task, virtualOffset, length); > + > + /* Replace task memory range with fresh pages, undoing the mapping > + * from guest RAM. */ > + range_address = task->address + virtualOffset; > + r = mach_vm_allocate(mach_task_self(), &range_address, length, > + VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE); > + g_assert(r == KERN_SUCCESS);error_setg An extra error_setg > + > + return true; > + }; > + > + desc.readMemory = ^bool(uint64_t physical_address, uint64_t length, > + void * _Nonnull dst) { > + apple_gfx_read_memory(s, physical_address, length, dst); > + return true; > + }; > +} > + > +static PGDisplayDescriptor *apple_gfx_prepare_display_descriptor(AppleGFXState *s) > +{ > + PGDisplayDescriptor *disp_desc = [PGDisplayDescriptor new]; > + > + disp_desc.name = @"QEMU display"; > + disp_desc.sizeInMillimeters = NSMakeSize(400., 300.); /* A 20" display */ > + disp_desc.queue = dispatch_get_main_queue(); > + disp_desc.newFrameEventHandler = ^(void) { > + trace_apple_gfx_new_frame(); > + dispatch_async(s->render_queue, ^{ > + /* Drop frames if we get too far ahead. */ > + bql_lock(); > + if (s->pending_frames >= 2) { > + bql_unlock(); > + return; > + } > + ++s->pending_frames; > + if (s->pending_frames > 1) { > + bql_unlock(); > + return; > + } > + @autoreleasepool { > + apple_gfx_render_new_frame_bql_unlock(s); > + } > + }); > + }; > + disp_desc.modeChangeHandler = ^(PGDisplayCoord_t sizeInPixels, > + OSType pixelFormat) { > + trace_apple_gfx_mode_change(sizeInPixels.x, sizeInPixels.y); > + > + BQL_LOCK_GUARD(); > + set_mode(s, sizeInPixels.x, sizeInPixels.y); > + }; > + disp_desc.cursorGlyphHandler = ^(NSBitmapImageRep *glyph, > + PGDisplayCoord_t hotSpot) { > + [glyph retain]; > + dispatch_async(get_background_queue(), ^{ > + BQL_LOCK_GUARD(); > + uint32_t bpp = glyph.bitsPerPixel; > + size_t width = glyph.pixelsWide; > + size_t height = glyph.pixelsHigh; > + size_t padding_bytes_per_row = glyph.bytesPerRow - width * 4; > + const uint8_t* px_data = glyph.bitmapData; > + > + trace_apple_gfx_cursor_set(bpp, width, height); > + > + if (s->cursor) { > + cursor_unref(s->cursor); > + s->cursor = NULL; > + } > + > + if (bpp == 32) { /* Shouldn't be anything else, but just to be safe...*/ > + s->cursor = cursor_alloc(width, height); > + s->cursor->hot_x = hotSpot.x; > + s->cursor->hot_y = hotSpot.y; > + > + uint32_t *dest_px = s->cursor->data; > + > + for (size_t y = 0; y < height; ++y) { > + for (size_t x = 0; x < width; ++x) { > + /* NSBitmapImageRep's red & blue channels are swapped > + * compared to QEMUCursor's. */ > + *dest_px = > + (px_data[0] << 16u) | > + (px_data[1] << 8u) | > + (px_data[2] << 0u) | > + (px_data[3] << 24u); > + ++dest_px; > + px_data += 4; > + } > + px_data += padding_bytes_per_row; > + } > + dpy_cursor_define(s->con, s->cursor); > + update_cursor(s); > + } > + [glyph release]; > + }); > + }; > + disp_desc.cursorShowHandler = ^(BOOL show) { > + dispatch_async(get_background_queue(), ^{ > + BQL_LOCK_GUARD(); > + trace_apple_gfx_cursor_show(show); > + s->cursor_show = show; > + update_cursor(s); > + }); > + }; > + disp_desc.cursorMoveHandler = ^(void) { > + dispatch_async(get_background_queue(), ^{ > + BQL_LOCK_GUARD(); > + trace_apple_gfx_cursor_move(); > + update_cursor(s); > + }); > + }; > + > + return disp_desc; > +} > + > +static NSArray<PGDisplayMode*>* apple_gfx_prepare_display_mode_array(void) > +{ > + PGDisplayMode *modes[ARRAY_SIZE(apple_gfx_modes)]; > + NSArray<PGDisplayMode*>* mode_array = nil; > + int i; > + > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > + modes[i] = > + [[PGDisplayMode alloc] initWithSizeInPixels:apple_gfx_modes[i] refreshRateInHz:60.]; > + } > + > + mode_array = [NSArray arrayWithObjects:modes count:ARRAY_SIZE(apple_gfx_modes)]; > + > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > + [modes[i] release]; > + modes[i] = nil; > + } > + > + return mode_array; > +} > + > +static id<MTLDevice> copy_suitable_metal_device(void) > +{ > + id<MTLDevice> dev = nil; > + NSArray<id<MTLDevice>> *devs = MTLCopyAllDevices(); > + > + /* Prefer a unified memory GPU. Failing that, pick a non-removable GPU. */ > + for (size_t i = 0; i < devs.count; ++i) { > + if (devs[i].hasUnifiedMemory) { > + dev = devs[i]; > + break; > + } > + if (!devs[i].removable) { > + dev = devs[i]; > + } > + } > + > + if (dev != nil) { > + [dev retain]; > + } else { > + dev = MTLCreateSystemDefaultDevice(); > + } > + [devs release]; > + > + return dev; > +} > + > +void apple_gfx_common_realize(AppleGFXState *s, PGDeviceDescriptor *desc, > + Error **errp) > +{ > + PGDisplayDescriptor *disp_desc = nil; > + > + if (apple_gfx_mig_blocker == NULL) { > + error_setg(&apple_gfx_mig_blocker, > + "Migration state blocked by apple-gfx display device"); > + if (migrate_add_blocker(&apple_gfx_mig_blocker, errp) < 0) { > + return; > + } > + } > + > + QTAILQ_INIT(&s->tasks); > + s->render_queue = dispatch_queue_create("apple-gfx.render", > + DISPATCH_QUEUE_SERIAL); > + s->mtl = copy_suitable_metal_device(); > + s->mtl_queue = [s->mtl newCommandQueue]; > + > + desc.device = s->mtl; > + > + apple_gfx_register_task_mapping_handlers(s, desc); > + > + s->pgdev = PGNewDeviceWithDescriptor(desc); > + > + disp_desc = apple_gfx_prepare_display_descriptor(s); > + s->pgdisp = [s->pgdev newDisplayWithDescriptor:disp_desc > + port:0 serialNum:1234]; > + [disp_desc release]; > + s->pgdisp.modeList = apple_gfx_prepare_display_mode_array(); > + > + create_fb(s); > + > + qemu_mutex_init(&s->job_mutex); > + qemu_cond_init(&s->job_cond); > +} > diff --git a/hw/display/meson.build b/hw/display/meson.build > index 20a94973fa2..619e642905a 100644 > --- a/hw/display/meson.build > +++ b/hw/display/meson.build > @@ -61,6 +61,10 @@ system_ss.add(when: 'CONFIG_ARTIST', if_true: files('artist.c')) > > system_ss.add(when: 'CONFIG_ATI_VGA', if_true: [files('ati.c', 'ati_2d.c', 'ati_dbg.c'), pixman]) > > +system_ss.add(when: 'CONFIG_MAC_PVG', if_true: [files('apple-gfx.m'), pvg, metal]) > +if cpu == 'aarch64' > + system_ss.add(when: 'CONFIG_MAC_PVG_MMIO', if_true: [files('apple-gfx-mmio.m'), pvg, metal]) > +endif > > if config_all_devices.has_key('CONFIG_VIRTIO_GPU') > virtio_gpu_ss = ss.source_set() > diff --git a/hw/display/trace-events b/hw/display/trace-events > index 781f8a33203..214998312b9 100644 > --- a/hw/display/trace-events > +++ b/hw/display/trace-events > @@ -191,3 +191,29 @@ dm163_bits_ppi(unsigned dest_width) "dest_width : %u" > dm163_leds(int led, uint32_t value) "led %d: 0x%x" > dm163_channels(int channel, uint8_t value) "channel %d: 0x%x" > dm163_refresh_rate(uint32_t rr) "refresh rate %d" > + > +# apple-gfx.m > +apple_gfx_read(uint64_t offset, uint64_t res) "offset=0x%"PRIx64" res=0x%"PRIx64 > +apple_gfx_write(uint64_t offset, uint64_t val) "offset=0x%"PRIx64" val=0x%"PRIx64 > +apple_gfx_create_task(uint32_t vm_size, void *va) "vm_size=0x%x base_addr=%p" > +apple_gfx_destroy_task(void *task) "task=%p" > +apple_gfx_map_memory(void *task, uint32_t range_count, uint64_t virtual_offset, uint32_t read_only) "task=%p range_count=0x%x virtual_offset=0x%"PRIx64" read_only=%d" > +apple_gfx_map_memory_range(uint32_t i, uint64_t phys_addr, uint64_t phys_len) "[%d] phys_addr=0x%"PRIx64" phys_len=0x%"PRIx64 > +apple_gfx_remap(uint64_t retval, uint64_t source, uint64_t target) "retval=%"PRId64" source=0x%"PRIx64" target=0x%"PRIx64 > +apple_gfx_unmap_memory(void *task, uint64_t virtual_offset, uint64_t length) "task=%p virtual_offset=0x%"PRIx64" length=0x%"PRIx64 > +apple_gfx_read_memory(uint64_t phys_address, uint64_t length, void *dst) "phys_addr=0x%"PRIx64" length=0x%"PRIx64" dest=%p" > +apple_gfx_raise_irq(uint32_t vector) "vector=0x%x" > +apple_gfx_new_frame(void) "" > +apple_gfx_mode_change(uint64_t x, uint64_t y) "x=%"PRId64" y=%"PRId64 > +apple_gfx_cursor_set(uint32_t bpp, uint64_t width, uint64_t height) "bpp=%d width=%"PRId64" height=0x%"PRId64 > +apple_gfx_cursor_show(uint32_t show) "show=%d" > +apple_gfx_cursor_move(void) "" > +apple_gfx_common_init(const char *device_name, size_t mmio_size) "device: %s; MMIO size: %zu bytes" > + > +# apple-gfx-mmio.m > +apple_gfx_mmio_iosfc_read(uint64_t offset, uint64_t res) "offset=0x%"PRIx64" res=0x%"PRIx64 > +apple_gfx_mmio_iosfc_write(uint64_t offset, uint64_t val) "offset=0x%"PRIx64" val=0x%"PRIx64 > +apple_gfx_iosfc_map_memory(uint64_t phys, uint64_t len, uint32_t ro, void *va, void *e, void *f, void* va_result, int success) "phys=0x%"PRIx64" len=0x%"PRIx64" ro=%d va=%p e=%p f=%p -> *va=%p, success = %d" > +apple_gfx_iosfc_unmap_memory(void *a, void *b, void *c, void *d, void *e, void *f) "a=%p b=%p c=%p d=%p e=%p f=%p" > +apple_gfx_iosfc_raise_irq(uint32_t vector) "vector=0x%x" > + > diff --git a/meson.build b/meson.build > index d26690ce204..0e124eff13f 100644 > --- a/meson.build > +++ b/meson.build > @@ -761,6 +761,8 @@ socket = [] > version_res = [] > coref = [] > iokit = [] > +pvg = [] > +metal = [] > emulator_link_args = [] > midl = not_found > widl = not_found > @@ -782,6 +784,8 @@ elif host_os == 'darwin' > coref = dependency('appleframeworks', modules: 'CoreFoundation') > iokit = dependency('appleframeworks', modules: 'IOKit', required: false) > host_dsosuf = '.dylib' > + pvg = dependency('appleframeworks', modules: 'ParavirtualizedGraphics') > + metal = dependency('appleframeworks', modules: 'Metal') > elif host_os == 'sunos' > socket = [cc.find_library('socket'), > cc.find_library('nsl'),
On Fri, 25 Oct 2024 at 08:03, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > On 2024/10/24 19:28, Phil Dennis-Jordan wrote: > > MacOS provides a framework (library) that allows any vmm to implement a > > paravirtualized 3d graphics passthrough to the host metal stack called > > ParavirtualizedGraphics.Framework (PVG). The library abstracts away > > almost every aspect of the paravirtualized device model and only provides > > and receives callbacks on MMIO access as well as to share memory address > > space between the VM and PVG. > > > > This patch implements a QEMU device that drives PVG for the VMApple > > variant of it. > > > > Signed-off-by: Alexander Graf <graf@amazon.com> > > Co-authored-by: Alexander Graf <graf@amazon.com> > > > > Subsequent changes: > > > > * Cherry-pick/rebase conflict fixes, API use updates. > > * Moved from hw/vmapple/ (useful outside that machine type) > > * Overhaul of threading model, many thread safety improvements. > > * Asynchronous rendering. > > * Memory and object lifetime fixes. > > * Refactoring to split generic and (vmapple) MMIO variant specific > > code. > > > > Signed-off-by: Phil Dennis-Jordan <phil@philjordan.eu> > > --- > > > > v2: > > > > * Cherry-pick/rebase conflict fixes > > * BQL function renaming > > * Moved from hw/vmapple/ (useful outside that machine type) > > * Code review comments: Switched to DEFINE_TYPES macro & little endian > > MMIO. > > * Removed some dead/superfluous code > > * Mad set_mode thread & memory safe > > * Added migration blocker due to lack of (de-)serialisation. > > * Fixes to ObjC refcounting and autorelease pool usage. > > * Fixed ObjC new/init misuse > > * Switched to ObjC category extension for private property. > > * Simplified task memory mapping and made it thread safe. > > * Refactoring to split generic and vmapple MMIO variant specific > > code. > > * Switched to asynchronous MMIO writes on x86-64 > > * Rendering and graphics update are now done asynchronously > > * Fixed cursor handling > > * Coding convention fixes > > * Removed software cursor compositing > > > > v3: > > > > * Rebased on latest upstream, fixed breakages including switching to > Resettable methods. > > * Squashed patches dealing with dGPUs, MMIO area size, and GPU picking. > > * Allow re-entrant MMIO; this simplifies the code and solves the > divergence > > between x86-64 and arm64 variants. > > > > v4: > > > > * Renamed '-vmapple' device variant to '-mmio' > > * MMIO device type now requires aarch64 host and guest > > * Complete overhaul of the glue code for making Qemu's and > > ParavirtualizedGraphics.framework's threading and synchronisation > models > > work together. Calls into PVG are from dispatch queues while the > > BQL-holding initiating thread processes AIO context events; > callbacks from > > PVG are scheduled as BHs on the BQL/main AIO context, awaiting > completion > > where necessary. > > * Guest frame rendering state is covered by the BQL, with only the PVG > calls > > outside the lock, and serialised on the named render_queue. > > * Simplified logic for dropping frames in-flight during mode changes, > fixed > > bug in pending frames logic. > > * Addressed smaller code review notes such as: function naming, object > type > > declarations, type names/declarations/casts, code formatting, > #include > > order, over-cautious ObjC retain/release, what goes in init vs > realize, > > etc. > > > > > > hw/display/Kconfig | 9 + > > hw/display/apple-gfx-mmio.m | 284 ++++++++++++++ > > hw/display/apple-gfx.h | 58 +++ > > hw/display/apple-gfx.m | 713 ++++++++++++++++++++++++++++++++++++ > > hw/display/meson.build | 4 + > > hw/display/trace-events | 26 ++ > > meson.build | 4 + > > 7 files changed, 1098 insertions(+) > > create mode 100644 hw/display/apple-gfx-mmio.m > > create mode 100644 hw/display/apple-gfx.h > > create mode 100644 hw/display/apple-gfx.m > > > > diff --git a/hw/display/Kconfig b/hw/display/Kconfig > > index 2250c740078..6a9b7b19ada 100644 > > --- a/hw/display/Kconfig > > +++ b/hw/display/Kconfig > > @@ -140,3 +140,12 @@ config XLNX_DISPLAYPORT > > > > config DM163 > > bool > > + > > +config MAC_PVG > > + bool > > + default y > > + > > +config MAC_PVG_MMIO > > + bool > > + depends on MAC_PVG && AARCH64 > > + > > diff --git a/hw/display/apple-gfx-mmio.m b/hw/display/apple-gfx-mmio.m > > new file mode 100644 > > index 00000000000..06131bc23f1 > > --- /dev/null > > +++ b/hw/display/apple-gfx-mmio.m > > @@ -0,0 +1,284 @@ > > +/* > > + * QEMU Apple ParavirtualizedGraphics.framework device, MMIO (arm64) > variant > > + * > > + * Copyright © 2023 Amazon.com, Inc. or its affiliates. All Rights > Reserved. > > + * > > + * This work is licensed under the terms of the GNU GPL, version 2 or > later. > > + * See the COPYING file in the top-level directory. > > Use SPDX-License-Identifier. You can find some examples with grep. > This was already part of the file when I took it over from Alex. I've used SPDX on any new files I've started from scratch. (I can of course *add* the SPDX line here too.) > > + * > > + * ParavirtualizedGraphics.framework is a set of libraries that macOS > provides > > + * which implements 3d graphics passthrough to the host as well as a > > + * proprietary guest communication channel to drive it. This device > model > > + * implements support to drive that library from within QEMU as an > MMIO-based > > + * system device for macOS on arm64 VMs. > > + */ > > + > > +#include "qemu/osdep.h" > > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > > +#include "apple-gfx.h" > > +#include "monitor/monitor.h" > > +#include "hw/sysbus.h" > > +#include "hw/irq.h" > > +#include "trace.h" > > + > > +OBJECT_DECLARE_SIMPLE_TYPE(AppleGFXMMIOState, APPLE_GFX_MMIO) > > + > > +/* > > + * ParavirtualizedGraphics.Framework only ships header files for the PCI > > + * variant which does not include IOSFC descriptors and host devices. > We add > > + * their definitions here so that we can also work with the ARM version. > > + */ > > +typedef bool(^IOSFCRaiseInterrupt)(uint32_t vector); > > +typedef bool(^IOSFCUnmapMemory)( > > + void *, void *, void *, void *, void *, void *); > > +typedef bool(^IOSFCMapMemory)( > > + uint64_t phys, uint64_t len, bool ro, void **va, void *, void *); > > + > > +@interface PGDeviceDescriptor (IOSurfaceMapper) > > +@property (readwrite, nonatomic) bool usingIOSurfaceMapper; > > +@end > > + > > +@interface PGIOSurfaceHostDeviceDescriptor : NSObject > > +-(PGIOSurfaceHostDeviceDescriptor *)init; > > +@property (readwrite, nonatomic, copy, nullable) IOSFCMapMemory > mapMemory; > > +@property (readwrite, nonatomic, copy, nullable) IOSFCUnmapMemory > unmapMemory; > > +@property (readwrite, nonatomic, copy, nullable) IOSFCRaiseInterrupt > raiseInterrupt; > > +@end > > + > > +@interface PGIOSurfaceHostDevice : NSObject > > +-(instancetype)initWithDescriptor:(PGIOSurfaceHostDeviceDescriptor > *)desc; > > +-(uint32_t)mmioReadAtOffset:(size_t)offset; > > +-(void)mmioWriteAtOffset:(size_t)offset value:(uint32_t)value; > > +@end > > + > > +struct AppleGFXMapSurfaceMemoryJob; > > +struct AppleGFXMMIOState { > > + SysBusDevice parent_obj; > > + > > + AppleGFXState common; > > + > > + qemu_irq irq_gfx; > > + qemu_irq irq_iosfc; > > + MemoryRegion iomem_iosfc; > > + PGIOSurfaceHostDevice *pgiosfc; > > +}; > > + > > +typedef struct AppleGFXMMIOJob { > > + AppleGFXMMIOState *state; > > + uint64_t offset; > > + uint64_t value; > > + bool completed; > > +} AppleGFXMMIOJob; > > + > > +static void iosfc_do_read(void *opaque) > > +{ > > + AppleGFXMMIOJob *job = opaque; > > + job->value = [job->state->pgiosfc mmioReadAtOffset:job->offset]; > > + qatomic_set(&job->completed, true); > > + aio_wait_kick(); > > +} > > + > > +static uint64_t iosfc_read(void *opaque, hwaddr offset, unsigned size) > > +{ > > + AppleGFXMMIOJob job = { > > + .state = opaque, > > + .offset = offset, > > + .completed = false, > > + }; > > + AioContext *context = qemu_get_aio_context(); > > + dispatch_queue_t queue = > dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > > + > > + dispatch_async_f(queue, &job, iosfc_do_read); > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > Pass NULL as the first argument of AIO_WAIT_WHILE(). > > > + > > + trace_apple_gfx_mmio_iosfc_read(offset, job.value); > > + return job.value; > > +} > > + > > +static void iosfc_do_write(void *opaque) > > +{ > > + AppleGFXMMIOJob *job = opaque; > > + [job->state->pgiosfc mmioWriteAtOffset:job->offset > value:job->value]; > > + qatomic_set(&job->completed, true); > > + aio_wait_kick(); > > +} > > + > > +static void iosfc_write(void *opaque, hwaddr offset, uint64_t val, > > + unsigned size) > > +{ > > + AppleGFXMMIOJob job = { > > + .state = opaque, > > + .offset = offset, > > + .value = val, > > + .completed = false, > > + }; > > + AioContext *context = qemu_get_aio_context(); > > + dispatch_queue_t queue = > dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > > + > > + dispatch_async_f(queue, &job, iosfc_do_write); > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > + > > + trace_apple_gfx_mmio_iosfc_write(offset, val); > > +} > > + > > +static const MemoryRegionOps apple_iosfc_ops = { > > + .read = iosfc_read, > > + .write = iosfc_write, > > + .endianness = DEVICE_LITTLE_ENDIAN, > > + .valid = { > > + .min_access_size = 4, > > + .max_access_size = 8, > > + }, > > + .impl = { > > + .min_access_size = 4, > > + .max_access_size = 8, > > + }, > > +}; > > + > > +static void raise_iosfc_irq(void *opaque) > > +{ > > + AppleGFXMMIOState *s = opaque; > > + > > + qemu_irq_pulse(s->irq_iosfc); > > +} > > + > > +typedef struct AppleGFXMapSurfaceMemoryJob { > > + uint64_t guest_physical_address; > > + uint64_t guest_physical_length; > > + void *result_mem; > > + AppleGFXMMIOState *state; > > + bool read_only; > > + bool success; > > + bool done; > > +} AppleGFXMapSurfaceMemoryJob; > > + > > +static void apple_gfx_mmio_map_surface_memory(void *opaque) > > +{ > > + AppleGFXMapSurfaceMemoryJob *job = opaque; > > + AppleGFXMMIOState *s = job->state; > > + mach_vm_address_t mem; > > + > > + mem = > apple_gfx_host_address_for_gpa_range(job->guest_physical_address, > > + > job->guest_physical_length, > > + job->read_only); > > + > > + qemu_mutex_lock(&s->common.job_mutex); > > + job->result_mem = (void*)mem; > > nit: write as (void *). > > > + job->success = mem != 0; > > + job->done = true; > > + qemu_cond_broadcast(&s->common.job_cond); > > + qemu_mutex_unlock(&s->common.job_mutex); > > +} > > + > > +static PGIOSurfaceHostDevice *apple_gfx_prepare_iosurface_host_device( > > + AppleGFXMMIOState *s) > > +{ > > + PGIOSurfaceHostDeviceDescriptor *iosfc_desc = > > + [PGIOSurfaceHostDeviceDescriptor new]; > > + PGIOSurfaceHostDevice *iosfc_host_dev = nil; > > + > > + iosfc_desc.mapMemory = > > + ^bool(uint64_t phys, uint64_t len, bool ro, void **va, void *e, > void *f) { > > + AppleGFXMapSurfaceMemoryJob job = { > > + .guest_physical_address = phys, .guest_physical_length > = len, > > + .read_only = ro, .state = s, > > + }; > > + > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > > + apple_gfx_mmio_map_surface_memory, > &job); > > + apple_gfx_await_bh_job(&s->common, &job.done); > > + > > + *va = job.result_mem; > > + > > + trace_apple_gfx_iosfc_map_memory(phys, len, ro, va, e, f, > *va, > > + job.success); > > + > > + return job.success; > > + }; > > + > > + iosfc_desc.unmapMemory = > > + ^bool(void *a, void *b, void *c, void *d, void *e, void *f) { > > + trace_apple_gfx_iosfc_unmap_memory(a, b, c, d, e, f); > > + return true; > > + }; > > + > > + iosfc_desc.raiseInterrupt = ^bool(uint32_t vector) { > > + trace_apple_gfx_iosfc_raise_irq(vector); > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > raise_iosfc_irq, s); > > Let's pass s->irq_iosfc here to unify raise_iosfc_irq() and > raise_gfx_irq(). > > > + return true; > > + }; > > + > > + iosfc_host_dev = > > + [[PGIOSurfaceHostDevice alloc] initWithDescriptor:iosfc_desc]; > > + [iosfc_desc release]; > > + return iosfc_host_dev; > > +} > > + > > +static void raise_gfx_irq(void *opaque) > > +{ > > + AppleGFXMMIOState *s = opaque; > > + > > + qemu_irq_pulse(s->irq_gfx); > > +} > > + > > +static void apple_gfx_mmio_realize(DeviceState *dev, Error **errp) > > +{ > > + @autoreleasepool { > > + AppleGFXMMIOState *s = APPLE_GFX_MMIO(dev); > > + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; > > + > > + desc.raiseInterrupt = ^(uint32_t vector) { > > + trace_apple_gfx_raise_irq(vector); > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > raise_gfx_irq, s); > > + }; > > + > > + desc.usingIOSurfaceMapper = true; > > + s->pgiosfc = apple_gfx_prepare_iosurface_host_device(s); > > + > > + apple_gfx_common_realize(&s->common, desc, errp); > > + [desc release]; > > + desc = nil; > > + } > > +} > > + > > +static void apple_gfx_mmio_init(Object *obj) > > +{ > > + AppleGFXMMIOState *s = APPLE_GFX_MMIO(obj); > > + > > + apple_gfx_common_init(obj, &s->common, TYPE_APPLE_GFX_MMIO); > > + > > + sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->common.iomem_gfx); > > + memory_region_init_io(&s->iomem_iosfc, obj, &apple_iosfc_ops, s, > > + TYPE_APPLE_GFX_MMIO, 0x10000); > > + sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->iomem_iosfc); > > + sysbus_init_irq(SYS_BUS_DEVICE(s), &s->irq_gfx); > > + sysbus_init_irq(SYS_BUS_DEVICE(s), &s->irq_iosfc); > > +} > > + > > +static void apple_gfx_mmio_reset(Object *obj, ResetType type) > > +{ > > + AppleGFXMMIOState *s = APPLE_GFX_MMIO(obj); > > + [s->common.pgdev reset]; > > +} > > + > > + > > +static void apple_gfx_mmio_class_init(ObjectClass *klass, void *data) > > +{ > > + DeviceClass *dc = DEVICE_CLASS(klass); > > + ResettableClass *rc = RESETTABLE_CLASS(klass); > > + > > + rc->phases.hold = apple_gfx_mmio_reset; > > + dc->hotpluggable = false; > > + dc->realize = apple_gfx_mmio_realize; > > +} > > + > > +static TypeInfo apple_gfx_mmio_types[] = { > > + { > > + .name = TYPE_APPLE_GFX_MMIO, > > + .parent = TYPE_SYS_BUS_DEVICE, > > + .instance_size = sizeof(AppleGFXMMIOState), > > + .class_init = apple_gfx_mmio_class_init, > > + .instance_init = apple_gfx_mmio_init, > > + } > > +}; > > +DEFINE_TYPES(apple_gfx_mmio_types) > > diff --git a/hw/display/apple-gfx.h b/hw/display/apple-gfx.h > > new file mode 100644 > > index 00000000000..39931fba65a > > --- /dev/null > > +++ b/hw/display/apple-gfx.h > > @@ -0,0 +1,58 @@ > > +#ifndef QEMU_APPLE_GFX_H > > +#define QEMU_APPLE_GFX_H > > + > > +#define TYPE_APPLE_GFX_MMIO "apple-gfx-mmio" > > +#define TYPE_APPLE_GFX_PCI "apple-gfx-pci" > > + > > +#include "qemu/osdep.h" > > +#include <dispatch/dispatch.h> > > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > > +#include "qemu/typedefs.h" > > +#include "exec/memory.h" > > +#include "ui/surface.h" > > + > > +@class PGDeviceDescriptor; > > +@protocol PGDevice; > > +@protocol PGDisplay; > > +@protocol MTLDevice; > > +@protocol MTLTexture; > > +@protocol MTLCommandQueue; > > + > > +typedef QTAILQ_HEAD(, PGTask_s) PGTaskList; > > + > > +struct AppleGFXMapMemoryJob; > > Probably this declaration of AppleGFXMapMemoryJob is unnecessary. > > > +typedef struct AppleGFXState { > > + MemoryRegion iomem_gfx; > > + id<PGDevice> pgdev; > > + id<PGDisplay> pgdisp; > > + PGTaskList tasks; > > + QemuConsole *con; > > + id<MTLDevice> mtl; > > + id<MTLCommandQueue> mtl_queue; > > + bool cursor_show; > > + QEMUCursor *cursor; > > + > > + /* For running PVG memory-mapping requests in the AIO context */ > > + QemuCond job_cond; > > + QemuMutex job_mutex; > > Use: QemuEvent > Hmm. I think if we were to use that, we would need to create a new QemuEvent for every job and destroy it afterward, which seems expensive. We can't rule out multiple concurrent jobs being submitted, and the QemuEvent system only supports a single producer as far as I can tell. You can probably sort of hack around it with just one QemuEvent by putting the qemu_event_wait into a loop and turning the job.done flag into an atomic (because it would now need to be checked outside the lock) but this all seems unnecessarily complicated considering the QemuEvent uses the same mechanism QemuCond/QemuMutex internally on macOS (the only platform relevant here), except we can use it as intended with QemuCond/QemuMutex rather than having to work against the abstraction. > > + > > + dispatch_queue_t render_queue; > > + /* The following fields should only be accessed from the BQL: */ > > Perhaps it may be better to document fields that can be accessed > *without* the BQL; most things in QEMU implicitly require the BQL. > > > + bool gfx_update_requested; > > + bool new_frame_ready; > > + bool using_managed_texture_storage; > > +} AppleGFXState; > > + > > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, const char* > obj_name); > > +void apple_gfx_common_realize(AppleGFXState *s, PGDeviceDescriptor > *desc, > > + Error **errp); > > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t guest_physical, > > + uint64_t length, bool > read_only); > > +void apple_gfx_await_bh_job(AppleGFXState *s, bool *job_done_flag); > > + > > +#endif > > + > > diff --git a/hw/display/apple-gfx.m b/hw/display/apple-gfx.m > > new file mode 100644 > > index 00000000000..46be9957f69 > > --- /dev/null > > +++ b/hw/display/apple-gfx.m > > @@ -0,0 +1,713 @@ > > +/* > > + * QEMU Apple ParavirtualizedGraphics.framework device > > + * > > + * Copyright © 2023 Amazon.com, Inc. or its affiliates. All Rights > Reserved. > > + * > > + * This work is licensed under the terms of the GNU GPL, version 2 or > later. > > + * See the COPYING file in the top-level directory. > > + * > > + * ParavirtualizedGraphics.framework is a set of libraries that macOS > provides > > + * which implements 3d graphics passthrough to the host as well as a > > + * proprietary guest communication channel to drive it. This device > model > > + * implements support to drive that library from within QEMU. > > + */ > > + > > +#include "qemu/osdep.h" > > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > > +#include <mach/mach_vm.h> > > +#include "apple-gfx.h" > > +#include "trace.h" > > +#include "qemu-main.h" > > +#include "exec/address-spaces.h" > > +#include "migration/blocker.h" > > +#include "monitor/monitor.h" > > +#include "qemu/main-loop.h" > > +#include "qemu/cutils.h" > > +#include "qemu/log.h" > > +#include "qapi/visitor.h" > > +#include "qapi/error.h" > > +#include "ui/console.h" > > + > > +static const PGDisplayCoord_t apple_gfx_modes[] = { > > + { .x = 1440, .y = 1080 }, > > + { .x = 1280, .y = 1024 }, > > +}; > > + > > +/* This implements a type defined in > <ParavirtualizedGraphics/PGDevice.h> > > + * which is opaque from the framework's point of view. Typedef PGTask_t > already > > + * exists in the framework headers. */ > > +struct PGTask_s { > > + QTAILQ_ENTRY(PGTask_s) node; > > + mach_vm_address_t address; > > + uint64_t len; > > +}; > > + > > +static Error *apple_gfx_mig_blocker; > > This does not have to be a static variable. > Hmm, the first 5 or so examples of migration blockers in other devices etc. I could find were all declared in this way. What are you suggesting as the alternative? And why not use the same pattern as in most of the rest of the code base? > > + > > +static void apple_gfx_render_frame_completed(AppleGFXState *s, > > + uint32_t width, uint32_t > height); > > + > > +static inline dispatch_queue_t get_background_queue(void) > > Don't add inline. The only effect for modern compilers of inline is to > suppress the unused function warnings. > > > +{ > > + return dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, > 0); > > +} > > + > > +static PGTask_t *apple_gfx_new_task(AppleGFXState *s, uint64_t len) > > +{ > > + mach_vm_address_t task_mem; > > + PGTask_t *task; > > + kern_return_t r; > > + > > + r = mach_vm_allocate(mach_task_self(), &task_mem, len, > VM_FLAGS_ANYWHERE); > > + if (r != KERN_SUCCESS || task_mem == 0) { > > Let's remove the check for task_mem == 0. We have no reason to reject it > if the platform insists it allocated a memory at address 0 though such a > situation should never happen in practice. > > > + return NULL; > > + } > > + > > + task = g_new0(PGTask_t, 1); > > + > > + task->address = task_mem; > > + task->len = len; > > + QTAILQ_INSERT_TAIL(&s->tasks, task, node); > > + > > + return task; > > +} > > + > > +typedef struct AppleGFXIOJob { > > + AppleGFXState *state; > > + uint64_t offset; > > + uint64_t value; > > + bool completed; > > +} AppleGFXIOJob; > > + > > +static void apple_gfx_do_read(void *opaque) > > +{ > > + AppleGFXIOJob *job = opaque; > > + job->value = [job->state->pgdev mmioReadAtOffset:job->offset]; > > + qatomic_set(&job->completed, true); > > + aio_wait_kick(); > > +} > > + > > +static uint64_t apple_gfx_read(void *opaque, hwaddr offset, unsigned > size) > > +{ > > + AppleGFXIOJob job = { > > + .state = opaque, > > + .offset = offset, > > + .completed = false, > > + }; > > + AioContext *context = qemu_get_aio_context(); > > + dispatch_queue_t queue = get_background_queue(); > > + > > + dispatch_async_f(queue, &job, apple_gfx_do_read); > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > + > > + trace_apple_gfx_read(offset, job.value); > > + return job.value; > > +} > > + > > +static void apple_gfx_do_write(void *opaque) > > +{ > > + AppleGFXIOJob *job = opaque; > > + [job->state->pgdev mmioWriteAtOffset:job->offset value:job->value]; > > + qatomic_set(&job->completed, true); > > + aio_wait_kick(); > > +} > > + > > +static void apple_gfx_write(void *opaque, hwaddr offset, uint64_t val, > > + unsigned size) > > +{ > > + /* The methods mmioReadAtOffset: and especially mmioWriteAtOffset: > can > > + * trigger and block on operations on other dispatch queues, which > in turn > > + * may call back out on one or more of the callback blocks. For > this reason, > > + * and as we are holding the BQL, we invoke the I/O methods on a > pool > > + * thread and handle AIO tasks while we wait. Any work in the > callbacks > > + * requiring the BQL will in turn schedule BHs which this thread > will > > + * process while waiting. */ > > + AppleGFXIOJob job = { > > + .state = opaque, > > + .offset = offset, > > + .value = val, > > + .completed = false, > > + }; > > + AioContext *context = qemu_get_current_aio_context(); > > + dispatch_queue_t queue = get_background_queue(); > > + > > + dispatch_async_f(queue, &job, apple_gfx_do_write); > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > + > > + trace_apple_gfx_write(offset, val); > > +} > > + > > +static const MemoryRegionOps apple_gfx_ops = { > > + .read = apple_gfx_read, > > + .write = apple_gfx_write, > > + .endianness = DEVICE_LITTLE_ENDIAN, > > + .valid = { > > + .min_access_size = 4, > > + .max_access_size = 8, > > + }, > > + .impl = { > > + .min_access_size = 4, > > + .max_access_size = 4, > > + }, > > +}; > > + > > +static void apple_gfx_render_new_frame_bql_unlock(AppleGFXState *s) > > +{ > > + BOOL r; > > + uint32_t width = surface_width(s->surface); > > + uint32_t height = surface_height(s->surface); > > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > > + id<MTLCommandBuffer> command_buffer = [s->mtl_queue commandBuffer]; > > + id<MTLTexture> texture = s->texture; > > + > > + assert(bql_locked()); > > + [texture retain]; > > + > > + bql_unlock(); > > + > > + /* This is not safe to call from the BQL due to PVG-internal locks > causing > > + * deadlocks. */ > > + r = [s->pgdisp encodeCurrentFrameToCommandBuffer:command_buffer > > + texture:texture > > + region:region]; > > + if (!r) { > > + [texture release]; > > + bql_lock(); > > + --s->pending_frames; > > + bql_unlock(); > > + qemu_log_mask(LOG_GUEST_ERROR, > "apple_gfx_render_new_frame_bql_unlock: " > > Use: __func__ > > > + > "encodeCurrentFrameToCommandBuffer:texture:region: failed\n"); > > + return; > > + } > > + > > + if (s->using_managed_texture_storage) { > > + /* "Managed" textures exist in both VRAM and RAM and must be > synced. */ > > + id<MTLBlitCommandEncoder> blit = [command_buffer > blitCommandEncoder]; > > + [blit synchronizeResource:texture]; > > + [blit endEncoding]; > > + } > > + [texture release]; > > + [command_buffer addCompletedHandler: > > + ^(id<MTLCommandBuffer> cb) > > + { > > + dispatch_async(s->render_queue, ^{ > > + apple_gfx_render_frame_completed(s, width, height); > > + }); > > + }]; > > + [command_buffer commit]; > > +} > > + > > +static void copy_mtl_texture_to_surface_mem(id<MTLTexture> texture, > void *vram) > > +{ > > + /* TODO: Skip this entirely on a pure Metal or headless/guest-only > > + * rendering path, else use a blit command encoder? Needs careful > > + * (double?) buffering design. */ > > + size_t width = texture.width, height = texture.height; > > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > > + [texture getBytes:vram > > + bytesPerRow:(width * 4) > > + bytesPerImage:(width * height * 4) > > + fromRegion:region > > + mipmapLevel:0 > > + slice:0]; > > +}copy_mtl_texture_to_surface_mem > > + > > +static void apple_gfx_render_frame_completed(AppleGFXState *s, > > + uint32_t width, uint32_t > height) > > +{ > > + bql_lock(); > > + --s->pending_frames; > > + assert(s->pending_frames >= 0); > > + > > + /* Only update display if mode hasn't changed since we started > rendering. */ > > + if (width == surface_width(s->surface) && > > + height == surface_height(s->surface)) { > > + copy_mtl_texture_to_surface_mem(s->texture, s->vram); > > + if (s->gfx_update_requested) { > > + s->gfx_update_requested = false; > > + dpy_gfx_update_full(s->con); > > + graphic_hw_update_done(s->con); > > + s->new_frame_ready = false; > > + } else { > > + s->new_frame_ready = true; > > + } > > + } > > + if (s->pending_frames > 0) { > > + apple_gfx_render_new_frame_bql_unlock(s); > > + } else { > > + bql_unlock(); > > + } > > +} > > + > > +static void apple_gfx_fb_update_display(void *opaque) > > +{ > > + AppleGFXState *s = opaque; > > + > > + assert(bql_locked()); > > + if (s->new_frame_ready) { > > + dpy_gfx_update_full(s->con); > > + s->new_frame_ready = false; > > + graphic_hw_update_done(s->con); > > + } else if (s->pending_frames > 0) { > > + s->gfx_update_requested = true; > > + } else { > > + graphic_hw_update_done(s->con); > > + } > > +} > > + > > +static const GraphicHwOps apple_gfx_fb_ops = { > > + .gfx_update = apple_gfx_fb_update_display, > > + .gfx_update_async = true, > > +}; > > + > > +static void update_cursor(AppleGFXState *s) > > +{ > > + assert(bql_locked()); > > + dpy_mouse_set(s->con, s->pgdisp.cursorPosition.x, > > + s->pgdisp.cursorPosition.y, s->cursor_show); > > +} > > + > > +static void set_mode(AppleGFXState *s, uint32_t width, uint32_t height) > > +{ > > + MTLTextureDescriptor *textureDescriptor; > > + > > + if (s->surface && > > + width == surface_width(s->surface) && > > + height == surface_height(s->surface)) { > > + return; > > + } > > + > > + g_free(s->vram); > > + [s->texture release]; > > + > > + s->vram = g_malloc0_n(width * height, 4); > > + s->surface = qemu_create_displaysurface_from(width, height, > PIXMAN_LE_a8r8g8b8, > > + width * 4, s->vram);> > +> + @autoreleasepool { > > + textureDescriptor = > > + [MTLTextureDescriptor > > + > texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm > > + width:width > > + height:height > > + mipmapped:NO]; > > + textureDescriptor.usage = s->pgdisp.minimumTextureUsage; > > + s->texture = [s->mtl > newTextureWithDescriptor:textureDescriptor]; > > > What about creating pixman_image_t from s->texture.buffer.contents? This > should save memory usage by removing the duplication of texture. > We need explicit control over when the GPU vs when the CPU may access the texture - only one of them may access them at a time. As far as I can tell, we can't control when the rest of Qemu might access the pixman_image used in the console surface? > > + } > > + > > + s->using_managed_texture_storage = > > + (s->texture.storageMode == MTLStorageModeManaged); > > + dpy_gfx_replace_surface(s->con, s->surface); > > +} > > + > > +static void create_fb(AppleGFXState *s) > > +{ > > + s->con = graphic_console_init(NULL, 0, &apple_gfx_fb_ops, s); > > + set_mode(s, 1440, 1080); > > + > > + s->cursor_show = true; > > +} > > + > > +static size_t apple_gfx_get_default_mmio_range_size(void) > > +{ > > + size_t mmio_range_size; > > + @autoreleasepool { > > + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; > > + mmio_range_size = desc.mmioLength; > > + [desc release]; > > + } > > + return mmio_range_size; > > +} > > + > > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, const char* > obj_name) > > +{ > > + size_t mmio_range_size = apple_gfx_get_default_mmio_range_size(); > > + > > + trace_apple_gfx_common_init(obj_name, mmio_range_size); > > + memory_region_init_io(&s->iomem_gfx, obj, &apple_gfx_ops, s, > obj_name, > > + mmio_range_size); > > + > > + /* TODO: PVG framework supports serialising device state: integrate > it! */ > > +} > > + > > +typedef struct AppleGFXMapMemoryJob { > > + AppleGFXState *state; > > + PGTask_t *task; > > + uint64_t virtual_offset; > > + PGPhysicalMemoryRange_t *ranges; > > + uint32_t range_count; > > + bool read_only; > > + bool success; > > + bool done; > > +} AppleGFXMapMemoryJob; > > + > > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t guest_physical, > > + uint64_t length, bool > read_only) > > +{ > > + MemoryRegion *ram_region; > > + uintptr_t host_address; > > + hwaddr ram_region_offset = 0; > > + hwaddr ram_region_length = length; > > + > > + ram_region = address_space_translate(&address_space_memory, > > + guest_physical, > > + &ram_region_offset, > > + &ram_region_length, !read_only, > > + MEMTXATTRS_UNSPECIFIED); > > Call memory_region_ref() so that it won't go away. > > > + > > + if (!ram_region || ram_region_length < length || > > + !memory_access_is_direct(ram_region, !read_only)) { > > + return 0; > > + } > > + > > + host_address = > (mach_vm_address_t)memory_region_get_ram_ptr(ram_region); > > host_address is typed as uintptr_t, not mach_vm_address_t. > > > + if (host_address == 0) { > > + return 0; > > + } > > + host_address += ram_region_offset; > > + > > + return host_address; > > +} > > + > > +static void apple_gfx_map_memory(void *opaque) > > +{ > > + AppleGFXMapMemoryJob *job = opaque; > > + AppleGFXState *s = job->state; > > + PGTask_t *task = job->task; > > + uint32_t range_count = job->range_count; > > + uint64_t virtual_offset = job->virtual_offset; > > + PGPhysicalMemoryRange_t *ranges = job->ranges; > > + bool read_only = job->read_only; > > + kern_return_t r; > > + mach_vm_address_t target, source; > > + vm_prot_t cur_protection, max_protection; > > + bool success = true; > > + > > + g_assert(bql_locked()); > > + > > + trace_apple_gfx_map_memory(task, range_count, virtual_offset, > read_only); > > + for (int i = 0; i < range_count; i++) { > > + PGPhysicalMemoryRange_t *range = &ranges[i]; > > + > > + target = task->address + virtual_offset; > > + virtual_offset += range->physicalLength; > > + > > + trace_apple_gfx_map_memory_range(i, range->physicalAddress, > > + range->physicalLength); > > + > > + source = > apple_gfx_host_address_for_gpa_range(range->physicalAddress, > > + > range->physicalLength, > > + read_only); > > + if (source == 0) { > > + success = false; > > + continue; > > + } > > + > > + MemoryRegion* alt_mr = NULL; > > + mach_vm_address_t alt_source = > (mach_vm_address_t)gpa2hva(&alt_mr, range->physicalAddress, > range->physicalLength, NULL); > > + g_assert(alt_source == source); > > Remove this; I guess this is for debugging. > > > + > > + cur_protection = 0; > > + max_protection = 0; > > + // Map guest RAM at range->physicalAddress into PG task memory > range > > + r = mach_vm_remap(mach_task_self(), > > + &target, range->physicalLength, vm_page_size > - 1, > > + VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE, > > + mach_task_self(), > > + source, false /* shared mapping, no copy */, > > + &cur_protection, &max_protection, > > + VM_INHERIT_COPY); > > + trace_apple_gfx_remap(r, source, target); > > + g_assert(r == KERN_SUCCESS); > > + } > > + > > + qemu_mutex_lock(&s->job_mutex); > > + job->success = success; > > + job->done = true; > > + qemu_cond_broadcast(&s->job_cond); > > + qemu_mutex_unlock(&s->job_mutex); > > +} > > + > > +void apple_gfx_await_bh_job(AppleGFXState *s, bool *job_done_flag) > > +{ > > + qemu_mutex_lock(&s->job_mutex); > > + while (!*job_done_flag) { > > + qemu_cond_wait(&s->job_cond, &s->job_mutex); > > + } > > + qemu_mutex_unlock(&s->job_mutex); > > +} > > + > > +typedef struct AppleGFXReadMemoryJob { > > + AppleGFXState *s; > > + hwaddr physical_address; > > + uint64_t length; > > + void *dst; > > + bool done; > > +} AppleGFXReadMemoryJob; > > + > > +static void apple_gfx_do_read_memory(void *opaque) > > +{ > > + AppleGFXReadMemoryJob *job = opaque; > > + AppleGFXState *s = job->s; > > + > > + cpu_physical_memory_read(job->physical_address, job->dst, > job->length); > > Use: dma_memory_read() > > > + > > + qemu_mutex_lock(&s->job_mutex); > > + job->done = true; > > + qemu_cond_broadcast(&s->job_cond); > > + qemu_mutex_unlock(&s->job_mutex); > > +} > > + > > +static void apple_gfx_read_memory(AppleGFXState *s, hwaddr > physical_address, > > + uint64_t length, void *dst) > > +{ > > + AppleGFXReadMemoryJob job = { > > + s, physical_address, length, dst > > + }; > > + > > + trace_apple_gfx_read_memory(physical_address, length, dst); > > + > > + /* Traversing the memory map requires RCU/BQL, so do it in a BH. */ > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > apple_gfx_do_read_memory, > > + &job); > > + apple_gfx_await_bh_job(s, &job.done); > > +} > > + > > +static void apple_gfx_register_task_mapping_handlers(AppleGFXState *s, > > + PGDeviceDescriptor > *desc) > > +{ > > + desc.createTask = ^(uint64_t vmSize, void * _Nullable * _Nonnull > baseAddress) { > > + PGTask_t *task = apple_gfx_new_task(s, vmSize); > > + *baseAddress = (void *)task->address; > > + trace_apple_gfx_create_task(vmSize, *baseAddress); > > + return task; > > + }; > > + > > + desc.destroyTask = ^(PGTask_t * _Nonnull task) { > > + trace_apple_gfx_destroy_task(task); > > + QTAILQ_REMOVE(&s->tasks, task, node); > > + mach_vm_deallocate(mach_task_self(), task->address, task->len); > > + g_free(task); > > + }; > > + > > + desc.mapMemory = ^bool(PGTask_t * _Nonnull task, uint32_t > range_count, > > + uint64_t virtual_offset, bool read_only, > > + PGPhysicalMemoryRange_t * _Nonnull ranges) { > > + AppleGFXMapMemoryJob job = { > > + .state = s, > > + .task = task, .ranges = ranges, .range_count = range_count, > > + .read_only = read_only, .virtual_offset = virtual_offset, > > + .done = false, .success = true, > > + }; > > + if (range_count > 0) { > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > > + apple_gfx_map_memory, &job); > > + apple_gfx_await_bh_job(s, &job.done); > > + } > > + return job.success; > > + }; > > + > > + desc.unmapMemory = ^bool(PGTask_t * _Nonnull task, uint64_t > virtualOffset, > > + uint64_t length) { > > + kern_return_t r; > > + mach_vm_address_t range_address; > > + > > + trace_apple_gfx_unmap_memory(task, virtualOffset, length); > > + > > + /* Replace task memory range with fresh pages, undoing the > mapping > > + * from guest RAM. */ > > + range_address = task->address + virtualOffset; > > + r = mach_vm_allocate(mach_task_self(), &range_address, length, > > + VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE); > > + g_assert(r == KERN_SUCCESS);error_setg > > An extra error_setg > > > + > > + return true; > > + }; > > + > > + desc.readMemory = ^bool(uint64_t physical_address, uint64_t length, > > + void * _Nonnull dst) { > > + apple_gfx_read_memory(s, physical_address, length, dst); > > + return true; > > + }; > > +} > > + > > +static PGDisplayDescriptor > *apple_gfx_prepare_display_descriptor(AppleGFXState *s) > > +{ > > + PGDisplayDescriptor *disp_desc = [PGDisplayDescriptor new]; > > + > > + disp_desc.name = @"QEMU display"; > > + disp_desc.sizeInMillimeters = NSMakeSize(400., 300.); /* A 20" > display */ > > + disp_desc.queue = dispatch_get_main_queue(); > > + disp_desc.newFrameEventHandler = ^(void) { > > + trace_apple_gfx_new_frame(); > > + dispatch_async(s->render_queue, ^{ > > + /* Drop frames if we get too far ahead. */ > > + bql_lock(); > > + if (s->pending_frames >= 2) { > > + bql_unlock(); > > + return; > > + } > > + ++s->pending_frames; > > + if (s->pending_frames > 1) { > > + bql_unlock(); > > + return; > > + } > > + @autoreleasepool { > > + apple_gfx_render_new_frame_bql_unlock(s); > > + } > > + }); > > + }; > > + disp_desc.modeChangeHandler = ^(PGDisplayCoord_t sizeInPixels, > > + OSType pixelFormat) { > > + trace_apple_gfx_mode_change(sizeInPixels.x, sizeInPixels.y); > > + > > + BQL_LOCK_GUARD(); > > + set_mode(s, sizeInPixels.x, sizeInPixels.y); > > + }; > > + disp_desc.cursorGlyphHandler = ^(NSBitmapImageRep *glyph, > > + PGDisplayCoord_t hotSpot) { > > + [glyph retain]; > > + dispatch_async(get_background_queue(), ^{ > > + BQL_LOCK_GUARD(); > > + uint32_t bpp = glyph.bitsPerPixel; > > + size_t width = glyph.pixelsWide; > > + size_t height = glyph.pixelsHigh; > > + size_t padding_bytes_per_row = glyph.bytesPerRow - width * > 4; > > + const uint8_t* px_data = glyph.bitmapData; > > + > > + trace_apple_gfx_cursor_set(bpp, width, height); > > + > > + if (s->cursor) { > > + cursor_unref(s->cursor); > > + s->cursor = NULL; > > + } > > + > > + if (bpp == 32) { /* Shouldn't be anything else, but just to > be safe...*/ > > + s->cursor = cursor_alloc(width, height); > > + s->cursor->hot_x = hotSpot.x; > > + s->cursor->hot_y = hotSpot.y; > > + > > + uint32_t *dest_px = s->cursor->data; > > + > > + for (size_t y = 0; y < height; ++y) { > > + for (size_t x = 0; x < width; ++x) { > > + /* NSBitmapImageRep's red & blue channels are > swapped > > + * compared to QEMUCursor's. */ > > + *dest_px = > > + (px_data[0] << 16u) | > > + (px_data[1] << 8u) | > > + (px_data[2] << 0u) | > > + (px_data[3] << 24u); > > + ++dest_px; > > + px_data += 4; > > + } > > + px_data += padding_bytes_per_row; > > + } > > + dpy_cursor_define(s->con, s->cursor); > > + update_cursor(s); > > + } > > + [glyph release]; > > + }); > > + }; > > + disp_desc.cursorShowHandler = ^(BOOL show) { > > + dispatch_async(get_background_queue(), ^{ > > + BQL_LOCK_GUARD(); > > + trace_apple_gfx_cursor_show(show); > > + s->cursor_show = show; > > + update_cursor(s); > > + }); > > + }; > > + disp_desc.cursorMoveHandler = ^(void) { > > + dispatch_async(get_background_queue(), ^{ > > + BQL_LOCK_GUARD(); > > + trace_apple_gfx_cursor_move(); > > + update_cursor(s); > > + }); > > + }; > > + > > + return disp_desc; > > +} > > + > > +static NSArray<PGDisplayMode*>* > apple_gfx_prepare_display_mode_array(void) > > +{ > > + PGDisplayMode *modes[ARRAY_SIZE(apple_gfx_modes)]; > > + NSArray<PGDisplayMode*>* mode_array = nil; > > + int i; > > + > > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > > + modes[i] = > > + [[PGDisplayMode alloc] > initWithSizeInPixels:apple_gfx_modes[i] refreshRateInHz:60.]; > > + } > > + > > + mode_array = [NSArray arrayWithObjects:modes > count:ARRAY_SIZE(apple_gfx_modes)]; > > + > > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > > + [modes[i] release]; > > + modes[i] = nil; > > + } > > + > > + return mode_array; > > +} > > + > > +static id<MTLDevice> copy_suitable_metal_device(void) > > +{ > > + id<MTLDevice> dev = nil; > > + NSArray<id<MTLDevice>> *devs = MTLCopyAllDevices(); > > + > > + /* Prefer a unified memory GPU. Failing that, pick a non-removable > GPU. */ > > + for (size_t i = 0; i < devs.count; ++i) { > > + if (devs[i].hasUnifiedMemory) { > > + dev = devs[i]; > > + break; > > + } > > + if (!devs[i].removable) { > > + dev = devs[i]; > > + } > > + } > > + > > + if (dev != nil) { > > + [dev retain]; > > + } else { > > + dev = MTLCreateSystemDefaultDevice(); > > + } > > + [devs release]; > > + > > + return dev; > > +} > > + > > +void apple_gfx_common_realize(AppleGFXState *s, PGDeviceDescriptor > *desc, > > + Error **errp) > > +{ > > + PGDisplayDescriptor *disp_desc = nil; > > + > > + if (apple_gfx_mig_blocker == NULL) { > > + error_setg(&apple_gfx_mig_blocker, > > + "Migration state blocked by apple-gfx display > device"); > > + if (migrate_add_blocker(&apple_gfx_mig_blocker, errp) < 0) { > > + return; > > + } > > + } > > + > > + QTAILQ_INIT(&s->tasks); > > + s->render_queue = dispatch_queue_create("apple-gfx.render", > > + DISPATCH_QUEUE_SERIAL); > > + s->mtl = copy_suitable_metal_device(); > > + s->mtl_queue = [s->mtl newCommandQueue]; > > + > > + desc.device = s->mtl; > > + > > + apple_gfx_register_task_mapping_handlers(s, desc); > > + > > + s->pgdev = PGNewDeviceWithDescriptor(desc); > > + > > + disp_desc = apple_gfx_prepare_display_descriptor(s); > > + s->pgdisp = [s->pgdev newDisplayWithDescriptor:disp_desc > > + port:0 serialNum:1234]; > > + [disp_desc release]; > > + s->pgdisp.modeList = apple_gfx_prepare_display_mode_array(); > > + > > + create_fb(s); > > + > > + qemu_mutex_init(&s->job_mutex); > > + qemu_cond_init(&s->job_cond); > > +} > > diff --git a/hw/display/meson.build b/hw/display/meson.build > > index 20a94973fa2..619e642905a 100644 > > --- a/hw/display/meson.build > > +++ b/hw/display/meson.build > > @@ -61,6 +61,10 @@ system_ss.add(when: 'CONFIG_ARTIST', if_true: > files('artist.c')) > > > > system_ss.add(when: 'CONFIG_ATI_VGA', if_true: [files('ati.c', > 'ati_2d.c', 'ati_dbg.c'), pixman]) > > > > +system_ss.add(when: 'CONFIG_MAC_PVG', if_true: > [files('apple-gfx.m'), pvg, metal]) > > +if cpu == 'aarch64' > > + system_ss.add(when: 'CONFIG_MAC_PVG_MMIO', if_true: > [files('apple-gfx-mmio.m'), pvg, metal]) > > +endif > > > > if config_all_devices.has_key('CONFIG_VIRTIO_GPU') > > virtio_gpu_ss = ss.source_set() > > diff --git a/hw/display/trace-events b/hw/display/trace-events > > index 781f8a33203..214998312b9 100644 > > --- a/hw/display/trace-events > > +++ b/hw/display/trace-events > > @@ -191,3 +191,29 @@ dm163_bits_ppi(unsigned dest_width) "dest_width : > %u" > > dm163_leds(int led, uint32_t value) "led %d: 0x%x" > > dm163_channels(int channel, uint8_t value) "channel %d: 0x%x" > > dm163_refresh_rate(uint32_t rr) "refresh rate %d" > > + > > +# apple-gfx.m > > +apple_gfx_read(uint64_t offset, uint64_t res) "offset=0x%"PRIx64" > res=0x%"PRIx64 > > +apple_gfx_write(uint64_t offset, uint64_t val) "offset=0x%"PRIx64" > val=0x%"PRIx64 > > +apple_gfx_create_task(uint32_t vm_size, void *va) "vm_size=0x%x > base_addr=%p" > > +apple_gfx_destroy_task(void *task) "task=%p" > > +apple_gfx_map_memory(void *task, uint32_t range_count, uint64_t > virtual_offset, uint32_t read_only) "task=%p range_count=0x%x > virtual_offset=0x%"PRIx64" read_only=%d" > > +apple_gfx_map_memory_range(uint32_t i, uint64_t phys_addr, uint64_t > phys_len) "[%d] phys_addr=0x%"PRIx64" phys_len=0x%"PRIx64 > > +apple_gfx_remap(uint64_t retval, uint64_t source, uint64_t target) > "retval=%"PRId64" source=0x%"PRIx64" target=0x%"PRIx64 > > +apple_gfx_unmap_memory(void *task, uint64_t virtual_offset, uint64_t > length) "task=%p virtual_offset=0x%"PRIx64" length=0x%"PRIx64 > > +apple_gfx_read_memory(uint64_t phys_address, uint64_t length, void > *dst) "phys_addr=0x%"PRIx64" length=0x%"PRIx64" dest=%p" > > +apple_gfx_raise_irq(uint32_t vector) "vector=0x%x" > > +apple_gfx_new_frame(void) "" > > +apple_gfx_mode_change(uint64_t x, uint64_t y) "x=%"PRId64" y=%"PRId64 > > +apple_gfx_cursor_set(uint32_t bpp, uint64_t width, uint64_t height) > "bpp=%d width=%"PRId64" height=0x%"PRId64 > > +apple_gfx_cursor_show(uint32_t show) "show=%d" > > +apple_gfx_cursor_move(void) "" > > +apple_gfx_common_init(const char *device_name, size_t mmio_size) > "device: %s; MMIO size: %zu bytes" > > + > > +# apple-gfx-mmio.m > > +apple_gfx_mmio_iosfc_read(uint64_t offset, uint64_t res) > "offset=0x%"PRIx64" res=0x%"PRIx64 > > +apple_gfx_mmio_iosfc_write(uint64_t offset, uint64_t val) > "offset=0x%"PRIx64" val=0x%"PRIx64 > > +apple_gfx_iosfc_map_memory(uint64_t phys, uint64_t len, uint32_t ro, > void *va, void *e, void *f, void* va_result, int success) "phys=0x%"PRIx64" > len=0x%"PRIx64" ro=%d va=%p e=%p f=%p -> *va=%p, success = %d" > > +apple_gfx_iosfc_unmap_memory(void *a, void *b, void *c, void *d, void > *e, void *f) "a=%p b=%p c=%p d=%p e=%p f=%p" > > +apple_gfx_iosfc_raise_irq(uint32_t vector) "vector=0x%x" > > + > > diff --git a/meson.build b/meson.build > > index d26690ce204..0e124eff13f 100644 > > --- a/meson.build > > +++ b/meson.build > > @@ -761,6 +761,8 @@ socket = [] > > version_res = [] > > coref = [] > > iokit = [] > > +pvg = [] > > +metal = [] > > emulator_link_args = [] > > midl = not_found > > widl = not_found > > @@ -782,6 +784,8 @@ elif host_os == 'darwin' > > coref = dependency('appleframeworks', modules: 'CoreFoundation') > > iokit = dependency('appleframeworks', modules: 'IOKit', required: > false) > > host_dsosuf = '.dylib' > > + pvg = dependency('appleframeworks', modules: > 'ParavirtualizedGraphics') > > + metal = dependency('appleframeworks', modules: 'Metal') > > elif host_os == 'sunos' > > socket = [cc.find_library('socket'), > > cc.find_library('nsl'), > >
On 2024/10/26 4:43, Phil Dennis-Jordan wrote: > > > On Fri, 25 Oct 2024 at 08:03, Akihiko Odaki <akihiko.odaki@daynix.com > <mailto:akihiko.odaki@daynix.com>> wrote: > > On 2024/10/24 19:28, Phil Dennis-Jordan wrote: > > MacOS provides a framework (library) that allows any vmm to > implement a > > paravirtualized 3d graphics passthrough to the host metal stack > called > > ParavirtualizedGraphics.Framework (PVG). The library abstracts away > > almost every aspect of the paravirtualized device model and only > provides > > and receives callbacks on MMIO access as well as to share memory > address > > space between the VM and PVG. > > > > This patch implements a QEMU device that drives PVG for the VMApple > > variant of it. > > > > Signed-off-by: Alexander Graf <graf@amazon.com > <mailto:graf@amazon.com>> > > Co-authored-by: Alexander Graf <graf@amazon.com > <mailto:graf@amazon.com>> > > > > Subsequent changes: > > > > * Cherry-pick/rebase conflict fixes, API use updates. > > * Moved from hw/vmapple/ (useful outside that machine type) > > * Overhaul of threading model, many thread safety improvements. > > * Asynchronous rendering. > > * Memory and object lifetime fixes. > > * Refactoring to split generic and (vmapple) MMIO variant specific > > code. > > > > Signed-off-by: Phil Dennis-Jordan <phil@philjordan.eu > <mailto:phil@philjordan.eu>> > > --- > > > > v2: > > > > * Cherry-pick/rebase conflict fixes > > * BQL function renaming > > * Moved from hw/vmapple/ (useful outside that machine type) > > * Code review comments: Switched to DEFINE_TYPES macro & little > endian > > MMIO. > > * Removed some dead/superfluous code > > * Mad set_mode thread & memory safe > > * Added migration blocker due to lack of (de-)serialisation. > > * Fixes to ObjC refcounting and autorelease pool usage. > > * Fixed ObjC new/init misuse > > * Switched to ObjC category extension for private property. > > * Simplified task memory mapping and made it thread safe. > > * Refactoring to split generic and vmapple MMIO variant specific > > code. > > * Switched to asynchronous MMIO writes on x86-64 > > * Rendering and graphics update are now done asynchronously > > * Fixed cursor handling > > * Coding convention fixes > > * Removed software cursor compositing > > > > v3: > > > > * Rebased on latest upstream, fixed breakages including > switching to Resettable methods. > > * Squashed patches dealing with dGPUs, MMIO area size, and GPU > picking. > > * Allow re-entrant MMIO; this simplifies the code and solves > the divergence > > between x86-64 and arm64 variants. > > > > v4: > > > > * Renamed '-vmapple' device variant to '-mmio' > > * MMIO device type now requires aarch64 host and guest > > * Complete overhaul of the glue code for making Qemu's and > > ParavirtualizedGraphics.framework's threading and > synchronisation models > > work together. Calls into PVG are from dispatch queues while the > > BQL-holding initiating thread processes AIO context events; > callbacks from > > PVG are scheduled as BHs on the BQL/main AIO context, > awaiting completion > > where necessary. > > * Guest frame rendering state is covered by the BQL, with only > the PVG calls > > outside the lock, and serialised on the named render_queue. > > * Simplified logic for dropping frames in-flight during mode > changes, fixed > > bug in pending frames logic. > > * Addressed smaller code review notes such as: function naming, > object type > > declarations, type names/declarations/casts, code formatting, > #include > > order, over-cautious ObjC retain/release, what goes in init > vs realize, > > etc. > > > > > > hw/display/Kconfig | 9 + > > hw/display/apple-gfx-mmio.m | 284 ++++++++++++++ > > hw/display/apple-gfx.h | 58 +++ > > hw/display/apple-gfx.m | 713 +++++++++++++++++++++++++++++ > +++++++ > > hw/display/meson.build | 4 + > > hw/display/trace-events | 26 ++ > > meson.build | 4 + > > 7 files changed, 1098 insertions(+) > > create mode 100644 hw/display/apple-gfx-mmio.m > > create mode 100644 hw/display/apple-gfx.h > > create mode 100644 hw/display/apple-gfx.m > > > > diff --git a/hw/display/Kconfig b/hw/display/Kconfig > > index 2250c740078..6a9b7b19ada 100644 > > --- a/hw/display/Kconfig > > +++ b/hw/display/Kconfig > > @@ -140,3 +140,12 @@ config XLNX_DISPLAYPORT > > > > config DM163 > > bool > > + > > +config MAC_PVG > > + bool > > + default y > > + > > +config MAC_PVG_MMIO > > + bool > > + depends on MAC_PVG && AARCH64 > > + > > diff --git a/hw/display/apple-gfx-mmio.m b/hw/display/apple-gfx- > mmio.m > > new file mode 100644 > > index 00000000000..06131bc23f1 > > --- /dev/null > > +++ b/hw/display/apple-gfx-mmio.m > > @@ -0,0 +1,284 @@ > > +/* > > + * QEMU Apple ParavirtualizedGraphics.framework device, MMIO > (arm64) variant > > + * > > + * Copyright © 2023 Amazon.com, Inc. or its affiliates. All > Rights Reserved. > > + * > > + * This work is licensed under the terms of the GNU GPL, version > 2 or later. > > + * See the COPYING file in the top-level directory. > > Use SPDX-License-Identifier. You can find some examples with grep. > > This was already part of the file when I took it over from Alex. I've > used SPDX on any new files I've started from scratch. (I can of course / > add/ the SPDX line here too.) Please add it here too. > > > + * > > + * ParavirtualizedGraphics.framework is a set of libraries that > macOS provides > > + * which implements 3d graphics passthrough to the host as well as a > > + * proprietary guest communication channel to drive it. This > device model > > + * implements support to drive that library from within QEMU as > an MMIO-based > > + * system device for macOS on arm64 VMs. > > + */ > > + > > +#include "qemu/osdep.h" > > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > > +#include "apple-gfx.h" > > +#include "monitor/monitor.h" > > +#include "hw/sysbus.h" > > +#include "hw/irq.h" > > +#include "trace.h" > > + > > +OBJECT_DECLARE_SIMPLE_TYPE(AppleGFXMMIOState, APPLE_GFX_MMIO) > > + > > +/* > > + * ParavirtualizedGraphics.Framework only ships header files for > the PCI > > + * variant which does not include IOSFC descriptors and host > devices. We add > > + * their definitions here so that we can also work with the ARM > version. > > + */ > > +typedef bool(^IOSFCRaiseInterrupt)(uint32_t vector); > > +typedef bool(^IOSFCUnmapMemory)( > > + void *, void *, void *, void *, void *, void *); > > +typedef bool(^IOSFCMapMemory)( > > + uint64_t phys, uint64_t len, bool ro, void **va, void *, > void *); > > + > > +@interface PGDeviceDescriptor (IOSurfaceMapper) > > +@property (readwrite, nonatomic) bool usingIOSurfaceMapper; > > +@end > > + > > +@interface PGIOSurfaceHostDeviceDescriptor : NSObject > > +-(PGIOSurfaceHostDeviceDescriptor *)init; > > +@property (readwrite, nonatomic, copy, nullable) IOSFCMapMemory > mapMemory; > > +@property (readwrite, nonatomic, copy, nullable) > IOSFCUnmapMemory unmapMemory; > > +@property (readwrite, nonatomic, copy, nullable) > IOSFCRaiseInterrupt raiseInterrupt; > > +@end > > + > > +@interface PGIOSurfaceHostDevice : NSObject > > +-(instancetype)initWithDescriptor: > (PGIOSurfaceHostDeviceDescriptor *)desc; > > +-(uint32_t)mmioReadAtOffset:(size_t)offset; > > +-(void)mmioWriteAtOffset:(size_t)offset value:(uint32_t)value; > > +@end > > + > > +struct AppleGFXMapSurfaceMemoryJob; > > +struct AppleGFXMMIOState { > > + SysBusDevice parent_obj; > > + > > + AppleGFXState common; > > + > > + qemu_irq irq_gfx; > > + qemu_irq irq_iosfc; > > + MemoryRegion iomem_iosfc; > > + PGIOSurfaceHostDevice *pgiosfc; > > +}; > > + > > +typedef struct AppleGFXMMIOJob { > > + AppleGFXMMIOState *state; > > + uint64_t offset; > > + uint64_t value; > > + bool completed; > > +} AppleGFXMMIOJob; > > + > > +static void iosfc_do_read(void *opaque) > > +{ > > + AppleGFXMMIOJob *job = opaque; > > + job->value = [job->state->pgiosfc mmioReadAtOffset:job->offset]; > > + qatomic_set(&job->completed, true); > > + aio_wait_kick(); > > +} > > + > > +static uint64_t iosfc_read(void *opaque, hwaddr offset, unsigned > size) > > +{ > > + AppleGFXMMIOJob job = { > > + .state = opaque, > > + .offset = offset, > > + .completed = false, > > + }; > > + AioContext *context = qemu_get_aio_context(); > > + dispatch_queue_t queue = > dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > > + > > + dispatch_async_f(queue, &job, iosfc_do_read); > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > Pass NULL as the first argument of AIO_WAIT_WHILE(). > > > + > > + trace_apple_gfx_mmio_iosfc_read(offset, job.value); > > + return job.value; > > +} > > + > > +static void iosfc_do_write(void *opaque) > > +{ > > + AppleGFXMMIOJob *job = opaque; > > + [job->state->pgiosfc mmioWriteAtOffset:job->offset > value:job->value]; > > + qatomic_set(&job->completed, true); > > + aio_wait_kick(); > > +} > > + > > +static void iosfc_write(void *opaque, hwaddr offset, uint64_t val, > > + unsigned size) > > +{ > > + AppleGFXMMIOJob job = { > > + .state = opaque, > > + .offset = offset, > > + .value = val, > > + .completed = false, > > + }; > > + AioContext *context = qemu_get_aio_context(); > > + dispatch_queue_t queue = > dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > > + > > + dispatch_async_f(queue, &job, iosfc_do_write); > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > + > > + trace_apple_gfx_mmio_iosfc_write(offset, val); > > +} > > + > > +static const MemoryRegionOps apple_iosfc_ops = { > > + .read = iosfc_read, > > + .write = iosfc_write, > > + .endianness = DEVICE_LITTLE_ENDIAN, > > + .valid = { > > + .min_access_size = 4, > > + .max_access_size = 8, > > + }, > > + .impl = { > > + .min_access_size = 4, > > + .max_access_size = 8, > > + }, > > +}; > > + > > +static void raise_iosfc_irq(void *opaque) > > +{ > > + AppleGFXMMIOState *s = opaque; > > + > > + qemu_irq_pulse(s->irq_iosfc); > > +} > > + > > +typedef struct AppleGFXMapSurfaceMemoryJob { > > + uint64_t guest_physical_address; > > + uint64_t guest_physical_length; > > + void *result_mem; > > + AppleGFXMMIOState *state; > > + bool read_only; > > + bool success; > > + bool done; > > +} AppleGFXMapSurfaceMemoryJob; > > + > > +static void apple_gfx_mmio_map_surface_memory(void *opaque) > > +{ > > + AppleGFXMapSurfaceMemoryJob *job = opaque; > > + AppleGFXMMIOState *s = job->state; > > + mach_vm_address_t mem; > > + > > + mem = apple_gfx_host_address_for_gpa_range(job- > >guest_physical_address, > > + job- > >guest_physical_length, > > + job->read_only); > > + > > + qemu_mutex_lock(&s->common.job_mutex); > > + job->result_mem = (void*)mem; > > nit: write as (void *). > > > + job->success = mem != 0; > > + job->done = true; > > + qemu_cond_broadcast(&s->common.job_cond); > > + qemu_mutex_unlock(&s->common.job_mutex); > > +} > > + > > +static PGIOSurfaceHostDevice > *apple_gfx_prepare_iosurface_host_device( > > + AppleGFXMMIOState *s) > > +{ > > + PGIOSurfaceHostDeviceDescriptor *iosfc_desc = > > + [PGIOSurfaceHostDeviceDescriptor new]; > > + PGIOSurfaceHostDevice *iosfc_host_dev = nil; > > + > > + iosfc_desc.mapMemory = > > + ^bool(uint64_t phys, uint64_t len, bool ro, void **va, > void *e, void *f) { > > + AppleGFXMapSurfaceMemoryJob job = { > > + .guest_physical_address = > phys, .guest_physical_length = len, > > + .read_only = ro, .state = s, > > + }; > > + > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > > + > apple_gfx_mmio_map_surface_memory, &job); > > + apple_gfx_await_bh_job(&s->common, &job.done); > > + > > + *va = job.result_mem; > > + > > + trace_apple_gfx_iosfc_map_memory(phys, len, ro, va, > e, f, *va, > > + job.success); > > + > > + return job.success; > > + }; > > + > > + iosfc_desc.unmapMemory = > > + ^bool(void *a, void *b, void *c, void *d, void *e, void > *f) { > > + trace_apple_gfx_iosfc_unmap_memory(a, b, c, d, e, f); > > + return true; > > + }; > > + > > + iosfc_desc.raiseInterrupt = ^bool(uint32_t vector) { > > + trace_apple_gfx_iosfc_raise_irq(vector); > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > raise_iosfc_irq, s); > > Let's pass s->irq_iosfc here to unify raise_iosfc_irq() and > raise_gfx_irq(). > > > + return true; > > + }; > > + > > + iosfc_host_dev = > > + [[PGIOSurfaceHostDevice alloc] > initWithDescriptor:iosfc_desc]; > > + [iosfc_desc release]; > > + return iosfc_host_dev; > > +} > > + > > +static void raise_gfx_irq(void *opaque) > > +{ > > + AppleGFXMMIOState *s = opaque; > > + > > + qemu_irq_pulse(s->irq_gfx); > > +} > > + > > +static void apple_gfx_mmio_realize(DeviceState *dev, Error **errp) > > +{ > > + @autoreleasepool { > > + AppleGFXMMIOState *s = APPLE_GFX_MMIO(dev); > > + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; > > + > > + desc.raiseInterrupt = ^(uint32_t vector) { > > + trace_apple_gfx_raise_irq(vector); > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > raise_gfx_irq, s); > > + }; > > + > > + desc.usingIOSurfaceMapper = true; > > + s->pgiosfc = apple_gfx_prepare_iosurface_host_device(s); > > + > > + apple_gfx_common_realize(&s->common, desc, errp); > > + [desc release]; > > + desc = nil; > > + } > > +} > > + > > +static void apple_gfx_mmio_init(Object *obj) > > +{ > > + AppleGFXMMIOState *s = APPLE_GFX_MMIO(obj); > > + > > + apple_gfx_common_init(obj, &s->common, TYPE_APPLE_GFX_MMIO); > > + > > + sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->common.iomem_gfx); > > + memory_region_init_io(&s->iomem_iosfc, obj, &apple_iosfc_ops, s, > > + TYPE_APPLE_GFX_MMIO, 0x10000); > > + sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->iomem_iosfc); > > + sysbus_init_irq(SYS_BUS_DEVICE(s), &s->irq_gfx); > > + sysbus_init_irq(SYS_BUS_DEVICE(s), &s->irq_iosfc); > > +} > > + > > +static void apple_gfx_mmio_reset(Object *obj, ResetType type) > > +{ > > + AppleGFXMMIOState *s = APPLE_GFX_MMIO(obj); > > + [s->common.pgdev reset]; > > +} > > + > > + > > +static void apple_gfx_mmio_class_init(ObjectClass *klass, void > *data) > > +{ > > + DeviceClass *dc = DEVICE_CLASS(klass); > > + ResettableClass *rc = RESETTABLE_CLASS(klass); > > + > > + rc->phases.hold = apple_gfx_mmio_reset; > > + dc->hotpluggable = false; > > + dc->realize = apple_gfx_mmio_realize; > > +} > > + > > +static TypeInfo apple_gfx_mmio_types[] = { > > + { > > + .name = TYPE_APPLE_GFX_MMIO, > > + .parent = TYPE_SYS_BUS_DEVICE, > > + .instance_size = sizeof(AppleGFXMMIOState), > > + .class_init = apple_gfx_mmio_class_init, > > + .instance_init = apple_gfx_mmio_init, > > + } > > +}; > > +DEFINE_TYPES(apple_gfx_mmio_types) > > diff --git a/hw/display/apple-gfx.h b/hw/display/apple-gfx.h > > new file mode 100644 > > index 00000000000..39931fba65a > > --- /dev/null > > +++ b/hw/display/apple-gfx.h > > @@ -0,0 +1,58 @@ > > +#ifndef QEMU_APPLE_GFX_H > > +#define QEMU_APPLE_GFX_H > > + > > +#define TYPE_APPLE_GFX_MMIO "apple-gfx-mmio" > > +#define TYPE_APPLE_GFX_PCI "apple-gfx-pci" > > + > > +#include "qemu/osdep.h" > > +#include <dispatch/dispatch.h> > > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > > +#include "qemu/typedefs.h" > > +#include "exec/memory.h" > > +#include "ui/surface.h" > > + > > +@class PGDeviceDescriptor; > > +@protocol PGDevice; > > +@protocol PGDisplay; > > +@protocol MTLDevice; > > +@protocol MTLTexture; > > +@protocol MTLCommandQueue; > > + > > +typedef QTAILQ_HEAD(, PGTask_s) PGTaskList; > > + > > +struct AppleGFXMapMemoryJob; > > Probably this declaration of AppleGFXMapMemoryJob is unnecessary. > > > +typedef struct AppleGFXState { > > + MemoryRegion iomem_gfx; > > + id<PGDevice> pgdev; > > + id<PGDisplay> pgdisp; > > + PGTaskList tasks; > > + QemuConsole *con; > > + id<MTLDevice> mtl; > > + id<MTLCommandQueue> mtl_queue; > > + bool cursor_show; > > + QEMUCursor *cursor; > > + > > + /* For running PVG memory-mapping requests in the AIO context */ > > + QemuCond job_cond; > > + QemuMutex job_mutex; > > Use: QemuEvent > > > Hmm. I think if we were to use that, we would need to create a new > QemuEvent for every job and destroy it afterward, which seems expensive. > We can't rule out multiple concurrent jobs being submitted, and the > QemuEvent system only supports a single producer as far as I can tell. > > You can probably sort of hack around it with just one QemuEvent by > putting the qemu_event_wait into a loop and turning the job.done flag > into an atomic (because it would now need to be checked outside the > lock) but this all seems unnecessarily complicated considering the > QemuEvent uses the same mechanism QemuCond/QemuMutex internally on macOS > (the only platform relevant here), except we can use it as intended with > QemuCond/QemuMutex rather than having to work against the abstraction. I don't think it's going to be used concurrently. It would be difficult to reason even for the framework if it performs memory unmapping/mapping/reading operations concurrently. PGDevice.h also notes raiseInterrupt needs to be thread-safe while it doesn't make such notes for memory operations. This actually makes sense. If it's ever going to be used concurrently, it's better to have QemuEvent for each job to avoid the thundering herd problem. > > > + > > + dispatch_queue_t render_queue; > > + /* The following fields should only be accessed from the BQL: */ > > Perhaps it may be better to document fields that can be accessed > *without* the BQL; most things in QEMU implicitly require the BQL. > > > + bool gfx_update_requested; > > + bool new_frame_ready; > > + bool using_managed_texture_storage; > > +} AppleGFXState; > > + > > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, const > char* obj_name); > > +void apple_gfx_common_realize(AppleGFXState *s, > PGDeviceDescriptor *desc, > > + Error **errp); > > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t > guest_physical, > > + uint64_t length, > bool read_only); > > +void apple_gfx_await_bh_job(AppleGFXState *s, bool *job_done_flag); > > + > > +#endif > > + > > diff --git a/hw/display/apple-gfx.m b/hw/display/apple-gfx.m > > new file mode 100644 > > index 00000000000..46be9957f69 > > --- /dev/null > > +++ b/hw/display/apple-gfx.m > > @@ -0,0 +1,713 @@ > > +/* > > + * QEMU Apple ParavirtualizedGraphics.framework device > > + * > > + * Copyright © 2023 Amazon.com, Inc. or its affiliates. All > Rights Reserved. > > + * > > + * This work is licensed under the terms of the GNU GPL, version > 2 or later. > > + * See the COPYING file in the top-level directory. > > + * > > + * ParavirtualizedGraphics.framework is a set of libraries that > macOS provides > > + * which implements 3d graphics passthrough to the host as well as a > > + * proprietary guest communication channel to drive it. This > device model > > + * implements support to drive that library from within QEMU. > > + */ > > + > > +#include "qemu/osdep.h" > > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > > +#include <mach/mach_vm.h> > > +#include "apple-gfx.h" > > +#include "trace.h" > > +#include "qemu-main.h" > > +#include "exec/address-spaces.h" > > +#include "migration/blocker.h" > > +#include "monitor/monitor.h" > > +#include "qemu/main-loop.h" > > +#include "qemu/cutils.h" > > +#include "qemu/log.h" > > +#include "qapi/visitor.h" > > +#include "qapi/error.h" > > +#include "ui/console.h" > > + > > +static const PGDisplayCoord_t apple_gfx_modes[] = { > > + { .x = 1440, .y = 1080 }, > > + { .x = 1280, .y = 1024 }, > > +}; > > + > > +/* This implements a type defined in <ParavirtualizedGraphics/ > PGDevice.h> > > + * which is opaque from the framework's point of view. Typedef > PGTask_t already > > + * exists in the framework headers. */ > > +struct PGTask_s { > > + QTAILQ_ENTRY(PGTask_s) node; > > + mach_vm_address_t address; > > + uint64_t len; > > +}; > > + > > +static Error *apple_gfx_mig_blocker; > > This does not have to be a static variable. > > > Hmm, the first 5 or so examples of migration blockers in other devices > etc. I could find were all declared in this way. What are you suggesting > as the alternative? And why not use the same pattern as in most of the > rest of the code base? I was wrong. This is better to be a static variable to ensure we won't add the same blocker in case we have two device instances. > > > + > > +static void apple_gfx_render_frame_completed(AppleGFXState *s, > > + uint32_t width, > uint32_t height); > > + > > +static inline dispatch_queue_t get_background_queue(void) > > Don't add inline. The only effect for modern compilers of inline is to > suppress the unused function warnings. > > > +{ > > + return > dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > > +} > > + > > +static PGTask_t *apple_gfx_new_task(AppleGFXState *s, uint64_t len) > > +{ > > + mach_vm_address_t task_mem; > > + PGTask_t *task; > > + kern_return_t r; > > + > > + r = mach_vm_allocate(mach_task_self(), &task_mem, len, > VM_FLAGS_ANYWHERE); > > + if (r != KERN_SUCCESS || task_mem == 0) { > > Let's remove the check for task_mem == 0. We have no reason to > reject it > if the platform insists it allocated a memory at address 0 though > such a > situation should never happen in practice. > > > + return NULL; > > + } > > + > > + task = g_new0(PGTask_t, 1); > > + > > + task->address = task_mem; > > + task->len = len; > > + QTAILQ_INSERT_TAIL(&s->tasks, task, node); > > + > > + return task; > > +} > > + > > +typedef struct AppleGFXIOJob { > > + AppleGFXState *state; > > + uint64_t offset; > > + uint64_t value; > > + bool completed; > > +} AppleGFXIOJob; > > + > > +static void apple_gfx_do_read(void *opaque) > > +{ > > + AppleGFXIOJob *job = opaque; > > + job->value = [job->state->pgdev mmioReadAtOffset:job->offset]; > > + qatomic_set(&job->completed, true); > > + aio_wait_kick(); > > +} > > + > > +static uint64_t apple_gfx_read(void *opaque, hwaddr offset, > unsigned size) > > +{ > > + AppleGFXIOJob job = { > > + .state = opaque, > > + .offset = offset, > > + .completed = false, > > + }; > > + AioContext *context = qemu_get_aio_context(); > > + dispatch_queue_t queue = get_background_queue(); > > + > > + dispatch_async_f(queue, &job, apple_gfx_do_read); > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > + > > + trace_apple_gfx_read(offset, job.value); > > + return job.value; > > +} > > + > > +static void apple_gfx_do_write(void *opaque) > > +{ > > + AppleGFXIOJob *job = opaque; > > + [job->state->pgdev mmioWriteAtOffset:job->offset value:job- > >value]; > > + qatomic_set(&job->completed, true); > > + aio_wait_kick(); > > +} > > + > > +static void apple_gfx_write(void *opaque, hwaddr offset, > uint64_t val, > > + unsigned size) > > +{ > > + /* The methods mmioReadAtOffset: and especially > mmioWriteAtOffset: can > > + * trigger and block on operations on other dispatch queues, > which in turn > > + * may call back out on one or more of the callback blocks. > For this reason, > > + * and as we are holding the BQL, we invoke the I/O methods > on a pool > > + * thread and handle AIO tasks while we wait. Any work in > the callbacks > > + * requiring the BQL will in turn schedule BHs which this > thread will > > + * process while waiting. */ > > + AppleGFXIOJob job = { > > + .state = opaque, > > + .offset = offset, > > + .value = val, > > + .completed = false, > > + }; > > + AioContext *context = qemu_get_current_aio_context(); > > + dispatch_queue_t queue = get_background_queue(); > > + > > + dispatch_async_f(queue, &job, apple_gfx_do_write); > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > + > > + trace_apple_gfx_write(offset, val); > > +} > > + > > +static const MemoryRegionOps apple_gfx_ops = { > > + .read = apple_gfx_read, > > + .write = apple_gfx_write, > > + .endianness = DEVICE_LITTLE_ENDIAN, > > + .valid = { > > + .min_access_size = 4, > > + .max_access_size = 8, > > + }, > > + .impl = { > > + .min_access_size = 4, > > + .max_access_size = 4, > > + }, > > +}; > > + > > +static void apple_gfx_render_new_frame_bql_unlock(AppleGFXState *s) > > +{ > > + BOOL r; > > + uint32_t width = surface_width(s->surface); > > + uint32_t height = surface_height(s->surface); > > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > > + id<MTLCommandBuffer> command_buffer = [s->mtl_queue > commandBuffer]; > > + id<MTLTexture> texture = s->texture; > > + > > + assert(bql_locked()); > > + [texture retain]; > > + > > + bql_unlock(); > > + > > + /* This is not safe to call from the BQL due to PVG-internal > locks causing > > + * deadlocks. */ > > + r = [s->pgdisp encodeCurrentFrameToCommandBuffer:command_buffer > > + texture:texture > > + region:region]; > > + if (!r) { > > + [texture release]; > > + bql_lock(); > > + --s->pending_frames; > > + bql_unlock(); > > + qemu_log_mask(LOG_GUEST_ERROR, > "apple_gfx_render_new_frame_bql_unlock: " > > Use: __func__ > > > + > "encodeCurrentFrameToCommandBuffer:texture:region: failed\n"); > > + return; > > + } > > + > > + if (s->using_managed_texture_storage) { > > + /* "Managed" textures exist in both VRAM and RAM and > must be synced. */ > > + id<MTLBlitCommandEncoder> blit = [command_buffer > blitCommandEncoder]; > > + [blit synchronizeResource:texture]; > > + [blit endEncoding]; > > + } > > + [texture release]; > > + [command_buffer addCompletedHandler: > > + ^(id<MTLCommandBuffer> cb) > > + { > > + dispatch_async(s->render_queue, ^{ > > + apple_gfx_render_frame_completed(s, width, height); > > + }); > > + }]; > > + [command_buffer commit]; > > +} > > + > > +static void copy_mtl_texture_to_surface_mem(id<MTLTexture> > texture, void *vram) > > +{ > > + /* TODO: Skip this entirely on a pure Metal or headless/ > guest-only > > + * rendering path, else use a blit command encoder? Needs > careful > > + * (double?) buffering design. */ > > + size_t width = texture.width, height = texture.height; > > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > > + [texture getBytes:vram > > + bytesPerRow:(width * 4) > > + bytesPerImage:(width * height * 4) > > + fromRegion:region > > + mipmapLevel:0 > > + slice:0]; > > +}copy_mtl_texture_to_surface_mem > > + > > +static void apple_gfx_render_frame_completed(AppleGFXState *s, > > + uint32_t width, > uint32_t height) > > +{ > > + bql_lock(); > > + --s->pending_frames; > > + assert(s->pending_frames >= 0); > > + > > + /* Only update display if mode hasn't changed since we > started rendering. */ > > + if (width == surface_width(s->surface) && > > + height == surface_height(s->surface)) { > > + copy_mtl_texture_to_surface_mem(s->texture, s->vram); > > + if (s->gfx_update_requested) { > > + s->gfx_update_requested = false; > > + dpy_gfx_update_full(s->con); > > + graphic_hw_update_done(s->con); > > + s->new_frame_ready = false; > > + } else { > > + s->new_frame_ready = true; > > + } > > + } > > + if (s->pending_frames > 0) { > > + apple_gfx_render_new_frame_bql_unlock(s); > > + } else { > > + bql_unlock(); > > + } > > +} > > + > > +static void apple_gfx_fb_update_display(void *opaque) > > +{ > > + AppleGFXState *s = opaque; > > + > > + assert(bql_locked()); > > + if (s->new_frame_ready) { > > + dpy_gfx_update_full(s->con); > > + s->new_frame_ready = false; > > + graphic_hw_update_done(s->con); > > + } else if (s->pending_frames > 0) { > > + s->gfx_update_requested = true; > > + } else { > > + graphic_hw_update_done(s->con); > > + } > > +} > > + > > +static const GraphicHwOps apple_gfx_fb_ops = { > > + .gfx_update = apple_gfx_fb_update_display, > > + .gfx_update_async = true, > > +}; > > + > > +static void update_cursor(AppleGFXState *s) > > +{ > > + assert(bql_locked()); > > + dpy_mouse_set(s->con, s->pgdisp.cursorPosition.x, > > + s->pgdisp.cursorPosition.y, s->cursor_show); > > +} > > + > > +static void set_mode(AppleGFXState *s, uint32_t width, uint32_t > height) > > +{ > > + MTLTextureDescriptor *textureDescriptor; > > + > > + if (s->surface && > > + width == surface_width(s->surface) && > > + height == surface_height(s->surface)) { > > + return; > > + } > > + > > + g_free(s->vram); > > + [s->texture release]; > > + > > + s->vram = g_malloc0_n(width * height, 4); > > + s->surface = qemu_create_displaysurface_from(width, height, > PIXMAN_LE_a8r8g8b8, > > + width * 4, s- > >vram);> +> + @autoreleasepool { > > + textureDescriptor = > > + [MTLTextureDescriptor > > + > texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm > > + width:width > > + height:height > > + mipmapped:NO]; > > + textureDescriptor.usage = s->pgdisp.minimumTextureUsage; > > + s->texture = [s->mtl > newTextureWithDescriptor:textureDescriptor]; > > > What about creating pixman_image_t from s->texture.buffer.contents? > This > should save memory usage by removing the duplication of texture. > > > We need explicit control over when the GPU vs when the CPU may access > the texture - only one of them may access them at a time. As far as I > can tell, we can't control when the rest of Qemu might access the > pixman_image used in the console surface? You are right; we need to have duplicate buffers. We can still avoid copying by using two MTLTextures for double-buffering instead of having a MTLTexture and a pixman_image and copying between them for MTLStorageModeManaged. > > > + } > > + > > + s->using_managed_texture_storage = > > + (s->texture.storageMode == MTLStorageModeManaged); > > + dpy_gfx_replace_surface(s->con, s->surface); > > +} > > + > > +static void create_fb(AppleGFXState *s) > > +{ > > + s->con = graphic_console_init(NULL, 0, &apple_gfx_fb_ops, s); > > + set_mode(s, 1440, 1080); > > + > > + s->cursor_show = true; > > +} > > + > > +static size_t apple_gfx_get_default_mmio_range_size(void) > > +{ > > + size_t mmio_range_size; > > + @autoreleasepool { > > + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; > > + mmio_range_size = desc.mmioLength; > > + [desc release]; > > + } > > + return mmio_range_size; > > +} > > + > > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, const > char* obj_name) > > +{ > > + size_t mmio_range_size = > apple_gfx_get_default_mmio_range_size(); > > + > > + trace_apple_gfx_common_init(obj_name, mmio_range_size); > > + memory_region_init_io(&s->iomem_gfx, obj, &apple_gfx_ops, s, > obj_name, > > + mmio_range_size); > > + > > + /* TODO: PVG framework supports serialising device state: > integrate it! */ > > +} > > + > > +typedef struct AppleGFXMapMemoryJob { > > + AppleGFXState *state; > > + PGTask_t *task; > > + uint64_t virtual_offset; > > + PGPhysicalMemoryRange_t *ranges; > > + uint32_t range_count; > > + bool read_only; > > + bool success; > > + bool done; > > +} AppleGFXMapMemoryJob; > > + > > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t > guest_physical, > > + uint64_t length, > bool read_only) > > +{ > > + MemoryRegion *ram_region; > > + uintptr_t host_address; > > + hwaddr ram_region_offset = 0; > > + hwaddr ram_region_length = length; > > + > > + ram_region = address_space_translate(&address_space_memory, > > + guest_physical, > > + &ram_region_offset, > > + &ram_region_length, ! > read_only, > > + MEMTXATTRS_UNSPECIFIED); > > Call memory_region_ref() so that it won't go away. > > > + > > + if (!ram_region || ram_region_length < length || > > + !memory_access_is_direct(ram_region, !read_only)) { > > + return 0; > > + } > > + > > + host_address = > (mach_vm_address_t)memory_region_get_ram_ptr(ram_region); > > host_address is typed as uintptr_t, not mach_vm_address_t. > > > + if (host_address == 0) { > > + return 0; > > + } > > + host_address += ram_region_offset; > > + > > + return host_address; > > +} > > + > > +static void apple_gfx_map_memory(void *opaque) > > +{ > > + AppleGFXMapMemoryJob *job = opaque; > > + AppleGFXState *s = job->state; > > + PGTask_t *task = job->task; > > + uint32_t range_count = job->range_count; > > + uint64_t virtual_offset = job->virtual_offset; > > + PGPhysicalMemoryRange_t *ranges = job->ranges; > > + bool read_only = job->read_only; > > + kern_return_t r; > > + mach_vm_address_t target, source; > > + vm_prot_t cur_protection, max_protection; > > + bool success = true; > > + > > + g_assert(bql_locked()); > > + > > + trace_apple_gfx_map_memory(task, range_count, > virtual_offset, read_only); > > + for (int i = 0; i < range_count; i++) { > > + PGPhysicalMemoryRange_t *range = &ranges[i]; > > + > > + target = task->address + virtual_offset; > > + virtual_offset += range->physicalLength; > > + > > + trace_apple_gfx_map_memory_range(i, range->physicalAddress, > > + range->physicalLength); > > + > > + source = apple_gfx_host_address_for_gpa_range(range- > >physicalAddress, > > + range- > >physicalLength, > > + read_only); > > + if (source == 0) { > > + success = false; > > + continue; > > + } > > + > > + MemoryRegion* alt_mr = NULL; > > + mach_vm_address_t alt_source = > (mach_vm_address_t)gpa2hva(&alt_mr, range->physicalAddress, range- > >physicalLength, NULL); > > + g_assert(alt_source == source); > > Remove this; I guess this is for debugging. > > > + > > + cur_protection = 0; > > + max_protection = 0; > > + // Map guest RAM at range->physicalAddress into PG task > memory range > > + r = mach_vm_remap(mach_task_self(), > > + &target, range->physicalLength, > vm_page_size - 1, > > + VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE, > > + mach_task_self(), > > + source, false /* shared mapping, no > copy */, > > + &cur_protection, &max_protection, > > + VM_INHERIT_COPY); > > + trace_apple_gfx_remap(r, source, target); > > + g_assert(r == KERN_SUCCESS); > > + } > > + > > + qemu_mutex_lock(&s->job_mutex); > > + job->success = success; > > + job->done = true; > > + qemu_cond_broadcast(&s->job_cond); > > + qemu_mutex_unlock(&s->job_mutex); > > +} > > + > > +void apple_gfx_await_bh_job(AppleGFXState *s, bool *job_done_flag) > > +{ > > + qemu_mutex_lock(&s->job_mutex); > > + while (!*job_done_flag) { > > + qemu_cond_wait(&s->job_cond, &s->job_mutex); > > + } > > + qemu_mutex_unlock(&s->job_mutex); > > +} > > + > > +typedef struct AppleGFXReadMemoryJob { > > + AppleGFXState *s; > > + hwaddr physical_address; > > + uint64_t length; > > + void *dst; > > + bool done; > > +} AppleGFXReadMemoryJob; > > + > > +static void apple_gfx_do_read_memory(void *opaque) > > +{ > > + AppleGFXReadMemoryJob *job = opaque; > > + AppleGFXState *s = job->s; > > + > > + cpu_physical_memory_read(job->physical_address, job->dst, > job->length); > > Use: dma_memory_read() > > > + > > + qemu_mutex_lock(&s->job_mutex); > > + job->done = true; > > + qemu_cond_broadcast(&s->job_cond); > > + qemu_mutex_unlock(&s->job_mutex); > > +} > > + > > +static void apple_gfx_read_memory(AppleGFXState *s, hwaddr > physical_address, > > + uint64_t length, void *dst) > > +{ > > + AppleGFXReadMemoryJob job = { > > + s, physical_address, length, dst > > + }; > > + > > + trace_apple_gfx_read_memory(physical_address, length, dst); > > + > > + /* Traversing the memory map requires RCU/BQL, so do it in a > BH. */ > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > apple_gfx_do_read_memory, > > + &job); > > + apple_gfx_await_bh_job(s, &job.done); > > +} > > + > > +static void > apple_gfx_register_task_mapping_handlers(AppleGFXState *s, > > + > PGDeviceDescriptor *desc) > > +{ > > + desc.createTask = ^(uint64_t vmSize, void * _Nullable * > _Nonnull baseAddress) { > > + PGTask_t *task = apple_gfx_new_task(s, vmSize); > > + *baseAddress = (void *)task->address; > > + trace_apple_gfx_create_task(vmSize, *baseAddress); > > + return task; > > + }; > > + > > + desc.destroyTask = ^(PGTask_t * _Nonnull task) { > > + trace_apple_gfx_destroy_task(task); > > + QTAILQ_REMOVE(&s->tasks, task, node); > > + mach_vm_deallocate(mach_task_self(), task->address, > task->len); > > + g_free(task); > > + }; > > + > > + desc.mapMemory = ^bool(PGTask_t * _Nonnull task, uint32_t > range_count, > > + uint64_t virtual_offset, bool read_only, > > + PGPhysicalMemoryRange_t * _Nonnull ranges) { > > + AppleGFXMapMemoryJob job = { > > + .state = s, > > + .task = task, .ranges = ranges, .range_count = > range_count, > > + .read_only = read_only, .virtual_offset = > virtual_offset, > > + .done = false, .success = true, > > + }; > > + if (range_count > 0) { > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > > + apple_gfx_map_memory, &job); > > + apple_gfx_await_bh_job(s, &job.done); > > + } > > + return job.success; > > + }; > > + > > + desc.unmapMemory = ^bool(PGTask_t * _Nonnull task, uint64_t > virtualOffset, > > + uint64_t length) { > > + kern_return_t r; > > + mach_vm_address_t range_address; > > + > > + trace_apple_gfx_unmap_memory(task, virtualOffset, length); > > + > > + /* Replace task memory range with fresh pages, undoing > the mapping > > + * from guest RAM. */ > > + range_address = task->address + virtualOffset; > > + r = mach_vm_allocate(mach_task_self(), &range_address, > length, > > + VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE); > > + g_assert(r == KERN_SUCCESS);error_setg > > An extra error_setg > > > + > > + return true; > > + }; > > + > > + desc.readMemory = ^bool(uint64_t physical_address, uint64_t > length, > > + void * _Nonnull dst) { > > + apple_gfx_read_memory(s, physical_address, length, dst); > > + return true; > > + }; > > +} > > + > > +static PGDisplayDescriptor > *apple_gfx_prepare_display_descriptor(AppleGFXState *s) > > +{ > > + PGDisplayDescriptor *disp_desc = [PGDisplayDescriptor new]; > > + > > + disp_desc.name <http://disp_desc.name> = @"QEMU display"; > > + disp_desc.sizeInMillimeters = NSMakeSize(400., 300.); /* A > 20" display */ > > + disp_desc.queue = dispatch_get_main_queue(); > > + disp_desc.newFrameEventHandler = ^(void) { > > + trace_apple_gfx_new_frame(); > > + dispatch_async(s->render_queue, ^{ > > + /* Drop frames if we get too far ahead. */ > > + bql_lock(); > > + if (s->pending_frames >= 2) { > > + bql_unlock(); > > + return; > > + } > > + ++s->pending_frames; > > + if (s->pending_frames > 1) { > > + bql_unlock(); > > + return; > > + } > > + @autoreleasepool { > > + apple_gfx_render_new_frame_bql_unlock(s); > > + } > > + }); > > + }; > > + disp_desc.modeChangeHandler = ^(PGDisplayCoord_t sizeInPixels, > > + OSType pixelFormat) { > > + trace_apple_gfx_mode_change(sizeInPixels.x, sizeInPixels.y); > > + > > + BQL_LOCK_GUARD(); > > + set_mode(s, sizeInPixels.x, sizeInPixels.y); > > + }; > > + disp_desc.cursorGlyphHandler = ^(NSBitmapImageRep *glyph, > > + PGDisplayCoord_t hotSpot) { > > + [glyph retain]; > > + dispatch_async(get_background_queue(), ^{ > > + BQL_LOCK_GUARD(); > > + uint32_t bpp = glyph.bitsPerPixel; > > + size_t width = glyph.pixelsWide; > > + size_t height = glyph.pixelsHigh; > > + size_t padding_bytes_per_row = glyph.bytesPerRow - > width * 4; > > + const uint8_t* px_data = glyph.bitmapData; > > + > > + trace_apple_gfx_cursor_set(bpp, width, height); > > + > > + if (s->cursor) { > > + cursor_unref(s->cursor); > > + s->cursor = NULL; > > + } > > + > > + if (bpp == 32) { /* Shouldn't be anything else, but > just to be safe...*/ > > + s->cursor = cursor_alloc(width, height); > > + s->cursor->hot_x = hotSpot.x; > > + s->cursor->hot_y = hotSpot.y; > > + > > + uint32_t *dest_px = s->cursor->data; > > + > > + for (size_t y = 0; y < height; ++y) { > > + for (size_t x = 0; x < width; ++x) { > > + /* NSBitmapImageRep's red & blue > channels are swapped > > + * compared to QEMUCursor's. */ > > + *dest_px = > > + (px_data[0] << 16u) | > > + (px_data[1] << 8u) | > > + (px_data[2] << 0u) | > > + (px_data[3] << 24u); > > + ++dest_px; > > + px_data += 4; > > + } > > + px_data += padding_bytes_per_row; > > + } > > + dpy_cursor_define(s->con, s->cursor); > > + update_cursor(s); > > + } > > + [glyph release]; > > + }); > > + }; > > + disp_desc.cursorShowHandler = ^(BOOL show) { > > + dispatch_async(get_background_queue(), ^{ > > + BQL_LOCK_GUARD(); > > + trace_apple_gfx_cursor_show(show); > > + s->cursor_show = show; > > + update_cursor(s); > > + }); > > + }; > > + disp_desc.cursorMoveHandler = ^(void) { > > + dispatch_async(get_background_queue(), ^{ > > + BQL_LOCK_GUARD(); > > + trace_apple_gfx_cursor_move(); > > + update_cursor(s); > > + }); > > + }; > > + > > + return disp_desc; > > +} > > + > > +static NSArray<PGDisplayMode*>* > apple_gfx_prepare_display_mode_array(void) > > +{ > > + PGDisplayMode *modes[ARRAY_SIZE(apple_gfx_modes)]; > > + NSArray<PGDisplayMode*>* mode_array = nil; > > + int i; > > + > > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > > + modes[i] = > > + [[PGDisplayMode alloc] > initWithSizeInPixels:apple_gfx_modes[i] refreshRateInHz:60.]; > > + } > > + > > + mode_array = [NSArray arrayWithObjects:modes > count:ARRAY_SIZE(apple_gfx_modes)]; > > + > > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > > + [modes[i] release]; > > + modes[i] = nil; > > + } > > + > > + return mode_array; > > +} > > + > > +static id<MTLDevice> copy_suitable_metal_device(void) > > +{ > > + id<MTLDevice> dev = nil; > > + NSArray<id<MTLDevice>> *devs = MTLCopyAllDevices(); > > + > > + /* Prefer a unified memory GPU. Failing that, pick a non- > removable GPU. */ > > + for (size_t i = 0; i < devs.count; ++i) { > > + if (devs[i].hasUnifiedMemory) { > > + dev = devs[i]; > > + break; > > + } > > + if (!devs[i].removable) { > > + dev = devs[i]; > > + } > > + } > > + > > + if (dev != nil) { > > + [dev retain]; > > + } else { > > + dev = MTLCreateSystemDefaultDevice(); > > + } > > + [devs release]; > > + > > + return dev; > > +} > > + > > +void apple_gfx_common_realize(AppleGFXState *s, > PGDeviceDescriptor *desc, > > + Error **errp) > > +{ > > + PGDisplayDescriptor *disp_desc = nil; > > + > > + if (apple_gfx_mig_blocker == NULL) { > > + error_setg(&apple_gfx_mig_blocker, > > + "Migration state blocked by apple-gfx display > device"); > > + if (migrate_add_blocker(&apple_gfx_mig_blocker, errp) < 0) { > > + return; > > + } > > + } > > + > > + QTAILQ_INIT(&s->tasks); > > + s->render_queue = dispatch_queue_create("apple-gfx.render", > > + DISPATCH_QUEUE_SERIAL); > > + s->mtl = copy_suitable_metal_device(); > > + s->mtl_queue = [s->mtl newCommandQueue]; > > + > > + desc.device = s->mtl; > > + > > + apple_gfx_register_task_mapping_handlers(s, desc); > > + > > + s->pgdev = PGNewDeviceWithDescriptor(desc); > > + > > + disp_desc = apple_gfx_prepare_display_descriptor(s); > > + s->pgdisp = [s->pgdev newDisplayWithDescriptor:disp_desc > > + port:0 > serialNum:1234]; > > + [disp_desc release]; > > + s->pgdisp.modeList = apple_gfx_prepare_display_mode_array(); > > + > > + create_fb(s); > > + > > + qemu_mutex_init(&s->job_mutex); > > + qemu_cond_init(&s->job_cond); > > +} > > diff --git a/hw/display/meson.build b/hw/display/meson.build > > index 20a94973fa2..619e642905a 100644 > > --- a/hw/display/meson.build > > +++ b/hw/display/meson.build > > @@ -61,6 +61,10 @@ system_ss.add(when: 'CONFIG_ARTIST', if_true: > files('artist.c')) > > > > system_ss.add(when: 'CONFIG_ATI_VGA', if_true: [files('ati.c', > 'ati_2d.c', 'ati_dbg.c'), pixman]) > > > > +system_ss.add(when: 'CONFIG_MAC_PVG', if_true: > [files('apple-gfx.m'), pvg, metal]) > > +if cpu == 'aarch64' > > + system_ss.add(when: 'CONFIG_MAC_PVG_MMIO', if_true: > [files('apple-gfx-mmio.m'), pvg, metal]) > > +endif > > > > if config_all_devices.has_key('CONFIG_VIRTIO_GPU') > > virtio_gpu_ss = ss.source_set() > > diff --git a/hw/display/trace-events b/hw/display/trace-events > > index 781f8a33203..214998312b9 100644 > > --- a/hw/display/trace-events > > +++ b/hw/display/trace-events > > @@ -191,3 +191,29 @@ dm163_bits_ppi(unsigned dest_width) > "dest_width : %u" > > dm163_leds(int led, uint32_t value) "led %d: 0x%x" > > dm163_channels(int channel, uint8_t value) "channel %d: 0x%x" > > dm163_refresh_rate(uint32_t rr) "refresh rate %d" > > + > > +# apple-gfx.m > > +apple_gfx_read(uint64_t offset, uint64_t res) > "offset=0x%"PRIx64" res=0x%"PRIx64 > > +apple_gfx_write(uint64_t offset, uint64_t val) > "offset=0x%"PRIx64" val=0x%"PRIx64 > > +apple_gfx_create_task(uint32_t vm_size, void *va) "vm_size=0x%x > base_addr=%p" > > +apple_gfx_destroy_task(void *task) "task=%p" > > +apple_gfx_map_memory(void *task, uint32_t range_count, uint64_t > virtual_offset, uint32_t read_only) "task=%p range_count=0x%x > virtual_offset=0x%"PRIx64" read_only=%d" > > +apple_gfx_map_memory_range(uint32_t i, uint64_t phys_addr, > uint64_t phys_len) "[%d] phys_addr=0x%"PRIx64" phys_len=0x%"PRIx64 > > +apple_gfx_remap(uint64_t retval, uint64_t source, uint64_t > target) "retval=%"PRId64" source=0x%"PRIx64" target=0x%"PRIx64 > > +apple_gfx_unmap_memory(void *task, uint64_t virtual_offset, > uint64_t length) "task=%p virtual_offset=0x%"PRIx64" length=0x%"PRIx64 > > +apple_gfx_read_memory(uint64_t phys_address, uint64_t length, > void *dst) "phys_addr=0x%"PRIx64" length=0x%"PRIx64" dest=%p" > > +apple_gfx_raise_irq(uint32_t vector) "vector=0x%x" > > +apple_gfx_new_frame(void) "" > > +apple_gfx_mode_change(uint64_t x, uint64_t y) "x=%"PRId64" > y=%"PRId64 > > +apple_gfx_cursor_set(uint32_t bpp, uint64_t width, uint64_t > height) "bpp=%d width=%"PRId64" height=0x%"PRId64 > > +apple_gfx_cursor_show(uint32_t show) "show=%d" > > +apple_gfx_cursor_move(void) "" > > +apple_gfx_common_init(const char *device_name, size_t mmio_size) > "device: %s; MMIO size: %zu bytes" > > + > > +# apple-gfx-mmio.m > > +apple_gfx_mmio_iosfc_read(uint64_t offset, uint64_t res) > "offset=0x%"PRIx64" res=0x%"PRIx64 > > +apple_gfx_mmio_iosfc_write(uint64_t offset, uint64_t val) > "offset=0x%"PRIx64" val=0x%"PRIx64 > > +apple_gfx_iosfc_map_memory(uint64_t phys, uint64_t len, uint32_t > ro, void *va, void *e, void *f, void* va_result, int success) > "phys=0x%"PRIx64" len=0x%"PRIx64" ro=%d va=%p e=%p f=%p -> *va=%p, > success = %d" > > +apple_gfx_iosfc_unmap_memory(void *a, void *b, void *c, void *d, > void *e, void *f) "a=%p b=%p c=%p d=%p e=%p f=%p" > > +apple_gfx_iosfc_raise_irq(uint32_t vector) "vector=0x%x" > > + > > diff --git a/meson.build b/meson.build > > index d26690ce204..0e124eff13f 100644 > > --- a/meson.build > > +++ b/meson.build > > @@ -761,6 +761,8 @@ socket = [] > > version_res = [] > > coref = [] > > iokit = [] > > +pvg = [] > > +metal = [] > > emulator_link_args = [] > > midl = not_found > > widl = not_found > > @@ -782,6 +784,8 @@ elif host_os == 'darwin' > > coref = dependency('appleframeworks', modules: 'CoreFoundation') > > iokit = dependency('appleframeworks', modules: 'IOKit', > required: false) > > host_dsosuf = '.dylib' > > + pvg = dependency('appleframeworks', modules: > 'ParavirtualizedGraphics') > > + metal = dependency('appleframeworks', modules: 'Metal') > > elif host_os == 'sunos' > > socket = [cc.find_library('socket'), > > cc.find_library('nsl'), >
On Sat, 26 Oct 2024 at 06:40, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > On 2024/10/26 4:43, Phil Dennis-Jordan wrote: > > > > > > On Fri, 25 Oct 2024 at 08:03, Akihiko Odaki <akihiko.odaki@daynix.com > > <mailto:akihiko.odaki@daynix.com>> wrote: > > > > On 2024/10/24 19:28, Phil Dennis-Jordan wrote: > > > + /* For running PVG memory-mapping requests in the AIO > context */ > > > + QemuCond job_cond; > > > + QemuMutex job_mutex; > > > > Use: QemuEvent > > > > > > Hmm. I think if we were to use that, we would need to create a new > > QemuEvent for every job and destroy it afterward, which seems expensive. > > We can't rule out multiple concurrent jobs being submitted, and the > > QemuEvent system only supports a single producer as far as I can tell. > > > > You can probably sort of hack around it with just one QemuEvent by > > putting the qemu_event_wait into a loop and turning the job.done flag > > into an atomic (because it would now need to be checked outside the > > lock) but this all seems unnecessarily complicated considering the > > QemuEvent uses the same mechanism QemuCond/QemuMutex internally on macOS > > (the only platform relevant here), except we can use it as intended with > > QemuCond/QemuMutex rather than having to work against the abstraction. > > I don't think it's going to be used concurrently. It would be difficult > to reason even for the framework if it performs memory > unmapping/mapping/reading operations concurrently. I've just performed a very quick test by wrapping the job submission/wait in the 2 mapMemory callbacks and the 1 readMemory callback with atomic counters and logging whenever a counter went above 1. * Overall, concurrent callbacks across all types were common (many per second when the VM is busy). It's not exactly a "thundering herd" (I never saw >2) but it's probably not a bad idea to use a separate condition variable for each job type. (task map, surface map, memory read) * While I did not observe any concurrent memory mapping operations *within* a type of memory map (2 task mappings or 2 surface mappings) I did see very occasional concurrent memory *read* callbacks. These would, as far as I can tell, not be safe with QemuEvents, unless we placed the event inside the job struct and init/destroyed it on every callback (which seems like excessive overhead). My recommendation would be to split it up into 3 pairs of mutex/cond; this will almost entirely remove any contention, but continue to be safe for when it does occur. I don't think QemuEvent is a realistic option (too tricky to get right) for the observed-concurrent readMemory callback. I'm nervous about assuming the mapMemory callbacks will NEVER be called concurrently, but at a push I'll acquiesce to switching those to QemuEvent in the absence of evidence of concurrency. > PGDevice.h also notes > raiseInterrupt needs to be thread-safe while it doesn't make such notes > for memory operations. This actually makes sense. > > If it's ever going to be used concurrently, it's better to have > QemuEvent for each job to avoid the thundering herd problem. > > > > > > + > > > + dispatch_queue_t render_queue; > > > + /* The following fields should only be accessed from the > BQL: */ > > > > Perhaps it may be better to document fields that can be accessed > > *without* the BQL; most things in QEMU implicitly require the BQL. > > > > > + bool gfx_update_requested; > > > + bool new_frame_ready; > > > + bool using_managed_texture_storage; > > > +} AppleGFXState; > > > + > > > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, const > > char* obj_name); > > > +void apple_gfx_common_realize(AppleGFXState *s, > > PGDeviceDescriptor *desc, > > > + Error **errp); > > > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t > > guest_physical, > > > + uint64_t length, > > bool read_only); > > > +void apple_gfx_await_bh_job(AppleGFXState *s, bool > *job_done_flag); > > > + > > > +#endif > > > + > > > diff --git a/hw/display/apple-gfx.m b/hw/display/apple-gfx.m > > > new file mode 100644 > > > index 00000000000..46be9957f69 > > > --- /dev/null > > > +++ b/hw/display/apple-gfx.m > > > @@ -0,0 +1,713 @@ > > > +/* > > > + * QEMU Apple ParavirtualizedGraphics.framework device > > > + * > > > + * Copyright © 2023 Amazon.com, Inc. or its affiliates. All > > Rights Reserved. > > > + * > > > + * This work is licensed under the terms of the GNU GPL, version > > 2 or later. > > > + * See the COPYING file in the top-level directory. > > > + * > > > + * ParavirtualizedGraphics.framework is a set of libraries that > > macOS provides > > > + * which implements 3d graphics passthrough to the host as well > as a > > > + * proprietary guest communication channel to drive it. This > > device model > > > + * implements support to drive that library from within QEMU. > > > + */ > > > + > > > +#include "qemu/osdep.h" > > > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > > > +#include <mach/mach_vm.h> > > > +#include "apple-gfx.h" > > > +#include "trace.h" > > > +#include "qemu-main.h" > > > +#include "exec/address-spaces.h" > > > +#include "migration/blocker.h" > > > +#include "monitor/monitor.h" > > > +#include "qemu/main-loop.h" > > > +#include "qemu/cutils.h" > > > +#include "qemu/log.h" > > > +#include "qapi/visitor.h" > > > +#include "qapi/error.h" > > > +#include "ui/console.h" > > > + > > > +static const PGDisplayCoord_t apple_gfx_modes[] = { > > > + { .x = 1440, .y = 1080 }, > > > + { .x = 1280, .y = 1024 }, > > > +}; > > > + > > > +/* This implements a type defined in <ParavirtualizedGraphics/ > > PGDevice.h> > > > + * which is opaque from the framework's point of view. Typedef > > PGTask_t already > > > + * exists in the framework headers. */ > > > +struct PGTask_s { > > > + QTAILQ_ENTRY(PGTask_s) node; > > > + mach_vm_address_t address; > > > + uint64_t len; > > > +}; > > > + > > > +static Error *apple_gfx_mig_blocker; > > > > This does not have to be a static variable. > > > > > > Hmm, the first 5 or so examples of migration blockers in other devices > > etc. I could find were all declared in this way. What are you suggesting > > as the alternative? And why not use the same pattern as in most of the > > rest of the code base? > > I was wrong. This is better to be a static variable to ensure we won't > add the same blocker in case we have two device instances. > > > > > > + > > > +static void apple_gfx_render_frame_completed(AppleGFXState *s, > > > + uint32_t width, > > uint32_t height); > > > + > > > +static inline dispatch_queue_t get_background_queue(void) > > > > Don't add inline. The only effect for modern compilers of inline is > to > > suppress the unused function warnings. > > > > > +{ > > > + return > > dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > > > +} > > > + > > > +static PGTask_t *apple_gfx_new_task(AppleGFXState *s, uint64_t > len) > > > +{ > > > + mach_vm_address_t task_mem; > > > + PGTask_t *task; > > > + kern_return_t r; > > > + > > > + r = mach_vm_allocate(mach_task_self(), &task_mem, len, > > VM_FLAGS_ANYWHERE); > > > + if (r != KERN_SUCCESS || task_mem == 0) { > > > > Let's remove the check for task_mem == 0. We have no reason to > > reject it > > if the platform insists it allocated a memory at address 0 though > > such a > > situation should never happen in practice. > > > > > + return NULL; > > > + } > > > + > > > + task = g_new0(PGTask_t, 1); > > > + > > > + task->address = task_mem; > > > + task->len = len; > > > + QTAILQ_INSERT_TAIL(&s->tasks, task, node); > > > + > > > + return task; > > > +} > > > + > > > +typedef struct AppleGFXIOJob { > > > + AppleGFXState *state; > > > + uint64_t offset; > > > + uint64_t value; > > > + bool completed; > > > +} AppleGFXIOJob; > > > + > > > +static void apple_gfx_do_read(void *opaque) > > > +{ > > > + AppleGFXIOJob *job = opaque; > > > + job->value = [job->state->pgdev > mmioReadAtOffset:job->offset]; > > > + qatomic_set(&job->completed, true); > > > + aio_wait_kick(); > > > +} > > > + > > > +static uint64_t apple_gfx_read(void *opaque, hwaddr offset, > > unsigned size) > > > +{ > > > + AppleGFXIOJob job = { > > > + .state = opaque, > > > + .offset = offset, > > > + .completed = false, > > > + }; > > > + AioContext *context = qemu_get_aio_context(); > > > + dispatch_queue_t queue = get_background_queue(); > > > + > > > + dispatch_async_f(queue, &job, apple_gfx_do_read); > > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > > + > > > + trace_apple_gfx_read(offset, job.value); > > > + return job.value; > > > +} > > > + > > > +static void apple_gfx_do_write(void *opaque) > > > +{ > > > + AppleGFXIOJob *job = opaque; > > > + [job->state->pgdev mmioWriteAtOffset:job->offset value:job- > > >value]; > > > + qatomic_set(&job->completed, true); > > > + aio_wait_kick(); > > > +} > > > + > > > +static void apple_gfx_write(void *opaque, hwaddr offset, > > uint64_t val, > > > + unsigned size) > > > +{ > > > + /* The methods mmioReadAtOffset: and especially > > mmioWriteAtOffset: can > > > + * trigger and block on operations on other dispatch queues, > > which in turn > > > + * may call back out on one or more of the callback blocks. > > For this reason, > > > + * and as we are holding the BQL, we invoke the I/O methods > > on a pool > > > + * thread and handle AIO tasks while we wait. Any work in > > the callbacks > > > + * requiring the BQL will in turn schedule BHs which this > > thread will > > > + * process while waiting. */ > > > + AppleGFXIOJob job = { > > > + .state = opaque, > > > + .offset = offset, > > > + .value = val, > > > + .completed = false, > > > + }; > > > + AioContext *context = qemu_get_current_aio_context(); > > > + dispatch_queue_t queue = get_background_queue(); > > > + > > > + dispatch_async_f(queue, &job, apple_gfx_do_write); > > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > > + > > > + trace_apple_gfx_write(offset, val); > > > +} > > > + > > > +static const MemoryRegionOps apple_gfx_ops = { > > > + .read = apple_gfx_read, > > > + .write = apple_gfx_write, > > > + .endianness = DEVICE_LITTLE_ENDIAN, > > > + .valid = { > > > + .min_access_size = 4, > > > + .max_access_size = 8, > > > + }, > > > + .impl = { > > > + .min_access_size = 4, > > > + .max_access_size = 4, > > > + }, > > > +}; > > > + > > > +static void apple_gfx_render_new_frame_bql_unlock(AppleGFXState > *s) > > > +{ > > > + BOOL r; > > > + uint32_t width = surface_width(s->surface); > > > + uint32_t height = surface_height(s->surface); > > > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > > > + id<MTLCommandBuffer> command_buffer = [s->mtl_queue > > commandBuffer]; > > > + id<MTLTexture> texture = s->texture; > > > + > > > + assert(bql_locked()); > > > + [texture retain]; > > > + > > > + bql_unlock(); > > > + > > > + /* This is not safe to call from the BQL due to PVG-internal > > locks causing > > > + * deadlocks. */ > > > + r = [s->pgdisp > encodeCurrentFrameToCommandBuffer:command_buffer > > > + texture:texture > > > + region:region]; > > > + if (!r) { > > > + [texture release]; > > > + bql_lock(); > > > + --s->pending_frames; > > > + bql_unlock(); > > > + qemu_log_mask(LOG_GUEST_ERROR, > > "apple_gfx_render_new_frame_bql_unlock: " > > > > Use: __func__ > > > > > + > > "encodeCurrentFrameToCommandBuffer:texture:region: failed\n"); > > > + return; > > > + } > > > + > > > + if (s->using_managed_texture_storage) { > > > + /* "Managed" textures exist in both VRAM and RAM and > > must be synced. */ > > > + id<MTLBlitCommandEncoder> blit = [command_buffer > > blitCommandEncoder]; > > > + [blit synchronizeResource:texture]; > > > + [blit endEncoding]; > > > + } > > > + [texture release]; > > > + [command_buffer addCompletedHandler: > > > + ^(id<MTLCommandBuffer> cb) > > > + { > > > + dispatch_async(s->render_queue, ^{ > > > + apple_gfx_render_frame_completed(s, width, > height); > > > + }); > > > + }]; > > > + [command_buffer commit]; > > > +} > > > + > > > +static void copy_mtl_texture_to_surface_mem(id<MTLTexture> > > texture, void *vram) > > > +{ > > > + /* TODO: Skip this entirely on a pure Metal or headless/ > > guest-only > > > + * rendering path, else use a blit command encoder? Needs > > careful > > > + * (double?) buffering design. */ > > > + size_t width = texture.width, height = texture.height; > > > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > > > + [texture getBytes:vram > > > + bytesPerRow:(width * 4) > > > + bytesPerImage:(width * height * 4) > > > + fromRegion:region > > > + mipmapLevel:0 > > > + slice:0]; > > > +}copy_mtl_texture_to_surface_mem > > > + > > > +static void apple_gfx_render_frame_completed(AppleGFXState *s, > > > + uint32_t width, > > uint32_t height) > > > +{ > > > + bql_lock(); > > > + --s->pending_frames; > > > + assert(s->pending_frames >= 0); > > > + > > > + /* Only update display if mode hasn't changed since we > > started rendering. */ > > > + if (width == surface_width(s->surface) && > > > + height == surface_height(s->surface)) { > > > + copy_mtl_texture_to_surface_mem(s->texture, s->vram); > > > + if (s->gfx_update_requested) { > > > + s->gfx_update_requested = false; > > > + dpy_gfx_update_full(s->con); > > > + graphic_hw_update_done(s->con); > > > + s->new_frame_ready = false; > > > + } else { > > > + s->new_frame_ready = true; > > > + } > > > + } > > > + if (s->pending_frames > 0) { > > > + apple_gfx_render_new_frame_bql_unlock(s); > > > + } else { > > > + bql_unlock(); > > > + } > > > +} > > > + > > > +static void apple_gfx_fb_update_display(void *opaque) > > > +{ > > > + AppleGFXState *s = opaque; > > > + > > > + assert(bql_locked()); > > > + if (s->new_frame_ready) { > > > + dpy_gfx_update_full(s->con); > > > + s->new_frame_ready = false; > > > + graphic_hw_update_done(s->con); > > > + } else if (s->pending_frames > 0) { > > > + s->gfx_update_requested = true; > > > + } else { > > > + graphic_hw_update_done(s->con); > > > + } > > > +} > > > + > > > +static const GraphicHwOps apple_gfx_fb_ops = { > > > + .gfx_update = apple_gfx_fb_update_display, > > > + .gfx_update_async = true, > > > +}; > > > + > > > +static void update_cursor(AppleGFXState *s) > > > +{ > > > + assert(bql_locked()); > > > + dpy_mouse_set(s->con, s->pgdisp.cursorPosition.x, > > > + s->pgdisp.cursorPosition.y, s->cursor_show); > > > +} > > > + > > > +static void set_mode(AppleGFXState *s, uint32_t width, uint32_t > > height) > > > +{ > > > + MTLTextureDescriptor *textureDescriptor; > > > + > > > + if (s->surface && > > > + width == surface_width(s->surface) && > > > + height == surface_height(s->surface)) { > > > + return; > > > + } > > > + > > > + g_free(s->vram); > > > + [s->texture release]; > > > + > > > + s->vram = g_malloc0_n(width * height, 4); > > > + s->surface = qemu_create_displaysurface_from(width, height, > > PIXMAN_LE_a8r8g8b8, > > > + width * 4, s- > > >vram);> +> + @autoreleasepool { > > > + textureDescriptor = > > > + [MTLTextureDescriptor > > > + > > texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm > > > + width:width > > > + height:height > > > + mipmapped:NO]; > > > + textureDescriptor.usage = s->pgdisp.minimumTextureUsage; > > > + s->texture = [s->mtl > > newTextureWithDescriptor:textureDescriptor]; > > > > > > What about creating pixman_image_t from s->texture.buffer.contents? > > This > > should save memory usage by removing the duplication of texture. > > > > > > We need explicit control over when the GPU vs when the CPU may access > > the texture - only one of them may access them at a time. As far as I > > can tell, we can't control when the rest of Qemu might access the > > pixman_image used in the console surface? > > You are right; we need to have duplicate buffers. We can still avoid > copying by using two MTLTextures for double-buffering instead of having > a MTLTexture and a pixman_image and copying between them for > MTLStorageModeManaged. > Do I understand correctly that you intend to swap the surface->image on every frame, or even the surface->image->data? If so, it's my understanding from reading the source of a bunch of UI implementations a few weeks ago that this is neither supported nor safe, as some implementations take long-lived references to these internal data structures until a dpy_gfx_switch callback. And the implementations for those callbacks are in turn very expensive in some cases. This is why my conclusion in the v4 thread was that double-buffering was infeasible with the current architecture. > > > > > + } > > > + > > > + s->using_managed_texture_storage = > > > + (s->texture.storageMode == MTLStorageModeManaged); > > > + dpy_gfx_replace_surface(s->con, s->surface); > > > +} > > > + > > > +static void create_fb(AppleGFXState *s) > > > +{ > > > + s->con = graphic_console_init(NULL, 0, &apple_gfx_fb_ops, s); > > > + set_mode(s, 1440, 1080); > > > + > > > + s->cursor_show = true; > > > +} > > > + > > > +static size_t apple_gfx_get_default_mmio_range_size(void) > > > +{ > > > + size_t mmio_range_size; > > > + @autoreleasepool { > > > + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; > > > + mmio_range_size = desc.mmioLength; > > > + [desc release]; > > > + } > > > + return mmio_range_size; > > > +} > > > + > > > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, const > > char* obj_name) > > > +{ > > > + size_t mmio_range_size = > > apple_gfx_get_default_mmio_range_size(); > > > + > > > + trace_apple_gfx_common_init(obj_name, mmio_range_size); > > > + memory_region_init_io(&s->iomem_gfx, obj, &apple_gfx_ops, s, > > obj_name, > > > + mmio_range_size); > > > + > > > + /* TODO: PVG framework supports serialising device state: > > integrate it! */ > > > +} > > > + > > > +typedef struct AppleGFXMapMemoryJob { > > > + AppleGFXState *state; > > > + PGTask_t *task; > > > + uint64_t virtual_offset; > > > + PGPhysicalMemoryRange_t *ranges; > > > + uint32_t range_count; > > > + bool read_only; > > > + bool success; > > > + bool done; > > > +} AppleGFXMapMemoryJob; > > > + > > > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t > > guest_physical, > > > + uint64_t length, > > bool read_only) > > > +{ > > > + MemoryRegion *ram_region; > > > + uintptr_t host_address; > > > + hwaddr ram_region_offset = 0; > > > + hwaddr ram_region_length = length; > > > + > > > + ram_region = address_space_translate(&address_space_memory, > > > + guest_physical, > > > + &ram_region_offset, > > > + &ram_region_length, ! > > read_only, > > > + MEMTXATTRS_UNSPECIFIED); > > > > Call memory_region_ref() so that it won't go away. > > > > > + > > > + if (!ram_region || ram_region_length < length || > > > + !memory_access_is_direct(ram_region, !read_only)) { > > > + return 0; > > > + } > > > + > > > + host_address = > > (mach_vm_address_t)memory_region_get_ram_ptr(ram_region); > > > > host_address is typed as uintptr_t, not mach_vm_address_t. > > > > > + if (host_address == 0) { > > > + return 0; > > > + } > > > + host_address += ram_region_offset; > > > + > > > + return host_address; > > > +} > > > + > > > +static void apple_gfx_map_memory(void *opaque) > > > +{ > > > + AppleGFXMapMemoryJob *job = opaque; > > > + AppleGFXState *s = job->state; > > > + PGTask_t *task = job->task; > > > + uint32_t range_count = job->range_count; > > > + uint64_t virtual_offset = job->virtual_offset; > > > + PGPhysicalMemoryRange_t *ranges = job->ranges; > > > + bool read_only = job->read_only; > > > + kern_return_t r; > > > + mach_vm_address_t target, source; > > > + vm_prot_t cur_protection, max_protection; > > > + bool success = true; > > > + > > > + g_assert(bql_locked()); > > > + > > > + trace_apple_gfx_map_memory(task, range_count, > > virtual_offset, read_only); > > > + for (int i = 0; i < range_count; i++) { > > > + PGPhysicalMemoryRange_t *range = &ranges[i]; > > > + > > > + target = task->address + virtual_offset; > > > + virtual_offset += range->physicalLength; > > > + > > > + trace_apple_gfx_map_memory_range(i, > range->physicalAddress, > > > + range->physicalLength); > > > + > > > + source = apple_gfx_host_address_for_gpa_range(range- > > >physicalAddress, > > > + range- > > >physicalLength, > > > + read_only); > > > + if (source == 0) { > > > + success = false; > > > + continue; > > > + } > > > + > > > + MemoryRegion* alt_mr = NULL; > > > + mach_vm_address_t alt_source = > > (mach_vm_address_t)gpa2hva(&alt_mr, range->physicalAddress, range- > > >physicalLength, NULL); > > > + g_assert(alt_source == source); > > > > Remove this; I guess this is for debugging. > > > > > + > > > + cur_protection = 0; > > > + max_protection = 0; > > > + // Map guest RAM at range->physicalAddress into PG task > > memory range > > > + r = mach_vm_remap(mach_task_self(), > > > + &target, range->physicalLength, > > vm_page_size - 1, > > > + VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE, > > > + mach_task_self(), > > > + source, false /* shared mapping, no > > copy */, > > > + &cur_protection, &max_protection, > > > + VM_INHERIT_COPY); > > > + trace_apple_gfx_remap(r, source, target); > > > + g_assert(r == KERN_SUCCESS); > > > + } > > > + > > > + qemu_mutex_lock(&s->job_mutex); > > > + job->success = success; > > > + job->done = true; > > > + qemu_cond_broadcast(&s->job_cond); > > > + qemu_mutex_unlock(&s->job_mutex); > > > +} > > > + > > > +void apple_gfx_await_bh_job(AppleGFXState *s, bool > *job_done_flag) > > > +{ > > > + qemu_mutex_lock(&s->job_mutex); > > > + while (!*job_done_flag) { > > > + qemu_cond_wait(&s->job_cond, &s->job_mutex); > > > + } > > > + qemu_mutex_unlock(&s->job_mutex); > > > +} > > > + > > > +typedef struct AppleGFXReadMemoryJob { > > > + AppleGFXState *s; > > > + hwaddr physical_address; > > > + uint64_t length; > > > + void *dst; > > > + bool done; > > > +} AppleGFXReadMemoryJob; > > > + > > > +static void apple_gfx_do_read_memory(void *opaque) > > > +{ > > > + AppleGFXReadMemoryJob *job = opaque; > > > + AppleGFXState *s = job->s; > > > + > > > + cpu_physical_memory_read(job->physical_address, job->dst, > > job->length); > > > > Use: dma_memory_read() > > > > > + > > > + qemu_mutex_lock(&s->job_mutex); > > > + job->done = true; > > > + qemu_cond_broadcast(&s->job_cond); > > > + qemu_mutex_unlock(&s->job_mutex); > > > +} > > > + > > > +static void apple_gfx_read_memory(AppleGFXState *s, hwaddr > > physical_address, > > > + uint64_t length, void *dst) > > > +{ > > > + AppleGFXReadMemoryJob job = { > > > + s, physical_address, length, dst > > > + }; > > > + > > > + trace_apple_gfx_read_memory(physical_address, length, dst); > > > + > > > + /* Traversing the memory map requires RCU/BQL, so do it in a > > BH. */ > > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > > apple_gfx_do_read_memory, > > > + &job); > > > + apple_gfx_await_bh_job(s, &job.done); > > > +} > > > + > > > +static void > > apple_gfx_register_task_mapping_handlers(AppleGFXState *s, > > > + > > PGDeviceDescriptor *desc) > > > +{ > > > + desc.createTask = ^(uint64_t vmSize, void * _Nullable * > > _Nonnull baseAddress) { > > > + PGTask_t *task = apple_gfx_new_task(s, vmSize); > > > + *baseAddress = (void *)task->address; > > > + trace_apple_gfx_create_task(vmSize, *baseAddress); > > > + return task; > > > + }; > > > + > > > + desc.destroyTask = ^(PGTask_t * _Nonnull task) { > > > + trace_apple_gfx_destroy_task(task); > > > + QTAILQ_REMOVE(&s->tasks, task, node); > > > + mach_vm_deallocate(mach_task_self(), task->address, > > task->len); > > > + g_free(task); > > > + }; > > > + > > > + desc.mapMemory = ^bool(PGTask_t * _Nonnull task, uint32_t > > range_count, > > > + uint64_t virtual_offset, bool read_only, > > > + PGPhysicalMemoryRange_t * _Nonnull > ranges) { > > > + AppleGFXMapMemoryJob job = { > > > + .state = s, > > > + .task = task, .ranges = ranges, .range_count = > > range_count, > > > + .read_only = read_only, .virtual_offset = > > virtual_offset, > > > + .done = false, .success = true, > > > + }; > > > + if (range_count > 0) { > > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > > > + apple_gfx_map_memory, &job); > > > + apple_gfx_await_bh_job(s, &job.done); > > > + } > > > + return job.success; > > > + }; > > > + > > > + desc.unmapMemory = ^bool(PGTask_t * _Nonnull task, uint64_t > > virtualOffset, > > > + uint64_t length) { > > > + kern_return_t r; > > > + mach_vm_address_t range_address; > > > + > > > + trace_apple_gfx_unmap_memory(task, virtualOffset, > length); > > > + > > > + /* Replace task memory range with fresh pages, undoing > > the mapping > > > + * from guest RAM. */ > > > + range_address = task->address + virtualOffset; > > > + r = mach_vm_allocate(mach_task_self(), &range_address, > > length, > > > + VM_FLAGS_FIXED | > VM_FLAGS_OVERWRITE); > > > + g_assert(r == KERN_SUCCESS);error_setg > > > > An extra error_setg > > > > > + > > > + return true; > > > + }; > > > + > > > + desc.readMemory = ^bool(uint64_t physical_address, uint64_t > > length, > > > + void * _Nonnull dst) { > > > + apple_gfx_read_memory(s, physical_address, length, dst); > > > + return true; > > > + }; > > > +} > > > + > > > +static PGDisplayDescriptor > > *apple_gfx_prepare_display_descriptor(AppleGFXState *s) > > > +{ > > > + PGDisplayDescriptor *disp_desc = [PGDisplayDescriptor new]; > > > + > > > + disp_desc.name <http://disp_desc.name> = @"QEMU display"; > > > + disp_desc.sizeInMillimeters = NSMakeSize(400., 300.); /* A > > 20" display */ > > > + disp_desc.queue = dispatch_get_main_queue(); > > > + disp_desc.newFrameEventHandler = ^(void) { > > > + trace_apple_gfx_new_frame(); > > > + dispatch_async(s->render_queue, ^{ > > > + /* Drop frames if we get too far ahead. */ > > > + bql_lock(); > > > + if (s->pending_frames >= 2) { > > > + bql_unlock(); > > > + return; > > > + } > > > + ++s->pending_frames; > > > + if (s->pending_frames > 1) { > > > + bql_unlock(); > > > + return; > > > + } > > > + @autoreleasepool { > > > + apple_gfx_render_new_frame_bql_unlock(s); > > > + } > > > + }); > > > + }; > > > + disp_desc.modeChangeHandler = ^(PGDisplayCoord_t > sizeInPixels, > > > + OSType pixelFormat) { > > > + trace_apple_gfx_mode_change(sizeInPixels.x, > sizeInPixels.y); > > > + > > > + BQL_LOCK_GUARD(); > > > + set_mode(s, sizeInPixels.x, sizeInPixels.y); > > > + }; > > > + disp_desc.cursorGlyphHandler = ^(NSBitmapImageRep *glyph, > > > + PGDisplayCoord_t hotSpot) { > > > + [glyph retain]; > > > + dispatch_async(get_background_queue(), ^{ > > > + BQL_LOCK_GUARD(); > > > + uint32_t bpp = glyph.bitsPerPixel; > > > + size_t width = glyph.pixelsWide; > > > + size_t height = glyph.pixelsHigh; > > > + size_t padding_bytes_per_row = glyph.bytesPerRow - > > width * 4; > > > + const uint8_t* px_data = glyph.bitmapData; > > > + > > > + trace_apple_gfx_cursor_set(bpp, width, height); > > > + > > > + if (s->cursor) { > > > + cursor_unref(s->cursor); > > > + s->cursor = NULL; > > > + } > > > + > > > + if (bpp == 32) { /* Shouldn't be anything else, but > > just to be safe...*/ > > > + s->cursor = cursor_alloc(width, height); > > > + s->cursor->hot_x = hotSpot.x; > > > + s->cursor->hot_y = hotSpot.y; > > > + > > > + uint32_t *dest_px = s->cursor->data; > > > + > > > + for (size_t y = 0; y < height; ++y) { > > > + for (size_t x = 0; x < width; ++x) { > > > + /* NSBitmapImageRep's red & blue > > channels are swapped > > > + * compared to QEMUCursor's. */ > > > + *dest_px = > > > + (px_data[0] << 16u) | > > > + (px_data[1] << 8u) | > > > + (px_data[2] << 0u) | > > > + (px_data[3] << 24u); > > > + ++dest_px; > > > + px_data += 4; > > > + } > > > + px_data += padding_bytes_per_row; > > > + } > > > + dpy_cursor_define(s->con, s->cursor); > > > + update_cursor(s); > > > + } > > > + [glyph release]; > > > + }); > > > + }; > > > + disp_desc.cursorShowHandler = ^(BOOL show) { > > > + dispatch_async(get_background_queue(), ^{ > > > + BQL_LOCK_GUARD(); > > > + trace_apple_gfx_cursor_show(show); > > > + s->cursor_show = show; > > > + update_cursor(s); > > > + }); > > > + }; > > > + disp_desc.cursorMoveHandler = ^(void) { > > > + dispatch_async(get_background_queue(), ^{ > > > + BQL_LOCK_GUARD(); > > > + trace_apple_gfx_cursor_move(); > > > + update_cursor(s); > > > + }); > > > + }; > > > + > > > + return disp_desc; > > > +} > > > + > > > +static NSArray<PGDisplayMode*>* > > apple_gfx_prepare_display_mode_array(void) > > > +{ > > > + PGDisplayMode *modes[ARRAY_SIZE(apple_gfx_modes)]; > > > + NSArray<PGDisplayMode*>* mode_array = nil; > > > + int i; > > > + > > > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > > > + modes[i] = > > > + [[PGDisplayMode alloc] > > initWithSizeInPixels:apple_gfx_modes[i] refreshRateInHz:60.]; > > > + } > > > + > > > + mode_array = [NSArray arrayWithObjects:modes > > count:ARRAY_SIZE(apple_gfx_modes)]; > > > + > > > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > > > + [modes[i] release]; > > > + modes[i] = nil; > > > + } > > > + > > > + return mode_array; > > > +} > > > + > > > +static id<MTLDevice> copy_suitable_metal_device(void) > > > +{ > > > + id<MTLDevice> dev = nil; > > > + NSArray<id<MTLDevice>> *devs = MTLCopyAllDevices(); > > > + > > > + /* Prefer a unified memory GPU. Failing that, pick a non- > > removable GPU. */ > > > + for (size_t i = 0; i < devs.count; ++i) { > > > + if (devs[i].hasUnifiedMemory) { > > > + dev = devs[i]; > > > + break; > > > + } > > > + if (!devs[i].removable) { > > > + dev = devs[i]; > > > + } > > > + } > > > + > > > + if (dev != nil) { > > > + [dev retain]; > > > + } else { > > > + dev = MTLCreateSystemDefaultDevice(); > > > + } > > > + [devs release]; > > > + > > > + return dev; > > > +} > > > + > > > +void apple_gfx_common_realize(AppleGFXState *s, > > PGDeviceDescriptor *desc, > > > + Error **errp) > > > +{ > > > + PGDisplayDescriptor *disp_desc = nil; > > > + > > > + if (apple_gfx_mig_blocker == NULL) { > > > + error_setg(&apple_gfx_mig_blocker, > > > + "Migration state blocked by apple-gfx display > > device"); > > > + if (migrate_add_blocker(&apple_gfx_mig_blocker, errp) < > 0) { > > > + return; > > > + } > > > + } > > > + > > > + QTAILQ_INIT(&s->tasks); > > > + s->render_queue = dispatch_queue_create("apple-gfx.render", > > > + > DISPATCH_QUEUE_SERIAL); > > > + s->mtl = copy_suitable_metal_device(); > > > + s->mtl_queue = [s->mtl newCommandQueue]; > > > + > > > + desc.device = s->mtl; > > > + > > > + apple_gfx_register_task_mapping_handlers(s, desc); > > > + > > > + s->pgdev = PGNewDeviceWithDescriptor(desc); > > > + > > > + disp_desc = apple_gfx_prepare_display_descriptor(s); > > > + s->pgdisp = [s->pgdev newDisplayWithDescriptor:disp_desc > > > + port:0 > > serialNum:1234]; > > > + [disp_desc release]; > > > + s->pgdisp.modeList = apple_gfx_prepare_display_mode_array(); > > > + > > > + create_fb(s); > > > + > > > + qemu_mutex_init(&s->job_mutex); > > > + qemu_cond_init(&s->job_cond); > > > +} > > > diff --git a/hw/display/meson.build b/hw/display/meson.build > > > index 20a94973fa2..619e642905a 100644 > > > --- a/hw/display/meson.build > > > +++ b/hw/display/meson.build > > > @@ -61,6 +61,10 @@ system_ss.add(when: 'CONFIG_ARTIST', if_true: > > files('artist.c')) > > > > > > system_ss.add(when: 'CONFIG_ATI_VGA', if_true: [files('ati.c', > > 'ati_2d.c', 'ati_dbg.c'), pixman]) > > > > > > +system_ss.add(when: 'CONFIG_MAC_PVG', if_true: > > [files('apple-gfx.m'), pvg, metal]) > > > +if cpu == 'aarch64' > > > + system_ss.add(when: 'CONFIG_MAC_PVG_MMIO', if_true: > > [files('apple-gfx-mmio.m'), pvg, metal]) > > > +endif > > > > > > if config_all_devices.has_key('CONFIG_VIRTIO_GPU') > > > virtio_gpu_ss = ss.source_set() > > > diff --git a/hw/display/trace-events b/hw/display/trace-events > > > index 781f8a33203..214998312b9 100644 > > > --- a/hw/display/trace-events > > > +++ b/hw/display/trace-events > > > @@ -191,3 +191,29 @@ dm163_bits_ppi(unsigned dest_width) > > "dest_width : %u" > > > dm163_leds(int led, uint32_t value) "led %d: 0x%x" > > > dm163_channels(int channel, uint8_t value) "channel %d: 0x%x" > > > dm163_refresh_rate(uint32_t rr) "refresh rate %d" > > > + > > > +# apple-gfx.m > > > +apple_gfx_read(uint64_t offset, uint64_t res) > > "offset=0x%"PRIx64" res=0x%"PRIx64 > > > +apple_gfx_write(uint64_t offset, uint64_t val) > > "offset=0x%"PRIx64" val=0x%"PRIx64 > > > +apple_gfx_create_task(uint32_t vm_size, void *va) "vm_size=0x%x > > base_addr=%p" > > > +apple_gfx_destroy_task(void *task) "task=%p" > > > +apple_gfx_map_memory(void *task, uint32_t range_count, uint64_t > > virtual_offset, uint32_t read_only) "task=%p range_count=0x%x > > virtual_offset=0x%"PRIx64" read_only=%d" > > > +apple_gfx_map_memory_range(uint32_t i, uint64_t phys_addr, > > uint64_t phys_len) "[%d] phys_addr=0x%"PRIx64" phys_len=0x%"PRIx64 > > > +apple_gfx_remap(uint64_t retval, uint64_t source, uint64_t > > target) "retval=%"PRId64" source=0x%"PRIx64" target=0x%"PRIx64 > > > +apple_gfx_unmap_memory(void *task, uint64_t virtual_offset, > > uint64_t length) "task=%p virtual_offset=0x%"PRIx64" > length=0x%"PRIx64 > > > +apple_gfx_read_memory(uint64_t phys_address, uint64_t length, > > void *dst) "phys_addr=0x%"PRIx64" length=0x%"PRIx64" dest=%p" > > > +apple_gfx_raise_irq(uint32_t vector) "vector=0x%x" > > > +apple_gfx_new_frame(void) "" > > > +apple_gfx_mode_change(uint64_t x, uint64_t y) "x=%"PRId64" > > y=%"PRId64 > > > +apple_gfx_cursor_set(uint32_t bpp, uint64_t width, uint64_t > > height) "bpp=%d width=%"PRId64" height=0x%"PRId64 > > > +apple_gfx_cursor_show(uint32_t show) "show=%d" > > > +apple_gfx_cursor_move(void) "" > > > +apple_gfx_common_init(const char *device_name, size_t mmio_size) > > "device: %s; MMIO size: %zu bytes" > > > + > > > +# apple-gfx-mmio.m > > > +apple_gfx_mmio_iosfc_read(uint64_t offset, uint64_t res) > > "offset=0x%"PRIx64" res=0x%"PRIx64 > > > +apple_gfx_mmio_iosfc_write(uint64_t offset, uint64_t val) > > "offset=0x%"PRIx64" val=0x%"PRIx64 > > > +apple_gfx_iosfc_map_memory(uint64_t phys, uint64_t len, uint32_t > > ro, void *va, void *e, void *f, void* va_result, int success) > > "phys=0x%"PRIx64" len=0x%"PRIx64" ro=%d va=%p e=%p f=%p -> *va=%p, > > success = %d" > > > +apple_gfx_iosfc_unmap_memory(void *a, void *b, void *c, void *d, > > void *e, void *f) "a=%p b=%p c=%p d=%p e=%p f=%p" > > > +apple_gfx_iosfc_raise_irq(uint32_t vector) "vector=0x%x" > > > + > > > diff --git a/meson.build b/meson.build > > > index d26690ce204..0e124eff13f 100644 > > > --- a/meson.build > > > +++ b/meson.build > > > @@ -761,6 +761,8 @@ socket = [] > > > version_res = [] > > > coref = [] > > > iokit = [] > > > +pvg = [] > > > +metal = [] > > > emulator_link_args = [] > > > midl = not_found > > > widl = not_found > > > @@ -782,6 +784,8 @@ elif host_os == 'darwin' > > > coref = dependency('appleframeworks', modules: > 'CoreFoundation') > > > iokit = dependency('appleframeworks', modules: 'IOKit', > > required: false) > > > host_dsosuf = '.dylib' > > > + pvg = dependency('appleframeworks', modules: > > 'ParavirtualizedGraphics') > > > + metal = dependency('appleframeworks', modules: 'Metal') > > > elif host_os == 'sunos' > > > socket = [cc.find_library('socket'), > > > cc.find_library('nsl'), > > > >
On 2024/10/26 19:24, Phil Dennis-Jordan wrote: > > > On Sat, 26 Oct 2024 at 06:40, Akihiko Odaki <akihiko.odaki@daynix.com > <mailto:akihiko.odaki@daynix.com>> wrote: > > On 2024/10/26 4:43, Phil Dennis-Jordan wrote: > > > > > > On Fri, 25 Oct 2024 at 08:03, Akihiko Odaki > <akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com> > > <mailto:akihiko.odaki@daynix.com > <mailto:akihiko.odaki@daynix.com>>> wrote: > > > > On 2024/10/24 19:28, Phil Dennis-Jordan wrote: > > > + /* For running PVG memory-mapping requests in the AIO > context */ > > > + QemuCond job_cond; > > > + QemuMutex job_mutex; > > > > Use: QemuEvent > > > > > > Hmm. I think if we were to use that, we would need to create a new > > QemuEvent for every job and destroy it afterward, which seems > expensive. > > We can't rule out multiple concurrent jobs being submitted, and the > > QemuEvent system only supports a single producer as far as I can > tell. > > > > You can probably sort of hack around it with just one QemuEvent by > > putting the qemu_event_wait into a loop and turning the job.done > flag > > into an atomic (because it would now need to be checked outside the > > lock) but this all seems unnecessarily complicated considering the > > QemuEvent uses the same mechanism QemuCond/QemuMutex internally > on macOS > > (the only platform relevant here), except we can use it as > intended with > > QemuCond/QemuMutex rather than having to work against the > abstraction. > > I don't think it's going to be used concurrently. It would be difficult > to reason even for the framework if it performs memory > unmapping/mapping/reading operations concurrently. > > > I've just performed a very quick test by wrapping the job submission/ > wait in the 2 mapMemory callbacks and the 1 readMemory callback with > atomic counters and logging whenever a counter went above 1. > > * Overall, concurrent callbacks across all types were common (many per > second when the VM is busy). It's not exactly a "thundering herd" (I > never saw >2) but it's probably not a bad idea to use a separate > condition variable for each job type. (task map, surface map, memory read) > * While I did not observe any concurrent memory mapping operations > *within* a type of memory map (2 task mappings or 2 surface mappings) I > did see very occasional concurrent memory *read* callbacks. These would, > as far as I can tell, not be safe with QemuEvents, unless we placed the > event inside the job struct and init/destroyed it on every callback > (which seems like excessive overhead). I think we can tolerate that overhead. init/destroy essentially sets the fields in the data structure and I estimate its total size is about 100 bytes. It is probably better than waking an irrelevant thread up. I also hope that keeps the code simple; it's not worthwhile adding code to optimize this. > > My recommendation would be to split it up into 3 pairs of mutex/cond; > this will almost entirely remove any contention, but continue to be safe > for when it does occur. I don't think QemuEvent is a realistic option > (too tricky to get right) for the observed-concurrent readMemory > callback. I'm nervous about assuming the mapMemory callbacks will NEVER > be called concurrently, but at a push I'll acquiesce to switching those > to QemuEvent in the absence of evidence of concurrency.> > PGDevice.h also notes > raiseInterrupt needs to be thread-safe while it doesn't make such notes > for memory operations. This actually makes sense. > > If it's ever going to be used concurrently, it's better to have > QemuEvent for each job to avoid the thundering herd problem. > > > > > > + > > > + dispatch_queue_t render_queue; > > > + /* The following fields should only be accessed from > the BQL: */ > > > > Perhaps it may be better to document fields that can be accessed > > *without* the BQL; most things in QEMU implicitly require the > BQL. > > > > > + bool gfx_update_requested; > > > + bool new_frame_ready; > > > + bool using_managed_texture_storage; > > > +} AppleGFXState; > > > + > > > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, > const > > char* obj_name); > > > +void apple_gfx_common_realize(AppleGFXState *s, > > PGDeviceDescriptor *desc, > > > + Error **errp); > > > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t > > guest_physical, > > > + uint64_t > length, > > bool read_only); > > > +void apple_gfx_await_bh_job(AppleGFXState *s, bool > *job_done_flag); > > > + > > > +#endif > > > + > > > diff --git a/hw/display/apple-gfx.m b/hw/display/apple-gfx.m > > > new file mode 100644 > > > index 00000000000..46be9957f69 > > > --- /dev/null > > > +++ b/hw/display/apple-gfx.m > > > @@ -0,0 +1,713 @@ > > > +/* > > > + * QEMU Apple ParavirtualizedGraphics.framework device > > > + * > > > + * Copyright © 2023 Amazon.com, Inc. or its affiliates. All > > Rights Reserved. > > > + * > > > + * This work is licensed under the terms of the GNU GPL, > version > > 2 or later. > > > + * See the COPYING file in the top-level directory. > > > + * > > > + * ParavirtualizedGraphics.framework is a set of > libraries that > > macOS provides > > > + * which implements 3d graphics passthrough to the host > as well as a > > > + * proprietary guest communication channel to drive it. This > > device model > > > + * implements support to drive that library from within QEMU. > > > + */ > > > + > > > +#include "qemu/osdep.h" > > > +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > > > +#include <mach/mach_vm.h> > > > +#include "apple-gfx.h" > > > +#include "trace.h" > > > +#include "qemu-main.h" > > > +#include "exec/address-spaces.h" > > > +#include "migration/blocker.h" > > > +#include "monitor/monitor.h" > > > +#include "qemu/main-loop.h" > > > +#include "qemu/cutils.h" > > > +#include "qemu/log.h" > > > +#include "qapi/visitor.h" > > > +#include "qapi/error.h" > > > +#include "ui/console.h" > > > + > > > +static const PGDisplayCoord_t apple_gfx_modes[] = { > > > + { .x = 1440, .y = 1080 }, > > > + { .x = 1280, .y = 1024 }, > > > +}; > > > + > > > +/* This implements a type defined in > <ParavirtualizedGraphics/ > > PGDevice.h> > > > + * which is opaque from the framework's point of view. > Typedef > > PGTask_t already > > > + * exists in the framework headers. */ > > > +struct PGTask_s { > > > + QTAILQ_ENTRY(PGTask_s) node; > > > + mach_vm_address_t address; > > > + uint64_t len; > > > +}; > > > + > > > +static Error *apple_gfx_mig_blocker; > > > > This does not have to be a static variable. > > > > > > Hmm, the first 5 or so examples of migration blockers in other > devices > > etc. I could find were all declared in this way. What are you > suggesting > > as the alternative? And why not use the same pattern as in most > of the > > rest of the code base? > > I was wrong. This is better to be a static variable to ensure we won't > add the same blocker in case we have two device instances. > > > > > > + > > > +static void > apple_gfx_render_frame_completed(AppleGFXState *s, > > > + uint32_t width, > > uint32_t height); > > > + > > > +static inline dispatch_queue_t get_background_queue(void) > > > > Don't add inline. The only effect for modern compilers of > inline is to > > suppress the unused function warnings. > > > > > +{ > > > + return > > dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > > > +} > > > + > > > +static PGTask_t *apple_gfx_new_task(AppleGFXState *s, > uint64_t len) > > > +{ > > > + mach_vm_address_t task_mem; > > > + PGTask_t *task; > > > + kern_return_t r; > > > + > > > + r = mach_vm_allocate(mach_task_self(), &task_mem, len, > > VM_FLAGS_ANYWHERE); > > > + if (r != KERN_SUCCESS || task_mem == 0) { > > > > Let's remove the check for task_mem == 0. We have no reason to > > reject it > > if the platform insists it allocated a memory at address 0 though > > such a > > situation should never happen in practice. > > > > > + return NULL; > > > + } > > > + > > > + task = g_new0(PGTask_t, 1); > > > + > > > + task->address = task_mem; > > > + task->len = len; > > > + QTAILQ_INSERT_TAIL(&s->tasks, task, node); > > > + > > > + return task; > > > +} > > > + > > > +typedef struct AppleGFXIOJob { > > > + AppleGFXState *state; > > > + uint64_t offset; > > > + uint64_t value; > > > + bool completed; > > > +} AppleGFXIOJob; > > > + > > > +static void apple_gfx_do_read(void *opaque) > > > +{ > > > + AppleGFXIOJob *job = opaque; > > > + job->value = [job->state->pgdev mmioReadAtOffset:job- > >offset]; > > > + qatomic_set(&job->completed, true); > > > + aio_wait_kick(); > > > +} > > > + > > > +static uint64_t apple_gfx_read(void *opaque, hwaddr offset, > > unsigned size) > > > +{ > > > + AppleGFXIOJob job = { > > > + .state = opaque, > > > + .offset = offset, > > > + .completed = false, > > > + }; > > > + AioContext *context = qemu_get_aio_context(); > > > + dispatch_queue_t queue = get_background_queue(); > > > + > > > + dispatch_async_f(queue, &job, apple_gfx_do_read); > > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > > + > > > + trace_apple_gfx_read(offset, job.value); > > > + return job.value; > > > +} > > > + > > > +static void apple_gfx_do_write(void *opaque) > > > +{ > > > + AppleGFXIOJob *job = opaque; > > > + [job->state->pgdev mmioWriteAtOffset:job->offset > value:job- > > >value]; > > > + qatomic_set(&job->completed, true); > > > + aio_wait_kick(); > > > +} > > > + > > > +static void apple_gfx_write(void *opaque, hwaddr offset, > > uint64_t val, > > > + unsigned size) > > > +{ > > > + /* The methods mmioReadAtOffset: and especially > > mmioWriteAtOffset: can > > > + * trigger and block on operations on other dispatch > queues, > > which in turn > > > + * may call back out on one or more of the callback > blocks. > > For this reason, > > > + * and as we are holding the BQL, we invoke the I/O > methods > > on a pool > > > + * thread and handle AIO tasks while we wait. Any work in > > the callbacks > > > + * requiring the BQL will in turn schedule BHs which this > > thread will > > > + * process while waiting. */ > > > + AppleGFXIOJob job = { > > > + .state = opaque, > > > + .offset = offset, > > > + .value = val, > > > + .completed = false, > > > + }; > > > + AioContext *context = qemu_get_current_aio_context(); > > > + dispatch_queue_t queue = get_background_queue(); > > > + > > > + dispatch_async_f(queue, &job, apple_gfx_do_write); > > > + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); > > > + > > > + trace_apple_gfx_write(offset, val); > > > +} > > > + > > > +static const MemoryRegionOps apple_gfx_ops = { > > > + .read = apple_gfx_read, > > > + .write = apple_gfx_write, > > > + .endianness = DEVICE_LITTLE_ENDIAN, > > > + .valid = { > > > + .min_access_size = 4, > > > + .max_access_size = 8, > > > + }, > > > + .impl = { > > > + .min_access_size = 4, > > > + .max_access_size = 4, > > > + }, > > > +}; > > > + > > > +static void > apple_gfx_render_new_frame_bql_unlock(AppleGFXState *s) > > > +{ > > > + BOOL r; > > > + uint32_t width = surface_width(s->surface); > > > + uint32_t height = surface_height(s->surface); > > > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > > > + id<MTLCommandBuffer> command_buffer = [s->mtl_queue > > commandBuffer]; > > > + id<MTLTexture> texture = s->texture; > > > + > > > + assert(bql_locked()); > > > + [texture retain]; > > > + > > > + bql_unlock(); > > > + > > > + /* This is not safe to call from the BQL due to PVG- > internal > > locks causing > > > + * deadlocks. */ > > > + r = [s->pgdisp > encodeCurrentFrameToCommandBuffer:command_buffer > > > + texture:texture > > > + region:region]; > > > + if (!r) { > > > + [texture release]; > > > + bql_lock(); > > > + --s->pending_frames; > > > + bql_unlock(); > > > + qemu_log_mask(LOG_GUEST_ERROR, > > "apple_gfx_render_new_frame_bql_unlock: " > > > > Use: __func__ > > > > > + > > "encodeCurrentFrameToCommandBuffer:texture:region: failed\n"); > > > + return; > > > + } > > > + > > > + if (s->using_managed_texture_storage) { > > > + /* "Managed" textures exist in both VRAM and RAM and > > must be synced. */ > > > + id<MTLBlitCommandEncoder> blit = [command_buffer > > blitCommandEncoder]; > > > + [blit synchronizeResource:texture]; > > > + [blit endEncoding]; > > > + } > > > + [texture release]; > > > + [command_buffer addCompletedHandler: > > > + ^(id<MTLCommandBuffer> cb) > > > + { > > > + dispatch_async(s->render_queue, ^{ > > > + apple_gfx_render_frame_completed(s, > width, height); > > > + }); > > > + }]; > > > + [command_buffer commit]; > > > +} > > > + > > > +static void copy_mtl_texture_to_surface_mem(id<MTLTexture> > > texture, void *vram) > > > +{ > > > + /* TODO: Skip this entirely on a pure Metal or headless/ > > guest-only > > > + * rendering path, else use a blit command encoder? Needs > > careful > > > + * (double?) buffering design. */ > > > + size_t width = texture.width, height = texture.height; > > > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > > > + [texture getBytes:vram > > > + bytesPerRow:(width * 4) > > > + bytesPerImage:(width * height * 4) > > > + fromRegion:region > > > + mipmapLevel:0 > > > + slice:0]; > > > +}copy_mtl_texture_to_surface_mem > > > + > > > +static void > apple_gfx_render_frame_completed(AppleGFXState *s, > > > + uint32_t width, > > uint32_t height) > > > +{ > > > + bql_lock(); > > > + --s->pending_frames; > > > + assert(s->pending_frames >= 0); > > > + > > > + /* Only update display if mode hasn't changed since we > > started rendering. */ > > > + if (width == surface_width(s->surface) && > > > + height == surface_height(s->surface)) { > > > + copy_mtl_texture_to_surface_mem(s->texture, s->vram); > > > + if (s->gfx_update_requested) { > > > + s->gfx_update_requested = false; > > > + dpy_gfx_update_full(s->con); > > > + graphic_hw_update_done(s->con); > > > + s->new_frame_ready = false; > > > + } else { > > > + s->new_frame_ready = true; > > > + } > > > + } > > > + if (s->pending_frames > 0) { > > > + apple_gfx_render_new_frame_bql_unlock(s); > > > + } else { > > > + bql_unlock(); > > > + } > > > +} > > > + > > > +static void apple_gfx_fb_update_display(void *opaque) > > > +{ > > > + AppleGFXState *s = opaque; > > > + > > > + assert(bql_locked()); > > > + if (s->new_frame_ready) { > > > + dpy_gfx_update_full(s->con); > > > + s->new_frame_ready = false; > > > + graphic_hw_update_done(s->con); > > > + } else if (s->pending_frames > 0) { > > > + s->gfx_update_requested = true; > > > + } else { > > > + graphic_hw_update_done(s->con); > > > + } > > > +} > > > + > > > +static const GraphicHwOps apple_gfx_fb_ops = { > > > + .gfx_update = apple_gfx_fb_update_display, > > > + .gfx_update_async = true, > > > +}; > > > + > > > +static void update_cursor(AppleGFXState *s) > > > +{ > > > + assert(bql_locked()); > > > + dpy_mouse_set(s->con, s->pgdisp.cursorPosition.x, > > > + s->pgdisp.cursorPosition.y, s- > >cursor_show); > > > +} > > > + > > > +static void set_mode(AppleGFXState *s, uint32_t width, > uint32_t > > height) > > > +{ > > > + MTLTextureDescriptor *textureDescriptor; > > > + > > > + if (s->surface && > > > + width == surface_width(s->surface) && > > > + height == surface_height(s->surface)) { > > > + return; > > > + } > > > + > > > + g_free(s->vram); > > > + [s->texture release]; > > > + > > > + s->vram = g_malloc0_n(width * height, 4); > > > + s->surface = qemu_create_displaysurface_from(width, > height, > > PIXMAN_LE_a8r8g8b8, > > > + width * > 4, s- > > >vram);> +> + @autoreleasepool { > > > + textureDescriptor = > > > + [MTLTextureDescriptor > > > + > > texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm > > > + width:width > > > + height:height > > > + mipmapped:NO]; > > > + textureDescriptor.usage = s- > >pgdisp.minimumTextureUsage; > > > + s->texture = [s->mtl > > newTextureWithDescriptor:textureDescriptor]; > > > > > > What about creating pixman_image_t from s- > >texture.buffer.contents? > > This > > should save memory usage by removing the duplication of texture. > > > > > > We need explicit control over when the GPU vs when the CPU may > access > > the texture - only one of them may access them at a time. As far > as I > > can tell, we can't control when the rest of Qemu might access the > > pixman_image used in the console surface? > > You are right; we need to have duplicate buffers. We can still avoid > copying by using two MTLTextures for double-buffering instead of having > a MTLTexture and a pixman_image and copying between them for > MTLStorageModeManaged. > > Do I understand correctly that you intend to swap the surface->image on > every frame, or even the surface->image->data? If so, it's my > understanding from reading the source of a bunch of UI implementations a > few weeks ago that this is neither supported nor safe, as some > implementations take long-lived references to these internal data > structures until a dpy_gfx_switch callback. And the implementations for > those callbacks are in turn very expensive in some cases. This is why my > conclusion in the v4 thread was that double-buffering was infeasible > with the current architecture. By the way, can't we take the BQL after encodeCurrentFrameToCommandBuffer and keep it until the completion handler? PVG requires the BQL unlocked for forward progress due to the bottom half usage in callbacks, but Metal doesn't. > > > > > > + } > > > + > > > + s->using_managed_texture_storage = > > > + (s->texture.storageMode == MTLStorageModeManaged); > > > + dpy_gfx_replace_surface(s->con, s->surface); > > > +} > > > + > > > +static void create_fb(AppleGFXState *s) > > > +{ > > > + s->con = graphic_console_init(NULL, 0, > &apple_gfx_fb_ops, s); > > > + set_mode(s, 1440, 1080); > > > + > > > + s->cursor_show = true; > > > +} > > > + > > > +static size_t apple_gfx_get_default_mmio_range_size(void) > > > +{ > > > + size_t mmio_range_size; > > > + @autoreleasepool { > > > + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; > > > + mmio_range_size = desc.mmioLength; > > > + [desc release]; > > > + } > > > + return mmio_range_size; > > > +} > > > + > > > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, > const > > char* obj_name) > > > +{ > > > + size_t mmio_range_size = > > apple_gfx_get_default_mmio_range_size(); > > > + > > > + trace_apple_gfx_common_init(obj_name, mmio_range_size); > > > + memory_region_init_io(&s->iomem_gfx, obj, > &apple_gfx_ops, s, > > obj_name, > > > + mmio_range_size); > > > + > > > + /* TODO: PVG framework supports serialising device state: > > integrate it! */ > > > +} > > > + > > > +typedef struct AppleGFXMapMemoryJob { > > > + AppleGFXState *state; > > > + PGTask_t *task; > > > + uint64_t virtual_offset; > > > + PGPhysicalMemoryRange_t *ranges; > > > + uint32_t range_count; > > > + bool read_only; > > > + bool success; > > > + bool done; > > > +} AppleGFXMapMemoryJob; > > > + > > > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t > > guest_physical, > > > + uint64_t > length, > > bool read_only) > > > +{ > > > + MemoryRegion *ram_region; > > > + uintptr_t host_address; > > > + hwaddr ram_region_offset = 0; > > > + hwaddr ram_region_length = length; > > > + > > > + ram_region = > address_space_translate(&address_space_memory, > > > + guest_physical, > > > + &ram_region_offset, > > > + > &ram_region_length, ! > > read_only, > > > + > MEMTXATTRS_UNSPECIFIED); > > > > Call memory_region_ref() so that it won't go away. > > > > > + > > > + if (!ram_region || ram_region_length < length || > > > + !memory_access_is_direct(ram_region, !read_only)) { > > > + return 0; > > > + } > > > + > > > + host_address = > > (mach_vm_address_t)memory_region_get_ram_ptr(ram_region); > > > > host_address is typed as uintptr_t, not mach_vm_address_t. > > > > > + if (host_address == 0) { > > > + return 0; > > > + } > > > + host_address += ram_region_offset; > > > + > > > + return host_address; > > > +} > > > + > > > +static void apple_gfx_map_memory(void *opaque) > > > +{ > > > + AppleGFXMapMemoryJob *job = opaque; > > > + AppleGFXState *s = job->state; > > > + PGTask_t *task = job->task; > > > + uint32_t range_count = job->range_count; > > > + uint64_t virtual_offset = job->virtual_offset; > > > + PGPhysicalMemoryRange_t *ranges = job->ranges; > > > + bool read_only = job->read_only; > > > + kern_return_t r; > > > + mach_vm_address_t target, source; > > > + vm_prot_t cur_protection, max_protection; > > > + bool success = true; > > > + > > > + g_assert(bql_locked()); > > > + > > > + trace_apple_gfx_map_memory(task, range_count, > > virtual_offset, read_only); > > > + for (int i = 0; i < range_count; i++) { > > > + PGPhysicalMemoryRange_t *range = &ranges[i]; > > > + > > > + target = task->address + virtual_offset; > > > + virtual_offset += range->physicalLength; > > > + > > > + trace_apple_gfx_map_memory_range(i, range- > >physicalAddress, > > > + range- > >physicalLength); > > > + > > > + source = apple_gfx_host_address_for_gpa_range(range- > > >physicalAddress, > > > + range- > > >physicalLength, > > > + > read_only); > > > + if (source == 0) { > > > + success = false; > > > + continue; > > > + } > > > + > > > + MemoryRegion* alt_mr = NULL; > > > + mach_vm_address_t alt_source = > > (mach_vm_address_t)gpa2hva(&alt_mr, range->physicalAddress, > range- > > >physicalLength, NULL); > > > + g_assert(alt_source == source); > > > > Remove this; I guess this is for debugging. > > > > > + > > > + cur_protection = 0; > > > + max_protection = 0; > > > + // Map guest RAM at range->physicalAddress into > PG task > > memory range > > > + r = mach_vm_remap(mach_task_self(), > > > + &target, range->physicalLength, > > vm_page_size - 1, > > > + VM_FLAGS_FIXED | > VM_FLAGS_OVERWRITE, > > > + mach_task_self(), > > > + source, false /* shared mapping, no > > copy */, > > > + &cur_protection, &max_protection, > > > + VM_INHERIT_COPY); > > > + trace_apple_gfx_remap(r, source, target); > > > + g_assert(r == KERN_SUCCESS); > > > + } > > > + > > > + qemu_mutex_lock(&s->job_mutex); > > > + job->success = success; > > > + job->done = true; > > > + qemu_cond_broadcast(&s->job_cond); > > > + qemu_mutex_unlock(&s->job_mutex); > > > +} > > > + > > > +void apple_gfx_await_bh_job(AppleGFXState *s, bool > *job_done_flag) > > > +{ > > > + qemu_mutex_lock(&s->job_mutex); > > > + while (!*job_done_flag) { > > > + qemu_cond_wait(&s->job_cond, &s->job_mutex); > > > + } > > > + qemu_mutex_unlock(&s->job_mutex); > > > +} > > > + > > > +typedef struct AppleGFXReadMemoryJob { > > > + AppleGFXState *s; > > > + hwaddr physical_address; > > > + uint64_t length; > > > + void *dst; > > > + bool done; > > > +} AppleGFXReadMemoryJob; > > > + > > > +static void apple_gfx_do_read_memory(void *opaque) > > > +{ > > > + AppleGFXReadMemoryJob *job = opaque; > > > + AppleGFXState *s = job->s; > > > + > > > + cpu_physical_memory_read(job->physical_address, job->dst, > > job->length); > > > > Use: dma_memory_read() > > > > > + > > > + qemu_mutex_lock(&s->job_mutex); > > > + job->done = true; > > > + qemu_cond_broadcast(&s->job_cond); > > > + qemu_mutex_unlock(&s->job_mutex); > > > +} > > > + > > > +static void apple_gfx_read_memory(AppleGFXState *s, hwaddr > > physical_address, > > > + uint64_t length, void *dst) > > > +{ > > > + AppleGFXReadMemoryJob job = { > > > + s, physical_address, length, dst > > > + }; > > > + > > > + trace_apple_gfx_read_memory(physical_address, length, > dst); > > > + > > > + /* Traversing the memory map requires RCU/BQL, so do > it in a > > BH. */ > > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > > apple_gfx_do_read_memory, > > > + &job); > > > + apple_gfx_await_bh_job(s, &job.done); > > > +} > > > + > > > +static void > > apple_gfx_register_task_mapping_handlers(AppleGFXState *s, > > > + > > PGDeviceDescriptor *desc) > > > +{ > > > + desc.createTask = ^(uint64_t vmSize, void * _Nullable * > > _Nonnull baseAddress) { > > > + PGTask_t *task = apple_gfx_new_task(s, vmSize); > > > + *baseAddress = (void *)task->address; > > > + trace_apple_gfx_create_task(vmSize, *baseAddress); > > > + return task; > > > + }; > > > + > > > + desc.destroyTask = ^(PGTask_t * _Nonnull task) { > > > + trace_apple_gfx_destroy_task(task); > > > + QTAILQ_REMOVE(&s->tasks, task, node); > > > + mach_vm_deallocate(mach_task_self(), task->address, > > task->len); > > > + g_free(task); > > > + }; > > > + > > > + desc.mapMemory = ^bool(PGTask_t * _Nonnull task, uint32_t > > range_count, > > > + uint64_t virtual_offset, bool > read_only, > > > + PGPhysicalMemoryRange_t * _Nonnull > ranges) { > > > + AppleGFXMapMemoryJob job = { > > > + .state = s, > > > + .task = task, .ranges = ranges, .range_count = > > range_count, > > > + .read_only = read_only, .virtual_offset = > > virtual_offset, > > > + .done = false, .success = true, > > > + }; > > > + if (range_count > 0) { > > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > > > + apple_gfx_map_memory, > &job); > > > + apple_gfx_await_bh_job(s, &job.done); > > > + } > > > + return job.success; > > > + }; > > > + > > > + desc.unmapMemory = ^bool(PGTask_t * _Nonnull task, > uint64_t > > virtualOffset, > > > + uint64_t length) { > > > + kern_return_t r; > > > + mach_vm_address_t range_address; > > > + > > > + trace_apple_gfx_unmap_memory(task, virtualOffset, > length); > > > + > > > + /* Replace task memory range with fresh pages, > undoing > > the mapping > > > + * from guest RAM. */ > > > + range_address = task->address + virtualOffset; > > > + r = mach_vm_allocate(mach_task_self(), > &range_address, > > length, > > > + VM_FLAGS_FIXED | > VM_FLAGS_OVERWRITE); > > > + g_assert(r == KERN_SUCCESS);error_setg > > > > An extra error_setg > > > > > + > > > + return true; > > > + }; > > > + > > > + desc.readMemory = ^bool(uint64_t physical_address, > uint64_t > > length, > > > + void * _Nonnull dst) { > > > + apple_gfx_read_memory(s, physical_address, > length, dst); > > > + return true; > > > + }; > > > +} > > > + > > > +static PGDisplayDescriptor > > *apple_gfx_prepare_display_descriptor(AppleGFXState *s) > > > +{ > > > + PGDisplayDescriptor *disp_desc = [PGDisplayDescriptor > new]; > > > + > > > + disp_desc.name <http://disp_desc.name> <http:// > disp_desc.name <http://disp_desc.name>> = @"QEMU display"; > > > + disp_desc.sizeInMillimeters = NSMakeSize(400., > 300.); /* A > > 20" display */ > > > + disp_desc.queue = dispatch_get_main_queue(); > > > + disp_desc.newFrameEventHandler = ^(void) { > > > + trace_apple_gfx_new_frame(); > > > + dispatch_async(s->render_queue, ^{ > > > + /* Drop frames if we get too far ahead. */ > > > + bql_lock(); > > > + if (s->pending_frames >= 2) { > > > + bql_unlock(); > > > + return; > > > + } > > > + ++s->pending_frames; > > > + if (s->pending_frames > 1) { > > > + bql_unlock(); > > > + return; > > > + } > > > + @autoreleasepool { > > > + apple_gfx_render_new_frame_bql_unlock(s); > > > + } > > > + }); > > > + }; > > > + disp_desc.modeChangeHandler = ^(PGDisplayCoord_t > sizeInPixels, > > > + OSType pixelFormat) { > > > + trace_apple_gfx_mode_change(sizeInPixels.x, > sizeInPixels.y); > > > + > > > + BQL_LOCK_GUARD(); > > > + set_mode(s, sizeInPixels.x, sizeInPixels.y); > > > + }; > > > + disp_desc.cursorGlyphHandler = ^(NSBitmapImageRep *glyph, > > > + PGDisplayCoord_t > hotSpot) { > > > + [glyph retain]; > > > + dispatch_async(get_background_queue(), ^{ > > > + BQL_LOCK_GUARD(); > > > + uint32_t bpp = glyph.bitsPerPixel; > > > + size_t width = glyph.pixelsWide; > > > + size_t height = glyph.pixelsHigh; > > > + size_t padding_bytes_per_row = > glyph.bytesPerRow - > > width * 4; > > > + const uint8_t* px_data = glyph.bitmapData; > > > + > > > + trace_apple_gfx_cursor_set(bpp, width, height); > > > + > > > + if (s->cursor) { > > > + cursor_unref(s->cursor); > > > + s->cursor = NULL; > > > + } > > > + > > > + if (bpp == 32) { /* Shouldn't be anything > else, but > > just to be safe...*/ > > > + s->cursor = cursor_alloc(width, height); > > > + s->cursor->hot_x = hotSpot.x; > > > + s->cursor->hot_y = hotSpot.y; > > > + > > > + uint32_t *dest_px = s->cursor->data; > > > + > > > + for (size_t y = 0; y < height; ++y) { > > > + for (size_t x = 0; x < width; ++x) { > > > + /* NSBitmapImageRep's red & blue > > channels are swapped > > > + * compared to QEMUCursor's. */ > > > + *dest_px = > > > + (px_data[0] << 16u) | > > > + (px_data[1] << 8u) | > > > + (px_data[2] << 0u) | > > > + (px_data[3] << 24u); > > > + ++dest_px; > > > + px_data += 4; > > > + } > > > + px_data += padding_bytes_per_row; > > > + } > > > + dpy_cursor_define(s->con, s->cursor); > > > + update_cursor(s); > > > + } > > > + [glyph release]; > > > + }); > > > + }; > > > + disp_desc.cursorShowHandler = ^(BOOL show) { > > > + dispatch_async(get_background_queue(), ^{ > > > + BQL_LOCK_GUARD(); > > > + trace_apple_gfx_cursor_show(show); > > > + s->cursor_show = show; > > > + update_cursor(s); > > > + }); > > > + }; > > > + disp_desc.cursorMoveHandler = ^(void) { > > > + dispatch_async(get_background_queue(), ^{ > > > + BQL_LOCK_GUARD(); > > > + trace_apple_gfx_cursor_move(); > > > + update_cursor(s); > > > + }); > > > + }; > > > + > > > + return disp_desc; > > > +} > > > + > > > +static NSArray<PGDisplayMode*>* > > apple_gfx_prepare_display_mode_array(void) > > > +{ > > > + PGDisplayMode *modes[ARRAY_SIZE(apple_gfx_modes)]; > > > + NSArray<PGDisplayMode*>* mode_array = nil; > > > + int i; > > > + > > > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > > > + modes[i] = > > > + [[PGDisplayMode alloc] > > initWithSizeInPixels:apple_gfx_modes[i] refreshRateInHz:60.]; > > > + } > > > + > > > + mode_array = [NSArray arrayWithObjects:modes > > count:ARRAY_SIZE(apple_gfx_modes)]; > > > + > > > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > > > + [modes[i] release]; > > > + modes[i] = nil; > > > + } > > > + > > > + return mode_array; > > > +} > > > + > > > +static id<MTLDevice> copy_suitable_metal_device(void) > > > +{ > > > + id<MTLDevice> dev = nil; > > > + NSArray<id<MTLDevice>> *devs = MTLCopyAllDevices(); > > > + > > > + /* Prefer a unified memory GPU. Failing that, pick a non- > > removable GPU. */ > > > + for (size_t i = 0; i < devs.count; ++i) { > > > + if (devs[i].hasUnifiedMemory) { > > > + dev = devs[i]; > > > + break; > > > + } > > > + if (!devs[i].removable) { > > > + dev = devs[i]; > > > + } > > > + } > > > + > > > + if (dev != nil) { > > > + [dev retain]; > > > + } else { > > > + dev = MTLCreateSystemDefaultDevice(); > > > + } > > > + [devs release]; > > > + > > > + return dev; > > > +} > > > + > > > +void apple_gfx_common_realize(AppleGFXState *s, > > PGDeviceDescriptor *desc, > > > + Error **errp) > > > +{ > > > + PGDisplayDescriptor *disp_desc = nil; > > > + > > > + if (apple_gfx_mig_blocker == NULL) { > > > + error_setg(&apple_gfx_mig_blocker, > > > + "Migration state blocked by apple-gfx > display > > device"); > > > + if (migrate_add_blocker(&apple_gfx_mig_blocker, > errp) < 0) { > > > + return; > > > + } > > > + } > > > + > > > + QTAILQ_INIT(&s->tasks); > > > + s->render_queue = dispatch_queue_create("apple- > gfx.render", > > > + > DISPATCH_QUEUE_SERIAL); > > > + s->mtl = copy_suitable_metal_device(); > > > + s->mtl_queue = [s->mtl newCommandQueue]; > > > + > > > + desc.device = s->mtl; > > > + > > > + apple_gfx_register_task_mapping_handlers(s, desc); > > > + > > > + s->pgdev = PGNewDeviceWithDescriptor(desc); > > > + > > > + disp_desc = apple_gfx_prepare_display_descriptor(s); > > > + s->pgdisp = [s->pgdev newDisplayWithDescriptor:disp_desc > > > + port:0 > > serialNum:1234]; > > > + [disp_desc release]; > > > + s->pgdisp.modeList = > apple_gfx_prepare_display_mode_array(); > > > + > > > + create_fb(s); > > > + > > > + qemu_mutex_init(&s->job_mutex); > > > + qemu_cond_init(&s->job_cond); > > > +} > > > diff --git a/hw/display/meson.build b/hw/display/meson.build > > > index 20a94973fa2..619e642905a 100644 > > > --- a/hw/display/meson.build > > > +++ b/hw/display/meson.build > > > @@ -61,6 +61,10 @@ system_ss.add(when: 'CONFIG_ARTIST', > if_true: > > files('artist.c')) > > > > > > system_ss.add(when: 'CONFIG_ATI_VGA', if_true: > [files('ati.c', > > 'ati_2d.c', 'ati_dbg.c'), pixman]) > > > > > > +system_ss.add(when: 'CONFIG_MAC_PVG', if_true: > > [files('apple-gfx.m'), pvg, metal]) > > > +if cpu == 'aarch64' > > > + system_ss.add(when: 'CONFIG_MAC_PVG_MMIO', if_true: > > [files('apple-gfx-mmio.m'), pvg, metal]) > > > +endif > > > > > > if config_all_devices.has_key('CONFIG_VIRTIO_GPU') > > > virtio_gpu_ss = ss.source_set() > > > diff --git a/hw/display/trace-events b/hw/display/trace-events > > > index 781f8a33203..214998312b9 100644 > > > --- a/hw/display/trace-events > > > +++ b/hw/display/trace-events > > > @@ -191,3 +191,29 @@ dm163_bits_ppi(unsigned dest_width) > > "dest_width : %u" > > > dm163_leds(int led, uint32_t value) "led %d: 0x%x" > > > dm163_channels(int channel, uint8_t value) "channel %d: > 0x%x" > > > dm163_refresh_rate(uint32_t rr) "refresh rate %d" > > > + > > > +# apple-gfx.m > > > +apple_gfx_read(uint64_t offset, uint64_t res) > > "offset=0x%"PRIx64" res=0x%"PRIx64 > > > +apple_gfx_write(uint64_t offset, uint64_t val) > > "offset=0x%"PRIx64" val=0x%"PRIx64 > > > +apple_gfx_create_task(uint32_t vm_size, void *va) > "vm_size=0x%x > > base_addr=%p" > > > +apple_gfx_destroy_task(void *task) "task=%p" > > > +apple_gfx_map_memory(void *task, uint32_t range_count, > uint64_t > > virtual_offset, uint32_t read_only) "task=%p range_count=0x%x > > virtual_offset=0x%"PRIx64" read_only=%d" > > > +apple_gfx_map_memory_range(uint32_t i, uint64_t phys_addr, > > uint64_t phys_len) "[%d] phys_addr=0x%"PRIx64" > phys_len=0x%"PRIx64 > > > +apple_gfx_remap(uint64_t retval, uint64_t source, uint64_t > > target) "retval=%"PRId64" source=0x%"PRIx64" target=0x%"PRIx64 > > > +apple_gfx_unmap_memory(void *task, uint64_t virtual_offset, > > uint64_t length) "task=%p virtual_offset=0x%"PRIx64" > length=0x%"PRIx64 > > > +apple_gfx_read_memory(uint64_t phys_address, uint64_t length, > > void *dst) "phys_addr=0x%"PRIx64" length=0x%"PRIx64" dest=%p" > > > +apple_gfx_raise_irq(uint32_t vector) "vector=0x%x" > > > +apple_gfx_new_frame(void) "" > > > +apple_gfx_mode_change(uint64_t x, uint64_t y) "x=%"PRId64" > > y=%"PRId64 > > > +apple_gfx_cursor_set(uint32_t bpp, uint64_t width, uint64_t > > height) "bpp=%d width=%"PRId64" height=0x%"PRId64 > > > +apple_gfx_cursor_show(uint32_t show) "show=%d" > > > +apple_gfx_cursor_move(void) "" > > > +apple_gfx_common_init(const char *device_name, size_t > mmio_size) > > "device: %s; MMIO size: %zu bytes" > > > + > > > +# apple-gfx-mmio.m > > > +apple_gfx_mmio_iosfc_read(uint64_t offset, uint64_t res) > > "offset=0x%"PRIx64" res=0x%"PRIx64 > > > +apple_gfx_mmio_iosfc_write(uint64_t offset, uint64_t val) > > "offset=0x%"PRIx64" val=0x%"PRIx64 > > > +apple_gfx_iosfc_map_memory(uint64_t phys, uint64_t len, > uint32_t > > ro, void *va, void *e, void *f, void* va_result, int success) > > "phys=0x%"PRIx64" len=0x%"PRIx64" ro=%d va=%p e=%p f=%p -> > *va=%p, > > success = %d" > > > +apple_gfx_iosfc_unmap_memory(void *a, void *b, void *c, > void *d, > > void *e, void *f) "a=%p b=%p c=%p d=%p e=%p f=%p" > > > +apple_gfx_iosfc_raise_irq(uint32_t vector) "vector=0x%x" > > > + > > > diff --git a/meson.build b/meson.build > > > index d26690ce204..0e124eff13f 100644 > > > --- a/meson.build > > > +++ b/meson.build > > > @@ -761,6 +761,8 @@ socket = [] > > > version_res = [] > > > coref = [] > > > iokit = [] > > > +pvg = [] > > > +metal = [] > > > emulator_link_args = [] > > > midl = not_found > > > widl = not_found > > > @@ -782,6 +784,8 @@ elif host_os == 'darwin' > > > coref = dependency('appleframeworks', modules: > 'CoreFoundation') > > > iokit = dependency('appleframeworks', modules: 'IOKit', > > required: false) > > > host_dsosuf = '.dylib' > > > + pvg = dependency('appleframeworks', modules: > > 'ParavirtualizedGraphics') > > > + metal = dependency('appleframeworks', modules: 'Metal') > > > elif host_os == 'sunos' > > > socket = [cc.find_library('socket'), > > > cc.find_library('nsl'), > > >
On Mon, 28 Oct 2024 at 08:42, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > On 2024/10/26 19:24, Phil Dennis-Jordan wrote: > > > > > > On Sat, 26 Oct 2024 at 06:40, Akihiko Odaki <akihiko.odaki@daynix.com > > <mailto:akihiko.odaki@daynix.com>> wrote: > > > > On 2024/10/26 4:43, Phil Dennis-Jordan wrote: > > > > > > > > > On Fri, 25 Oct 2024 at 08:03, Akihiko Odaki > > <akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com> > > > <mailto:akihiko.odaki@daynix.com > > <mailto:akihiko.odaki@daynix.com>>> wrote: > > > > > > On 2024/10/24 19:28, Phil Dennis-Jordan wrote: > > > > + /* For running PVG memory-mapping requests in the AIO > > context */ > > > > + QemuCond job_cond; > > > > + QemuMutex job_mutex; > > > > > > Use: QemuEvent > > > > > > > > > Hmm. I think if we were to use that, we would need to create a new > > > QemuEvent for every job and destroy it afterward, which seems > > expensive. > > > We can't rule out multiple concurrent jobs being submitted, and > the > > > QemuEvent system only supports a single producer as far as I can > > tell. > > > > > > You can probably sort of hack around it with just one QemuEvent by > > > putting the qemu_event_wait into a loop and turning the job.done > > flag > > > into an atomic (because it would now need to be checked outside > the > > > lock) but this all seems unnecessarily complicated considering the > > > QemuEvent uses the same mechanism QemuCond/QemuMutex internally > > on macOS > > > (the only platform relevant here), except we can use it as > > intended with > > > QemuCond/QemuMutex rather than having to work against the > > abstraction. > > > > I don't think it's going to be used concurrently. It would be > difficult > > to reason even for the framework if it performs memory > > unmapping/mapping/reading operations concurrently. > > > > > > I've just performed a very quick test by wrapping the job submission/ > > wait in the 2 mapMemory callbacks and the 1 readMemory callback with > > atomic counters and logging whenever a counter went above 1. > > > > * Overall, concurrent callbacks across all types were common (many per > > second when the VM is busy). It's not exactly a "thundering herd" (I > > never saw >2) but it's probably not a bad idea to use a separate > > condition variable for each job type. (task map, surface map, memory > read) > > * While I did not observe any concurrent memory mapping operations > > *within* a type of memory map (2 task mappings or 2 surface mappings) I > > did see very occasional concurrent memory *read* callbacks. These would, > > as far as I can tell, not be safe with QemuEvents, unless we placed the > > event inside the job struct and init/destroyed it on every callback > > (which seems like excessive overhead). > > I think we can tolerate that overhead. init/destroy essentially sets the > fields in the data structure and I estimate its total size is about 100 > bytes. It is probably better than waking an irrelevant thread up. I also > hope that keeps the code simple; it's not worthwhile adding code to > optimize this. > At least pthread_cond_{init,destroy} and pthread_mutex_{init,destroy} don't make any syscalls, so yeah it's probably an acceptable overhead. > > > > My recommendation would be to split it up into 3 pairs of mutex/cond; > > this will almost entirely remove any contention, but continue to be safe > > for when it does occur. I don't think QemuEvent is a realistic option > > (too tricky to get right) for the observed-concurrent readMemory > > callback. I'm nervous about assuming the mapMemory callbacks will NEVER > > be called concurrently, but at a push I'll acquiesce to switching those > > to QemuEvent in the absence of evidence of concurrency.> > > PGDevice.h also notes > > raiseInterrupt needs to be thread-safe while it doesn't make such > notes > > for memory operations. This actually makes sense. > > > > If it's ever going to be used concurrently, it's better to have > > QemuEvent for each job to avoid the thundering herd problem. > > > > > > > > + > > > > + dispatch_queue_t render_queue; > > > > + /* The following fields should only be accessed from > > the BQL: */ > > > > > > Perhaps it may be better to document fields that can be > accessed > > > *without* the BQL; most things in QEMU implicitly require the > > BQL. > > > > > > > + bool gfx_update_requested; > > > > + bool new_frame_ready; > > > > + bool using_managed_texture_storage; > > > > +} AppleGFXState; > > > > + > > > > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, > > const > > > char* obj_name); > > > > +void apple_gfx_common_realize(AppleGFXState *s, > > > PGDeviceDescriptor *desc, > > > > + Error **errp); > > > > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t > > > guest_physical, > > > > + uint64_t > > length, > > > bool read_only); > > > > +void apple_gfx_await_bh_job(AppleGFXState *s, bool > > *job_done_flag); > > > > + > > > > +#endif > > > > + > > > > diff --git a/hw/display/apple-gfx.m > b/hw/display/apple-gfx.m > > > > new file mode 100644 > > > > index 00000000000..46be9957f69 > > > > --- /dev/null > > > > +++ b/hw/display/apple-gfx.m > > > > @@ -0,0 +1,713 @@ > > > > +/* > > > > + * QEMU Apple ParavirtualizedGraphics.framework device > > > > + * > > > > + * Copyright © 2023 Amazon.com, Inc. or its affiliates. > All > > > Rights Reserved. > > > > + * > > > > + * This work is licensed under the terms of the GNU GPL, > > version > > > 2 or later. > > > > + * See the COPYING file in the top-level directory. > > > > + * > > > > + * ParavirtualizedGraphics.framework is a set of > > libraries that > > > macOS provides > > > > + * which implements 3d graphics passthrough to the host > > as well as a > > > > + * proprietary guest communication channel to drive it. > This > > > device model > > > > + * implements support to drive that library from within > QEMU. > > > > + */ > > > > + > > > > +#include "qemu/osdep.h" > > > > +#import > <ParavirtualizedGraphics/ParavirtualizedGraphics.h> > > > > +#include <mach/mach_vm.h> > > > > +#include "apple-gfx.h" > > > > +#include "trace.h" > > > > +#include "qemu-main.h" > > > > +#include "exec/address-spaces.h" > > > > +#include "migration/blocker.h" > > > > +#include "monitor/monitor.h" > > > > +#include "qemu/main-loop.h" > > > > +#include "qemu/cutils.h" > > > > +#include "qemu/log.h" > > > > +#include "qapi/visitor.h" > > > > +#include "qapi/error.h" > > > > +#include "ui/console.h" > > > > + > > > > +static const PGDisplayCoord_t apple_gfx_modes[] = { > > > > + { .x = 1440, .y = 1080 }, > > > > + { .x = 1280, .y = 1024 }, > > > > +}; > > > > + > > > > +/* This implements a type defined in > > <ParavirtualizedGraphics/ > > > PGDevice.h> > > > > + * which is opaque from the framework's point of view. > > Typedef > > > PGTask_t already > > > > + * exists in the framework headers. */ > > > > +struct PGTask_s { > > > > + QTAILQ_ENTRY(PGTask_s) node; > > > > + mach_vm_address_t address; > > > > + uint64_t len; > > > > +}; > > > > + > > > > +static Error *apple_gfx_mig_blocker; > > > > > > This does not have to be a static variable. > > > > > > > > > Hmm, the first 5 or so examples of migration blockers in other > > devices > > > etc. I could find were all declared in this way. What are you > > suggesting > > > as the alternative? And why not use the same pattern as in most > > of the > > > rest of the code base? > > > > I was wrong. This is better to be a static variable to ensure we > won't > > add the same blocker in case we have two device instances. > > > > > > > > > + > > > > +static void > > apple_gfx_render_frame_completed(AppleGFXState *s, > > > > + uint32_t > width, > > > uint32_t height); > > > > + > > > > +static inline dispatch_queue_t get_background_queue(void) > > > > > > Don't add inline. The only effect for modern compilers of > > inline is to > > > suppress the unused function warnings. > > > > > > > +{ > > > > + return > > > dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); > > > > +} > > > > + > > > > +static PGTask_t *apple_gfx_new_task(AppleGFXState *s, > > uint64_t len) > > > > +{ > > > > + mach_vm_address_t task_mem; > > > > + PGTask_t *task; > > > > + kern_return_t r; > > > > + > > > > + r = mach_vm_allocate(mach_task_self(), &task_mem, len, > > > VM_FLAGS_ANYWHERE); > > > > + if (r != KERN_SUCCESS || task_mem == 0) { > > > > > > Let's remove the check for task_mem == 0. We have no reason to > > > reject it > > > if the platform insists it allocated a memory at address 0 > though > > > such a > > > situation should never happen in practice. > > > > > > > + return NULL; > > > > + } > > > > + > > > > + task = g_new0(PGTask_t, 1); > > > > + > > > > + task->address = task_mem; > > > > + task->len = len; > > > > + QTAILQ_INSERT_TAIL(&s->tasks, task, node); > > > > + > > > > + return task; > > > > +} > > > > + > > > > +typedef struct AppleGFXIOJob { > > > > + AppleGFXState *state; > > > > + uint64_t offset; > > > > + uint64_t value; > > > > + bool completed; > > > > +} AppleGFXIOJob; > > > > + > > > > +static void apple_gfx_do_read(void *opaque) > > > > +{ > > > > + AppleGFXIOJob *job = opaque; > > > > + job->value = [job->state->pgdev mmioReadAtOffset:job- > > >offset]; > > > > + qatomic_set(&job->completed, true); > > > > + aio_wait_kick(); > > > > +} > > > > + > > > > +static uint64_t apple_gfx_read(void *opaque, hwaddr > offset, > > > unsigned size) > > > > +{ > > > > + AppleGFXIOJob job = { > > > > + .state = opaque, > > > > + .offset = offset, > > > > + .completed = false, > > > > + }; > > > > + AioContext *context = qemu_get_aio_context(); > > > > + dispatch_queue_t queue = get_background_queue(); > > > > + > > > > + dispatch_async_f(queue, &job, apple_gfx_do_read); > > > > + AIO_WAIT_WHILE(context, > !qatomic_read(&job.completed)); > > > > + > > > > + trace_apple_gfx_read(offset, job.value); > > > > + return job.value; > > > > +} > > > > + > > > > +static void apple_gfx_do_write(void *opaque) > > > > +{ > > > > + AppleGFXIOJob *job = opaque; > > > > + [job->state->pgdev mmioWriteAtOffset:job->offset > > value:job- > > > >value]; > > > > + qatomic_set(&job->completed, true); > > > > + aio_wait_kick(); > > > > +} > > > > + > > > > +static void apple_gfx_write(void *opaque, hwaddr offset, > > > uint64_t val, > > > > + unsigned size) > > > > +{ > > > > + /* The methods mmioReadAtOffset: and especially > > > mmioWriteAtOffset: can > > > > + * trigger and block on operations on other dispatch > > queues, > > > which in turn > > > > + * may call back out on one or more of the callback > > blocks. > > > For this reason, > > > > + * and as we are holding the BQL, we invoke the I/O > > methods > > > on a pool > > > > + * thread and handle AIO tasks while we wait. Any > work in > > > the callbacks > > > > + * requiring the BQL will in turn schedule BHs which > this > > > thread will > > > > + * process while waiting. */ > > > > + AppleGFXIOJob job = { > > > > + .state = opaque, > > > > + .offset = offset, > > > > + .value = val, > > > > + .completed = false, > > > > + }; > > > > + AioContext *context = qemu_get_current_aio_context(); > > > > + dispatch_queue_t queue = get_background_queue(); > > > > + > > > > + dispatch_async_f(queue, &job, apple_gfx_do_write); > > > > + AIO_WAIT_WHILE(context, > !qatomic_read(&job.completed)); > > > > + > > > > + trace_apple_gfx_write(offset, val); > > > > +} > > > > + > > > > +static const MemoryRegionOps apple_gfx_ops = { > > > > + .read = apple_gfx_read, > > > > + .write = apple_gfx_write, > > > > + .endianness = DEVICE_LITTLE_ENDIAN, > > > > + .valid = { > > > > + .min_access_size = 4, > > > > + .max_access_size = 8, > > > > + }, > > > > + .impl = { > > > > + .min_access_size = 4, > > > > + .max_access_size = 4, > > > > + }, > > > > +}; > > > > + > > > > +static void > > apple_gfx_render_new_frame_bql_unlock(AppleGFXState *s) > > > > +{ > > > > + BOOL r; > > > > + uint32_t width = surface_width(s->surface); > > > > + uint32_t height = surface_height(s->surface); > > > > + MTLRegion region = MTLRegionMake2D(0, 0, width, > height); > > > > + id<MTLCommandBuffer> command_buffer = [s->mtl_queue > > > commandBuffer]; > > > > + id<MTLTexture> texture = s->texture; > > > > + > > > > + assert(bql_locked()); > > > > + [texture retain]; > > > > + > > > > + bql_unlock(); > > > > + > > > > + /* This is not safe to call from the BQL due to PVG- > > internal > > > locks causing > > > > + * deadlocks. */ > > > > + r = [s->pgdisp > > encodeCurrentFrameToCommandBuffer:command_buffer > > > > + > texture:texture > > > > + > region:region]; > > > > + if (!r) { > > > > + [texture release]; > > > > + bql_lock(); > > > > + --s->pending_frames; > > > > + bql_unlock(); > > > > + qemu_log_mask(LOG_GUEST_ERROR, > > > "apple_gfx_render_new_frame_bql_unlock: " > > > > > > Use: __func__ > > > > > > > + > > > "encodeCurrentFrameToCommandBuffer:texture:region: failed\n"); > > > > + return; > > > > + } > > > > + > > > > + if (s->using_managed_texture_storage) { > > > > + /* "Managed" textures exist in both VRAM and RAM > and > > > must be synced. */ > > > > + id<MTLBlitCommandEncoder> blit = [command_buffer > > > blitCommandEncoder]; > > > > + [blit synchronizeResource:texture]; > > > > + [blit endEncoding]; > > > > + } > > > > + [texture release]; > > > > + [command_buffer addCompletedHandler: > > > > + ^(id<MTLCommandBuffer> cb) > > > > + { > > > > + dispatch_async(s->render_queue, ^{ > > > > + apple_gfx_render_frame_completed(s, > > width, height); > > > > + }); > > > > + }]; > > > > + [command_buffer commit]; > > > > +} > > > > + > > > > +static void copy_mtl_texture_to_surface_mem(id<MTLTexture> > > > texture, void *vram) > > > > +{ > > > > + /* TODO: Skip this entirely on a pure Metal or > headless/ > > > guest-only > > > > + * rendering path, else use a blit command encoder? > Needs > > > careful > > > > + * (double?) buffering design. */ > > > > + size_t width = texture.width, height = texture.height; > > > > + MTLRegion region = MTLRegionMake2D(0, 0, width, > height); > > > > + [texture getBytes:vram > > > > + bytesPerRow:(width * 4) > > > > + bytesPerImage:(width * height * 4) > > > > + fromRegion:region > > > > + mipmapLevel:0 > > > > + slice:0]; > > > > +}copy_mtl_texture_to_surface_mem > > > > + > > > > +static void > > apple_gfx_render_frame_completed(AppleGFXState *s, > > > > + uint32_t > width, > > > uint32_t height) > > > > +{ > > > > + bql_lock(); > > > > + --s->pending_frames; > > > > + assert(s->pending_frames >= 0); > > > > + > > > > + /* Only update display if mode hasn't changed since we > > > started rendering. */ > > > > + if (width == surface_width(s->surface) && > > > > + height == surface_height(s->surface)) { > > > > + copy_mtl_texture_to_surface_mem(s->texture, > s->vram); > > > > + if (s->gfx_update_requested) { > > > > + s->gfx_update_requested = false; > > > > + dpy_gfx_update_full(s->con); > > > > + graphic_hw_update_done(s->con); > > > > + s->new_frame_ready = false; > > > > + } else { > > > > + s->new_frame_ready = true; > > > > + } > > > > + } > > > > + if (s->pending_frames > 0) { > > > > + apple_gfx_render_new_frame_bql_unlock(s); > > > > + } else { > > > > + bql_unlock(); > > > > + } > > > > +} > > > > + > > > > +static void apple_gfx_fb_update_display(void *opaque) > > > > +{ > > > > + AppleGFXState *s = opaque; > > > > + > > > > + assert(bql_locked()); > > > > + if (s->new_frame_ready) { > > > > + dpy_gfx_update_full(s->con); > > > > + s->new_frame_ready = false; > > > > + graphic_hw_update_done(s->con); > > > > + } else if (s->pending_frames > 0) { > > > > + s->gfx_update_requested = true; > > > > + } else { > > > > + graphic_hw_update_done(s->con); > > > > + } > > > > +} > > > > + > > > > +static const GraphicHwOps apple_gfx_fb_ops = { > > > > + .gfx_update = apple_gfx_fb_update_display, > > > > + .gfx_update_async = true, > > > > +}; > > > > + > > > > +static void update_cursor(AppleGFXState *s) > > > > +{ > > > > + assert(bql_locked()); > > > > + dpy_mouse_set(s->con, s->pgdisp.cursorPosition.x, > > > > + s->pgdisp.cursorPosition.y, s- > > >cursor_show); > > > > +} > > > > + > > > > +static void set_mode(AppleGFXState *s, uint32_t width, > > uint32_t > > > height) > > > > +{ > > > > + MTLTextureDescriptor *textureDescriptor; > > > > + > > > > + if (s->surface && > > > > + width == surface_width(s->surface) && > > > > + height == surface_height(s->surface)) { > > > > + return; > > > > + } > > > > + > > > > + g_free(s->vram); > > > > + [s->texture release]; > > > > + > > > > + s->vram = g_malloc0_n(width * height, 4); > > > > + s->surface = qemu_create_displaysurface_from(width, > > height, > > > PIXMAN_LE_a8r8g8b8, > > > > + width * > > 4, s- > > > >vram);> +> + @autoreleasepool { > > > > + textureDescriptor = > > > > + [MTLTextureDescriptor > > > > + > > > texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm > > > > + width:width > > > > + height:height > > > > + mipmapped:NO]; > > > > + textureDescriptor.usage = s- > > >pgdisp.minimumTextureUsage; > > > > + s->texture = [s->mtl > > > newTextureWithDescriptor:textureDescriptor]; > > > > > > > > > What about creating pixman_image_t from s- > > >texture.buffer.contents? > > > This > > > should save memory usage by removing the duplication of > texture. > > > > > > > > > We need explicit control over when the GPU vs when the CPU may > > access > > > the texture - only one of them may access them at a time. As far > > as I > > > can tell, we can't control when the rest of Qemu might access the > > > pixman_image used in the console surface? > > > > You are right; we need to have duplicate buffers. We can still avoid > > copying by using two MTLTextures for double-buffering instead of > having > > a MTLTexture and a pixman_image and copying between them for > > MTLStorageModeManaged. > > > > Do I understand correctly that you intend to swap the surface->image on > > every frame, or even the surface->image->data? If so, it's my > > understanding from reading the source of a bunch of UI implementations a > > few weeks ago that this is neither supported nor safe, as some > > implementations take long-lived references to these internal data > > structures until a dpy_gfx_switch callback. And the implementations for > > those callbacks are in turn very expensive in some cases. This is why my > > conclusion in the v4 thread was that double-buffering was infeasible > > with the current architecture. > > By the way, can't we take the BQL after > encodeCurrentFrameToCommandBuffer and keep it until the completion > handler? PVG requires the BQL unlocked for forward progress due to the > bottom half usage in callbacks, but Metal doesn't. > What would be the advantage of this? Also, if you're suggesting unlocking the BQL *inside* the completion handler: I'm pretty sure locking on one thread and unlocking on another is not supported. > > > > > > > > > + } > > > > + > > > > + s->using_managed_texture_storage = > > > > + (s->texture.storageMode == MTLStorageModeManaged); > > > > + dpy_gfx_replace_surface(s->con, s->surface); > > > > +} > > > > + > > > > +static void create_fb(AppleGFXState *s) > > > > +{ > > > > + s->con = graphic_console_init(NULL, 0, > > &apple_gfx_fb_ops, s); > > > > + set_mode(s, 1440, 1080); > > > > + > > > > + s->cursor_show = true; > > > > +} > > > > + > > > > +static size_t apple_gfx_get_default_mmio_range_size(void) > > > > +{ > > > > + size_t mmio_range_size; > > > > + @autoreleasepool { > > > > + PGDeviceDescriptor *desc = [PGDeviceDescriptor > new]; > > > > + mmio_range_size = desc.mmioLength; > > > > + [desc release]; > > > > + } > > > > + return mmio_range_size; > > > > +} > > > > + > > > > +void apple_gfx_common_init(Object *obj, AppleGFXState *s, > > const > > > char* obj_name) > > > > +{ > > > > + size_t mmio_range_size = > > > apple_gfx_get_default_mmio_range_size(); > > > > + > > > > + trace_apple_gfx_common_init(obj_name, > mmio_range_size); > > > > + memory_region_init_io(&s->iomem_gfx, obj, > > &apple_gfx_ops, s, > > > obj_name, > > > > + mmio_range_size); > > > > + > > > > + /* TODO: PVG framework supports serialising device > state: > > > integrate it! */ > > > > +} > > > > + > > > > +typedef struct AppleGFXMapMemoryJob { > > > > + AppleGFXState *state; > > > > + PGTask_t *task; > > > > + uint64_t virtual_offset; > > > > + PGPhysicalMemoryRange_t *ranges; > > > > + uint32_t range_count; > > > > + bool read_only; > > > > + bool success; > > > > + bool done; > > > > +} AppleGFXMapMemoryJob; > > > > + > > > > +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t > > > guest_physical, > > > > + uint64_t > > length, > > > bool read_only) > > > > +{ > > > > + MemoryRegion *ram_region; > > > > + uintptr_t host_address; > > > > + hwaddr ram_region_offset = 0; > > > > + hwaddr ram_region_length = length; > > > > + > > > > + ram_region = > > address_space_translate(&address_space_memory, > > > > + guest_physical, > > > > + > &ram_region_offset, > > > > + > > &ram_region_length, ! > > > read_only, > > > > + > > MEMTXATTRS_UNSPECIFIED); > > > > > > Call memory_region_ref() so that it won't go away. > > > > > > > + > > > > + if (!ram_region || ram_region_length < length || > > > > + !memory_access_is_direct(ram_region, !read_only)) > { > > > > + return 0; > > > > + } > > > > + > > > > + host_address = > > > (mach_vm_address_t)memory_region_get_ram_ptr(ram_region); > > > > > > host_address is typed as uintptr_t, not mach_vm_address_t. > > > > > > > + if (host_address == 0) { > > > > + return 0; > > > > + } > > > > + host_address += ram_region_offset; > > > > + > > > > + return host_address; > > > > +} > > > > + > > > > +static void apple_gfx_map_memory(void *opaque) > > > > +{ > > > > + AppleGFXMapMemoryJob *job = opaque; > > > > + AppleGFXState *s = job->state; > > > > + PGTask_t *task = job->task; > > > > + uint32_t range_count = job->range_count; > > > > + uint64_t virtual_offset = job->virtual_offset; > > > > + PGPhysicalMemoryRange_t *ranges = job->ranges; > > > > + bool read_only = job->read_only; > > > > + kern_return_t r; > > > > + mach_vm_address_t target, source; > > > > + vm_prot_t cur_protection, max_protection; > > > > + bool success = true; > > > > + > > > > + g_assert(bql_locked()); > > > > + > > > > + trace_apple_gfx_map_memory(task, range_count, > > > virtual_offset, read_only); > > > > + for (int i = 0; i < range_count; i++) { > > > > + PGPhysicalMemoryRange_t *range = &ranges[i]; > > > > + > > > > + target = task->address + virtual_offset; > > > > + virtual_offset += range->physicalLength; > > > > + > > > > + trace_apple_gfx_map_memory_range(i, range- > > >physicalAddress, > > > > + range- > > >physicalLength); > > > > + > > > > + source = > apple_gfx_host_address_for_gpa_range(range- > > > >physicalAddress, > > > > + > range- > > > >physicalLength, > > > > + > > read_only); > > > > + if (source == 0) { > > > > + success = false; > > > > + continue; > > > > + } > > > > + > > > > + MemoryRegion* alt_mr = NULL; > > > > + mach_vm_address_t alt_source = > > > (mach_vm_address_t)gpa2hva(&alt_mr, range->physicalAddress, > > range- > > > >physicalLength, NULL); > > > > + g_assert(alt_source == source); > > > > > > Remove this; I guess this is for debugging. > > > > > > > + > > > > + cur_protection = 0; > > > > + max_protection = 0; > > > > + // Map guest RAM at range->physicalAddress into > > PG task > > > memory range > > > > + r = mach_vm_remap(mach_task_self(), > > > > + &target, range->physicalLength, > > > vm_page_size - 1, > > > > + VM_FLAGS_FIXED | > > VM_FLAGS_OVERWRITE, > > > > + mach_task_self(), > > > > + source, false /* shared > mapping, no > > > copy */, > > > > + &cur_protection, > &max_protection, > > > > + VM_INHERIT_COPY); > > > > + trace_apple_gfx_remap(r, source, target); > > > > + g_assert(r == KERN_SUCCESS); > > > > + } > > > > + > > > > + qemu_mutex_lock(&s->job_mutex); > > > > + job->success = success; > > > > + job->done = true; > > > > + qemu_cond_broadcast(&s->job_cond); > > > > + qemu_mutex_unlock(&s->job_mutex); > > > > +} > > > > + > > > > +void apple_gfx_await_bh_job(AppleGFXState *s, bool > > *job_done_flag) > > > > +{ > > > > + qemu_mutex_lock(&s->job_mutex); > > > > + while (!*job_done_flag) { > > > > + qemu_cond_wait(&s->job_cond, &s->job_mutex); > > > > + } > > > > + qemu_mutex_unlock(&s->job_mutex); > > > > +} > > > > + > > > > +typedef struct AppleGFXReadMemoryJob { > > > > + AppleGFXState *s; > > > > + hwaddr physical_address; > > > > + uint64_t length; > > > > + void *dst; > > > > + bool done; > > > > +} AppleGFXReadMemoryJob; > > > > + > > > > +static void apple_gfx_do_read_memory(void *opaque) > > > > +{ > > > > + AppleGFXReadMemoryJob *job = opaque; > > > > + AppleGFXState *s = job->s; > > > > + > > > > + cpu_physical_memory_read(job->physical_address, > job->dst, > > > job->length); > > > > > > Use: dma_memory_read() > > > > > > > + > > > > + qemu_mutex_lock(&s->job_mutex); > > > > + job->done = true; > > > > + qemu_cond_broadcast(&s->job_cond); > > > > + qemu_mutex_unlock(&s->job_mutex); > > > > +} > > > > + > > > > +static void apple_gfx_read_memory(AppleGFXState *s, hwaddr > > > physical_address, > > > > + uint64_t length, void > *dst) > > > > +{ > > > > + AppleGFXReadMemoryJob job = { > > > > + s, physical_address, length, dst > > > > + }; > > > > + > > > > + trace_apple_gfx_read_memory(physical_address, length, > > dst); > > > > + > > > > + /* Traversing the memory map requires RCU/BQL, so do > > it in a > > > BH. */ > > > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > > > apple_gfx_do_read_memory, > > > > + &job); > > > > + apple_gfx_await_bh_job(s, &job.done); > > > > +} > > > > + > > > > +static void > > > apple_gfx_register_task_mapping_handlers(AppleGFXState *s, > > > > + > > > PGDeviceDescriptor *desc) > > > > +{ > > > > + desc.createTask = ^(uint64_t vmSize, void * _Nullable > * > > > _Nonnull baseAddress) { > > > > + PGTask_t *task = apple_gfx_new_task(s, vmSize); > > > > + *baseAddress = (void *)task->address; > > > > + trace_apple_gfx_create_task(vmSize, *baseAddress); > > > > + return task; > > > > + }; > > > > + > > > > + desc.destroyTask = ^(PGTask_t * _Nonnull task) { > > > > + trace_apple_gfx_destroy_task(task); > > > > + QTAILQ_REMOVE(&s->tasks, task, node); > > > > + mach_vm_deallocate(mach_task_self(), > task->address, > > > task->len); > > > > + g_free(task); > > > > + }; > > > > + > > > > + desc.mapMemory = ^bool(PGTask_t * _Nonnull task, > uint32_t > > > range_count, > > > > + uint64_t virtual_offset, bool > > read_only, > > > > + PGPhysicalMemoryRange_t * _Nonnull > > ranges) { > > > > + AppleGFXMapMemoryJob job = { > > > > + .state = s, > > > > + .task = task, .ranges = ranges, .range_count = > > > range_count, > > > > + .read_only = read_only, .virtual_offset = > > > virtual_offset, > > > > + .done = false, .success = true, > > > > + }; > > > > + if (range_count > 0) { > > > > + > aio_bh_schedule_oneshot(qemu_get_aio_context(), > > > > + apple_gfx_map_memory, > > &job); > > > > + apple_gfx_await_bh_job(s, &job.done); > > > > + } > > > > + return job.success; > > > > + }; > > > > + > > > > + desc.unmapMemory = ^bool(PGTask_t * _Nonnull task, > > uint64_t > > > virtualOffset, > > > > + uint64_t length) { > > > > + kern_return_t r; > > > > + mach_vm_address_t range_address; > > > > + > > > > + trace_apple_gfx_unmap_memory(task, virtualOffset, > > length); > > > > + > > > > + /* Replace task memory range with fresh pages, > > undoing > > > the mapping > > > > + * from guest RAM. */ > > > > + range_address = task->address + virtualOffset; > > > > + r = mach_vm_allocate(mach_task_self(), > > &range_address, > > > length, > > > > + VM_FLAGS_FIXED | > > VM_FLAGS_OVERWRITE); > > > > + g_assert(r == KERN_SUCCESS);error_setg > > > > > > An extra error_setg > > > > > > > + > > > > + return true; > > > > + }; > > > > + > > > > + desc.readMemory = ^bool(uint64_t physical_address, > > uint64_t > > > length, > > > > + void * _Nonnull dst) { > > > > + apple_gfx_read_memory(s, physical_address, > > length, dst); > > > > + return true; > > > > + }; > > > > +} > > > > + > > > > +static PGDisplayDescriptor > > > *apple_gfx_prepare_display_descriptor(AppleGFXState *s) > > > > +{ > > > > + PGDisplayDescriptor *disp_desc = [PGDisplayDescriptor > > new]; > > > > + > > > > + disp_desc.name <http://disp_desc.name> <http:// > > disp_desc.name <http://disp_desc.name>> = @"QEMU display"; > > > > + disp_desc.sizeInMillimeters = NSMakeSize(400., > > 300.); /* A > > > 20" display */ > > > > + disp_desc.queue = dispatch_get_main_queue(); > > > > + disp_desc.newFrameEventHandler = ^(void) { > > > > + trace_apple_gfx_new_frame(); > > > > + dispatch_async(s->render_queue, ^{ > > > > + /* Drop frames if we get too far ahead. */ > > > > + bql_lock(); > > > > + if (s->pending_frames >= 2) { > > > > + bql_unlock(); > > > > + return; > > > > + } > > > > + ++s->pending_frames; > > > > + if (s->pending_frames > 1) { > > > > + bql_unlock(); > > > > + return; > > > > + } > > > > + @autoreleasepool { > > > > + apple_gfx_render_new_frame_bql_unlock(s); > > > > + } > > > > + }); > > > > + }; > > > > + disp_desc.modeChangeHandler = ^(PGDisplayCoord_t > > sizeInPixels, > > > > + OSType pixelFormat) { > > > > + trace_apple_gfx_mode_change(sizeInPixels.x, > > sizeInPixels.y); > > > > + > > > > + BQL_LOCK_GUARD(); > > > > + set_mode(s, sizeInPixels.x, sizeInPixels.y); > > > > + }; > > > > + disp_desc.cursorGlyphHandler = ^(NSBitmapImageRep > *glyph, > > > > + PGDisplayCoord_t > > hotSpot) { > > > > + [glyph retain]; > > > > + dispatch_async(get_background_queue(), ^{ > > > > + BQL_LOCK_GUARD(); > > > > + uint32_t bpp = glyph.bitsPerPixel; > > > > + size_t width = glyph.pixelsWide; > > > > + size_t height = glyph.pixelsHigh; > > > > + size_t padding_bytes_per_row = > > glyph.bytesPerRow - > > > width * 4; > > > > + const uint8_t* px_data = glyph.bitmapData; > > > > + > > > > + trace_apple_gfx_cursor_set(bpp, width, > height); > > > > + > > > > + if (s->cursor) { > > > > + cursor_unref(s->cursor); > > > > + s->cursor = NULL; > > > > + } > > > > + > > > > + if (bpp == 32) { /* Shouldn't be anything > > else, but > > > just to be safe...*/ > > > > + s->cursor = cursor_alloc(width, height); > > > > + s->cursor->hot_x = hotSpot.x; > > > > + s->cursor->hot_y = hotSpot.y; > > > > + > > > > + uint32_t *dest_px = s->cursor->data; > > > > + > > > > + for (size_t y = 0; y < height; ++y) { > > > > + for (size_t x = 0; x < width; ++x) { > > > > + /* NSBitmapImageRep's red & blue > > > channels are swapped > > > > + * compared to QEMUCursor's. */ > > > > + *dest_px = > > > > + (px_data[0] << 16u) | > > > > + (px_data[1] << 8u) | > > > > + (px_data[2] << 0u) | > > > > + (px_data[3] << 24u); > > > > + ++dest_px; > > > > + px_data += 4; > > > > + } > > > > + px_data += padding_bytes_per_row; > > > > + } > > > > + dpy_cursor_define(s->con, s->cursor); > > > > + update_cursor(s); > > > > + } > > > > + [glyph release]; > > > > + }); > > > > + }; > > > > + disp_desc.cursorShowHandler = ^(BOOL show) { > > > > + dispatch_async(get_background_queue(), ^{ > > > > + BQL_LOCK_GUARD(); > > > > + trace_apple_gfx_cursor_show(show); > > > > + s->cursor_show = show; > > > > + update_cursor(s); > > > > + }); > > > > + }; > > > > + disp_desc.cursorMoveHandler = ^(void) { > > > > + dispatch_async(get_background_queue(), ^{ > > > > + BQL_LOCK_GUARD(); > > > > + trace_apple_gfx_cursor_move(); > > > > + update_cursor(s); > > > > + }); > > > > + }; > > > > + > > > > + return disp_desc; > > > > +} > > > > + > > > > +static NSArray<PGDisplayMode*>* > > > apple_gfx_prepare_display_mode_array(void) > > > > +{ > > > > + PGDisplayMode *modes[ARRAY_SIZE(apple_gfx_modes)]; > > > > + NSArray<PGDisplayMode*>* mode_array = nil; > > > > + int i; > > > > + > > > > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > > > > + modes[i] = > > > > + [[PGDisplayMode alloc] > > > initWithSizeInPixels:apple_gfx_modes[i] refreshRateInHz:60.]; > > > > + } > > > > + > > > > + mode_array = [NSArray arrayWithObjects:modes > > > count:ARRAY_SIZE(apple_gfx_modes)]; > > > > + > > > > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > > > > + [modes[i] release]; > > > > + modes[i] = nil; > > > > + } > > > > + > > > > + return mode_array; > > > > +} > > > > + > > > > +static id<MTLDevice> copy_suitable_metal_device(void) > > > > +{ > > > > + id<MTLDevice> dev = nil; > > > > + NSArray<id<MTLDevice>> *devs = MTLCopyAllDevices(); > > > > + > > > > + /* Prefer a unified memory GPU. Failing that, pick a > non- > > > removable GPU. */ > > > > + for (size_t i = 0; i < devs.count; ++i) { > > > > + if (devs[i].hasUnifiedMemory) { > > > > + dev = devs[i]; > > > > + break; > > > > + } > > > > + if (!devs[i].removable) { > > > > + dev = devs[i]; > > > > + } > > > > + } > > > > + > > > > + if (dev != nil) { > > > > + [dev retain]; > > > > + } else { > > > > + dev = MTLCreateSystemDefaultDevice(); > > > > + } > > > > + [devs release]; > > > > + > > > > + return dev; > > > > +} > > > > + > > > > +void apple_gfx_common_realize(AppleGFXState *s, > > > PGDeviceDescriptor *desc, > > > > + Error **errp) > > > > +{ > > > > + PGDisplayDescriptor *disp_desc = nil; > > > > + > > > > + if (apple_gfx_mig_blocker == NULL) { > > > > + error_setg(&apple_gfx_mig_blocker, > > > > + "Migration state blocked by apple-gfx > > display > > > device"); > > > > + if (migrate_add_blocker(&apple_gfx_mig_blocker, > > errp) < 0) { > > > > + return; > > > > + } > > > > + } > > > > + > > > > + QTAILQ_INIT(&s->tasks); > > > > + s->render_queue = dispatch_queue_create("apple- > > gfx.render", > > > > + > > DISPATCH_QUEUE_SERIAL); > > > > + s->mtl = copy_suitable_metal_device(); > > > > + s->mtl_queue = [s->mtl newCommandQueue]; > > > > + > > > > + desc.device = s->mtl; > > > > + > > > > + apple_gfx_register_task_mapping_handlers(s, desc); > > > > + > > > > + s->pgdev = PGNewDeviceWithDescriptor(desc); > > > > + > > > > + disp_desc = apple_gfx_prepare_display_descriptor(s); > > > > + s->pgdisp = [s->pgdev > newDisplayWithDescriptor:disp_desc > > > > + port:0 > > > serialNum:1234]; > > > > + [disp_desc release]; > > > > + s->pgdisp.modeList = > > apple_gfx_prepare_display_mode_array(); > > > > + > > > > + create_fb(s); > > > > + > > > > + qemu_mutex_init(&s->job_mutex); > > > > + qemu_cond_init(&s->job_cond); > > > > +} > > > > diff --git a/hw/display/meson.build > b/hw/display/meson.build > > > > index 20a94973fa2..619e642905a 100644 > > > > --- a/hw/display/meson.build > > > > +++ b/hw/display/meson.build > > > > @@ -61,6 +61,10 @@ system_ss.add(when: 'CONFIG_ARTIST', > > if_true: > > > files('artist.c')) > > > > > > > > system_ss.add(when: 'CONFIG_ATI_VGA', if_true: > > [files('ati.c', > > > 'ati_2d.c', 'ati_dbg.c'), pixman]) > > > > > > > > +system_ss.add(when: 'CONFIG_MAC_PVG', if_true: > > > [files('apple-gfx.m'), pvg, metal]) > > > > +if cpu == 'aarch64' > > > > + system_ss.add(when: 'CONFIG_MAC_PVG_MMIO', if_true: > > > [files('apple-gfx-mmio.m'), pvg, metal]) > > > > +endif > > > > > > > > if config_all_devices.has_key('CONFIG_VIRTIO_GPU') > > > > virtio_gpu_ss = ss.source_set() > > > > diff --git a/hw/display/trace-events > b/hw/display/trace-events > > > > index 781f8a33203..214998312b9 100644 > > > > --- a/hw/display/trace-events > > > > +++ b/hw/display/trace-events > > > > @@ -191,3 +191,29 @@ dm163_bits_ppi(unsigned dest_width) > > > "dest_width : %u" > > > > dm163_leds(int led, uint32_t value) "led %d: 0x%x" > > > > dm163_channels(int channel, uint8_t value) "channel %d: > > 0x%x" > > > > dm163_refresh_rate(uint32_t rr) "refresh rate %d" > > > > + > > > > +# apple-gfx.m > > > > +apple_gfx_read(uint64_t offset, uint64_t res) > > > "offset=0x%"PRIx64" res=0x%"PRIx64 > > > > +apple_gfx_write(uint64_t offset, uint64_t val) > > > "offset=0x%"PRIx64" val=0x%"PRIx64 > > > > +apple_gfx_create_task(uint32_t vm_size, void *va) > > "vm_size=0x%x > > > base_addr=%p" > > > > +apple_gfx_destroy_task(void *task) "task=%p" > > > > +apple_gfx_map_memory(void *task, uint32_t range_count, > > uint64_t > > > virtual_offset, uint32_t read_only) "task=%p range_count=0x%x > > > virtual_offset=0x%"PRIx64" read_only=%d" > > > > +apple_gfx_map_memory_range(uint32_t i, uint64_t phys_addr, > > > uint64_t phys_len) "[%d] phys_addr=0x%"PRIx64" > > phys_len=0x%"PRIx64 > > > > +apple_gfx_remap(uint64_t retval, uint64_t source, uint64_t > > > target) "retval=%"PRId64" source=0x%"PRIx64" target=0x%"PRIx64 > > > > +apple_gfx_unmap_memory(void *task, uint64_t > virtual_offset, > > > uint64_t length) "task=%p virtual_offset=0x%"PRIx64" > > length=0x%"PRIx64 > > > > +apple_gfx_read_memory(uint64_t phys_address, uint64_t > length, > > > void *dst) "phys_addr=0x%"PRIx64" length=0x%"PRIx64" dest=%p" > > > > +apple_gfx_raise_irq(uint32_t vector) "vector=0x%x" > > > > +apple_gfx_new_frame(void) "" > > > > +apple_gfx_mode_change(uint64_t x, uint64_t y) "x=%"PRId64" > > > y=%"PRId64 > > > > +apple_gfx_cursor_set(uint32_t bpp, uint64_t width, > uint64_t > > > height) "bpp=%d width=%"PRId64" height=0x%"PRId64 > > > > +apple_gfx_cursor_show(uint32_t show) "show=%d" > > > > +apple_gfx_cursor_move(void) "" > > > > +apple_gfx_common_init(const char *device_name, size_t > > mmio_size) > > > "device: %s; MMIO size: %zu bytes" > > > > + > > > > +# apple-gfx-mmio.m > > > > +apple_gfx_mmio_iosfc_read(uint64_t offset, uint64_t res) > > > "offset=0x%"PRIx64" res=0x%"PRIx64 > > > > +apple_gfx_mmio_iosfc_write(uint64_t offset, uint64_t val) > > > "offset=0x%"PRIx64" val=0x%"PRIx64 > > > > +apple_gfx_iosfc_map_memory(uint64_t phys, uint64_t len, > > uint32_t > > > ro, void *va, void *e, void *f, void* va_result, int success) > > > "phys=0x%"PRIx64" len=0x%"PRIx64" ro=%d va=%p e=%p f=%p -> > > *va=%p, > > > success = %d" > > > > +apple_gfx_iosfc_unmap_memory(void *a, void *b, void *c, > > void *d, > > > void *e, void *f) "a=%p b=%p c=%p d=%p e=%p f=%p" > > > > +apple_gfx_iosfc_raise_irq(uint32_t vector) "vector=0x%x" > > > > + > > > > diff --git a/meson.build b/meson.build > > > > index d26690ce204..0e124eff13f 100644 > > > > --- a/meson.build > > > > +++ b/meson.build > > > > @@ -761,6 +761,8 @@ socket = [] > > > > version_res = [] > > > > coref = [] > > > > iokit = [] > > > > +pvg = [] > > > > +metal = [] > > > > emulator_link_args = [] > > > > midl = not_found > > > > widl = not_found > > > > @@ -782,6 +784,8 @@ elif host_os == 'darwin' > > > > coref = dependency('appleframeworks', modules: > > 'CoreFoundation') > > > > iokit = dependency('appleframeworks', modules: 'IOKit', > > > required: false) > > > > host_dsosuf = '.dylib' > > > > + pvg = dependency('appleframeworks', modules: > > > 'ParavirtualizedGraphics') > > > > + metal = dependency('appleframeworks', modules: 'Metal') > > > > elif host_os == 'sunos' > > > > socket = [cc.find_library('socket'), > > > > cc.find_library('nsl'), > > > > > > >
On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan <phil@philjordan.eu> wrote: > > > > >> > > Hmm. I think if we were to use that, we would need to create a >> new >> > > QemuEvent for every job and destroy it afterward, which seems >> > expensive. >> > > We can't rule out multiple concurrent jobs being submitted, and >> the >> > > QemuEvent system only supports a single producer as far as I can >> > tell. >> > > >> > > You can probably sort of hack around it with just one QemuEvent >> by >> > > putting the qemu_event_wait into a loop and turning the job.done >> > flag >> > > into an atomic (because it would now need to be checked outside >> the >> > > lock) but this all seems unnecessarily complicated considering >> the >> > > QemuEvent uses the same mechanism QemuCond/QemuMutex internally >> > on macOS >> > > (the only platform relevant here), except we can use it as >> > intended with >> > > QemuCond/QemuMutex rather than having to work against the >> > abstraction. >> > >> > I don't think it's going to be used concurrently. It would be >> difficult >> > to reason even for the framework if it performs memory >> > unmapping/mapping/reading operations concurrently. >> > >> > >> > I've just performed a very quick test by wrapping the job submission/ >> > wait in the 2 mapMemory callbacks and the 1 readMemory callback with >> > atomic counters and logging whenever a counter went above 1. >> > >> > * Overall, concurrent callbacks across all types were common (many >> per >> > second when the VM is busy). It's not exactly a "thundering herd" (I >> > never saw >2) but it's probably not a bad idea to use a separate >> > condition variable for each job type. (task map, surface map, memory >> read) >> > * While I did not observe any concurrent memory mapping operations >> > *within* a type of memory map (2 task mappings or 2 surface mappings) I >> > did see very occasional concurrent memory *read* callbacks. These >> would, >> > as far as I can tell, not be safe with QemuEvents, unless we placed the >> > event inside the job struct and init/destroyed it on every callback >> > (which seems like excessive overhead). >> >> I think we can tolerate that overhead. init/destroy essentially sets the >> fields in the data structure and I estimate its total size is about 100 >> bytes. It is probably better than waking an irrelevant thread up. I also >> hope that keeps the code simple; it's not worthwhile adding code to >> optimize this. >> > > At least pthread_cond_{init,destroy} and pthread_mutex_{init,destroy} > don't make any syscalls, so yeah it's probably an acceptable overhead. > I've just experimented with QemuEvents created on-demand and ran into some weird deadlocks, which then made me sit down and think about it some more. I've come to the conclusion that creating (and crucially, destroying) QemuEvents on demand in this way is not safe. Specifically, you must not call qemu_event_destroy() - which transitively destroys the mutex and condition variable - unless you can guarantee that the qemu_event_set() call on that event object has completed. In qemu_event_set, the event object's value is atomically set to EV_SET. If the previous value was EV_BUSY, qemu_futex_wake() is called. All of this is outside any mutex, however, so apart from memory coherence (there are barriers) this can race with the waiting thread. qemu_event_wait() reads the event's value. If EV_FREE, it's atomically set to EV_BUSY. Then the mutex is locked, the value is checked again, and if it's still EV_BUSY, it waits for the condition variable, otherwise the mutex is immediately unlocked again. If the trigger thread's qemu_event_set() flip to EV_SET occurs between the waiting thread's two atomic reads of the value, the waiting thread will never wait for the condition variable, but the trigger thread WILL try to acquire the mutex and signal the condition variable in qemu_futex_wake(), by which time the waiting thread may have advanced outside of qemu_event_wait(). This is all fine usually, BUT if you destroy the QemuEvent immediately after the qemu_event_wait() call, qemu_futex_wake() may try to lock a mutex that has been destroyed, or signal a condition variable which has been destroyed. I don't see a reasonable way of making this safe other than using long-lived mutexes and condition variables. And anyway, we have much, MUCH bigger contention/performance issues coming from almost everything being covered by the BQL. (If waking these callbacks can even be considered an issue: I haven't seen it show up in profiling, whereas BQL contention very much does.) I'll submit v5 of this patch set with separate condition variables for each job type. This should make the occurrence of waking the wrong thread quite rare, while reasoning about correctness is pretty straightforward. I think that's good enough.
On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan <phil@philjordan.eu > <mailto:phil@philjordan.eu>> wrote: > > > > > > > > Hmm. I think if we were to use that, we would need to > create a new > > > QemuEvent for every job and destroy it afterward, > which seems > > expensive. > > > We can't rule out multiple concurrent jobs being > submitted, and the > > > QemuEvent system only supports a single producer as > far as I can > > tell. > > > > > > You can probably sort of hack around it with just one > QemuEvent by > > > putting the qemu_event_wait into a loop and turning > the job.done > > flag > > > into an atomic (because it would now need to be > checked outside the > > > lock) but this all seems unnecessarily complicated > considering the > > > QemuEvent uses the same mechanism QemuCond/QemuMutex > internally > > on macOS > > > (the only platform relevant here), except we can use it as > > intended with > > > QemuCond/QemuMutex rather than having to work against the > > abstraction. > > > > I don't think it's going to be used concurrently. It > would be difficult > > to reason even for the framework if it performs memory > > unmapping/mapping/reading operations concurrently. > > > > > > I've just performed a very quick test by wrapping the job > submission/ > > wait in the 2 mapMemory callbacks and the 1 readMemory > callback with > > atomic counters and logging whenever a counter went above 1. > > > > * Overall, concurrent callbacks across all types were > common (many per > > second when the VM is busy). It's not exactly a "thundering > herd" (I > > never saw >2) but it's probably not a bad idea to use a separate > > condition variable for each job type. (task map, surface map, > memory read) > > * While I did not observe any concurrent memory mapping > operations > > *within* a type of memory map (2 task mappings or 2 surface > mappings) I > > did see very occasional concurrent memory *read* callbacks. > These would, > > as far as I can tell, not be safe with QemuEvents, unless we > placed the > > event inside the job struct and init/destroyed it on every > callback > > (which seems like excessive overhead). > > I think we can tolerate that overhead. init/destroy essentially > sets the > fields in the data structure and I estimate its total size is > about 100 > bytes. It is probably better than waking an irrelevant thread > up. I also > hope that keeps the code simple; it's not worthwhile adding code to > optimize this. > > > At least pthread_cond_{init,destroy} and > pthread_mutex_{init,destroy} don't make any syscalls, so yeah it's > probably an acceptable overhead. > > > I've just experimented with QemuEvents created on-demand and ran into > some weird deadlocks, which then made me sit down and think about it > some more. I've come to the conclusion that creating (and crucially, > destroying) QemuEvents on demand in this way is not safe. > > Specifically, you must not call qemu_event_destroy() - which > transitively destroys the mutex and condition variable - unless you can > guarantee that the qemu_event_set() call on that event object has completed. > > In qemu_event_set, the event object's value is atomically set to EV_SET. > If the previous value was EV_BUSY, qemu_futex_wake() is called. All of > this is outside any mutex, however, so apart from memory coherence > (there are barriers) this can race with the waiting thread. > qemu_event_wait() reads the event's value. If EV_FREE, it's atomically > set to EV_BUSY. Then the mutex is locked, the value is checked again, > and if it's still EV_BUSY, it waits for the condition variable, > otherwise the mutex is immediately unlocked again. If the trigger > thread's qemu_event_set() flip to EV_SET occurs between the waiting > thread's two atomic reads of the value, the waiting thread will never > wait for the condition variable, but the trigger thread WILL try to > acquire the mutex and signal the condition variable in > qemu_futex_wake(), by which time the waiting thread may have advanced > outside of qemu_event_wait(). Sorry if I'm making a mistake again, but the waiting thread won't set to EV_BUSY unless the value is EV_FREE on the second read so the trigger thread will not call qemu_futex_wake() if it manages to set to EV_SET before the second read, will it? > > This is all fine usually, BUT if you destroy the QemuEvent immediately > after the qemu_event_wait() call, qemu_futex_wake() may try to lock a > mutex that has been destroyed, or signal a condition variable which has > been destroyed. I don't see a reasonable way of making this safe other > than using long-lived mutexes and condition variables. And anyway, we > have much, MUCH bigger contention/performance issues coming from almost > everything being covered by the BQL. (If waking these callbacks can even > be considered an issue: I haven't seen it show up in profiling, whereas > BQL contention very much does.) > > I'll submit v5 of this patch set with separate condition variables for > each job type. This should make the occurrence of waking the wrong > thread quite rare, while reasoning about correctness is pretty > straightforward. I think that's good enough. > > >
On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > > > > > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan <phil@philjordan.eu > > <mailto:phil@philjordan.eu>> wrote: > > > > > > > > > > > > Hmm. I think if we were to use that, we would need to > > create a new > > > > QemuEvent for every job and destroy it afterward, > > which seems > > > expensive. > > > > We can't rule out multiple concurrent jobs being > > submitted, and the > > > > QemuEvent system only supports a single producer as > > far as I can > > > tell. > > > > > > > > You can probably sort of hack around it with just one > > QemuEvent by > > > > putting the qemu_event_wait into a loop and turning > > the job.done > > > flag > > > > into an atomic (because it would now need to be > > checked outside the > > > > lock) but this all seems unnecessarily complicated > > considering the > > > > QemuEvent uses the same mechanism QemuCond/QemuMutex > > internally > > > on macOS > > > > (the only platform relevant here), except we can use > it as > > > intended with > > > > QemuCond/QemuMutex rather than having to work against > the > > > abstraction. > > > > > > I don't think it's going to be used concurrently. It > > would be difficult > > > to reason even for the framework if it performs memory > > > unmapping/mapping/reading operations concurrently. > > > > > > > > > I've just performed a very quick test by wrapping the job > > submission/ > > > wait in the 2 mapMemory callbacks and the 1 readMemory > > callback with > > > atomic counters and logging whenever a counter went above 1. > > > > > > * Overall, concurrent callbacks across all types were > > common (many per > > > second when the VM is busy). It's not exactly a "thundering > > herd" (I > > > never saw >2) but it's probably not a bad idea to use a > separate > > > condition variable for each job type. (task map, surface map, > > memory read) > > > * While I did not observe any concurrent memory mapping > > operations > > > *within* a type of memory map (2 task mappings or 2 surface > > mappings) I > > > did see very occasional concurrent memory *read* callbacks. > > These would, > > > as far as I can tell, not be safe with QemuEvents, unless we > > placed the > > > event inside the job struct and init/destroyed it on every > > callback > > > (which seems like excessive overhead). > > > > I think we can tolerate that overhead. init/destroy essentially > > sets the > > fields in the data structure and I estimate its total size is > > about 100 > > bytes. It is probably better than waking an irrelevant thread > > up. I also > > hope that keeps the code simple; it's not worthwhile adding code > to > > optimize this. > > > > > > At least pthread_cond_{init,destroy} and > > pthread_mutex_{init,destroy} don't make any syscalls, so yeah it's > > probably an acceptable overhead. > > > > > > I've just experimented with QemuEvents created on-demand and ran into > > some weird deadlocks, which then made me sit down and think about it > > some more. I've come to the conclusion that creating (and crucially, > > destroying) QemuEvents on demand in this way is not safe. > > > > Specifically, you must not call qemu_event_destroy() - which > > transitively destroys the mutex and condition variable - unless you can > > guarantee that the qemu_event_set() call on that event object has > completed. > > > > In qemu_event_set, the event object's value is atomically set to EV_SET. > > If the previous value was EV_BUSY, qemu_futex_wake() is called. All of > > this is outside any mutex, however, so apart from memory coherence > > (there are barriers) this can race with the waiting thread. > > qemu_event_wait() reads the event's value. If EV_FREE, it's atomically > > set to EV_BUSY. Then the mutex is locked, the value is checked again, > > and if it's still EV_BUSY, it waits for the condition variable, > > otherwise the mutex is immediately unlocked again. If the trigger > > thread's qemu_event_set() flip to EV_SET occurs between the waiting > > thread's two atomic reads of the value, the waiting thread will never > > wait for the condition variable, but the trigger thread WILL try to > > acquire the mutex and signal the condition variable in > > qemu_futex_wake(), by which time the waiting thread may have advanced > > outside of qemu_event_wait(). > > Sorry if I'm making a mistake again, but the waiting thread won't set to > EV_BUSY unless the value is EV_FREE on the second read so the trigger > thread will not call qemu_futex_wake() if it manages to set to EV_SET > before the second read, will it? > This sequence of events will cause the problem: WAITER (in qemu_event_wait): value = qatomic_load_acquire(&ev->value); -> EV_FREE TRIGGER (in qemu_event_set): qatomic_read(&ev->value) != EV_SET -> EV_FREE (condition is false) WAITER: qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET -> cmpxchg returns EV_FREE, condition false. ev->value = EV_BUSY. TRIGGER: int old = qatomic_xchg(&ev->value, EV_SET); smp_mb__after_rmw(); if (old == EV_BUSY) { -> old = EV_BUSY, condition true. ev->value = EV_SET WAITER (in qemu_futex_wait(ev, EV_BUSY)): pthread_mutex_lock(&ev->lock); if (ev->value == val) { -> false, because value is EV_SET WAITER: pthread_mutex_unlock(&ev->lock); … qemu_event_destroy(&job->done_event); TRIGGER (in qemu_futex_wake(ev, INT_MAX)): pthread_mutex_lock(&ev->lock); -> hangs, because mutex has been destroyed > > > This is all fine usually, BUT if you destroy the QemuEvent immediately > > after the qemu_event_wait() call, qemu_futex_wake() may try to lock a > > mutex that has been destroyed, or signal a condition variable which has > > been destroyed. I don't see a reasonable way of making this safe other > > than using long-lived mutexes and condition variables. And anyway, we > > have much, MUCH bigger contention/performance issues coming from almost > > everything being covered by the BQL. (If waking these callbacks can even > > be considered an issue: I haven't seen it show up in profiling, whereas > > BQL contention very much does.) > > > > I'll submit v5 of this patch set with separate condition variables for > > each job type. This should make the occurrence of waking the wrong > > thread quite rare, while reasoning about correctness is pretty > > straightforward. I think that's good enough. > > > > > > > >
On 2024/10/28 23:13, Phil Dennis-Jordan wrote: > > > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki <akihiko.odaki@daynix.com > <mailto:akihiko.odaki@daynix.com>> wrote: > > On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > > > > > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan > <phil@philjordan.eu <mailto:phil@philjordan.eu> > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>> wrote: > > > > > > > > > > > > Hmm. I think if we were to use that, we would > need to > > create a new > > > > QemuEvent for every job and destroy it afterward, > > which seems > > > expensive. > > > > We can't rule out multiple concurrent jobs being > > submitted, and the > > > > QemuEvent system only supports a single producer as > > far as I can > > > tell. > > > > > > > > You can probably sort of hack around it with > just one > > QemuEvent by > > > > putting the qemu_event_wait into a loop and turning > > the job.done > > > flag > > > > into an atomic (because it would now need to be > > checked outside the > > > > lock) but this all seems unnecessarily complicated > > considering the > > > > QemuEvent uses the same mechanism QemuCond/ > QemuMutex > > internally > > > on macOS > > > > (the only platform relevant here), except we > can use it as > > > intended with > > > > QemuCond/QemuMutex rather than having to work > against the > > > abstraction. > > > > > > I don't think it's going to be used concurrently. It > > would be difficult > > > to reason even for the framework if it performs memory > > > unmapping/mapping/reading operations concurrently. > > > > > > > > > I've just performed a very quick test by wrapping the job > > submission/ > > > wait in the 2 mapMemory callbacks and the 1 readMemory > > callback with > > > atomic counters and logging whenever a counter went > above 1. > > > > > > * Overall, concurrent callbacks across all types were > > common (many per > > > second when the VM is busy). It's not exactly a > "thundering > > herd" (I > > > never saw >2) but it's probably not a bad idea to use > a separate > > > condition variable for each job type. (task map, > surface map, > > memory read) > > > * While I did not observe any concurrent memory mapping > > operations > > > *within* a type of memory map (2 task mappings or 2 > surface > > mappings) I > > > did see very occasional concurrent memory *read* > callbacks. > > These would, > > > as far as I can tell, not be safe with QemuEvents, > unless we > > placed the > > > event inside the job struct and init/destroyed it on every > > callback > > > (which seems like excessive overhead). > > > > I think we can tolerate that overhead. init/destroy > essentially > > sets the > > fields in the data structure and I estimate its total size is > > about 100 > > bytes. It is probably better than waking an irrelevant thread > > up. I also > > hope that keeps the code simple; it's not worthwhile > adding code to > > optimize this. > > > > > > At least pthread_cond_{init,destroy} and > > pthread_mutex_{init,destroy} don't make any syscalls, so yeah > it's > > probably an acceptable overhead. > > > > > > I've just experimented with QemuEvents created on-demand and ran > into > > some weird deadlocks, which then made me sit down and think about it > > some more. I've come to the conclusion that creating (and crucially, > > destroying) QemuEvents on demand in this way is not safe. > > > > Specifically, you must not call qemu_event_destroy() - which > > transitively destroys the mutex and condition variable - unless > you can > > guarantee that the qemu_event_set() call on that event object has > completed. > > > > In qemu_event_set, the event object's value is atomically set to > EV_SET. > > If the previous value was EV_BUSY, qemu_futex_wake() is called. > All of > > this is outside any mutex, however, so apart from memory coherence > > (there are barriers) this can race with the waiting thread. > > qemu_event_wait() reads the event's value. If EV_FREE, it's > atomically > > set to EV_BUSY. Then the mutex is locked, the value is checked > again, > > and if it's still EV_BUSY, it waits for the condition variable, > > otherwise the mutex is immediately unlocked again. If the trigger > > thread's qemu_event_set() flip to EV_SET occurs between the waiting > > thread's two atomic reads of the value, the waiting thread will > never > > wait for the condition variable, but the trigger thread WILL try to > > acquire the mutex and signal the condition variable in > > qemu_futex_wake(), by which time the waiting thread may have > advanced > > outside of qemu_event_wait(). > > Sorry if I'm making a mistake again, but the waiting thread won't > set to > EV_BUSY unless the value is EV_FREE on the second read so the trigger > thread will not call qemu_futex_wake() if it manages to set to EV_SET > before the second read, will it? > > > This sequence of events will cause the problem: > > WAITER (in qemu_event_wait): > value = qatomic_load_acquire(&ev->value); > -> EV_FREE > > TRIGGER (in qemu_event_set): > qatomic_read(&ev->value) != EV_SET > -> EV_FREE (condition is false) > > WAITER: > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET > -> cmpxchg returns EV_FREE, condition false. > ev->value = EV_BUSY. > > TRIGGER: > int old = qatomic_xchg(&ev->value, EV_SET); > smp_mb__after_rmw(); > if (old == EV_BUSY) { > -> old = EV_BUSY, condition true. > ev->value = EV_SET > > WAITER (in qemu_futex_wait(ev, EV_BUSY)): > pthread_mutex_lock(&ev->lock); > if (ev->value == val) { > -> false, because value is EV_SET > > WAITER: > pthread_mutex_unlock(&ev->lock); > … > qemu_event_destroy(&job->done_event); > > TRIGGER (in qemu_futex_wake(ev, INT_MAX)): > pthread_mutex_lock(&ev->lock); > -> hangs, because mutex has been destroyed Thanks for clarification. This is very insightful. > > > > > This is all fine usually, BUT if you destroy the QemuEvent > immediately > > after the qemu_event_wait() call, qemu_futex_wake() may try to > lock a > > mutex that has been destroyed, or signal a condition variable > which has > > been destroyed. I don't see a reasonable way of making this safe > other > > than using long-lived mutexes and condition variables. And > anyway, we > > have much, MUCH bigger contention/performance issues coming from > almost > > everything being covered by the BQL. (If waking these callbacks > can even > > be considered an issue: I haven't seen it show up in profiling, > whereas > > BQL contention very much does.) > > > > I'll submit v5 of this patch set with separate condition > variables for > > each job type. This should make the occurrence of waking the wrong > > thread quite rare, while reasoning about correctness is pretty > > straightforward. I think that's good enough. What about using QemuSemaphore then? It does not seem to have the problem same with QemuEvent.
On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > On 2024/10/28 23:13, Phil Dennis-Jordan wrote: > > > > > > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki <akihiko.odaki@daynix.com > > <mailto:akihiko.odaki@daynix.com>> wrote: > > > > On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > > > > > > > > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan > > <phil@philjordan.eu <mailto:phil@philjordan.eu> > > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>> wrote: > > > > > > > > > > > > > > > > Hmm. I think if we were to use that, we would > > need to > > > create a new > > > > > QemuEvent for every job and destroy it > afterward, > > > which seems > > > > expensive. > > > > > We can't rule out multiple concurrent jobs being > > > submitted, and the > > > > > QemuEvent system only supports a single > producer as > > > far as I can > > > > tell. > > > > > > > > > > You can probably sort of hack around it with > > just one > > > QemuEvent by > > > > > putting the qemu_event_wait into a loop and > turning > > > the job.done > > > > flag > > > > > into an atomic (because it would now need to be > > > checked outside the > > > > > lock) but this all seems unnecessarily > complicated > > > considering the > > > > > QemuEvent uses the same mechanism QemuCond/ > > QemuMutex > > > internally > > > > on macOS > > > > > (the only platform relevant here), except we > > can use it as > > > > intended with > > > > > QemuCond/QemuMutex rather than having to work > > against the > > > > abstraction. > > > > > > > > I don't think it's going to be used concurrently. > It > > > would be difficult > > > > to reason even for the framework if it performs > memory > > > > unmapping/mapping/reading operations concurrently. > > > > > > > > > > > > I've just performed a very quick test by wrapping the > job > > > submission/ > > > > wait in the 2 mapMemory callbacks and the 1 readMemory > > > callback with > > > > atomic counters and logging whenever a counter went > > above 1. > > > > > > > > * Overall, concurrent callbacks across all types were > > > common (many per > > > > second when the VM is busy). It's not exactly a > > "thundering > > > herd" (I > > > > never saw >2) but it's probably not a bad idea to use > > a separate > > > > condition variable for each job type. (task map, > > surface map, > > > memory read) > > > > * While I did not observe any concurrent memory > mapping > > > operations > > > > *within* a type of memory map (2 task mappings or 2 > > surface > > > mappings) I > > > > did see very occasional concurrent memory *read* > > callbacks. > > > These would, > > > > as far as I can tell, not be safe with QemuEvents, > > unless we > > > placed the > > > > event inside the job struct and init/destroyed it on > every > > > callback > > > > (which seems like excessive overhead). > > > > > > I think we can tolerate that overhead. init/destroy > > essentially > > > sets the > > > fields in the data structure and I estimate its total > size is > > > about 100 > > > bytes. It is probably better than waking an irrelevant > thread > > > up. I also > > > hope that keeps the code simple; it's not worthwhile > > adding code to > > > optimize this. > > > > > > > > > At least pthread_cond_{init,destroy} and > > > pthread_mutex_{init,destroy} don't make any syscalls, so yeah > > it's > > > probably an acceptable overhead. > > > > > > > > > I've just experimented with QemuEvents created on-demand and ran > > into > > > some weird deadlocks, which then made me sit down and think about > it > > > some more. I've come to the conclusion that creating (and > crucially, > > > destroying) QemuEvents on demand in this way is not safe. > > > > > > Specifically, you must not call qemu_event_destroy() - which > > > transitively destroys the mutex and condition variable - unless > > you can > > > guarantee that the qemu_event_set() call on that event object has > > completed. > > > > > > In qemu_event_set, the event object's value is atomically set to > > EV_SET. > > > If the previous value was EV_BUSY, qemu_futex_wake() is called. > > All of > > > this is outside any mutex, however, so apart from memory coherence > > > (there are barriers) this can race with the waiting thread. > > > qemu_event_wait() reads the event's value. If EV_FREE, it's > > atomically > > > set to EV_BUSY. Then the mutex is locked, the value is checked > > again, > > > and if it's still EV_BUSY, it waits for the condition variable, > > > otherwise the mutex is immediately unlocked again. If the trigger > > > thread's qemu_event_set() flip to EV_SET occurs between the > waiting > > > thread's two atomic reads of the value, the waiting thread will > > never > > > wait for the condition variable, but the trigger thread WILL try > to > > > acquire the mutex and signal the condition variable in > > > qemu_futex_wake(), by which time the waiting thread may have > > advanced > > > outside of qemu_event_wait(). > > > > Sorry if I'm making a mistake again, but the waiting thread won't > > set to > > EV_BUSY unless the value is EV_FREE on the second read so the trigger > > thread will not call qemu_futex_wake() if it manages to set to EV_SET > > before the second read, will it? > > > > > > This sequence of events will cause the problem: > > > > WAITER (in qemu_event_wait): > > value = qatomic_load_acquire(&ev->value); > > -> EV_FREE > > > > TRIGGER (in qemu_event_set): > > qatomic_read(&ev->value) != EV_SET > > -> EV_FREE (condition is false) > > > > WAITER: > > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET > > -> cmpxchg returns EV_FREE, condition false. > > ev->value = EV_BUSY. > > > TRIGGER: > > int old = qatomic_xchg(&ev->value, EV_SET); > > smp_mb__after_rmw(); > > if (old == EV_BUSY) { > > -> old = EV_BUSY, condition true. > > ev->value = EV_SET > > > > WAITER (in qemu_futex_wait(ev, EV_BUSY)): > > pthread_mutex_lock(&ev->lock); > > if (ev->value == val) { > > -> false, because value is EV_SET > > > > WAITER: > > pthread_mutex_unlock(&ev->lock); > > … > > qemu_event_destroy(&job->done_event); > > > > TRIGGER (in qemu_futex_wake(ev, INT_MAX)): > > pthread_mutex_lock(&ev->lock); > > -> hangs, because mutex has been destroyed > > Thanks for clarification. This is very insightful. > > > > > > > > > > This is all fine usually, BUT if you destroy the QemuEvent > > immediately > > > after the qemu_event_wait() call, qemu_futex_wake() may try to > > lock a > > > mutex that has been destroyed, or signal a condition variable > > which has > > > been destroyed. I don't see a reasonable way of making this safe > > other > > > than using long-lived mutexes and condition variables. And > > anyway, we > > > have much, MUCH bigger contention/performance issues coming from > > almost > > > everything being covered by the BQL. (If waking these callbacks > > can even > > > be considered an issue: I haven't seen it show up in profiling, > > whereas > > > BQL contention very much does.) > > > > > > I'll submit v5 of this patch set with separate condition > > variables for > > > each job type. This should make the occurrence of waking the wrong > > > thread quite rare, while reasoning about correctness is pretty > > > straightforward. I think that's good enough. > > What about using QemuSemaphore then? It does not seem to have the > problem same with QemuEvent. > Nowhere else in the code base uses short-lived semaphores, and while I can't immediately see a risk (the mutex is held during both post and wait) there might be some non-obvious problem with the approach. Internally, the semaphores use condition variables. The solution using condition variables directly already works, is safe, relatively easy to reason about, and does not cause any performance issues. There is a tiny inefficiency about waking up a thread unnecessarily in the rare case when two callbacks of the same kind occur concurrently. In practice, it's irrelevant. Thanks to the awkward mismatch of the PVGraphics.framework's libdispatch based approach and Qemu's BQL/AIO/BH approach, we are already sending messages to other threads very frequently. This isn't ideal, but not fixable without drastically reducing the need to acquire the BQL across Qemu. I do not think it is worth spending even more time trying to fix this part of the code which isn't broken in the first place.
On 2024/10/29 6:06, Phil Dennis-Jordan wrote: > > > On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki <akihiko.odaki@daynix.com > <mailto:akihiko.odaki@daynix.com>> wrote: > > On 2024/10/28 23:13, Phil Dennis-Jordan wrote: > > > > > > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki > <akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com> > > <mailto:akihiko.odaki@daynix.com > <mailto:akihiko.odaki@daynix.com>>> wrote: > > > > On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > > > > > > > > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan > > <phil@philjordan.eu <mailto:phil@philjordan.eu> > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>> > > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu> > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>>> wrote: > > > > > > > > > > > > > > > > Hmm. I think if we were to use that, we > would > > need to > > > create a new > > > > > QemuEvent for every job and destroy it > afterward, > > > which seems > > > > expensive. > > > > > We can't rule out multiple concurrent > jobs being > > > submitted, and the > > > > > QemuEvent system only supports a single > producer as > > > far as I can > > > > tell. > > > > > > > > > > You can probably sort of hack around it with > > just one > > > QemuEvent by > > > > > putting the qemu_event_wait into a loop > and turning > > > the job.done > > > > flag > > > > > into an atomic (because it would now > need to be > > > checked outside the > > > > > lock) but this all seems unnecessarily > complicated > > > considering the > > > > > QemuEvent uses the same mechanism QemuCond/ > > QemuMutex > > > internally > > > > on macOS > > > > > (the only platform relevant here), except we > > can use it as > > > > intended with > > > > > QemuCond/QemuMutex rather than having to > work > > against the > > > > abstraction. > > > > > > > > I don't think it's going to be used > concurrently. It > > > would be difficult > > > > to reason even for the framework if it > performs memory > > > > unmapping/mapping/reading operations > concurrently. > > > > > > > > > > > > I've just performed a very quick test by > wrapping the job > > > submission/ > > > > wait in the 2 mapMemory callbacks and the 1 > readMemory > > > callback with > > > > atomic counters and logging whenever a counter went > > above 1. > > > > > > > > * Overall, concurrent callbacks across all > types were > > > common (many per > > > > second when the VM is busy). It's not exactly a > > "thundering > > > herd" (I > > > > never saw >2) but it's probably not a bad idea > to use > > a separate > > > > condition variable for each job type. (task map, > > surface map, > > > memory read) > > > > * While I did not observe any concurrent > memory mapping > > > operations > > > > *within* a type of memory map (2 task mappings or 2 > > surface > > > mappings) I > > > > did see very occasional concurrent memory *read* > > callbacks. > > > These would, > > > > as far as I can tell, not be safe with QemuEvents, > > unless we > > > placed the > > > > event inside the job struct and init/destroyed > it on every > > > callback > > > > (which seems like excessive overhead). > > > > > > I think we can tolerate that overhead. init/destroy > > essentially > > > sets the > > > fields in the data structure and I estimate its > total size is > > > about 100 > > > bytes. It is probably better than waking an > irrelevant thread > > > up. I also > > > hope that keeps the code simple; it's not worthwhile > > adding code to > > > optimize this. > > > > > > > > > At least pthread_cond_{init,destroy} and > > > pthread_mutex_{init,destroy} don't make any syscalls, > so yeah > > it's > > > probably an acceptable overhead. > > > > > > > > > I've just experimented with QemuEvents created on-demand > and ran > > into > > > some weird deadlocks, which then made me sit down and > think about it > > > some more. I've come to the conclusion that creating (and > crucially, > > > destroying) QemuEvents on demand in this way is not safe. > > > > > > Specifically, you must not call qemu_event_destroy() - which > > > transitively destroys the mutex and condition variable - > unless > > you can > > > guarantee that the qemu_event_set() call on that event > object has > > completed. > > > > > > In qemu_event_set, the event object's value is atomically > set to > > EV_SET. > > > If the previous value was EV_BUSY, qemu_futex_wake() is > called. > > All of > > > this is outside any mutex, however, so apart from memory > coherence > > > (there are barriers) this can race with the waiting thread. > > > qemu_event_wait() reads the event's value. If EV_FREE, it's > > atomically > > > set to EV_BUSY. Then the mutex is locked, the value is checked > > again, > > > and if it's still EV_BUSY, it waits for the condition > variable, > > > otherwise the mutex is immediately unlocked again. If the > trigger > > > thread's qemu_event_set() flip to EV_SET occurs between > the waiting > > > thread's two atomic reads of the value, the waiting thread > will > > never > > > wait for the condition variable, but the trigger thread > WILL try to > > > acquire the mutex and signal the condition variable in > > > qemu_futex_wake(), by which time the waiting thread may have > > advanced > > > outside of qemu_event_wait(). > > > > Sorry if I'm making a mistake again, but the waiting thread won't > > set to > > EV_BUSY unless the value is EV_FREE on the second read so the > trigger > > thread will not call qemu_futex_wake() if it manages to set > to EV_SET > > before the second read, will it? > > > > > > This sequence of events will cause the problem: > > > > WAITER (in qemu_event_wait): > > value = qatomic_load_acquire(&ev->value); > > -> EV_FREE > > > > TRIGGER (in qemu_event_set): > > qatomic_read(&ev->value) != EV_SET > > -> EV_FREE (condition is false) > > > > WAITER: > > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET > > -> cmpxchg returns EV_FREE, condition false. > > ev->value = EV_BUSY. > > > TRIGGER: > > int old = qatomic_xchg(&ev->value, EV_SET); > > smp_mb__after_rmw(); > > if (old == EV_BUSY) { > > -> old = EV_BUSY, condition true. > > ev->value = EV_SET > > > > WAITER (in qemu_futex_wait(ev, EV_BUSY)): > > pthread_mutex_lock(&ev->lock); > > if (ev->value == val) { > > -> false, because value is EV_SET > > > > WAITER: > > pthread_mutex_unlock(&ev->lock); > > … > > qemu_event_destroy(&job->done_event); > > > > TRIGGER (in qemu_futex_wake(ev, INT_MAX)): > > pthread_mutex_lock(&ev->lock); > > -> hangs, because mutex has been destroyed > > Thanks for clarification. This is very insightful. > > > > > > > > > > This is all fine usually, BUT if you destroy the QemuEvent > > immediately > > > after the qemu_event_wait() call, qemu_futex_wake() may try to > > lock a > > > mutex that has been destroyed, or signal a condition variable > > which has > > > been destroyed. I don't see a reasonable way of making > this safe > > other > > > than using long-lived mutexes and condition variables. And > > anyway, we > > > have much, MUCH bigger contention/performance issues > coming from > > almost > > > everything being covered by the BQL. (If waking these > callbacks > > can even > > > be considered an issue: I haven't seen it show up in > profiling, > > whereas > > > BQL contention very much does.) > > > > > > I'll submit v5 of this patch set with separate condition > > variables for > > > each job type. This should make the occurrence of waking > the wrong > > > thread quite rare, while reasoning about correctness is pretty > > > straightforward. I think that's good enough. > > What about using QemuSemaphore then? It does not seem to have the > problem same with QemuEvent. > > > Nowhere else in the code base uses short-lived semaphores, and while I > can't immediately see a risk (the mutex is held during both post and > wait) there might be some non-obvious problem with the approach. > Internally, the semaphores use condition variables. The solution using > condition variables directly already works, is safe, relatively easy to > reason about, and does not cause any performance issues. There is a tiny > inefficiency about waking up a thread unnecessarily in the rare case > when two callbacks of the same kind occur concurrently. In practice, > it's irrelevant. Thanks to the awkward mismatch of the > PVGraphics.framework's libdispatch based approach and Qemu's BQL/AIO/BH > approach, we are already sending messages to other threads very > frequently. This isn't ideal, but not fixable without drastically > reducing the need to acquire the BQL across Qemu. I found several usage of ephemeral semaphores: h_random() in hw/ppc/spapr_rng.c colo_process_checkpoint() in migration/colo.c postcopy_thread_create() in migration/postcopy-ram.c I'm sure short-lived semaphores will keep working (or break migration in strange ways). > > I do not think it is worth spending even more time trying to fix this > part of the code which isn't broken in the first place. I'm sorry to bring you to this mess, which I didn't really expect. I thought combining a shared pair of conditional variable and mutex and job-specific bools is unnecessarily complex, and having one synchronization primitive for each job will be simpler and will just work. However, there was a pitfall with QemuEvent as you demonstrated and now I grep QemuEvent and QemuSemaphore to find all such ephemeral usage is written with QemuSemaphore instead of QemuEvent. I think the critical problem here is that it is not documented that qemu_event_destroy() cannot be used immediately after qemu_event_wait(). We would not run into this situation if it is written. I will write a patch to add such a documentation comment.
On Tue, 29 Oct 2024 at 08:42, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > On 2024/10/29 6:06, Phil Dennis-Jordan wrote: > > > > > > On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki <akihiko.odaki@daynix.com > > <mailto:akihiko.odaki@daynix.com>> wrote: > > > > On 2024/10/28 23:13, Phil Dennis-Jordan wrote: > > > > > > > > > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki > > <akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com> > > > <mailto:akihiko.odaki@daynix.com > > <mailto:akihiko.odaki@daynix.com>>> wrote: > > > > > > On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > > > > > > > > > > > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan > > > <phil@philjordan.eu <mailto:phil@philjordan.eu> > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>> > > > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu> > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>>> wrote: > > > > > > > > > > > > > > > > > > > > Hmm. I think if we were to use that, we > > would > > > need to > > > > create a new > > > > > > QemuEvent for every job and destroy it > > afterward, > > > > which seems > > > > > expensive. > > > > > > We can't rule out multiple concurrent > > jobs being > > > > submitted, and the > > > > > > QemuEvent system only supports a single > > producer as > > > > far as I can > > > > > tell. > > > > > > > > > > > > You can probably sort of hack around it > with > > > just one > > > > QemuEvent by > > > > > > putting the qemu_event_wait into a loop > > and turning > > > > the job.done > > > > > flag > > > > > > into an atomic (because it would now > > need to be > > > > checked outside the > > > > > > lock) but this all seems unnecessarily > > complicated > > > > considering the > > > > > > QemuEvent uses the same mechanism > QemuCond/ > > > QemuMutex > > > > internally > > > > > on macOS > > > > > > (the only platform relevant here), > except we > > > can use it as > > > > > intended with > > > > > > QemuCond/QemuMutex rather than having to > > work > > > against the > > > > > abstraction. > > > > > > > > > > I don't think it's going to be used > > concurrently. It > > > > would be difficult > > > > > to reason even for the framework if it > > performs memory > > > > > unmapping/mapping/reading operations > > concurrently. > > > > > > > > > > > > > > > I've just performed a very quick test by > > wrapping the job > > > > submission/ > > > > > wait in the 2 mapMemory callbacks and the 1 > > readMemory > > > > callback with > > > > > atomic counters and logging whenever a counter > went > > > above 1. > > > > > > > > > > * Overall, concurrent callbacks across all > > types were > > > > common (many per > > > > > second when the VM is busy). It's not exactly a > > > "thundering > > > > herd" (I > > > > > never saw >2) but it's probably not a bad idea > > to use > > > a separate > > > > > condition variable for each job type. (task map, > > > surface map, > > > > memory read) > > > > > * While I did not observe any concurrent > > memory mapping > > > > operations > > > > > *within* a type of memory map (2 task mappings > or 2 > > > surface > > > > mappings) I > > > > > did see very occasional concurrent memory *read* > > > callbacks. > > > > These would, > > > > > as far as I can tell, not be safe with > QemuEvents, > > > unless we > > > > placed the > > > > > event inside the job struct and init/destroyed > > it on every > > > > callback > > > > > (which seems like excessive overhead). > > > > > > > > I think we can tolerate that overhead. init/destroy > > > essentially > > > > sets the > > > > fields in the data structure and I estimate its > > total size is > > > > about 100 > > > > bytes. It is probably better than waking an > > irrelevant thread > > > > up. I also > > > > hope that keeps the code simple; it's not > worthwhile > > > adding code to > > > > optimize this. > > > > > > > > > > > > At least pthread_cond_{init,destroy} and > > > > pthread_mutex_{init,destroy} don't make any syscalls, > > so yeah > > > it's > > > > probably an acceptable overhead. > > > > > > > > > > > > I've just experimented with QemuEvents created on-demand > > and ran > > > into > > > > some weird deadlocks, which then made me sit down and > > think about it > > > > some more. I've come to the conclusion that creating (and > > crucially, > > > > destroying) QemuEvents on demand in this way is not safe. > > > > > > > > Specifically, you must not call qemu_event_destroy() - > which > > > > transitively destroys the mutex and condition variable - > > unless > > > you can > > > > guarantee that the qemu_event_set() call on that event > > object has > > > completed. > > > > > > > > In qemu_event_set, the event object's value is atomically > > set to > > > EV_SET. > > > > If the previous value was EV_BUSY, qemu_futex_wake() is > > called. > > > All of > > > > this is outside any mutex, however, so apart from memory > > coherence > > > > (there are barriers) this can race with the waiting thread. > > > > qemu_event_wait() reads the event's value. If EV_FREE, it's > > > atomically > > > > set to EV_BUSY. Then the mutex is locked, the value is > checked > > > again, > > > > and if it's still EV_BUSY, it waits for the condition > > variable, > > > > otherwise the mutex is immediately unlocked again. If the > > trigger > > > > thread's qemu_event_set() flip to EV_SET occurs between > > the waiting > > > > thread's two atomic reads of the value, the waiting thread > > will > > > never > > > > wait for the condition variable, but the trigger thread > > WILL try to > > > > acquire the mutex and signal the condition variable in > > > > qemu_futex_wake(), by which time the waiting thread may > have > > > advanced > > > > outside of qemu_event_wait(). > > > > > > Sorry if I'm making a mistake again, but the waiting thread > won't > > > set to > > > EV_BUSY unless the value is EV_FREE on the second read so the > > trigger > > > thread will not call qemu_futex_wake() if it manages to set > > to EV_SET > > > before the second read, will it? > > > > > > > > > This sequence of events will cause the problem: > > > > > > WAITER (in qemu_event_wait): > > > value = qatomic_load_acquire(&ev->value); > > > -> EV_FREE > > > > > > TRIGGER (in qemu_event_set): > > > qatomic_read(&ev->value) != EV_SET > > > -> EV_FREE (condition is false) > > > > > > WAITER: > > > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET > > > -> cmpxchg returns EV_FREE, condition false. > > > ev->value = EV_BUSY. > > > > TRIGGER: > > > int old = qatomic_xchg(&ev->value, EV_SET); > > > smp_mb__after_rmw(); > > > if (old == EV_BUSY) { > > > -> old = EV_BUSY, condition true. > > > ev->value = EV_SET > > > > > > WAITER (in qemu_futex_wait(ev, EV_BUSY)): > > > pthread_mutex_lock(&ev->lock); > > > if (ev->value == val) { > > > -> false, because value is EV_SET > > > > > > WAITER: > > > pthread_mutex_unlock(&ev->lock); > > > … > > > qemu_event_destroy(&job->done_event); > > > > > > TRIGGER (in qemu_futex_wake(ev, INT_MAX)): > > > pthread_mutex_lock(&ev->lock); > > > -> hangs, because mutex has been destroyed > > > > Thanks for clarification. This is very insightful. > > > > > > > > > > > > > > > This is all fine usually, BUT if you destroy the QemuEvent > > > immediately > > > > after the qemu_event_wait() call, qemu_futex_wake() may > try to > > > lock a > > > > mutex that has been destroyed, or signal a condition > variable > > > which has > > > > been destroyed. I don't see a reasonable way of making > > this safe > > > other > > > > than using long-lived mutexes and condition variables. And > > > anyway, we > > > > have much, MUCH bigger contention/performance issues > > coming from > > > almost > > > > everything being covered by the BQL. (If waking these > > callbacks > > > can even > > > > be considered an issue: I haven't seen it show up in > > profiling, > > > whereas > > > > BQL contention very much does.) > > > > > > > > I'll submit v5 of this patch set with separate condition > > > variables for > > > > each job type. This should make the occurrence of waking > > the wrong > > > > thread quite rare, while reasoning about correctness is > pretty > > > > straightforward. I think that's good enough. > > > > What about using QemuSemaphore then? It does not seem to have the > > problem same with QemuEvent. > > > > > > Nowhere else in the code base uses short-lived semaphores, and while I > > can't immediately see a risk (the mutex is held during both post and > > wait) there might be some non-obvious problem with the approach. > > Internally, the semaphores use condition variables. The solution using > > condition variables directly already works, is safe, relatively easy to > > reason about, and does not cause any performance issues. There is a tiny > > inefficiency about waking up a thread unnecessarily in the rare case > > when two callbacks of the same kind occur concurrently. In practice, > > it's irrelevant. Thanks to the awkward mismatch of the > > PVGraphics.framework's libdispatch based approach and Qemu's BQL/AIO/BH > > approach, we are already sending messages to other threads very > > frequently. This isn't ideal, but not fixable without drastically > > reducing the need to acquire the BQL across Qemu. > > I found several usage of ephemeral semaphores: > h_random() in hw/ppc/spapr_rng.c > colo_process_checkpoint() in migration/colo.c > postcopy_thread_create() in migration/postcopy-ram.c > > I'm sure short-lived semaphores will keep working (or break migration in > strange ways). > > > > > I do not think it is worth spending even more time trying to fix this > > part of the code which isn't broken in the first place. > > I'm sorry to bring you to this mess, which I didn't really expect. I > thought combining a shared pair of conditional variable and mutex and > job-specific bools is unnecessarily complex, and having one > synchronization primitive for each job will be simpler and will just work. > With multithreading, the devil is always in the detail! 😅 I wouldn't mind if we were seeing genuine issues with the Mutex/Cond code, but it's fine as far as I can tell. The QemuEvent version wasn't even really any simpler (replacing bool done; with QemuEvent done_event; and await -> init/wait/destroy gets longer while lock/broadcast/unlock -> set gets shorter), and I guess a QemuSemaphore version would be about the same. Relying on the way an edge case is handled - destroying immediately after waiting - in the long term potentially makes the code more fragile too in case implementation details change. I think we've reached a bikeshedding stage here, and I suggest any further improvements on this part other than bug fixes should be deferred to future patches. > However, there was a pitfall with QemuEvent as you demonstrated and now > I grep QemuEvent and QemuSemaphore to find all such ephemeral usage is > written with QemuSemaphore instead of QemuEvent. I think the critical > problem here is that it is not documented that qemu_event_destroy() > cannot be used immediately after qemu_event_wait(). We would not run > into this situation if it is written. I will write a patch to add such a > documentation comment. > Sounds good, I'd be happy to review it (cc me). Thanks for the in-depth reviews on all the patches in this set! You've prompted me to make some significant improvements, even if the experience with the job signalling has ended up being a frustrating one. The code is otherwise definitely better now than before. I've just posted v5 of the patch set and I hope we're pretty close to "good enough" now. All the best, Phil
On 2024/10/30 6:16, Phil Dennis-Jordan wrote: > > > On Tue, 29 Oct 2024 at 08:42, Akihiko Odaki <akihiko.odaki@daynix.com > <mailto:akihiko.odaki@daynix.com>> wrote: > > On 2024/10/29 6:06, Phil Dennis-Jordan wrote: > > > > > > On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki > <akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com> > > <mailto:akihiko.odaki@daynix.com > <mailto:akihiko.odaki@daynix.com>>> wrote: > > > > On 2024/10/28 23:13, Phil Dennis-Jordan wrote: > > > > > > > > > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki > > <akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com> > <mailto:akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com>> > > > <mailto:akihiko.odaki@daynix.com > <mailto:akihiko.odaki@daynix.com> > > <mailto:akihiko.odaki@daynix.com > <mailto:akihiko.odaki@daynix.com>>>> wrote: > > > > > > On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > > > > > > > > > > > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan > > > <phil@philjordan.eu <mailto:phil@philjordan.eu> > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>> > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu> > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>> > > > > <mailto:phil@philjordan.eu > <mailto:phil@philjordan.eu> <mailto:phil@philjordan.eu > <mailto:phil@philjordan.eu>> > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu> > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>>>> wrote: > > > > > > > > > > > > > > > > > > > > Hmm. I think if we were to use > that, we > > would > > > need to > > > > create a new > > > > > > QemuEvent for every job and > destroy it > > afterward, > > > > which seems > > > > > expensive. > > > > > > We can't rule out multiple concurrent > > jobs being > > > > submitted, and the > > > > > > QemuEvent system only supports a > single > > producer as > > > > far as I can > > > > > tell. > > > > > > > > > > > > You can probably sort of hack > around it with > > > just one > > > > QemuEvent by > > > > > > putting the qemu_event_wait into > a loop > > and turning > > > > the job.done > > > > > flag > > > > > > into an atomic (because it would now > > need to be > > > > checked outside the > > > > > > lock) but this all seems > unnecessarily > > complicated > > > > considering the > > > > > > QemuEvent uses the same mechanism > QemuCond/ > > > QemuMutex > > > > internally > > > > > on macOS > > > > > > (the only platform relevant > here), except we > > > can use it as > > > > > intended with > > > > > > QemuCond/QemuMutex rather than > having to > > work > > > against the > > > > > abstraction. > > > > > > > > > > I don't think it's going to be used > > concurrently. It > > > > would be difficult > > > > > to reason even for the framework if it > > performs memory > > > > > unmapping/mapping/reading operations > > concurrently. > > > > > > > > > > > > > > > I've just performed a very quick test by > > wrapping the job > > > > submission/ > > > > > wait in the 2 mapMemory callbacks and the 1 > > readMemory > > > > callback with > > > > > atomic counters and logging whenever a > counter went > > > above 1. > > > > > > > > > > * Overall, concurrent callbacks across all > > types were > > > > common (many per > > > > > second when the VM is busy). It's not > exactly a > > > "thundering > > > > herd" (I > > > > > never saw >2) but it's probably not a > bad idea > > to use > > > a separate > > > > > condition variable for each job type. > (task map, > > > surface map, > > > > memory read) > > > > > * While I did not observe any concurrent > > memory mapping > > > > operations > > > > > *within* a type of memory map (2 task > mappings or 2 > > > surface > > > > mappings) I > > > > > did see very occasional concurrent > memory *read* > > > callbacks. > > > > These would, > > > > > as far as I can tell, not be safe with > QemuEvents, > > > unless we > > > > placed the > > > > > event inside the job struct and init/ > destroyed > > it on every > > > > callback > > > > > (which seems like excessive overhead). > > > > > > > > I think we can tolerate that overhead. > init/destroy > > > essentially > > > > sets the > > > > fields in the data structure and I estimate its > > total size is > > > > about 100 > > > > bytes. It is probably better than waking an > > irrelevant thread > > > > up. I also > > > > hope that keeps the code simple; it's not > worthwhile > > > adding code to > > > > optimize this. > > > > > > > > > > > > At least pthread_cond_{init,destroy} and > > > > pthread_mutex_{init,destroy} don't make any > syscalls, > > so yeah > > > it's > > > > probably an acceptable overhead. > > > > > > > > > > > > I've just experimented with QemuEvents created on- > demand > > and ran > > > into > > > > some weird deadlocks, which then made me sit down and > > think about it > > > > some more. I've come to the conclusion that > creating (and > > crucially, > > > > destroying) QemuEvents on demand in this way is not > safe. > > > > > > > > Specifically, you must not call > qemu_event_destroy() - which > > > > transitively destroys the mutex and condition > variable - > > unless > > > you can > > > > guarantee that the qemu_event_set() call on that event > > object has > > > completed. > > > > > > > > In qemu_event_set, the event object's value is > atomically > > set to > > > EV_SET. > > > > If the previous value was EV_BUSY, qemu_futex_wake() is > > called. > > > All of > > > > this is outside any mutex, however, so apart from > memory > > coherence > > > > (there are barriers) this can race with the waiting > thread. > > > > qemu_event_wait() reads the event's value. If > EV_FREE, it's > > > atomically > > > > set to EV_BUSY. Then the mutex is locked, the value > is checked > > > again, > > > > and if it's still EV_BUSY, it waits for the condition > > variable, > > > > otherwise the mutex is immediately unlocked again. > If the > > trigger > > > > thread's qemu_event_set() flip to EV_SET occurs between > > the waiting > > > > thread's two atomic reads of the value, the waiting > thread > > will > > > never > > > > wait for the condition variable, but the trigger thread > > WILL try to > > > > acquire the mutex and signal the condition variable in > > > > qemu_futex_wake(), by which time the waiting > thread may have > > > advanced > > > > outside of qemu_event_wait(). > > > > > > Sorry if I'm making a mistake again, but the waiting > thread won't > > > set to > > > EV_BUSY unless the value is EV_FREE on the second read > so the > > trigger > > > thread will not call qemu_futex_wake() if it manages > to set > > to EV_SET > > > before the second read, will it? > > > > > > > > > This sequence of events will cause the problem: > > > > > > WAITER (in qemu_event_wait): > > > value = qatomic_load_acquire(&ev->value); > > > -> EV_FREE > > > > > > TRIGGER (in qemu_event_set): > > > qatomic_read(&ev->value) != EV_SET > > > -> EV_FREE (condition is false) > > > > > > WAITER: > > > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET > > > -> cmpxchg returns EV_FREE, condition false. > > > ev->value = EV_BUSY. > > > > TRIGGER: > > > int old = qatomic_xchg(&ev->value, EV_SET); > > > smp_mb__after_rmw(); > > > if (old == EV_BUSY) { > > > -> old = EV_BUSY, condition true. > > > ev->value = EV_SET > > > > > > WAITER (in qemu_futex_wait(ev, EV_BUSY)): > > > pthread_mutex_lock(&ev->lock); > > > if (ev->value == val) { > > > -> false, because value is EV_SET > > > > > > WAITER: > > > pthread_mutex_unlock(&ev->lock); > > > … > > > qemu_event_destroy(&job->done_event); > > > > > > TRIGGER (in qemu_futex_wake(ev, INT_MAX)): > > > pthread_mutex_lock(&ev->lock); > > > -> hangs, because mutex has been destroyed > > > > Thanks for clarification. This is very insightful. > > > > > > > > > > > > > > > This is all fine usually, BUT if you destroy the > QemuEvent > > > immediately > > > > after the qemu_event_wait() call, qemu_futex_wake() > may try to > > > lock a > > > > mutex that has been destroyed, or signal a > condition variable > > > which has > > > > been destroyed. I don't see a reasonable way of making > > this safe > > > other > > > > than using long-lived mutexes and condition > variables. And > > > anyway, we > > > > have much, MUCH bigger contention/performance issues > > coming from > > > almost > > > > everything being covered by the BQL. (If waking these > > callbacks > > > can even > > > > be considered an issue: I haven't seen it show up in > > profiling, > > > whereas > > > > BQL contention very much does.) > > > > > > > > I'll submit v5 of this patch set with separate > condition > > > variables for > > > > each job type. This should make the occurrence of > waking > > the wrong > > > > thread quite rare, while reasoning about > correctness is pretty > > > > straightforward. I think that's good enough. > > > > What about using QemuSemaphore then? It does not seem to have the > > problem same with QemuEvent. > > > > > > Nowhere else in the code base uses short-lived semaphores, and > while I > > can't immediately see a risk (the mutex is held during both post and > > wait) there might be some non-obvious problem with the approach. > > Internally, the semaphores use condition variables. The solution > using > > condition variables directly already works, is safe, relatively > easy to > > reason about, and does not cause any performance issues. There is > a tiny > > inefficiency about waking up a thread unnecessarily in the rare case > > when two callbacks of the same kind occur concurrently. In practice, > > it's irrelevant. Thanks to the awkward mismatch of the > > PVGraphics.framework's libdispatch based approach and Qemu's BQL/ > AIO/BH > > approach, we are already sending messages to other threads very > > frequently. This isn't ideal, but not fixable without drastically > > reducing the need to acquire the BQL across Qemu. > > I found several usage of ephemeral semaphores: > h_random() in hw/ppc/spapr_rng.c > colo_process_checkpoint() in migration/colo.c > postcopy_thread_create() in migration/postcopy-ram.c > > I'm sure short-lived semaphores will keep working (or break > migration in > strange ways). > > > > > I do not think it is worth spending even more time trying to fix > this > > part of the code which isn't broken in the first place. > > I'm sorry to bring you to this mess, which I didn't really expect. I > thought combining a shared pair of conditional variable and mutex and > job-specific bools is unnecessarily complex, and having one > synchronization primitive for each job will be simpler and will just > work. > > > With multithreading, the devil is always in the detail! 😅 I wouldn't > mind if we were seeing genuine issues with the Mutex/Cond code, but it's > fine as far as I can tell. The QemuEvent version wasn't even really any > simpler (replacing bool done; with QemuEvent done_event; and await -> > init/wait/destroy gets longer while lock/broadcast/unlock -> set gets > shorter), and I guess a QemuSemaphore version would be about the same. > Relying on the way an edge case is handled - destroying immediately > after waiting - in the long term potentially makes the code more fragile > too in case implementation details change. I think we've reached a > bikeshedding stage here, and I suggest any further improvements on this > part other than bug fixes should be deferred to future patches. We still have more than bikeshedding. There are two design options discussed: 1) Whether synchronization primitives should be localized 2) Whether short-lived QemuSemaphore is appropriate if 1) is true We discussed 2) in details, but haven't done much for 1) so there is some room for discussion. Even for 2), let me emphasize that avoiding ephemeral QemuSemaphore in one device implementation is not appropriate as means to deal with its potential problem when there is similar existing usage. QEMU needs to be correct as whole, and having a workaround only part of its codebase is not OK. We need to either follow existing patterns or prepare for even more discussion (and I'm for the former). Regarding 1), I think it's easier just to show code. Below is my idea of code change to localize synchronization primitives. This code is not tested or even compiled, but it should be sufficient to demonstrate the idea. There are a few notable observations: a) apple_gfx_await_bh_job() can be extended to absorb all repetitive code of BH jobs. Such a change is probably beneficial even when synchronization primitives are shared, but it is more beneficial when synchronization primitives are localized since it allows to wrap init and destroy. b) No need to declare multiple conditional variables and choosing one of them for each job type. Instead we can have one definition and let it instantiated whenever creating BH jobs. c) Localized synchronization primitives make reasoning simpler and makes the comment in apple-gfx.h unnecessary. We still need the discussion of QemuEvent v.s. QemuSemaphore, but it will be dealt in the common code so apple-gfx does not need to have its own comment. Regards, Akihiko Odaki --- hw/display/apple-gfx.h | 19 +++--------- hw/display/apple-gfx-mmio.m | 23 ++------------ hw/display/apple-gfx.m | 60 +++++++++++++++++-------------------- 3 files changed, 34 insertions(+), 68 deletions(-) diff --git a/hw/display/apple-gfx.h b/hw/display/apple-gfx.h index e9fef09e37ea..b5aeed4f3dcf 100644 --- a/hw/display/apple-gfx.h +++ b/hw/display/apple-gfx.h @@ -40,19 +40,6 @@ typedef struct AppleGFXState { dispatch_queue_t render_queue; struct AppleGFXDisplayMode *display_modes; uint32_t num_display_modes; - /* - * QemuMutex & QemuConds for awaiting completion of PVG memory-mapping and - * reading requests after submitting them to run in the AIO context. - * QemuCond (rather than QemuEvent) are used so multiple concurrent jobs - * can be handled safely. - * The state associated with each job is tracked in a AppleGFX*Job struct - * for each kind of job; instances are allocated on the caller's stack. - * This struct also contains the completion flag which is used in - * conjunction with the condition variable. - */ - QemuMutex job_mutex; - QemuCond task_map_job_cond; - QemuCond mem_read_job_cond; /* tasks is protected by task_mutex */ QemuMutex task_mutex; @@ -82,8 +69,10 @@ void apple_gfx_common_realize(AppleGFXState *s, PGDeviceDescriptor *desc, uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t guest_physical, uint64_t length, bool read_only, MemoryRegion **mapping_in_region); -void apple_gfx_await_bh_job(AppleGFXState *s, QemuCond *job_cond, - bool *job_done_flag); +void apple_gfx_bh_job_run_full(QEMUBHFunc *cb, void *opaque, const char *name); + +#define apple_gfx_bh_job_run(cb, opaque) \ + apple_gfx_bh_job_run_full((cb), (opaque), (stringify(cb))) extern const PropertyInfo qdev_prop_display_mode; diff --git a/hw/display/apple-gfx-mmio.m b/hw/display/apple-gfx-mmio.m index a801c5fa722e..889a23df89e9 100644 --- a/hw/display/apple-gfx-mmio.m +++ b/hw/display/apple-gfx-mmio.m @@ -60,8 +60,6 @@ -(void)mmioWriteAtOffset:(size_t)offset value:(uint32_t)value; AppleGFXState common; - QemuCond iosfc_map_job_cond; - QemuCond iosfc_unmap_job_cond; qemu_irq irq_gfx; qemu_irq irq_iosfc; MemoryRegion iomem_iosfc; @@ -154,7 +152,6 @@ static void raise_irq(void *opaque) AppleGFXMMIOState *state; bool read_only; bool success; - bool done; } AppleGFXMapSurfaceMemoryJob; typedef struct AppleGFXMMIOMappedRegion { @@ -203,18 +200,13 @@ static void apple_gfx_mmio_map_surface_memory(void *opaque) } } - qemu_mutex_lock(&s->common.job_mutex); job->result_mem = (void *)mem; job->success = mem != 0; - job->done = true; - qemu_cond_broadcast(&s->iosfc_map_job_cond); - qemu_mutex_unlock(&s->common.job_mutex); } typedef struct AppleGFXUnmapSurfaceMemoryJob { void *virtual_address; AppleGFXMMIOState *state; - bool done; } AppleGFXUnmapSurfaceMemoryJob; static AppleGFXMMIOMappedRegion *find_mapped_region_containing(GArray *regions, @@ -257,11 +249,6 @@ static void apple_gfx_mmio_unmap_surface_memory(void *opaque) __func__, job->virtual_address, regions->len); } - - qemu_mutex_lock(&s->common.job_mutex); - job->done = true; - qemu_cond_broadcast(&s->iosfc_unmap_job_cond); - qemu_mutex_unlock(&s->common.job_mutex); } static PGIOSurfaceHostDevice *apple_gfx_prepare_iosurface_host_device( @@ -278,9 +265,7 @@ static void apple_gfx_mmio_unmap_surface_memory(void *opaque) .read_only = ro, .state = s, }; - aio_bh_schedule_oneshot(qemu_get_aio_context(), - apple_gfx_mmio_map_surface_memory, &job); - apple_gfx_await_bh_job(&s->common, &s->iosfc_map_job_cond, &job.done); + apple_gfx_bh_job_run(apple_gfx_mmio_map_surface_memory, &job); *va = job.result_mem; @@ -295,9 +280,7 @@ static void apple_gfx_mmio_unmap_surface_memory(void *opaque) AppleGFXUnmapSurfaceMemoryJob job = { va, s }; trace_apple_gfx_iosfc_unmap_memory(va, b, c, d, e, f); - aio_bh_schedule_oneshot(qemu_get_aio_context(), - apple_gfx_mmio_unmap_surface_memory, &job); - apple_gfx_await_bh_job(&s->common, &s->iosfc_unmap_job_cond, &job.done); + apple_gfx_bh_job_run(apple_gfx_mmio_unmap_surface_memory, &job); return true; }; @@ -336,8 +319,6 @@ static void apple_gfx_mmio_realize(DeviceState *dev, Error **errp) 2 /* Usually no more RAM regions*/); apple_gfx_common_realize(&s->common, desc, errp); - qemu_cond_init(&s->iosfc_map_job_cond); - qemu_cond_init(&s->iosfc_unmap_job_cond); [desc release]; desc = nil; diff --git a/hw/display/apple-gfx.m b/hw/display/apple-gfx.m index 2e264e5561fc..4d174e766310 100644 --- a/hw/display/apple-gfx.m +++ b/hw/display/apple-gfx.m @@ -90,6 +90,31 @@ static dispatch_queue_t get_background_queue(void) return task; } +typedef struct AppleGFXJob { + QEMUBHFunc cb; + void *opaque; + QemuSemaphore sem; +} AppleGFXJob; + +void apple_gfx_bh_job_cb(void *opaque) +{ + AppleGFXJob *job = opaque; + job->cb(job->opaque); + qemu_sem_post(&job->sem); +} + +void apple_gfx_bh_job_run_full(QEMUBHFunc *cb, void *opaque, const char *name) +{ + AppleGFXJob job; + job->cb = cb; + job->opaque = opaque; + qemu_sem_init(&job->sem, 0); + aio_bh_schedule_oneshot_full(qemu_get_aio_context(), apple_gfx_bh_job_cb, + &job, name); + qemu_sem_wait(&job->sem); + qemu_sem_destroy(&job->sem); +} + typedef struct AppleGFXIOJob { AppleGFXState *state; uint64_t offset; @@ -355,7 +380,6 @@ void apple_gfx_common_init(Object *obj, AppleGFXState *s, const char* obj_name) uint32_t range_count; bool read_only; bool success; - bool done; } AppleGFXMapMemoryJob; uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t guest_physical, @@ -457,20 +481,7 @@ static void apple_gfx_map_memory(void *opaque) g_assert(r == KERN_SUCCESS); } - qemu_mutex_lock(&s->job_mutex); job->success = success; - job->done = true; - qemu_cond_broadcast(&s->task_map_job_cond); - qemu_mutex_unlock(&s->job_mutex); -} - -void apple_gfx_await_bh_job(AppleGFXState *s, QemuCond *job_cond, bool *job_done_flag) -{ - qemu_mutex_lock(&s->job_mutex); - while (!*job_done_flag) { - qemu_cond_wait(job_cond, &s->job_mutex); - } - qemu_mutex_unlock(&s->job_mutex); } typedef struct AppleGFXReadMemoryJob { @@ -478,8 +489,6 @@ void apple_gfx_await_bh_job(AppleGFXState *s, QemuCond *job_cond, bool *job_done hwaddr physical_address; uint64_t length; void *dst; - bool done; - bool success; } AppleGFXReadMemoryJob; static void apple_gfx_do_read_memory(void *opaque) @@ -491,11 +500,6 @@ static void apple_gfx_do_read_memory(void *opaque) r = dma_memory_read(&address_space_memory, job->physical_address, job->dst, job->length, MEMTXATTRS_UNSPECIFIED); job->success = r == MEMTX_OK; - - qemu_mutex_lock(&s->job_mutex); - job->done = true; - qemu_cond_broadcast(&s->mem_read_job_cond); - qemu_mutex_unlock(&s->job_mutex); } static bool apple_gfx_read_memory(AppleGFXState *s, hwaddr physical_address, @@ -508,9 +512,7 @@ static bool apple_gfx_read_memory(AppleGFXState *s, hwaddr physical_address, trace_apple_gfx_read_memory(physical_address, length, dst); /* Traversing the memory map requires RCU/BQL, so do it in a BH. */ - aio_bh_schedule_oneshot(qemu_get_aio_context(), apple_gfx_do_read_memory, - &job); - apple_gfx_await_bh_job(s, &s->mem_read_job_cond, &job.done); + apple_gfx_bh_job_run(s, apple_gfx_do_read_memory, &job); return job.success; } @@ -556,12 +558,10 @@ static void apple_gfx_register_task_mapping_handlers(AppleGFXState *s, .state = s, .task = task, .ranges = ranges, .range_count = range_count, .read_only = read_only, .virtual_offset = virtual_offset, - .done = false, .success = true, + .success = true, }; if (range_count > 0) { - aio_bh_schedule_oneshot(qemu_get_aio_context(), - apple_gfx_map_memory, &job); - apple_gfx_await_bh_job(s, &s->task_map_job_cond, &job.done); + apple_gfx_bh_job_run(s, apple_gfx_map_memory, &job); } return job.success; }; @@ -780,10 +780,6 @@ void apple_gfx_common_realize(AppleGFXState *s, PGDeviceDescriptor *desc, apple_gfx_create_display_mode_array(display_modes, num_display_modes); create_fb(s); - - qemu_mutex_init(&s->job_mutex); - qemu_cond_init(&s->task_map_job_cond); - qemu_cond_init(&s->mem_read_job_cond); } static void apple_gfx_get_display_mode(Object *obj, Visitor *v,
On Thu, 31 Oct 2024 at 07:52, Akihiko Odaki <akihiko.odaki@daynix.com> wrote: > On 2024/10/30 6:16, Phil Dennis-Jordan wrote: > > > > > > On Tue, 29 Oct 2024 at 08:42, Akihiko Odaki <akihiko.odaki@daynix.com > > <mailto:akihiko.odaki@daynix.com>> wrote: > > > > On 2024/10/29 6:06, Phil Dennis-Jordan wrote: > > > > > > > > > On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki > > <akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com> > > > <mailto:akihiko.odaki@daynix.com > > <mailto:akihiko.odaki@daynix.com>>> wrote: > > > > > > On 2024/10/28 23:13, Phil Dennis-Jordan wrote: > > > > > > > > > > > > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki > > > <akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com> > > <mailto:akihiko.odaki@daynix.com <mailto:akihiko.odaki@daynix.com>> > > > > <mailto:akihiko.odaki@daynix.com > > <mailto:akihiko.odaki@daynix.com> > > > <mailto:akihiko.odaki@daynix.com > > <mailto:akihiko.odaki@daynix.com>>>> wrote: > > > > > > > > On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > > > > > > > > > > > > > > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan > > > > <phil@philjordan.eu <mailto:phil@philjordan.eu> > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>> > > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu> > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>> > > > > > <mailto:phil@philjordan.eu > > <mailto:phil@philjordan.eu> <mailto:phil@philjordan.eu > > <mailto:phil@philjordan.eu>> > > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu> > > <mailto:phil@philjordan.eu <mailto:phil@philjordan.eu>>>>> wrote: > > > > > > > > > > > > > > > > > > > > > > > > Hmm. I think if we were to use > > that, we > > > would > > > > need to > > > > > create a new > > > > > > > QemuEvent for every job and > > destroy it > > > afterward, > > > > > which seems > > > > > > expensive. > > > > > > > We can't rule out multiple > concurrent > > > jobs being > > > > > submitted, and the > > > > > > > QemuEvent system only supports a > > single > > > producer as > > > > > far as I can > > > > > > tell. > > > > > > > > > > > > > > You can probably sort of hack > > around it with > > > > just one > > > > > QemuEvent by > > > > > > > putting the qemu_event_wait into > > a loop > > > and turning > > > > > the job.done > > > > > > flag > > > > > > > into an atomic (because it would > now > > > need to be > > > > > checked outside the > > > > > > > lock) but this all seems > > unnecessarily > > > complicated > > > > > considering the > > > > > > > QemuEvent uses the same mechanism > > QemuCond/ > > > > QemuMutex > > > > > internally > > > > > > on macOS > > > > > > > (the only platform relevant > > here), except we > > > > can use it as > > > > > > intended with > > > > > > > QemuCond/QemuMutex rather than > > having to > > > work > > > > against the > > > > > > abstraction. > > > > > > > > > > > > I don't think it's going to be used > > > concurrently. It > > > > > would be difficult > > > > > > to reason even for the framework if > it > > > performs memory > > > > > > unmapping/mapping/reading operations > > > concurrently. > > > > > > > > > > > > > > > > > > I've just performed a very quick test by > > > wrapping the job > > > > > submission/ > > > > > > wait in the 2 mapMemory callbacks and > the 1 > > > readMemory > > > > > callback with > > > > > > atomic counters and logging whenever a > > counter went > > > > above 1. > > > > > > > > > > > > * Overall, concurrent callbacks across > all > > > types were > > > > > common (many per > > > > > > second when the VM is busy). It's not > > exactly a > > > > "thundering > > > > > herd" (I > > > > > > never saw >2) but it's probably not a > > bad idea > > > to use > > > > a separate > > > > > > condition variable for each job type. > > (task map, > > > > surface map, > > > > > memory read) > > > > > > * While I did not observe any > concurrent > > > memory mapping > > > > > operations > > > > > > *within* a type of memory map (2 task > > mappings or 2 > > > > surface > > > > > mappings) I > > > > > > did see very occasional concurrent > > memory *read* > > > > callbacks. > > > > > These would, > > > > > > as far as I can tell, not be safe with > > QemuEvents, > > > > unless we > > > > > placed the > > > > > > event inside the job struct and init/ > > destroyed > > > it on every > > > > > callback > > > > > > (which seems like excessive overhead). > > > > > > > > > > I think we can tolerate that overhead. > > init/destroy > > > > essentially > > > > > sets the > > > > > fields in the data structure and I estimate > its > > > total size is > > > > > about 100 > > > > > bytes. It is probably better than waking an > > > irrelevant thread > > > > > up. I also > > > > > hope that keeps the code simple; it's not > > worthwhile > > > > adding code to > > > > > optimize this. > > > > > > > > > > > > > > > At least pthread_cond_{init,destroy} and > > > > > pthread_mutex_{init,destroy} don't make any > > syscalls, > > > so yeah > > > > it's > > > > > probably an acceptable overhead. > > > > > > > > > > > > > > > I've just experimented with QemuEvents created on- > > demand > > > and ran > > > > into > > > > > some weird deadlocks, which then made me sit down > and > > > think about it > > > > > some more. I've come to the conclusion that > > creating (and > > > crucially, > > > > > destroying) QemuEvents on demand in this way is not > > safe. > > > > > > > > > > Specifically, you must not call > > qemu_event_destroy() - which > > > > > transitively destroys the mutex and condition > > variable - > > > unless > > > > you can > > > > > guarantee that the qemu_event_set() call on that > event > > > object has > > > > completed. > > > > > > > > > > In qemu_event_set, the event object's value is > > atomically > > > set to > > > > EV_SET. > > > > > If the previous value was EV_BUSY, > qemu_futex_wake() is > > > called. > > > > All of > > > > > this is outside any mutex, however, so apart from > > memory > > > coherence > > > > > (there are barriers) this can race with the waiting > > thread. > > > > > qemu_event_wait() reads the event's value. If > > EV_FREE, it's > > > > atomically > > > > > set to EV_BUSY. Then the mutex is locked, the value > > is checked > > > > again, > > > > > and if it's still EV_BUSY, it waits for the > condition > > > variable, > > > > > otherwise the mutex is immediately unlocked again. > > If the > > > trigger > > > > > thread's qemu_event_set() flip to EV_SET occurs > between > > > the waiting > > > > > thread's two atomic reads of the value, the waiting > > thread > > > will > > > > never > > > > > wait for the condition variable, but the trigger > thread > > > WILL try to > > > > > acquire the mutex and signal the condition variable > in > > > > > qemu_futex_wake(), by which time the waiting > > thread may have > > > > advanced > > > > > outside of qemu_event_wait(). > > > > > > > > Sorry if I'm making a mistake again, but the waiting > > thread won't > > > > set to > > > > EV_BUSY unless the value is EV_FREE on the second read > > so the > > > trigger > > > > thread will not call qemu_futex_wake() if it manages > > to set > > > to EV_SET > > > > before the second read, will it? > > > > > > > > > > > > This sequence of events will cause the problem: > > > > > > > > WAITER (in qemu_event_wait): > > > > value = qatomic_load_acquire(&ev->value); > > > > -> EV_FREE > > > > > > > > TRIGGER (in qemu_event_set): > > > > qatomic_read(&ev->value) != EV_SET > > > > -> EV_FREE (condition is false) > > > > > > > > WAITER: > > > > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET > > > > -> cmpxchg returns EV_FREE, condition false. > > > > ev->value = EV_BUSY. > > > > > TRIGGER: > > > > int old = qatomic_xchg(&ev->value, EV_SET); > > > > smp_mb__after_rmw(); > > > > if (old == EV_BUSY) { > > > > -> old = EV_BUSY, condition true. > > > > ev->value = EV_SET > > > > > > > > WAITER (in qemu_futex_wait(ev, EV_BUSY)): > > > > pthread_mutex_lock(&ev->lock); > > > > if (ev->value == val) { > > > > -> false, because value is EV_SET > > > > > > > > WAITER: > > > > pthread_mutex_unlock(&ev->lock); > > > > … > > > > qemu_event_destroy(&job->done_event); > > > > > > > > TRIGGER (in qemu_futex_wake(ev, INT_MAX)): > > > > pthread_mutex_lock(&ev->lock); > > > > -> hangs, because mutex has been destroyed > > > > > > Thanks for clarification. This is very insightful. > > > > > > > > > > > > > > > > > > > > This is all fine usually, BUT if you destroy the > > QemuEvent > > > > immediately > > > > > after the qemu_event_wait() call, qemu_futex_wake() > > may try to > > > > lock a > > > > > mutex that has been destroyed, or signal a > > condition variable > > > > which has > > > > > been destroyed. I don't see a reasonable way of > making > > > this safe > > > > other > > > > > than using long-lived mutexes and condition > > variables. And > > > > anyway, we > > > > > have much, MUCH bigger contention/performance issues > > > coming from > > > > almost > > > > > everything being covered by the BQL. (If waking > these > > > callbacks > > > > can even > > > > > be considered an issue: I haven't seen it show up in > > > profiling, > > > > whereas > > > > > BQL contention very much does.) > > > > > > > > > > I'll submit v5 of this patch set with separate > > condition > > > > variables for > > > > > each job type. This should make the occurrence of > > waking > > > the wrong > > > > > thread quite rare, while reasoning about > > correctness is pretty > > > > > straightforward. I think that's good enough. > > > > > > What about using QemuSemaphore then? It does not seem to have > the > > > problem same with QemuEvent. > > > > > > > > > Nowhere else in the code base uses short-lived semaphores, and > > while I > > > can't immediately see a risk (the mutex is held during both post > and > > > wait) there might be some non-obvious problem with the approach. > > > Internally, the semaphores use condition variables. The solution > > using > > > condition variables directly already works, is safe, relatively > > easy to > > > reason about, and does not cause any performance issues. There is > > a tiny > > > inefficiency about waking up a thread unnecessarily in the rare > case > > > when two callbacks of the same kind occur concurrently. In > practice, > > > it's irrelevant. Thanks to the awkward mismatch of the > > > PVGraphics.framework's libdispatch based approach and Qemu's BQL/ > > AIO/BH > > > approach, we are already sending messages to other threads very > > > frequently. This isn't ideal, but not fixable without drastically > > > reducing the need to acquire the BQL across Qemu. > > > > I found several usage of ephemeral semaphores: > > h_random() in hw/ppc/spapr_rng.c > > colo_process_checkpoint() in migration/colo.c > > postcopy_thread_create() in migration/postcopy-ram.c > > > > I'm sure short-lived semaphores will keep working (or break > > migration in > > strange ways). > > > > > > > > I do not think it is worth spending even more time trying to fix > > this > > > part of the code which isn't broken in the first place. > > > > I'm sorry to bring you to this mess, which I didn't really expect. I > > thought combining a shared pair of conditional variable and mutex and > > job-specific bools is unnecessarily complex, and having one > > synchronization primitive for each job will be simpler and will just > > work. > > > > > > With multithreading, the devil is always in the detail! 😅 I wouldn't > > mind if we were seeing genuine issues with the Mutex/Cond code, but it's > > fine as far as I can tell. The QemuEvent version wasn't even really any > > simpler (replacing bool done; with QemuEvent done_event; and await -> > > init/wait/destroy gets longer while lock/broadcast/unlock -> set gets > > shorter), and I guess a QemuSemaphore version would be about the same. > > Relying on the way an edge case is handled - destroying immediately > > after waiting - in the long term potentially makes the code more fragile > > too in case implementation details change. I think we've reached a > > bikeshedding stage here, and I suggest any further improvements on this > > part other than bug fixes should be deferred to future patches. > > We still have more than bikeshedding. There are two design options > discussed: > > 1) Whether synchronization primitives should be localized > 2) Whether short-lived QemuSemaphore is appropriate if 1) is true > > We discussed 2) in details, but haven't done much for 1) so there is > some room for discussion. > > Even for 2), let me emphasize that avoiding ephemeral QemuSemaphore in > one device implementation is not appropriate as means to deal with its > potential problem when there is similar existing usage. QEMU needs to be > correct as whole, and having a workaround only part of its codebase is > not OK. We need to either follow existing patterns or prepare for even > more discussion (and I'm for the former). > > Regarding 1), I think it's easier just to show code. Below is my idea of > code change to localize synchronization primitives. This code is not > tested or even compiled, but it should be sufficient to demonstrate the > idea. There are a few notable observations: > > a) apple_gfx_await_bh_job() can be extended to absorb all repetitive > code of BH jobs. Such a change is probably beneficial even when > synchronization primitives are shared, but it is more beneficial when > synchronization primitives are localized since it allows to wrap init > and destroy. > > b) No need to declare multiple conditional variables and choosing one of > them for each job type. Instead we can have one definition and let it > instantiated whenever creating BH jobs. > > c) Localized synchronization primitives make reasoning simpler and makes > the comment in apple-gfx.h unnecessary. We still need the discussion of > QemuEvent v.s. QemuSemaphore, but it will be dealt in the common code so > apple-gfx does not need to have its own comment. > After moving the memory mapping/unmapping code into RCU critical sections, there's now actually only a single BH job on which we need to block in the callback. This is the readMemory/DMA job. I've implemented that using an ephemeral QemuSemaphore now, although I've skipped the extra level of indirection, helper function and macro, as they'd only be used once. I've done some testing on both x86-64 and arm64 with the QemuSemaphore and unlike the QemuEvent, I haven't run into any deadlocks/hangs with it. I've just posted v6 of the patch set with this change (and some other changes vs v5). Thanks, Phil > Regards, > Akihiko Odaki > > --- > hw/display/apple-gfx.h | 19 +++--------- > hw/display/apple-gfx-mmio.m | 23 ++------------ > hw/display/apple-gfx.m | 60 +++++++++++++++++-------------------- > 3 files changed, 34 insertions(+), 68 deletions(-) > > diff --git a/hw/display/apple-gfx.h b/hw/display/apple-gfx.h > index e9fef09e37ea..b5aeed4f3dcf 100644 > --- a/hw/display/apple-gfx.h > +++ b/hw/display/apple-gfx.h > @@ -40,19 +40,6 @@ typedef struct AppleGFXState { > dispatch_queue_t render_queue; > struct AppleGFXDisplayMode *display_modes; > uint32_t num_display_modes; > - /* > - * QemuMutex & QemuConds for awaiting completion of PVG > memory-mapping and > - * reading requests after submitting them to run in the AIO context. > - * QemuCond (rather than QemuEvent) are used so multiple concurrent > jobs > - * can be handled safely. > - * The state associated with each job is tracked in a AppleGFX*Job > struct > - * for each kind of job; instances are allocated on the caller's > stack. > - * This struct also contains the completion flag which is used in > - * conjunction with the condition variable. > - */ > - QemuMutex job_mutex; > - QemuCond task_map_job_cond; > - QemuCond mem_read_job_cond; > > /* tasks is protected by task_mutex */ > QemuMutex task_mutex; > @@ -82,8 +69,10 @@ void apple_gfx_common_realize(AppleGFXState *s, > PGDeviceDescriptor *desc, > uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t guest_physical, > uint64_t length, bool > read_only, > MemoryRegion > **mapping_in_region); > -void apple_gfx_await_bh_job(AppleGFXState *s, QemuCond *job_cond, > - bool *job_done_flag); > +void apple_gfx_bh_job_run_full(QEMUBHFunc *cb, void *opaque, const char > *name); > + > +#define apple_gfx_bh_job_run(cb, opaque) \ > + apple_gfx_bh_job_run_full((cb), (opaque), (stringify(cb))) > > extern const PropertyInfo qdev_prop_display_mode; > > diff --git a/hw/display/apple-gfx-mmio.m b/hw/display/apple-gfx-mmio.m > index a801c5fa722e..889a23df89e9 100644 > --- a/hw/display/apple-gfx-mmio.m > +++ b/hw/display/apple-gfx-mmio.m > @@ -60,8 +60,6 @@ -(void)mmioWriteAtOffset:(size_t)offset > value:(uint32_t)value; > > AppleGFXState common; > > - QemuCond iosfc_map_job_cond; > - QemuCond iosfc_unmap_job_cond; > qemu_irq irq_gfx; > qemu_irq irq_iosfc; > MemoryRegion iomem_iosfc; > @@ -154,7 +152,6 @@ static void raise_irq(void *opaque) > AppleGFXMMIOState *state; > bool read_only; > bool success; > - bool done; > } AppleGFXMapSurfaceMemoryJob; > > typedef struct AppleGFXMMIOMappedRegion { > @@ -203,18 +200,13 @@ static void apple_gfx_mmio_map_surface_memory(void > *opaque) > } > } > > - qemu_mutex_lock(&s->common.job_mutex); > job->result_mem = (void *)mem; > job->success = mem != 0; > - job->done = true; > - qemu_cond_broadcast(&s->iosfc_map_job_cond); > - qemu_mutex_unlock(&s->common.job_mutex); > } > > typedef struct AppleGFXUnmapSurfaceMemoryJob { > void *virtual_address; > AppleGFXMMIOState *state; > - bool done; > } AppleGFXUnmapSurfaceMemoryJob; > > static AppleGFXMMIOMappedRegion *find_mapped_region_containing(GArray > *regions, > @@ -257,11 +249,6 @@ static void > apple_gfx_mmio_unmap_surface_memory(void *opaque) > __func__, > job->virtual_address, regions->len); > } > - > - qemu_mutex_lock(&s->common.job_mutex); > - job->done = true; > - qemu_cond_broadcast(&s->iosfc_unmap_job_cond); > - qemu_mutex_unlock(&s->common.job_mutex); > } > > static PGIOSurfaceHostDevice *apple_gfx_prepare_iosurface_host_device( > @@ -278,9 +265,7 @@ static void apple_gfx_mmio_unmap_surface_memory(void > *opaque) > .read_only = ro, .state = s, > }; > > - aio_bh_schedule_oneshot(qemu_get_aio_context(), > - apple_gfx_mmio_map_surface_memory, > &job); > - apple_gfx_await_bh_job(&s->common, &s->iosfc_map_job_cond, > &job.done); > + apple_gfx_bh_job_run(apple_gfx_mmio_map_surface_memory, &job); > > *va = job.result_mem; > > @@ -295,9 +280,7 @@ static void apple_gfx_mmio_unmap_surface_memory(void > *opaque) > AppleGFXUnmapSurfaceMemoryJob job = { va, s }; > trace_apple_gfx_iosfc_unmap_memory(va, b, c, d, e, f); > > - aio_bh_schedule_oneshot(qemu_get_aio_context(), > - > apple_gfx_mmio_unmap_surface_memory, &job); > - apple_gfx_await_bh_job(&s->common, > &s->iosfc_unmap_job_cond, &job.done); > + apple_gfx_bh_job_run(apple_gfx_mmio_unmap_surface_memory, > &job); > > return true; > }; > @@ -336,8 +319,6 @@ static void apple_gfx_mmio_realize(DeviceState *dev, > Error **errp) > 2 /* Usually no more RAM regions*/); > > apple_gfx_common_realize(&s->common, desc, errp); > - qemu_cond_init(&s->iosfc_map_job_cond); > - qemu_cond_init(&s->iosfc_unmap_job_cond); > > [desc release]; > desc = nil; > diff --git a/hw/display/apple-gfx.m b/hw/display/apple-gfx.m > index 2e264e5561fc..4d174e766310 100644 > --- a/hw/display/apple-gfx.m > +++ b/hw/display/apple-gfx.m > @@ -90,6 +90,31 @@ static dispatch_queue_t get_background_queue(void) > return task; > } > > +typedef struct AppleGFXJob { > + QEMUBHFunc cb; > + void *opaque; > + QemuSemaphore sem; > +} AppleGFXJob; > + > +void apple_gfx_bh_job_cb(void *opaque) > +{ > + AppleGFXJob *job = opaque; > + job->cb(job->opaque); > + qemu_sem_post(&job->sem); > +} > + > +void apple_gfx_bh_job_run_full(QEMUBHFunc *cb, void *opaque, const char > *name) > +{ > + AppleGFXJob job; > + job->cb = cb; > + job->opaque = opaque; > + qemu_sem_init(&job->sem, 0); > + aio_bh_schedule_oneshot_full(qemu_get_aio_context(), > apple_gfx_bh_job_cb, > + &job, name); > + qemu_sem_wait(&job->sem); > + qemu_sem_destroy(&job->sem); > +} > + > typedef struct AppleGFXIOJob { > AppleGFXState *state; > uint64_t offset; > @@ -355,7 +380,6 @@ void apple_gfx_common_init(Object *obj, > AppleGFXState *s, const char* obj_name) > uint32_t range_count; > bool read_only; > bool success; > - bool done; > } AppleGFXMapMemoryJob; > > uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t guest_physical, > @@ -457,20 +481,7 @@ static void apple_gfx_map_memory(void *opaque) > g_assert(r == KERN_SUCCESS); > } > > - qemu_mutex_lock(&s->job_mutex); > job->success = success; > - job->done = true; > - qemu_cond_broadcast(&s->task_map_job_cond); > - qemu_mutex_unlock(&s->job_mutex); > -} > - > -void apple_gfx_await_bh_job(AppleGFXState *s, QemuCond *job_cond, bool > *job_done_flag) > -{ > - qemu_mutex_lock(&s->job_mutex); > - while (!*job_done_flag) { > - qemu_cond_wait(job_cond, &s->job_mutex); > - } > - qemu_mutex_unlock(&s->job_mutex); > } > > typedef struct AppleGFXReadMemoryJob { > @@ -478,8 +489,6 @@ void apple_gfx_await_bh_job(AppleGFXState *s, > QemuCond *job_cond, bool *job_done > hwaddr physical_address; > uint64_t length; > void *dst; > - bool done; > - bool success; > } AppleGFXReadMemoryJob; > > static void apple_gfx_do_read_memory(void *opaque) > @@ -491,11 +500,6 @@ static void apple_gfx_do_read_memory(void *opaque) > r = dma_memory_read(&address_space_memory, job->physical_address, > job->dst, job->length, MEMTXATTRS_UNSPECIFIED); > job->success = r == MEMTX_OK; > - > - qemu_mutex_lock(&s->job_mutex); > - job->done = true; > - qemu_cond_broadcast(&s->mem_read_job_cond); > - qemu_mutex_unlock(&s->job_mutex); > } > > static bool apple_gfx_read_memory(AppleGFXState *s, hwaddr > physical_address, > @@ -508,9 +512,7 @@ static bool apple_gfx_read_memory(AppleGFXState *s, > hwaddr physical_address, > trace_apple_gfx_read_memory(physical_address, length, dst); > > /* Traversing the memory map requires RCU/BQL, so do it in a BH. */ > - aio_bh_schedule_oneshot(qemu_get_aio_context(), > apple_gfx_do_read_memory, > - &job); > - apple_gfx_await_bh_job(s, &s->mem_read_job_cond, &job.done); > + apple_gfx_bh_job_run(s, apple_gfx_do_read_memory, &job); > return job.success; > } > > @@ -556,12 +558,10 @@ static void > apple_gfx_register_task_mapping_handlers(AppleGFXState *s, > .state = s, > .task = task, .ranges = ranges, .range_count = range_count, > .read_only = read_only, .virtual_offset = virtual_offset, > - .done = false, .success = true, > + .success = true, > }; > if (range_count > 0) { > - aio_bh_schedule_oneshot(qemu_get_aio_context(), > - apple_gfx_map_memory, &job); > - apple_gfx_await_bh_job(s, &s->task_map_job_cond, &job.done); > + apple_gfx_bh_job_run(s, apple_gfx_map_memory, &job); > } > return job.success; > }; > @@ -780,10 +780,6 @@ void apple_gfx_common_realize(AppleGFXState *s, > PGDeviceDescriptor *desc, > apple_gfx_create_display_mode_array(display_modes, > num_display_modes); > > create_fb(s); > - > - qemu_mutex_init(&s->job_mutex); > - qemu_cond_init(&s->task_map_job_cond); > - qemu_cond_init(&s->mem_read_job_cond); > } > > static void apple_gfx_get_display_mode(Object *obj, Visitor *v, > -- > 2.47.0 >
diff --git a/hw/display/Kconfig b/hw/display/Kconfig index 2250c740078..6a9b7b19ada 100644 --- a/hw/display/Kconfig +++ b/hw/display/Kconfig @@ -140,3 +140,12 @@ config XLNX_DISPLAYPORT config DM163 bool + +config MAC_PVG + bool + default y + +config MAC_PVG_MMIO + bool + depends on MAC_PVG && AARCH64 + diff --git a/hw/display/apple-gfx-mmio.m b/hw/display/apple-gfx-mmio.m new file mode 100644 index 00000000000..06131bc23f1 --- /dev/null +++ b/hw/display/apple-gfx-mmio.m @@ -0,0 +1,284 @@ +/* + * QEMU Apple ParavirtualizedGraphics.framework device, MMIO (arm64) variant + * + * Copyright © 2023 Amazon.com, Inc. or its affiliates. All Rights Reserved. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + * ParavirtualizedGraphics.framework is a set of libraries that macOS provides + * which implements 3d graphics passthrough to the host as well as a + * proprietary guest communication channel to drive it. This device model + * implements support to drive that library from within QEMU as an MMIO-based + * system device for macOS on arm64 VMs. + */ + +#include "qemu/osdep.h" +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> +#include "apple-gfx.h" +#include "monitor/monitor.h" +#include "hw/sysbus.h" +#include "hw/irq.h" +#include "trace.h" + +OBJECT_DECLARE_SIMPLE_TYPE(AppleGFXMMIOState, APPLE_GFX_MMIO) + +/* + * ParavirtualizedGraphics.Framework only ships header files for the PCI + * variant which does not include IOSFC descriptors and host devices. We add + * their definitions here so that we can also work with the ARM version. + */ +typedef bool(^IOSFCRaiseInterrupt)(uint32_t vector); +typedef bool(^IOSFCUnmapMemory)( + void *, void *, void *, void *, void *, void *); +typedef bool(^IOSFCMapMemory)( + uint64_t phys, uint64_t len, bool ro, void **va, void *, void *); + +@interface PGDeviceDescriptor (IOSurfaceMapper) +@property (readwrite, nonatomic) bool usingIOSurfaceMapper; +@end + +@interface PGIOSurfaceHostDeviceDescriptor : NSObject +-(PGIOSurfaceHostDeviceDescriptor *)init; +@property (readwrite, nonatomic, copy, nullable) IOSFCMapMemory mapMemory; +@property (readwrite, nonatomic, copy, nullable) IOSFCUnmapMemory unmapMemory; +@property (readwrite, nonatomic, copy, nullable) IOSFCRaiseInterrupt raiseInterrupt; +@end + +@interface PGIOSurfaceHostDevice : NSObject +-(instancetype)initWithDescriptor:(PGIOSurfaceHostDeviceDescriptor *)desc; +-(uint32_t)mmioReadAtOffset:(size_t)offset; +-(void)mmioWriteAtOffset:(size_t)offset value:(uint32_t)value; +@end + +struct AppleGFXMapSurfaceMemoryJob; +struct AppleGFXMMIOState { + SysBusDevice parent_obj; + + AppleGFXState common; + + qemu_irq irq_gfx; + qemu_irq irq_iosfc; + MemoryRegion iomem_iosfc; + PGIOSurfaceHostDevice *pgiosfc; +}; + +typedef struct AppleGFXMMIOJob { + AppleGFXMMIOState *state; + uint64_t offset; + uint64_t value; + bool completed; +} AppleGFXMMIOJob; + +static void iosfc_do_read(void *opaque) +{ + AppleGFXMMIOJob *job = opaque; + job->value = [job->state->pgiosfc mmioReadAtOffset:job->offset]; + qatomic_set(&job->completed, true); + aio_wait_kick(); +} + +static uint64_t iosfc_read(void *opaque, hwaddr offset, unsigned size) +{ + AppleGFXMMIOJob job = { + .state = opaque, + .offset = offset, + .completed = false, + }; + AioContext *context = qemu_get_aio_context(); + dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); + + dispatch_async_f(queue, &job, iosfc_do_read); + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); + + trace_apple_gfx_mmio_iosfc_read(offset, job.value); + return job.value; +} + +static void iosfc_do_write(void *opaque) +{ + AppleGFXMMIOJob *job = opaque; + [job->state->pgiosfc mmioWriteAtOffset:job->offset value:job->value]; + qatomic_set(&job->completed, true); + aio_wait_kick(); +} + +static void iosfc_write(void *opaque, hwaddr offset, uint64_t val, + unsigned size) +{ + AppleGFXMMIOJob job = { + .state = opaque, + .offset = offset, + .value = val, + .completed = false, + }; + AioContext *context = qemu_get_aio_context(); + dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); + + dispatch_async_f(queue, &job, iosfc_do_write); + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); + + trace_apple_gfx_mmio_iosfc_write(offset, val); +} + +static const MemoryRegionOps apple_iosfc_ops = { + .read = iosfc_read, + .write = iosfc_write, + .endianness = DEVICE_LITTLE_ENDIAN, + .valid = { + .min_access_size = 4, + .max_access_size = 8, + }, + .impl = { + .min_access_size = 4, + .max_access_size = 8, + }, +}; + +static void raise_iosfc_irq(void *opaque) +{ + AppleGFXMMIOState *s = opaque; + + qemu_irq_pulse(s->irq_iosfc); +} + +typedef struct AppleGFXMapSurfaceMemoryJob { + uint64_t guest_physical_address; + uint64_t guest_physical_length; + void *result_mem; + AppleGFXMMIOState *state; + bool read_only; + bool success; + bool done; +} AppleGFXMapSurfaceMemoryJob; + +static void apple_gfx_mmio_map_surface_memory(void *opaque) +{ + AppleGFXMapSurfaceMemoryJob *job = opaque; + AppleGFXMMIOState *s = job->state; + mach_vm_address_t mem; + + mem = apple_gfx_host_address_for_gpa_range(job->guest_physical_address, + job->guest_physical_length, + job->read_only); + + qemu_mutex_lock(&s->common.job_mutex); + job->result_mem = (void*)mem; + job->success = mem != 0; + job->done = true; + qemu_cond_broadcast(&s->common.job_cond); + qemu_mutex_unlock(&s->common.job_mutex); +} + +static PGIOSurfaceHostDevice *apple_gfx_prepare_iosurface_host_device( + AppleGFXMMIOState *s) +{ + PGIOSurfaceHostDeviceDescriptor *iosfc_desc = + [PGIOSurfaceHostDeviceDescriptor new]; + PGIOSurfaceHostDevice *iosfc_host_dev = nil; + + iosfc_desc.mapMemory = + ^bool(uint64_t phys, uint64_t len, bool ro, void **va, void *e, void *f) { + AppleGFXMapSurfaceMemoryJob job = { + .guest_physical_address = phys, .guest_physical_length = len, + .read_only = ro, .state = s, + }; + + aio_bh_schedule_oneshot(qemu_get_aio_context(), + apple_gfx_mmio_map_surface_memory, &job); + apple_gfx_await_bh_job(&s->common, &job.done); + + *va = job.result_mem; + + trace_apple_gfx_iosfc_map_memory(phys, len, ro, va, e, f, *va, + job.success); + + return job.success; + }; + + iosfc_desc.unmapMemory = + ^bool(void *a, void *b, void *c, void *d, void *e, void *f) { + trace_apple_gfx_iosfc_unmap_memory(a, b, c, d, e, f); + return true; + }; + + iosfc_desc.raiseInterrupt = ^bool(uint32_t vector) { + trace_apple_gfx_iosfc_raise_irq(vector); + aio_bh_schedule_oneshot(qemu_get_aio_context(), raise_iosfc_irq, s); + return true; + }; + + iosfc_host_dev = + [[PGIOSurfaceHostDevice alloc] initWithDescriptor:iosfc_desc]; + [iosfc_desc release]; + return iosfc_host_dev; +} + +static void raise_gfx_irq(void *opaque) +{ + AppleGFXMMIOState *s = opaque; + + qemu_irq_pulse(s->irq_gfx); +} + +static void apple_gfx_mmio_realize(DeviceState *dev, Error **errp) +{ + @autoreleasepool { + AppleGFXMMIOState *s = APPLE_GFX_MMIO(dev); + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; + + desc.raiseInterrupt = ^(uint32_t vector) { + trace_apple_gfx_raise_irq(vector); + aio_bh_schedule_oneshot(qemu_get_aio_context(), raise_gfx_irq, s); + }; + + desc.usingIOSurfaceMapper = true; + s->pgiosfc = apple_gfx_prepare_iosurface_host_device(s); + + apple_gfx_common_realize(&s->common, desc, errp); + [desc release]; + desc = nil; + } +} + +static void apple_gfx_mmio_init(Object *obj) +{ + AppleGFXMMIOState *s = APPLE_GFX_MMIO(obj); + + apple_gfx_common_init(obj, &s->common, TYPE_APPLE_GFX_MMIO); + + sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->common.iomem_gfx); + memory_region_init_io(&s->iomem_iosfc, obj, &apple_iosfc_ops, s, + TYPE_APPLE_GFX_MMIO, 0x10000); + sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->iomem_iosfc); + sysbus_init_irq(SYS_BUS_DEVICE(s), &s->irq_gfx); + sysbus_init_irq(SYS_BUS_DEVICE(s), &s->irq_iosfc); +} + +static void apple_gfx_mmio_reset(Object *obj, ResetType type) +{ + AppleGFXMMIOState *s = APPLE_GFX_MMIO(obj); + [s->common.pgdev reset]; +} + + +static void apple_gfx_mmio_class_init(ObjectClass *klass, void *data) +{ + DeviceClass *dc = DEVICE_CLASS(klass); + ResettableClass *rc = RESETTABLE_CLASS(klass); + + rc->phases.hold = apple_gfx_mmio_reset; + dc->hotpluggable = false; + dc->realize = apple_gfx_mmio_realize; +} + +static TypeInfo apple_gfx_mmio_types[] = { + { + .name = TYPE_APPLE_GFX_MMIO, + .parent = TYPE_SYS_BUS_DEVICE, + .instance_size = sizeof(AppleGFXMMIOState), + .class_init = apple_gfx_mmio_class_init, + .instance_init = apple_gfx_mmio_init, + } +}; +DEFINE_TYPES(apple_gfx_mmio_types) diff --git a/hw/display/apple-gfx.h b/hw/display/apple-gfx.h new file mode 100644 index 00000000000..39931fba65a --- /dev/null +++ b/hw/display/apple-gfx.h @@ -0,0 +1,58 @@ +#ifndef QEMU_APPLE_GFX_H +#define QEMU_APPLE_GFX_H + +#define TYPE_APPLE_GFX_MMIO "apple-gfx-mmio" +#define TYPE_APPLE_GFX_PCI "apple-gfx-pci" + +#include "qemu/osdep.h" +#include <dispatch/dispatch.h> +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> +#include "qemu/typedefs.h" +#include "exec/memory.h" +#include "ui/surface.h" + +@class PGDeviceDescriptor; +@protocol PGDevice; +@protocol PGDisplay; +@protocol MTLDevice; +@protocol MTLTexture; +@protocol MTLCommandQueue; + +typedef QTAILQ_HEAD(, PGTask_s) PGTaskList; + +struct AppleGFXMapMemoryJob; +typedef struct AppleGFXState { + MemoryRegion iomem_gfx; + id<PGDevice> pgdev; + id<PGDisplay> pgdisp; + PGTaskList tasks; + QemuConsole *con; + id<MTLDevice> mtl; + id<MTLCommandQueue> mtl_queue; + bool cursor_show; + QEMUCursor *cursor; + + /* For running PVG memory-mapping requests in the AIO context */ + QemuCond job_cond; + QemuMutex job_mutex; + + dispatch_queue_t render_queue; + /* The following fields should only be accessed from the BQL: */ + bool gfx_update_requested; + bool new_frame_ready; + bool using_managed_texture_storage; + int32_t pending_frames; + void *vram; + DisplaySurface *surface; + id<MTLTexture> texture; +} AppleGFXState; + +void apple_gfx_common_init(Object *obj, AppleGFXState *s, const char* obj_name); +void apple_gfx_common_realize(AppleGFXState *s, PGDeviceDescriptor *desc, + Error **errp); +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t guest_physical, + uint64_t length, bool read_only); +void apple_gfx_await_bh_job(AppleGFXState *s, bool *job_done_flag); + +#endif + diff --git a/hw/display/apple-gfx.m b/hw/display/apple-gfx.m new file mode 100644 index 00000000000..46be9957f69 --- /dev/null +++ b/hw/display/apple-gfx.m @@ -0,0 +1,713 @@ +/* + * QEMU Apple ParavirtualizedGraphics.framework device + * + * Copyright © 2023 Amazon.com, Inc. or its affiliates. All Rights Reserved. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + * ParavirtualizedGraphics.framework is a set of libraries that macOS provides + * which implements 3d graphics passthrough to the host as well as a + * proprietary guest communication channel to drive it. This device model + * implements support to drive that library from within QEMU. + */ + +#include "qemu/osdep.h" +#import <ParavirtualizedGraphics/ParavirtualizedGraphics.h> +#include <mach/mach_vm.h> +#include "apple-gfx.h" +#include "trace.h" +#include "qemu-main.h" +#include "exec/address-spaces.h" +#include "migration/blocker.h" +#include "monitor/monitor.h" +#include "qemu/main-loop.h" +#include "qemu/cutils.h" +#include "qemu/log.h" +#include "qapi/visitor.h" +#include "qapi/error.h" +#include "ui/console.h" + +static const PGDisplayCoord_t apple_gfx_modes[] = { + { .x = 1440, .y = 1080 }, + { .x = 1280, .y = 1024 }, +}; + +/* This implements a type defined in <ParavirtualizedGraphics/PGDevice.h> + * which is opaque from the framework's point of view. Typedef PGTask_t already + * exists in the framework headers. */ +struct PGTask_s { + QTAILQ_ENTRY(PGTask_s) node; + mach_vm_address_t address; + uint64_t len; +}; + +static Error *apple_gfx_mig_blocker; + +static void apple_gfx_render_frame_completed(AppleGFXState *s, + uint32_t width, uint32_t height); + +static inline dispatch_queue_t get_background_queue(void) +{ + return dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); +} + +static PGTask_t *apple_gfx_new_task(AppleGFXState *s, uint64_t len) +{ + mach_vm_address_t task_mem; + PGTask_t *task; + kern_return_t r; + + r = mach_vm_allocate(mach_task_self(), &task_mem, len, VM_FLAGS_ANYWHERE); + if (r != KERN_SUCCESS || task_mem == 0) { + return NULL; + } + + task = g_new0(PGTask_t, 1); + + task->address = task_mem; + task->len = len; + QTAILQ_INSERT_TAIL(&s->tasks, task, node); + + return task; +} + +typedef struct AppleGFXIOJob { + AppleGFXState *state; + uint64_t offset; + uint64_t value; + bool completed; +} AppleGFXIOJob; + +static void apple_gfx_do_read(void *opaque) +{ + AppleGFXIOJob *job = opaque; + job->value = [job->state->pgdev mmioReadAtOffset:job->offset]; + qatomic_set(&job->completed, true); + aio_wait_kick(); +} + +static uint64_t apple_gfx_read(void *opaque, hwaddr offset, unsigned size) +{ + AppleGFXIOJob job = { + .state = opaque, + .offset = offset, + .completed = false, + }; + AioContext *context = qemu_get_aio_context(); + dispatch_queue_t queue = get_background_queue(); + + dispatch_async_f(queue, &job, apple_gfx_do_read); + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); + + trace_apple_gfx_read(offset, job.value); + return job.value; +} + +static void apple_gfx_do_write(void *opaque) +{ + AppleGFXIOJob *job = opaque; + [job->state->pgdev mmioWriteAtOffset:job->offset value:job->value]; + qatomic_set(&job->completed, true); + aio_wait_kick(); +} + +static void apple_gfx_write(void *opaque, hwaddr offset, uint64_t val, + unsigned size) +{ + /* The methods mmioReadAtOffset: and especially mmioWriteAtOffset: can + * trigger and block on operations on other dispatch queues, which in turn + * may call back out on one or more of the callback blocks. For this reason, + * and as we are holding the BQL, we invoke the I/O methods on a pool + * thread and handle AIO tasks while we wait. Any work in the callbacks + * requiring the BQL will in turn schedule BHs which this thread will + * process while waiting. */ + AppleGFXIOJob job = { + .state = opaque, + .offset = offset, + .value = val, + .completed = false, + }; + AioContext *context = qemu_get_current_aio_context(); + dispatch_queue_t queue = get_background_queue(); + + dispatch_async_f(queue, &job, apple_gfx_do_write); + AIO_WAIT_WHILE(context, !qatomic_read(&job.completed)); + + trace_apple_gfx_write(offset, val); +} + +static const MemoryRegionOps apple_gfx_ops = { + .read = apple_gfx_read, + .write = apple_gfx_write, + .endianness = DEVICE_LITTLE_ENDIAN, + .valid = { + .min_access_size = 4, + .max_access_size = 8, + }, + .impl = { + .min_access_size = 4, + .max_access_size = 4, + }, +}; + +static void apple_gfx_render_new_frame_bql_unlock(AppleGFXState *s) +{ + BOOL r; + uint32_t width = surface_width(s->surface); + uint32_t height = surface_height(s->surface); + MTLRegion region = MTLRegionMake2D(0, 0, width, height); + id<MTLCommandBuffer> command_buffer = [s->mtl_queue commandBuffer]; + id<MTLTexture> texture = s->texture; + + assert(bql_locked()); + [texture retain]; + + bql_unlock(); + + /* This is not safe to call from the BQL due to PVG-internal locks causing + * deadlocks. */ + r = [s->pgdisp encodeCurrentFrameToCommandBuffer:command_buffer + texture:texture + region:region]; + if (!r) { + [texture release]; + bql_lock(); + --s->pending_frames; + bql_unlock(); + qemu_log_mask(LOG_GUEST_ERROR, "apple_gfx_render_new_frame_bql_unlock: " + "encodeCurrentFrameToCommandBuffer:texture:region: failed\n"); + return; + } + + if (s->using_managed_texture_storage) { + /* "Managed" textures exist in both VRAM and RAM and must be synced. */ + id<MTLBlitCommandEncoder> blit = [command_buffer blitCommandEncoder]; + [blit synchronizeResource:texture]; + [blit endEncoding]; + } + [texture release]; + [command_buffer addCompletedHandler: + ^(id<MTLCommandBuffer> cb) + { + dispatch_async(s->render_queue, ^{ + apple_gfx_render_frame_completed(s, width, height); + }); + }]; + [command_buffer commit]; +} + +static void copy_mtl_texture_to_surface_mem(id<MTLTexture> texture, void *vram) +{ + /* TODO: Skip this entirely on a pure Metal or headless/guest-only + * rendering path, else use a blit command encoder? Needs careful + * (double?) buffering design. */ + size_t width = texture.width, height = texture.height; + MTLRegion region = MTLRegionMake2D(0, 0, width, height); + [texture getBytes:vram + bytesPerRow:(width * 4) + bytesPerImage:(width * height * 4) + fromRegion:region + mipmapLevel:0 + slice:0]; +} + +static void apple_gfx_render_frame_completed(AppleGFXState *s, + uint32_t width, uint32_t height) +{ + bql_lock(); + --s->pending_frames; + assert(s->pending_frames >= 0); + + /* Only update display if mode hasn't changed since we started rendering. */ + if (width == surface_width(s->surface) && + height == surface_height(s->surface)) { + copy_mtl_texture_to_surface_mem(s->texture, s->vram); + if (s->gfx_update_requested) { + s->gfx_update_requested = false; + dpy_gfx_update_full(s->con); + graphic_hw_update_done(s->con); + s->new_frame_ready = false; + } else { + s->new_frame_ready = true; + } + } + if (s->pending_frames > 0) { + apple_gfx_render_new_frame_bql_unlock(s); + } else { + bql_unlock(); + } +} + +static void apple_gfx_fb_update_display(void *opaque) +{ + AppleGFXState *s = opaque; + + assert(bql_locked()); + if (s->new_frame_ready) { + dpy_gfx_update_full(s->con); + s->new_frame_ready = false; + graphic_hw_update_done(s->con); + } else if (s->pending_frames > 0) { + s->gfx_update_requested = true; + } else { + graphic_hw_update_done(s->con); + } +} + +static const GraphicHwOps apple_gfx_fb_ops = { + .gfx_update = apple_gfx_fb_update_display, + .gfx_update_async = true, +}; + +static void update_cursor(AppleGFXState *s) +{ + assert(bql_locked()); + dpy_mouse_set(s->con, s->pgdisp.cursorPosition.x, + s->pgdisp.cursorPosition.y, s->cursor_show); +} + +static void set_mode(AppleGFXState *s, uint32_t width, uint32_t height) +{ + MTLTextureDescriptor *textureDescriptor; + + if (s->surface && + width == surface_width(s->surface) && + height == surface_height(s->surface)) { + return; + } + + g_free(s->vram); + [s->texture release]; + + s->vram = g_malloc0_n(width * height, 4); + s->surface = qemu_create_displaysurface_from(width, height, PIXMAN_LE_a8r8g8b8, + width * 4, s->vram); + + @autoreleasepool { + textureDescriptor = + [MTLTextureDescriptor + texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm + width:width + height:height + mipmapped:NO]; + textureDescriptor.usage = s->pgdisp.minimumTextureUsage; + s->texture = [s->mtl newTextureWithDescriptor:textureDescriptor]; + } + + s->using_managed_texture_storage = + (s->texture.storageMode == MTLStorageModeManaged); + dpy_gfx_replace_surface(s->con, s->surface); +} + +static void create_fb(AppleGFXState *s) +{ + s->con = graphic_console_init(NULL, 0, &apple_gfx_fb_ops, s); + set_mode(s, 1440, 1080); + + s->cursor_show = true; +} + +static size_t apple_gfx_get_default_mmio_range_size(void) +{ + size_t mmio_range_size; + @autoreleasepool { + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; + mmio_range_size = desc.mmioLength; + [desc release]; + } + return mmio_range_size; +} + +void apple_gfx_common_init(Object *obj, AppleGFXState *s, const char* obj_name) +{ + size_t mmio_range_size = apple_gfx_get_default_mmio_range_size(); + + trace_apple_gfx_common_init(obj_name, mmio_range_size); + memory_region_init_io(&s->iomem_gfx, obj, &apple_gfx_ops, s, obj_name, + mmio_range_size); + + /* TODO: PVG framework supports serialising device state: integrate it! */ +} + +typedef struct AppleGFXMapMemoryJob { + AppleGFXState *state; + PGTask_t *task; + uint64_t virtual_offset; + PGPhysicalMemoryRange_t *ranges; + uint32_t range_count; + bool read_only; + bool success; + bool done; +} AppleGFXMapMemoryJob; + +uintptr_t apple_gfx_host_address_for_gpa_range(uint64_t guest_physical, + uint64_t length, bool read_only) +{ + MemoryRegion *ram_region; + uintptr_t host_address; + hwaddr ram_region_offset = 0; + hwaddr ram_region_length = length; + + ram_region = address_space_translate(&address_space_memory, + guest_physical, + &ram_region_offset, + &ram_region_length, !read_only, + MEMTXATTRS_UNSPECIFIED); + + if (!ram_region || ram_region_length < length || + !memory_access_is_direct(ram_region, !read_only)) { + return 0; + } + + host_address = (mach_vm_address_t)memory_region_get_ram_ptr(ram_region); + if (host_address == 0) { + return 0; + } + host_address += ram_region_offset; + + return host_address; +} + +static void apple_gfx_map_memory(void *opaque) +{ + AppleGFXMapMemoryJob *job = opaque; + AppleGFXState *s = job->state; + PGTask_t *task = job->task; + uint32_t range_count = job->range_count; + uint64_t virtual_offset = job->virtual_offset; + PGPhysicalMemoryRange_t *ranges = job->ranges; + bool read_only = job->read_only; + kern_return_t r; + mach_vm_address_t target, source; + vm_prot_t cur_protection, max_protection; + bool success = true; + + g_assert(bql_locked()); + + trace_apple_gfx_map_memory(task, range_count, virtual_offset, read_only); + for (int i = 0; i < range_count; i++) { + PGPhysicalMemoryRange_t *range = &ranges[i]; + + target = task->address + virtual_offset; + virtual_offset += range->physicalLength; + + trace_apple_gfx_map_memory_range(i, range->physicalAddress, + range->physicalLength); + + source = apple_gfx_host_address_for_gpa_range(range->physicalAddress, + range->physicalLength, + read_only); + if (source == 0) { + success = false; + continue; + } + + MemoryRegion* alt_mr = NULL; + mach_vm_address_t alt_source = (mach_vm_address_t)gpa2hva(&alt_mr, range->physicalAddress, range->physicalLength, NULL); + g_assert(alt_source == source); + + cur_protection = 0; + max_protection = 0; + // Map guest RAM at range->physicalAddress into PG task memory range + r = mach_vm_remap(mach_task_self(), + &target, range->physicalLength, vm_page_size - 1, + VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE, + mach_task_self(), + source, false /* shared mapping, no copy */, + &cur_protection, &max_protection, + VM_INHERIT_COPY); + trace_apple_gfx_remap(r, source, target); + g_assert(r == KERN_SUCCESS); + } + + qemu_mutex_lock(&s->job_mutex); + job->success = success; + job->done = true; + qemu_cond_broadcast(&s->job_cond); + qemu_mutex_unlock(&s->job_mutex); +} + +void apple_gfx_await_bh_job(AppleGFXState *s, bool *job_done_flag) +{ + qemu_mutex_lock(&s->job_mutex); + while (!*job_done_flag) { + qemu_cond_wait(&s->job_cond, &s->job_mutex); + } + qemu_mutex_unlock(&s->job_mutex); +} + +typedef struct AppleGFXReadMemoryJob { + AppleGFXState *s; + hwaddr physical_address; + uint64_t length; + void *dst; + bool done; +} AppleGFXReadMemoryJob; + +static void apple_gfx_do_read_memory(void *opaque) +{ + AppleGFXReadMemoryJob *job = opaque; + AppleGFXState *s = job->s; + + cpu_physical_memory_read(job->physical_address, job->dst, job->length); + + qemu_mutex_lock(&s->job_mutex); + job->done = true; + qemu_cond_broadcast(&s->job_cond); + qemu_mutex_unlock(&s->job_mutex); +} + +static void apple_gfx_read_memory(AppleGFXState *s, hwaddr physical_address, + uint64_t length, void *dst) +{ + AppleGFXReadMemoryJob job = { + s, physical_address, length, dst + }; + + trace_apple_gfx_read_memory(physical_address, length, dst); + + /* Traversing the memory map requires RCU/BQL, so do it in a BH. */ + aio_bh_schedule_oneshot(qemu_get_aio_context(), apple_gfx_do_read_memory, + &job); + apple_gfx_await_bh_job(s, &job.done); +} + +static void apple_gfx_register_task_mapping_handlers(AppleGFXState *s, + PGDeviceDescriptor *desc) +{ + desc.createTask = ^(uint64_t vmSize, void * _Nullable * _Nonnull baseAddress) { + PGTask_t *task = apple_gfx_new_task(s, vmSize); + *baseAddress = (void *)task->address; + trace_apple_gfx_create_task(vmSize, *baseAddress); + return task; + }; + + desc.destroyTask = ^(PGTask_t * _Nonnull task) { + trace_apple_gfx_destroy_task(task); + QTAILQ_REMOVE(&s->tasks, task, node); + mach_vm_deallocate(mach_task_self(), task->address, task->len); + g_free(task); + }; + + desc.mapMemory = ^bool(PGTask_t * _Nonnull task, uint32_t range_count, + uint64_t virtual_offset, bool read_only, + PGPhysicalMemoryRange_t * _Nonnull ranges) { + AppleGFXMapMemoryJob job = { + .state = s, + .task = task, .ranges = ranges, .range_count = range_count, + .read_only = read_only, .virtual_offset = virtual_offset, + .done = false, .success = true, + }; + if (range_count > 0) { + aio_bh_schedule_oneshot(qemu_get_aio_context(), + apple_gfx_map_memory, &job); + apple_gfx_await_bh_job(s, &job.done); + } + return job.success; + }; + + desc.unmapMemory = ^bool(PGTask_t * _Nonnull task, uint64_t virtualOffset, + uint64_t length) { + kern_return_t r; + mach_vm_address_t range_address; + + trace_apple_gfx_unmap_memory(task, virtualOffset, length); + + /* Replace task memory range with fresh pages, undoing the mapping + * from guest RAM. */ + range_address = task->address + virtualOffset; + r = mach_vm_allocate(mach_task_self(), &range_address, length, + VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE); + g_assert(r == KERN_SUCCESS); + + return true; + }; + + desc.readMemory = ^bool(uint64_t physical_address, uint64_t length, + void * _Nonnull dst) { + apple_gfx_read_memory(s, physical_address, length, dst); + return true; + }; +} + +static PGDisplayDescriptor *apple_gfx_prepare_display_descriptor(AppleGFXState *s) +{ + PGDisplayDescriptor *disp_desc = [PGDisplayDescriptor new]; + + disp_desc.name = @"QEMU display"; + disp_desc.sizeInMillimeters = NSMakeSize(400., 300.); /* A 20" display */ + disp_desc.queue = dispatch_get_main_queue(); + disp_desc.newFrameEventHandler = ^(void) { + trace_apple_gfx_new_frame(); + dispatch_async(s->render_queue, ^{ + /* Drop frames if we get too far ahead. */ + bql_lock(); + if (s->pending_frames >= 2) { + bql_unlock(); + return; + } + ++s->pending_frames; + if (s->pending_frames > 1) { + bql_unlock(); + return; + } + @autoreleasepool { + apple_gfx_render_new_frame_bql_unlock(s); + } + }); + }; + disp_desc.modeChangeHandler = ^(PGDisplayCoord_t sizeInPixels, + OSType pixelFormat) { + trace_apple_gfx_mode_change(sizeInPixels.x, sizeInPixels.y); + + BQL_LOCK_GUARD(); + set_mode(s, sizeInPixels.x, sizeInPixels.y); + }; + disp_desc.cursorGlyphHandler = ^(NSBitmapImageRep *glyph, + PGDisplayCoord_t hotSpot) { + [glyph retain]; + dispatch_async(get_background_queue(), ^{ + BQL_LOCK_GUARD(); + uint32_t bpp = glyph.bitsPerPixel; + size_t width = glyph.pixelsWide; + size_t height = glyph.pixelsHigh; + size_t padding_bytes_per_row = glyph.bytesPerRow - width * 4; + const uint8_t* px_data = glyph.bitmapData; + + trace_apple_gfx_cursor_set(bpp, width, height); + + if (s->cursor) { + cursor_unref(s->cursor); + s->cursor = NULL; + } + + if (bpp == 32) { /* Shouldn't be anything else, but just to be safe...*/ + s->cursor = cursor_alloc(width, height); + s->cursor->hot_x = hotSpot.x; + s->cursor->hot_y = hotSpot.y; + + uint32_t *dest_px = s->cursor->data; + + for (size_t y = 0; y < height; ++y) { + for (size_t x = 0; x < width; ++x) { + /* NSBitmapImageRep's red & blue channels are swapped + * compared to QEMUCursor's. */ + *dest_px = + (px_data[0] << 16u) | + (px_data[1] << 8u) | + (px_data[2] << 0u) | + (px_data[3] << 24u); + ++dest_px; + px_data += 4; + } + px_data += padding_bytes_per_row; + } + dpy_cursor_define(s->con, s->cursor); + update_cursor(s); + } + [glyph release]; + }); + }; + disp_desc.cursorShowHandler = ^(BOOL show) { + dispatch_async(get_background_queue(), ^{ + BQL_LOCK_GUARD(); + trace_apple_gfx_cursor_show(show); + s->cursor_show = show; + update_cursor(s); + }); + }; + disp_desc.cursorMoveHandler = ^(void) { + dispatch_async(get_background_queue(), ^{ + BQL_LOCK_GUARD(); + trace_apple_gfx_cursor_move(); + update_cursor(s); + }); + }; + + return disp_desc; +} + +static NSArray<PGDisplayMode*>* apple_gfx_prepare_display_mode_array(void) +{ + PGDisplayMode *modes[ARRAY_SIZE(apple_gfx_modes)]; + NSArray<PGDisplayMode*>* mode_array = nil; + int i; + + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { + modes[i] = + [[PGDisplayMode alloc] initWithSizeInPixels:apple_gfx_modes[i] refreshRateInHz:60.]; + } + + mode_array = [NSArray arrayWithObjects:modes count:ARRAY_SIZE(apple_gfx_modes)]; + + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { + [modes[i] release]; + modes[i] = nil; + } + + return mode_array; +} + +static id<MTLDevice> copy_suitable_metal_device(void) +{ + id<MTLDevice> dev = nil; + NSArray<id<MTLDevice>> *devs = MTLCopyAllDevices(); + + /* Prefer a unified memory GPU. Failing that, pick a non-removable GPU. */ + for (size_t i = 0; i < devs.count; ++i) { + if (devs[i].hasUnifiedMemory) { + dev = devs[i]; + break; + } + if (!devs[i].removable) { + dev = devs[i]; + } + } + + if (dev != nil) { + [dev retain]; + } else { + dev = MTLCreateSystemDefaultDevice(); + } + [devs release]; + + return dev; +} + +void apple_gfx_common_realize(AppleGFXState *s, PGDeviceDescriptor *desc, + Error **errp) +{ + PGDisplayDescriptor *disp_desc = nil; + + if (apple_gfx_mig_blocker == NULL) { + error_setg(&apple_gfx_mig_blocker, + "Migration state blocked by apple-gfx display device"); + if (migrate_add_blocker(&apple_gfx_mig_blocker, errp) < 0) { + return; + } + } + + QTAILQ_INIT(&s->tasks); + s->render_queue = dispatch_queue_create("apple-gfx.render", + DISPATCH_QUEUE_SERIAL); + s->mtl = copy_suitable_metal_device(); + s->mtl_queue = [s->mtl newCommandQueue]; + + desc.device = s->mtl; + + apple_gfx_register_task_mapping_handlers(s, desc); + + s->pgdev = PGNewDeviceWithDescriptor(desc); + + disp_desc = apple_gfx_prepare_display_descriptor(s); + s->pgdisp = [s->pgdev newDisplayWithDescriptor:disp_desc + port:0 serialNum:1234]; + [disp_desc release]; + s->pgdisp.modeList = apple_gfx_prepare_display_mode_array(); + + create_fb(s); + + qemu_mutex_init(&s->job_mutex); + qemu_cond_init(&s->job_cond); +} diff --git a/hw/display/meson.build b/hw/display/meson.build index 20a94973fa2..619e642905a 100644 --- a/hw/display/meson.build +++ b/hw/display/meson.build @@ -61,6 +61,10 @@ system_ss.add(when: 'CONFIG_ARTIST', if_true: files('artist.c')) system_ss.add(when: 'CONFIG_ATI_VGA', if_true: [files('ati.c', 'ati_2d.c', 'ati_dbg.c'), pixman]) +system_ss.add(when: 'CONFIG_MAC_PVG', if_true: [files('apple-gfx.m'), pvg, metal]) +if cpu == 'aarch64' + system_ss.add(when: 'CONFIG_MAC_PVG_MMIO', if_true: [files('apple-gfx-mmio.m'), pvg, metal]) +endif if config_all_devices.has_key('CONFIG_VIRTIO_GPU') virtio_gpu_ss = ss.source_set() diff --git a/hw/display/trace-events b/hw/display/trace-events index 781f8a33203..214998312b9 100644 --- a/hw/display/trace-events +++ b/hw/display/trace-events @@ -191,3 +191,29 @@ dm163_bits_ppi(unsigned dest_width) "dest_width : %u" dm163_leds(int led, uint32_t value) "led %d: 0x%x" dm163_channels(int channel, uint8_t value) "channel %d: 0x%x" dm163_refresh_rate(uint32_t rr) "refresh rate %d" + +# apple-gfx.m +apple_gfx_read(uint64_t offset, uint64_t res) "offset=0x%"PRIx64" res=0x%"PRIx64 +apple_gfx_write(uint64_t offset, uint64_t val) "offset=0x%"PRIx64" val=0x%"PRIx64 +apple_gfx_create_task(uint32_t vm_size, void *va) "vm_size=0x%x base_addr=%p" +apple_gfx_destroy_task(void *task) "task=%p" +apple_gfx_map_memory(void *task, uint32_t range_count, uint64_t virtual_offset, uint32_t read_only) "task=%p range_count=0x%x virtual_offset=0x%"PRIx64" read_only=%d" +apple_gfx_map_memory_range(uint32_t i, uint64_t phys_addr, uint64_t phys_len) "[%d] phys_addr=0x%"PRIx64" phys_len=0x%"PRIx64 +apple_gfx_remap(uint64_t retval, uint64_t source, uint64_t target) "retval=%"PRId64" source=0x%"PRIx64" target=0x%"PRIx64 +apple_gfx_unmap_memory(void *task, uint64_t virtual_offset, uint64_t length) "task=%p virtual_offset=0x%"PRIx64" length=0x%"PRIx64 +apple_gfx_read_memory(uint64_t phys_address, uint64_t length, void *dst) "phys_addr=0x%"PRIx64" length=0x%"PRIx64" dest=%p" +apple_gfx_raise_irq(uint32_t vector) "vector=0x%x" +apple_gfx_new_frame(void) "" +apple_gfx_mode_change(uint64_t x, uint64_t y) "x=%"PRId64" y=%"PRId64 +apple_gfx_cursor_set(uint32_t bpp, uint64_t width, uint64_t height) "bpp=%d width=%"PRId64" height=0x%"PRId64 +apple_gfx_cursor_show(uint32_t show) "show=%d" +apple_gfx_cursor_move(void) "" +apple_gfx_common_init(const char *device_name, size_t mmio_size) "device: %s; MMIO size: %zu bytes" + +# apple-gfx-mmio.m +apple_gfx_mmio_iosfc_read(uint64_t offset, uint64_t res) "offset=0x%"PRIx64" res=0x%"PRIx64 +apple_gfx_mmio_iosfc_write(uint64_t offset, uint64_t val) "offset=0x%"PRIx64" val=0x%"PRIx64 +apple_gfx_iosfc_map_memory(uint64_t phys, uint64_t len, uint32_t ro, void *va, void *e, void *f, void* va_result, int success) "phys=0x%"PRIx64" len=0x%"PRIx64" ro=%d va=%p e=%p f=%p -> *va=%p, success = %d" +apple_gfx_iosfc_unmap_memory(void *a, void *b, void *c, void *d, void *e, void *f) "a=%p b=%p c=%p d=%p e=%p f=%p" +apple_gfx_iosfc_raise_irq(uint32_t vector) "vector=0x%x" + diff --git a/meson.build b/meson.build index d26690ce204..0e124eff13f 100644 --- a/meson.build +++ b/meson.build @@ -761,6 +761,8 @@ socket = [] version_res = [] coref = [] iokit = [] +pvg = [] +metal = [] emulator_link_args = [] midl = not_found widl = not_found @@ -782,6 +784,8 @@ elif host_os == 'darwin' coref = dependency('appleframeworks', modules: 'CoreFoundation') iokit = dependency('appleframeworks', modules: 'IOKit', required: false) host_dsosuf = '.dylib' + pvg = dependency('appleframeworks', modules: 'ParavirtualizedGraphics') + metal = dependency('appleframeworks', modules: 'Metal') elif host_os == 'sunos' socket = [cc.find_library('socket'), cc.find_library('nsl'),