From patchwork Tue Sep 22 12:16:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pavel Dovgalyuk X-Patchwork-Id: 1369043 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=nongnu.org (client-ip=209.51.188.17; helo=lists.gnu.org; envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=ispras.ru Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4BwgSV0gh3z9sTH for ; Tue, 22 Sep 2020 22:23:06 +1000 (AEST) Received: from localhost ([::1]:45278 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kKhK4-0007F2-30 for incoming@patchwork.ozlabs.org; Tue, 22 Sep 2020 08:23:04 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:38846) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kKhEF-0001UO-Ns for qemu-devel@nongnu.org; Tue, 22 Sep 2020 08:17:03 -0400 Received: from mail.ispras.ru ([83.149.199.84]:47730) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kKhE9-0006V8-5X for qemu-devel@nongnu.org; Tue, 22 Sep 2020 08:17:03 -0400 Received: from [127.0.1.1] (unknown [62.118.151.149]) by mail.ispras.ru (Postfix) with ESMTPSA id 34AD841B1032; Tue, 22 Sep 2020 12:16:53 +0000 (UTC) Subject: [PATCH v5 13/15] docs: convert replay.txt to rst From: Pavel Dovgalyuk To: qemu-devel@nongnu.org Date: Tue, 22 Sep 2020 15:16:52 +0300 Message-ID: <160077701288.10249.16846150592069982759.stgit@pasha-ThinkPad-X280> In-Reply-To: <160077693745.10249.9707329107813662236.stgit@pasha-ThinkPad-X280> References: <160077693745.10249.9707329107813662236.stgit@pasha-ThinkPad-X280> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Received-SPF: pass client-ip=83.149.199.84; envelope-from=pavel.dovgalyuk@ispras.ru; helo=mail.ispras.ru X-detected-operating-system: by eggs.gnu.org: First seen = 2020/09/22 08:15:38 X-ACL-Warn: Detected OS = Linux 3.11 and newer [fuzzy] X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: kwolf@redhat.com, wrampazz@redhat.com, pavel.dovgalyuk@ispras.ru, ehabkost@redhat.com, alex.bennee@linaro.org, mtosatti@redhat.com, armbru@redhat.com, mreitz@redhat.com, stefanha@redhat.com, crosa@redhat.com, pbonzini@redhat.com, philmd@redhat.com, zhiwei_liu@c-sky.com, rth@twiddle.net Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" This patch converts record/replay documentation into rst format. Signed-off-by: Pavel Dovgalyuk --- docs/replay.txt | 410 ------------------------------------------------ docs/system/index.rst | 1 docs/system/replay.rst | 410 ++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 411 insertions(+), 410 deletions(-) delete mode 100644 docs/replay.txt create mode 100644 docs/system/replay.rst diff --git a/docs/replay.txt b/docs/replay.txt deleted file mode 100644 index 39fe5e9740..0000000000 --- a/docs/replay.txt +++ /dev/null @@ -1,410 +0,0 @@ -Copyright (c) 2010-2015 Institute for System Programming - of the Russian Academy of Sciences. - -This work is licensed under the terms of the GNU GPL, version 2 or later. -See the COPYING file in the top-level directory. - -Record/replay -------------- - -Record/replay functions are used for the deterministic replay of qemu execution. -Execution recording writes a non-deterministic events log, which can be later -used for replaying the execution anywhere and for unlimited number of times. -It also supports checkpointing for faster rewind to the specific replay moment. -Execution replaying reads the log and replays all non-deterministic events -including external input, hardware clocks, and interrupts. - -Deterministic replay has the following features: - * Deterministically replays whole system execution and all contents of - the memory, state of the hardware devices, clocks, and screen of the VM. - * Writes execution log into the file for later replaying for multiple times - on different machines. - * Supports i386, x86_64, and Arm hardware platforms. - * Performs deterministic replay of all operations with keyboard and mouse - input devices. - -Usage of the record/replay: - * First, record the execution with the following command line: - qemu-system-i386 \ - -icount shift=7,rr=record,rrfile=replay.bin \ - -drive file=disk.qcow2,if=none,snapshot,id=img-direct \ - -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \ - -device ide-hd,drive=img-blkreplay \ - -netdev user,id=net1 -device rtl8139,netdev=net1 \ - -object filter-replay,id=replay,netdev=net1 - * After recording, you can replay it by using another command line: - qemu-system-i386 \ - -icount shift=7,rr=replay,rrfile=replay.bin \ - -drive file=disk.qcow2,if=none,snapshot,id=img-direct \ - -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \ - -device ide-hd,drive=img-blkreplay \ - -netdev user,id=net1 -device rtl8139,netdev=net1 \ - -object filter-replay,id=replay,netdev=net1 - The only difference with recording is changing the rr option - from record to replay. - * Block device images are not actually changed in the recording mode, - because all of the changes are written to the temporary overlay file. - This behavior is enabled by using blkreplay driver. It should be used - for every enabled block device, as described in 'Block devices' section. - * '-net none' option should be specified when network is not used, - because QEMU adds network card by default. When network is needed, - it should be configured explicitly with replay filter, as described - in 'Network devices' section. - * Interaction with audio devices and serial ports are recorded and replayed - automatically when such devices are enabled. - -Academic papers with description of deterministic replay implementation: -http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html -http://dl.acm.org/citation.cfm?id=2786805.2803179 - -Modifications of qemu include: - * wrappers for clock and time functions to save their return values in the log - * saving different asynchronous events (e.g. system shutdown) into the log - * synchronization of the bottom halves execution - * synchronization of the threads from thread pool - * recording/replaying user input (mouse, keyboard, and microphone) - * adding internal checkpoints for cpu and io synchronization - * network filter for recording and replaying the packets - * block driver for making block layer deterministic - * serial port input record and replay - * recording of random numbers obtained from the external sources - -Locking and thread synchronisation ----------------------------------- - -Previously the synchronisation of the main thread and the vCPU thread -was ensured by the holding of the BQL. However the trend has been to -reduce the time the BQL was held across the system including under TCG -system emulation. As it is important that batches of events are kept -in sequence (e.g. expiring timers and checkpoints in the main thread -while instruction checkpoints are written by the vCPU thread) we need -another lock to keep things in lock-step. This role is now handled by -the replay_mutex_lock. It used to be held only for each event being -written but now it is held for a whole execution period. This results -in a deterministic ping-pong between the two main threads. - -As the BQL is now a finer grained lock than the replay_lock it is almost -certainly a bug, and a source of deadlocks, to take the -replay_mutex_lock while the BQL is held. This is enforced by an assert. -While the unlocks are usually in the reverse order, this is not -necessary; you can drop the replay_lock while holding the BQL, without -doing a more complicated unlock_iothread/replay_unlock/lock_iothread -sequence. - -Non-deterministic events ------------------------- - -Our record/replay system is based on saving and replaying non-deterministic -events (e.g. keyboard input) and simulating deterministic ones (e.g. reading -from HDD or memory of the VM). Saving only non-deterministic events makes -log file smaller and simulation faster. - -The following non-deterministic data from peripheral devices is saved into -the log: mouse and keyboard input, network packets, audio controller input, -serial port input, and hardware clocks (they are non-deterministic -too, because their values are taken from the host machine). Inputs from -simulated hardware, memory of VM, software interrupts, and execution of -instructions are not saved into the log, because they are deterministic and -can be replayed by simulating the behavior of virtual machine starting from -initial state. - -We had to solve three tasks to implement deterministic replay: recording -non-deterministic events, replaying non-deterministic events, and checking -that there is no divergence between record and replay modes. - -We changed several parts of QEMU to make event log recording and replaying. -Devices' models that have non-deterministic input from external devices were -changed to write every external event into the execution log immediately. -E.g. network packets are written into the log when they arrive into the virtual -network adapter. - -All non-deterministic events are coming from these devices. But to -replay them we need to know at which moments they occur. We specify -these moments by counting the number of instructions executed between -every pair of consecutive events. - -Instruction counting --------------------- - -QEMU should work in icount mode to use record/replay feature. icount was -designed to allow deterministic execution in absence of external inputs -of the virtual machine. We also use icount to control the occurrence of the -non-deterministic events. The number of instructions elapsed from the last event -is written to the log while recording the execution. In replay mode we -can predict when to inject that event using the instruction counter. - -Timers ------- - -Timers are used to execute callbacks from different subsystems of QEMU -at the specified moments of time. There are several kinds of timers: - * Real time clock. Based on host time and used only for callbacks that - do not change the virtual machine state. For this reason real time - clock and timers does not affect deterministic replay at all. - * Virtual clock. These timers run only during the emulation. In icount - mode virtual clock value is calculated using executed instructions counter. - That is why it is completely deterministic and does not have to be recorded. - * Host clock. This clock is used by device models that simulate real time - sources (e.g. real time clock chip). Host clock is the one of the sources - of non-determinism. Host clock read operations should be logged to - make the execution deterministic. - * Virtual real time clock. This clock is similar to real time clock but - it is used only for increasing virtual clock while virtual machine is - sleeping. Due to its nature it is also non-deterministic as the host clock - and has to be logged too. - -Checkpoints ------------ - -Replaying of the execution of virtual machine is bound by sources of -non-determinism. These are inputs from clock and peripheral devices, -and QEMU thread scheduling. Thread scheduling affect on processing events -from timers, asynchronous input-output, and bottom halves. - -Invocations of timers are coupled with clock reads and changing the state -of the virtual machine. Reads produce non-deterministic data taken from -host clock. And VM state changes should preserve their order. Their relative -order in replay mode must replicate the order of callbacks in record mode. -To preserve this order we use checkpoints. When a specific clock is processed -in record mode we save to the log special "checkpoint" event. -Checkpoints here do not refer to virtual machine snapshots. They are just -record/replay events used for synchronization. - -QEMU in replay mode will try to invoke timers processing in random moment -of time. That's why we do not process a group of timers until the checkpoint -event will be read from the log. Such an event allows synchronizing CPU -execution and timer events. - -Two other checkpoints govern the "warping" of the virtual clock. -While the virtual machine is idle, the virtual clock increments at -1 ns per *real time* nanosecond. This is done by setting up a timer -(called the warp timer) on the virtual real time clock, so that the -timer fires at the next deadline of the virtual clock; the virtual clock -is then incremented (which is called "warping" the virtual clock) as -soon as the timer fires or the CPUs need to go out of the idle state. -Two functions are used for this purpose; because these actions change -virtual machine state and must be deterministic, each of them creates a -checkpoint. qemu_start_warp_timer checks if the CPUs are idle and if so -starts accounting real time to virtual clock. qemu_account_warp_timer -is called when the CPUs get an interrupt or when the warp timer fires, -and it warps the virtual clock by the amount of real time that has passed -since qemu_start_warp_timer. - -Bottom halves -------------- - -Disk I/O events are completely deterministic in our model, because -in both record and replay modes we start virtual machine from the same -disk state. But callbacks that virtual disk controller uses for reading and -writing the disk may occur at different moments of time in record and replay -modes. - -Reading and writing requests are created by CPU thread of QEMU. Later these -requests proceed to block layer which creates "bottom halves". Bottom -halves consist of callback and its parameters. They are processed when -main loop locks the global mutex. These locks are not synchronized with -replaying process because main loop also processes the events that do not -affect the virtual machine state (like user interaction with monitor). - -That is why we had to implement saving and replaying bottom halves callbacks -synchronously to the CPU execution. When the callback is about to execute -it is added to the queue in the replay module. This queue is written to the -log when its callbacks are executed. In replay mode callbacks are not processed -until the corresponding event is read from the events log file. - -Sometimes the block layer uses asynchronous callbacks for its internal purposes -(like reading or writing VM snapshots or disk image cluster tables). In this -case bottom halves are not marked as "replayable" and do not saved -into the log. - -Block devices -------------- - -Block devices record/replay module intercepts calls of -bdrv coroutine functions at the top of block drivers stack. -To record and replay block operations the drive must be configured -as following: - -drive file=disk.qcow2,if=none,snapshot,id=img-direct - -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay - -device ide-hd,drive=img-blkreplay - -blkreplay driver should be inserted between disk image and virtual driver -controller. Therefore all disk requests may be recorded and replayed. - -All block completion operations are added to the queue in the coroutines. -Queue is flushed at checkpoints and information about processed requests -is recorded to the log. In replay phase the queue is matched with -events read from the log. Therefore block devices requests are processed -deterministically. - -Snapshotting ------------- - -New VM snapshots may be created in replay mode. They can be used later -to recover the desired VM state. All VM states created in replay mode -are associated with the moment of time in the replay scenario. -After recovering the VM state replay will start from that position. - -Default starting snapshot name may be specified with icount field -rrsnapshot as follows: - -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name - -This snapshot is created at start of recording and restored at start -of replaying. It also can be loaded while replaying to roll back -the execution. - -'snapshot' flag of the disk image must be removed to save the snapshots -in the overlay (or original image) instead of using the temporary overlay. - -drive file=disk.ovl,if=none,id=img-direct - -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay - -device ide-hd,drive=img-blkreplay - -Use QEMU monitor to create additional snapshots. 'savevm ' command -created the snapshot and 'loadvm ' restores it. To prevent corruption -of the original disk image, use overlay files linked to the original images. -Therefore all new snapshots (including the starting one) will be saved in -overlays and the original image remains unchanged. - -When you need to use snapshots with diskless virtual machine, -it must be started with 'orphan' qcow2 image. This image will be used -for storing VM snapshots. Here is the example of the command line for this: - - qemu-system-i386 -icount shift=3,rr=replay,rrfile=record.bin,rrsnapshot=init \ - -net none -drive file=empty.qcow2,if=none,id=rr - -empty.qcow2 drive does not connected to any virtual block device and used -for VM snapshots only. - -Network devices ---------------- - -Record and replay for network interactions is performed with the network filter. -Each backend must have its own instance of the replay filter as follows: - -netdev user,id=net1 -device rtl8139,netdev=net1 - -object filter-replay,id=replay,netdev=net1 - -Replay network filter is used to record and replay network packets. While -recording the virtual machine this filter puts all packets coming from -the outer world into the log. In replay mode packets from the log are -injected into the network device. All interactions with network backend -in replay mode are disabled. - -Audio devices -------------- - -Audio data is recorded and replay automatically. The command line for recording -and replaying must contain identical specifications of audio hardware, e.g.: - -soundhw ac97 - -Serial ports ------------- - -Serial ports input is recorded and replay automatically. The command lines -for recording and replaying must contain identical number of ports in record -and replay modes, but their backends may differ. -E.g., '-serial stdio' in record mode, and '-serial null' in replay mode. - -Reverse debugging ------------------ - -Reverse debugging allows "executing" the program in reverse direction. -GDB remote protocol supports "reverse step" and "reverse continue" -commands. The first one steps single instruction backwards in time, -and the second one finds the last breakpoint in the past. - -Recorded executions may be used to enable reverse debugging. QEMU can't -execute the code in backwards direction, but can load a snapshot and -replay forward to find the desired position or breakpoint. - -The following GDB commands are supported: - - reverse-stepi (or rsi) - step one instruction backwards - - reverse-continue (or rc) - find last breakpoint in the past - -Reverse step loads the nearest snapshot and replays the execution until -the required instruction is met. - -Reverse continue may include several passes of examining the execution -between the snapshots. Each of the passes include the following steps: - 1. loading the snapshot - 2. replaying to examine the breakpoints - 3. if breakpoint or watchpoint was met - - loading the snaphot again - - replaying to the required breakpoint - 4. else - - proceeding to the p.1 with the earlier snapshot - -Therefore usage of the reverse debugging requires at least one snapshot -created in advance. This can be done by omitting 'snapshot' option -for the block drives and adding 'rrsnapshot' for both record and replay -command lines. -See the "Snapshotting" section to learn more about running record/replay -and creating the snapshot in these modes. - -Replay log format ------------------ - -Record/replay log consists of the header and the sequence of execution -events. The header includes 4-byte replay version id and 8-byte reserved -field. Version is updated every time replay log format changes to prevent -using replay log created by another build of qemu. - -The sequence of the events describes virtual machine state changes. -It includes all non-deterministic inputs of VM, synchronization marks and -instruction counts used to correctly inject inputs at replay. - -Synchronization marks (checkpoints) are used for synchronizing qemu threads -that perform operations with virtual hardware. These operations may change -system's state (e.g., change some register or generate interrupt) and -therefore should execute synchronously with CPU thread. - -Every event in the log includes 1-byte event id and optional arguments. -When argument is an array, it is stored as 4-byte array length -and corresponding number of bytes with data. -Here is the list of events that are written into the log: - - - EVENT_INSTRUCTION. Instructions executed since last event. - Argument: 4-byte number of executed instructions. - - EVENT_INTERRUPT. Used to synchronize interrupt processing. - - EVENT_EXCEPTION. Used to synchronize exception handling. - - EVENT_ASYNC. This is a group of events. They are always processed - together with checkpoints. When such an event is generated, it is - stored in the queue and processed only when checkpoint occurs. - Every such event is followed by 1-byte checkpoint id and 1-byte - async event id from the following list: - - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes - callbacks that affect virtual machine state, but normally called - asynchronously. - Argument: 8-byte operation id. - - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains - parameters of keyboard and mouse input operations - (key press/release, mouse pointer movement). - Arguments: 9-16 bytes depending of input event. - - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event. - - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input - initiated by the sender. - Arguments: 1-byte character device id. - Array with bytes were read. - - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize - operations with disk and flash drives with CPU. - Argument: 8-byte operation id. - - REPLAY_ASYNC_EVENT_NET. Incoming network packet. - Arguments: 1-byte network adapter id. - 4-byte packet flags. - Array with packet bytes. - - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu, - e.g., by closing the window. - - EVENT_CHAR_WRITE. Used to synchronize character output operations. - Arguments: 4-byte output function return value. - 4-byte offset in the output array. - - EVENT_CHAR_READ_ALL. Used to synchronize character input operations, - initiated by qemu. - Argument: Array with bytes that were read. - - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation, - initiated by qemu. - Argument: 4-byte error code. - - EVENT_CLOCK + clock_id. Group of events for host clock read operations. - Argument: 8-byte clock value. - - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of - CPU, internal threads, and asynchronous input events. May be followed - by one or more EVENT_ASYNC events. - - EVENT_END. Last event in the log. diff --git a/docs/system/index.rst b/docs/system/index.rst index c0f685b818..39fe8177f5 100644 --- a/docs/system/index.rst +++ b/docs/system/index.rst @@ -27,6 +27,7 @@ Contents: vnc-security tls gdb + replay managed-startup targets security diff --git a/docs/system/replay.rst b/docs/system/replay.rst new file mode 100644 index 0000000000..d6395ab72a --- /dev/null +++ b/docs/system/replay.rst @@ -0,0 +1,410 @@ +Copyright (c) 2010-2015 Institute for System Programming + of the Russian Academy of Sciences. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. + +Record/replay +============= + +Record/replay functions are used for the deterministic replay of qemu execution. +Execution recording writes a non-deterministic events log, which can be later +used for replaying the execution anywhere and for unlimited number of times. +It also supports checkpointing for faster rewind to the specific replay moment. +Execution replaying reads the log and replays all non-deterministic events +including external input, hardware clocks, and interrupts. + +Deterministic replay has the following features: + * Deterministically replays whole system execution and all contents of + the memory, state of the hardware devices, clocks, and screen of the VM. + * Writes execution log into the file for later replaying for multiple times + on different machines. + * Supports i386, x86_64, and Arm hardware platforms. + * Performs deterministic replay of all operations with keyboard and mouse + input devices. + +Usage of the record/replay: + * First, record the execution with the following command line: + qemu-system-i386 \ + -icount shift=7,rr=record,rrfile=replay.bin \ + -drive file=disk.qcow2,if=none,snapshot,id=img-direct \ + -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \ + -device ide-hd,drive=img-blkreplay \ + -netdev user,id=net1 -device rtl8139,netdev=net1 \ + -object filter-replay,id=replay,netdev=net1 + * After recording, you can replay it by using another command line: + qemu-system-i386 \ + -icount shift=7,rr=replay,rrfile=replay.bin \ + -drive file=disk.qcow2,if=none,snapshot,id=img-direct \ + -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \ + -device ide-hd,drive=img-blkreplay \ + -netdev user,id=net1 -device rtl8139,netdev=net1 \ + -object filter-replay,id=replay,netdev=net1 + The only difference with recording is changing the rr option + from record to replay. + * Block device images are not actually changed in the recording mode, + because all of the changes are written to the temporary overlay file. + This behavior is enabled by using blkreplay driver. It should be used + for every enabled block device, as described in 'Block devices' section. + * '-net none' option should be specified when network is not used, + because QEMU adds network card by default. When network is needed, + it should be configured explicitly with replay filter, as described + in 'Network devices' section. + * Interaction with audio devices and serial ports are recorded and replayed + automatically when such devices are enabled. + +Academic papers with description of deterministic replay implementation: +http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html +http://dl.acm.org/citation.cfm?id=2786805.2803179 + +Modifications of qemu include: + * wrappers for clock and time functions to save their return values in the log + * saving different asynchronous events (e.g. system shutdown) into the log + * synchronization of the bottom halves execution + * synchronization of the threads from thread pool + * recording/replaying user input (mouse, keyboard, and microphone) + * adding internal checkpoints for cpu and io synchronization + * network filter for recording and replaying the packets + * block driver for making block layer deterministic + * serial port input record and replay + * recording of random numbers obtained from the external sources + +Locking and thread synchronisation +---------------------------------- + +Previously the synchronisation of the main thread and the vCPU thread +was ensured by the holding of the BQL. However the trend has been to +reduce the time the BQL was held across the system including under TCG +system emulation. As it is important that batches of events are kept +in sequence (e.g. expiring timers and checkpoints in the main thread +while instruction checkpoints are written by the vCPU thread) we need +another lock to keep things in lock-step. This role is now handled by +the replay_mutex_lock. It used to be held only for each event being +written but now it is held for a whole execution period. This results +in a deterministic ping-pong between the two main threads. + +As the BQL is now a finer grained lock than the replay_lock it is almost +certainly a bug, and a source of deadlocks, to take the +replay_mutex_lock while the BQL is held. This is enforced by an assert. +While the unlocks are usually in the reverse order, this is not +necessary; you can drop the replay_lock while holding the BQL, without +doing a more complicated unlock_iothread/replay_unlock/lock_iothread +sequence. + +Non-deterministic events +------------------------ + +Our record/replay system is based on saving and replaying non-deterministic +events (e.g. keyboard input) and simulating deterministic ones (e.g. reading +from HDD or memory of the VM). Saving only non-deterministic events makes +log file smaller and simulation faster. + +The following non-deterministic data from peripheral devices is saved into +the log: mouse and keyboard input, network packets, audio controller input, +serial port input, and hardware clocks (they are non-deterministic +too, because their values are taken from the host machine). Inputs from +simulated hardware, memory of VM, software interrupts, and execution of +instructions are not saved into the log, because they are deterministic and +can be replayed by simulating the behavior of virtual machine starting from +initial state. + +We had to solve three tasks to implement deterministic replay: recording +non-deterministic events, replaying non-deterministic events, and checking +that there is no divergence between record and replay modes. + +We changed several parts of QEMU to make event log recording and replaying. +Devices' models that have non-deterministic input from external devices were +changed to write every external event into the execution log immediately. +E.g. network packets are written into the log when they arrive into the virtual +network adapter. + +All non-deterministic events are coming from these devices. But to +replay them we need to know at which moments they occur. We specify +these moments by counting the number of instructions executed between +every pair of consecutive events. + +Instruction counting +-------------------- + +QEMU should work in icount mode to use record/replay feature. icount was +designed to allow deterministic execution in absence of external inputs +of the virtual machine. We also use icount to control the occurrence of the +non-deterministic events. The number of instructions elapsed from the last event +is written to the log while recording the execution. In replay mode we +can predict when to inject that event using the instruction counter. + +Timers +------ + +Timers are used to execute callbacks from different subsystems of QEMU +at the specified moments of time. There are several kinds of timers: + * Real time clock. Based on host time and used only for callbacks that + do not change the virtual machine state. For this reason real time + clock and timers does not affect deterministic replay at all. + * Virtual clock. These timers run only during the emulation. In icount + mode virtual clock value is calculated using executed instructions counter. + That is why it is completely deterministic and does not have to be recorded. + * Host clock. This clock is used by device models that simulate real time + sources (e.g. real time clock chip). Host clock is the one of the sources + of non-determinism. Host clock read operations should be logged to + make the execution deterministic. + * Virtual real time clock. This clock is similar to real time clock but + it is used only for increasing virtual clock while virtual machine is + sleeping. Due to its nature it is also non-deterministic as the host clock + and has to be logged too. + +Checkpoints +----------- + +Replaying of the execution of virtual machine is bound by sources of +non-determinism. These are inputs from clock and peripheral devices, +and QEMU thread scheduling. Thread scheduling affect on processing events +from timers, asynchronous input-output, and bottom halves. + +Invocations of timers are coupled with clock reads and changing the state +of the virtual machine. Reads produce non-deterministic data taken from +host clock. And VM state changes should preserve their order. Their relative +order in replay mode must replicate the order of callbacks in record mode. +To preserve this order we use checkpoints. When a specific clock is processed +in record mode we save to the log special "checkpoint" event. +Checkpoints here do not refer to virtual machine snapshots. They are just +record/replay events used for synchronization. + +QEMU in replay mode will try to invoke timers processing in random moment +of time. That's why we do not process a group of timers until the checkpoint +event will be read from the log. Such an event allows synchronizing CPU +execution and timer events. + +Two other checkpoints govern the "warping" of the virtual clock. +While the virtual machine is idle, the virtual clock increments at +1 ns per *real time* nanosecond. This is done by setting up a timer +(called the warp timer) on the virtual real time clock, so that the +timer fires at the next deadline of the virtual clock; the virtual clock +is then incremented (which is called "warping" the virtual clock) as +soon as the timer fires or the CPUs need to go out of the idle state. +Two functions are used for this purpose; because these actions change +virtual machine state and must be deterministic, each of them creates a +checkpoint. qemu_start_warp_timer checks if the CPUs are idle and if so +starts accounting real time to virtual clock. qemu_account_warp_timer +is called when the CPUs get an interrupt or when the warp timer fires, +and it warps the virtual clock by the amount of real time that has passed +since qemu_start_warp_timer. + +Bottom halves +------------- + +Disk I/O events are completely deterministic in our model, because +in both record and replay modes we start virtual machine from the same +disk state. But callbacks that virtual disk controller uses for reading and +writing the disk may occur at different moments of time in record and replay +modes. + +Reading and writing requests are created by CPU thread of QEMU. Later these +requests proceed to block layer which creates "bottom halves". Bottom +halves consist of callback and its parameters. They are processed when +main loop locks the global mutex. These locks are not synchronized with +replaying process because main loop also processes the events that do not +affect the virtual machine state (like user interaction with monitor). + +That is why we had to implement saving and replaying bottom halves callbacks +synchronously to the CPU execution. When the callback is about to execute +it is added to the queue in the replay module. This queue is written to the +log when its callbacks are executed. In replay mode callbacks are not processed +until the corresponding event is read from the events log file. + +Sometimes the block layer uses asynchronous callbacks for its internal purposes +(like reading or writing VM snapshots or disk image cluster tables). In this +case bottom halves are not marked as "replayable" and do not saved +into the log. + +Block devices +------------- + +Block devices record/replay module intercepts calls of +bdrv coroutine functions at the top of block drivers stack. +To record and replay block operations the drive must be configured +as following: + -drive file=disk.qcow2,if=none,snapshot,id=img-direct + -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay + -device ide-hd,drive=img-blkreplay + +blkreplay driver should be inserted between disk image and virtual driver +controller. Therefore all disk requests may be recorded and replayed. + +All block completion operations are added to the queue in the coroutines. +Queue is flushed at checkpoints and information about processed requests +is recorded to the log. In replay phase the queue is matched with +events read from the log. Therefore block devices requests are processed +deterministically. + +Snapshotting +------------ + +New VM snapshots may be created in replay mode. They can be used later +to recover the desired VM state. All VM states created in replay mode +are associated with the moment of time in the replay scenario. +After recovering the VM state replay will start from that position. + +Default starting snapshot name may be specified with icount field +rrsnapshot as follows: + -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name + +This snapshot is created at start of recording and restored at start +of replaying. It also can be loaded while replaying to roll back +the execution. + +'snapshot' flag of the disk image must be removed to save the snapshots +in the overlay (or original image) instead of using the temporary overlay. + -drive file=disk.ovl,if=none,id=img-direct + -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay + -device ide-hd,drive=img-blkreplay + +Use QEMU monitor to create additional snapshots. 'savevm ' command +created the snapshot and 'loadvm ' restores it. To prevent corruption +of the original disk image, use overlay files linked to the original images. +Therefore all new snapshots (including the starting one) will be saved in +overlays and the original image remains unchanged. + +When you need to use snapshots with diskless virtual machine, +it must be started with 'orphan' qcow2 image. This image will be used +for storing VM snapshots. Here is the example of the command line for this: + + qemu-system-i386 -icount shift=3,rr=replay,rrfile=record.bin,rrsnapshot=init \ + -net none -drive file=empty.qcow2,if=none,id=rr + +empty.qcow2 drive does not connected to any virtual block device and used +for VM snapshots only. + +Network devices +--------------- + +Record and replay for network interactions is performed with the network filter. +Each backend must have its own instance of the replay filter as follows: + -netdev user,id=net1 -device rtl8139,netdev=net1 + -object filter-replay,id=replay,netdev=net1 + +Replay network filter is used to record and replay network packets. While +recording the virtual machine this filter puts all packets coming from +the outer world into the log. In replay mode packets from the log are +injected into the network device. All interactions with network backend +in replay mode are disabled. + +Audio devices +------------- + +Audio data is recorded and replay automatically. The command line for recording +and replaying must contain identical specifications of audio hardware, e.g.: + -soundhw ac97 + +Serial ports +------------ + +Serial ports input is recorded and replay automatically. The command lines +for recording and replaying must contain identical number of ports in record +and replay modes, but their backends may differ. +E.g., '-serial stdio' in record mode, and '-serial null' in replay mode. + +Reverse debugging +----------------- + +Reverse debugging allows "executing" the program in reverse direction. +GDB remote protocol supports "reverse step" and "reverse continue" +commands. The first one steps single instruction backwards in time, +and the second one finds the last breakpoint in the past. + +Recorded executions may be used to enable reverse debugging. QEMU can't +execute the code in backwards direction, but can load a snapshot and +replay forward to find the desired position or breakpoint. + +The following GDB commands are supported: + - reverse-stepi (or rsi) - step one instruction backwards + - reverse-continue (or rc) - find last breakpoint in the past + +Reverse step loads the nearest snapshot and replays the execution until +the required instruction is met. + +Reverse continue may include several passes of examining the execution +between the snapshots. Each of the passes include the following steps: + 1. loading the snapshot + 2. replaying to examine the breakpoints + 3. if breakpoint or watchpoint was met + - loading the snaphot again + - replaying to the required breakpoint + 4. else + - proceeding to the p.1 with the earlier snapshot + +Therefore usage of the reverse debugging requires at least one snapshot +created in advance. This can be done by omitting 'snapshot' option +for the block drives and adding 'rrsnapshot' for both record and replay +command lines. +See the "Snapshotting" section to learn more about running record/replay +and creating the snapshot in these modes. + +Replay log format +----------------- + +Record/replay log consists of the header and the sequence of execution +events. The header includes 4-byte replay version id and 8-byte reserved +field. Version is updated every time replay log format changes to prevent +using replay log created by another build of qemu. + +The sequence of the events describes virtual machine state changes. +It includes all non-deterministic inputs of VM, synchronization marks and +instruction counts used to correctly inject inputs at replay. + +Synchronization marks (checkpoints) are used for synchronizing qemu threads +that perform operations with virtual hardware. These operations may change +system's state (e.g., change some register or generate interrupt) and +therefore should execute synchronously with CPU thread. + +Every event in the log includes 1-byte event id and optional arguments. +When argument is an array, it is stored as 4-byte array length +and corresponding number of bytes with data. +Here is the list of events that are written into the log: + + - EVENT_INSTRUCTION. Instructions executed since last event. + Argument: 4-byte number of executed instructions. + - EVENT_INTERRUPT. Used to synchronize interrupt processing. + - EVENT_EXCEPTION. Used to synchronize exception handling. + - EVENT_ASYNC. This is a group of events. They are always processed + together with checkpoints. When such an event is generated, it is + stored in the queue and processed only when checkpoint occurs. + Every such event is followed by 1-byte checkpoint id and 1-byte + async event id from the following list: + - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes + callbacks that affect virtual machine state, but normally called + asynchronously. + Argument: 8-byte operation id. + - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains + parameters of keyboard and mouse input operations + (key press/release, mouse pointer movement). + Arguments: 9-16 bytes depending of input event. + - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event. + - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input + initiated by the sender. + Arguments: 1-byte character device id. + Array with bytes were read. + - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize + operations with disk and flash drives with CPU. + Argument: 8-byte operation id. + - REPLAY_ASYNC_EVENT_NET. Incoming network packet. + Arguments: 1-byte network adapter id. + 4-byte packet flags. + Array with packet bytes. + - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu, + e.g., by closing the window. + - EVENT_CHAR_WRITE. Used to synchronize character output operations. + Arguments: 4-byte output function return value. + 4-byte offset in the output array. + - EVENT_CHAR_READ_ALL. Used to synchronize character input operations, + initiated by qemu. + Argument: Array with bytes that were read. + - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation, + initiated by qemu. + Argument: 4-byte error code. + - EVENT_CLOCK + clock_id. Group of events for host clock read operations. + Argument: 8-byte clock value. + - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of + CPU, internal threads, and asynchronous input events. May be followed + by one or more EVENT_ASYNC events. + - EVENT_END. Last event in the log.