diff mbox series

[RFC,14/15] s390-bios: Support booting from real dasd device

Message ID 1530811543-6881-15-git-send-email-jjherne@linux.ibm.com
State New
Headers show
Series s390: vfio-ccw dasd ipl support | expand

Commit Message

Jason J. Herne July 5, 2018, 5:25 p.m. UTC
From: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>

Allows guest to boot from a vfio configured real dasd device.

Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Signed-off-by: Jason J. Herne <jjherne@linux.ibm.com>
---
 docs/devel/s390-dasd-ipl.txt | 132 +++++++++++++++++++++++
 pc-bios/s390-ccw/Makefile    |   2 +-
 pc-bios/s390-ccw/dasd-ipl.c  | 249 +++++++++++++++++++++++++++++++++++++++++++
 pc-bios/s390-ccw/dasd-ipl.h  |  16 +++
 pc-bios/s390-ccw/main.c      |   4 +
 pc-bios/s390-ccw/s390-arch.h |  13 +++
 6 files changed, 415 insertions(+), 1 deletion(-)
 create mode 100644 docs/devel/s390-dasd-ipl.txt
 create mode 100644 pc-bios/s390-ccw/dasd-ipl.c
 create mode 100644 pc-bios/s390-ccw/dasd-ipl.h

Comments

David Hildenbrand July 17, 2018, 8:43 p.m. UTC | #1
On 05.07.2018 19:25, Jason J. Herne wrote:
> From: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
> 
> Allows guest to boot from a vfio configured real dasd device.
> 
> Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
> Signed-off-by: Jason J. Herne <jjherne@linux.ibm.com>
> ---
>  docs/devel/s390-dasd-ipl.txt | 132 +++++++++++++++++++++++
>  pc-bios/s390-ccw/Makefile    |   2 +-
>  pc-bios/s390-ccw/dasd-ipl.c  | 249 +++++++++++++++++++++++++++++++++++++++++++
>  pc-bios/s390-ccw/dasd-ipl.h  |  16 +++
>  pc-bios/s390-ccw/main.c      |   4 +
>  pc-bios/s390-ccw/s390-arch.h |  13 +++
>  6 files changed, 415 insertions(+), 1 deletion(-)
>  create mode 100644 docs/devel/s390-dasd-ipl.txt
>  create mode 100644 pc-bios/s390-ccw/dasd-ipl.c
>  create mode 100644 pc-bios/s390-ccw/dasd-ipl.h
> 
> diff --git a/docs/devel/s390-dasd-ipl.txt b/docs/devel/s390-dasd-ipl.txt
> new file mode 100644
> index 0000000..87aecb9
> --- /dev/null
> +++ b/docs/devel/s390-dasd-ipl.txt
> @@ -0,0 +1,132 @@
> +*****************************
> +***** s390 hardware IPL *****
> +*****************************
> +
> +The s390 hardware IPL process consists of the following steps.
> +
> +1. A READ IPL ccw is constructed in memory location 0x0.
> +    This ccw, by definition, reads the IPL1 record which is located on the disk
> +    at cylinder 0 track 0 record 1. Note that the chain flag is on in this ccw
> +    so when it is complete another ccw will be fetched and executed from memory
> +    location 0x08.
> +
> +2. Execute the Read IPL ccw at 0x00, thereby reading IPL1 data into 0x00.
> +    IPL1 data is 24 bytes in length and consists of the following pieces of
> +    information: [psw][read ccw][tic ccw]. When the machine executes the Read
> +    IPL ccw it read the 24-bytes of IPL1 to be read into memory starting at
> +    location 0x0. Then the ccw program at 0x08 which consists of a read
> +    ccw and a tic ccw is automatically executed because of the chain flag from
> +    the original READ IPL ccw. The read ccw will read the IPL2 data into memory
> +    and the TIC (Tranfer In Channel) will transfer control to the channel
> +    program contained in the IPL2 data. The TIC channel command is the
> +    equivalent of a branch/jump/goto instruction for channel programs.
> +    NOTE: The ccws in IPL1 are defined by the architecture to be format 0.
> +
> +3. Execute IPL2.
> +    The TIC ccw instruction at the end of the IPL1 channel program will begin
> +    the execution of the IPL2 channel program. IPL2 is stage-2 of the boot
> +    process and will contain a larger channel program than IPL1. The point of
> +    IPL2 is to find and load either the operating system or a small program that
> +    loads the operating system from disk. At the end of this step all or some of
> +    the real operating system is loaded into memory and we are ready to hand
> +    control over to the guest operating system. At this point the guest
> +    operating system is entirely responsible for loading any more data it might
> +    need to function. NOTE: The IPL2 channel program might read data into memory
> +    location 0 thereby overwriting the IPL1 psw and channel program. This is ok
> +    as long as the data placed in location 0 contains a psw whose instruction
> +    address points to the guest operating system code to execute at the end of
> +    the IPL/boot process.
> +    NOTE: The ccws in IPL2 are defined by the architecture to be format 0.
> +
> +4. Start executing the guest operating system.
> +    The psw that was loaded into memory location 0 as part of the ipl process
> +    should contain the needed flags for the operating system we have loaded. The
> +    psw's instruction address will point to the location in memory where we want
> +    to start executing the operating system. This psw is loaded (via LPSW
> +    instruction) causing control to be passed to the operating system code.
> +
> +In a non-virtualized environment this process, handled entirely by the hardware,
> +is kicked off by the user initiating a "Load" procedure from the hardware
> +management console. This "Load" procedure crafts a special "Read IPL" ccw in
> +memory location 0x0 that reads IPL1. It then executes this ccw thereby kicking
> +off the reading of IPL1 data. Since the channel program from IPL1 will be
> +written immediately after the special "Read IPL" ccw, the IPL1 channel program
> +will be executed immediately (the special read ccw has the chaining bit turned
> +on). The TIC at the end of the IPL1 channel program will cause the IPL2 channel
> +program to be executed automatically. After this sequence completes the "Load"
> +procedure then loads the psw from 0x0.
> +
> +*****************************************
> +***** How this all pertains to Qemu *****
> +*****************************************
> +
> +In theory we should merely have to do the following to IPL/boot a guest
> +operating system from a DASD device:
> +
> +1. Place a "Read IPL" ccw into memory location 0x0 with chaining bit on.
> +2. Execute channel program at 0x0.
> +3. LPSW 0x0.
> +
> +However, our emulation of the machine's channel program logic is missing one key
> +feature that is required for this process to work: non-prefetch of ccw data.
> +
> +When we start a channel program we pass the channel subsystem parameters via an
> +ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
> +bit is on then Qemu is allowed to read the entire channel program from guest
> +memory before it starts executing it. This means that any channel commands that
> +read additional channel commands will not work as expected because the newly
> +read commands will only exist in guest memory and NOT within Qemu's channel
> +subsystem memory. Qemu's channel subsystem's implementation currently requires
> +this bit to be on for all channel programs. This is a problem because the IPL
> +process consists of transferring control from the "Read IPL" ccw immediately to
> +the IPL1 channel program that was read by "Read IPL".
> ++

I have way too little insight into channel devices and how QEMU
implements them, however I wonder what hinders us from implementing
support for !prefetch in QEMU?

What you tailored here seems impressive :) Just want to know what the
technical background of this prefetch thingy in QEMU is.
Cornelia Huck July 18, 2018, 7:40 a.m. UTC | #2
On Tue, 17 Jul 2018 22:43:27 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 05.07.2018 19:25, Jason J. Herne wrote:

> > +*****************************************
> > +***** How this all pertains to Qemu *****
> > +*****************************************
> > +
> > +In theory we should merely have to do the following to IPL/boot a guest
> > +operating system from a DASD device:
> > +
> > +1. Place a "Read IPL" ccw into memory location 0x0 with chaining bit on.
> > +2. Execute channel program at 0x0.
> > +3. LPSW 0x0.
> > +
> > +However, our emulation of the machine's channel program logic is missing one key
> > +feature that is required for this process to work: non-prefetch of ccw data.
> > +
> > +When we start a channel program we pass the channel subsystem parameters via an
> > +ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
> > +bit is on then Qemu is allowed to read the entire channel program from guest
> > +memory before it starts executing it. This means that any channel commands that
> > +read additional channel commands will not work as expected because the newly
> > +read commands will only exist in guest memory and NOT within Qemu's channel
> > +subsystem memory. Qemu's channel subsystem's implementation currently requires
> > +this bit to be on for all channel programs. This is a problem because the IPL
> > +process consists of transferring control from the "Read IPL" ccw immediately to
> > +the IPL1 channel program that was read by "Read IPL".
> > ++  
> 
> I have way too little insight into channel devices and how QEMU
> implements them, however I wonder what hinders us from implementing
> support for !prefetch in QEMU?
> 
> What you tailored here seems impressive :) Just want to know what the
> technical background of this prefetch thingy in QEMU is.

This has to do with how vfio-ccw processes and translates channel
programs.

Currently, we hand over the chain of channel commands to the kernel to
be translated (guest->host addresses) and to execute ssch on the real
subchannel. However, this requires sending the channel program over in
one go, which makes it impossible for the guest to modify an in-flight
channel program (there are tricks like putting a suspend marker on a
channel command and moving that marker forward as you go which make it
possible to know that a channel command has not yet been processed;
IIRC the lcs driver in Linux does that, or at least used to do that).
Our implementation currently does not accommodate that (the Linux dasd
driver does not use that feature). It's not impossible to implement it,
but it would require some effort (and I don't think anybody currently
has spare time for that...)
David Hildenbrand July 18, 2018, 7:51 a.m. UTC | #3
On 18.07.2018 09:40, Cornelia Huck wrote:
> On Tue, 17 Jul 2018 22:43:27 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 05.07.2018 19:25, Jason J. Herne wrote:
> 
>>> +*****************************************
>>> +***** How this all pertains to Qemu *****
>>> +*****************************************
>>> +
>>> +In theory we should merely have to do the following to IPL/boot a guest
>>> +operating system from a DASD device:
>>> +
>>> +1. Place a "Read IPL" ccw into memory location 0x0 with chaining bit on.
>>> +2. Execute channel program at 0x0.
>>> +3. LPSW 0x0.
>>> +
>>> +However, our emulation of the machine's channel program logic is missing one key
>>> +feature that is required for this process to work: non-prefetch of ccw data.
>>> +
>>> +When we start a channel program we pass the channel subsystem parameters via an
>>> +ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
>>> +bit is on then Qemu is allowed to read the entire channel program from guest
>>> +memory before it starts executing it. This means that any channel commands that
>>> +read additional channel commands will not work as expected because the newly
>>> +read commands will only exist in guest memory and NOT within Qemu's channel
>>> +subsystem memory. Qemu's channel subsystem's implementation currently requires
>>> +this bit to be on for all channel programs. This is a problem because the IPL
>>> +process consists of transferring control from the "Read IPL" ccw immediately to
>>> +the IPL1 channel program that was read by "Read IPL".
>>> ++  
>>
>> I have way too little insight into channel devices and how QEMU
>> implements them, however I wonder what hinders us from implementing
>> support for !prefetch in QEMU?
>>
>> What you tailored here seems impressive :) Just want to know what the
>> technical background of this prefetch thingy in QEMU is.
> 
> This has to do with how vfio-ccw processes and translates channel
> programs.
> 

Ah, okay, I thought this was *QEMUs* fault, but it actually is
vfio-ccw's fault, and QEMU can't do anything about it.

> Currently, we hand over the chain of channel commands to the kernel to
> be translated (guest->host addresses) and to execute ssch on the real
> subchannel. However, this requires sending the channel program over in
> one go, which makes it impossible for the guest to modify an in-flight
> channel program (there are tricks like putting a suspend marker on a
> channel command and moving that marker forward as you go which make it
> possible to know that a channel command has not yet been processed;
> IIRC the lcs driver in Linux does that, or at least used to do that).
> Our implementation currently does not accommodate that (the Linux dasd
> driver does not use that feature). It's not impossible to implement it,
> but it would require some effort (and I don't think anybody currently
> has spare time for that...)

Spare time, what's that? :)

Thanks for the background info!
Halil Pasic July 18, 2018, 10:55 a.m. UTC | #4
On 07/18/2018 09:40 AM, Cornelia Huck wrote:
> On Tue, 17 Jul 2018 22:43:27 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 05.07.2018 19:25, Jason J. Herne wrote:
> 
>>> +*****************************************
>>> +***** How this all pertains to Qemu *****
>>> +*****************************************
>>> +
>>> +In theory we should merely have to do the following to IPL/boot a guest
>>> +operating system from a DASD device:
>>> +
>>> +1. Place a "Read IPL" ccw into memory location 0x0 with chaining bit on.
>>> +2. Execute channel program at 0x0.
>>> +3. LPSW 0x0.
>>> +
>>> +However, our emulation of the machine's channel program logic is missing one key
>>> +feature that is required for this process to work: non-prefetch of ccw data.
>>> +
>>> +When we start a channel program we pass the channel subsystem parameters via an
>>> +ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
>>> +bit is on then Qemu is allowed to read the entire channel program from guest
>>> +memory before it starts executing it. This means that any channel commands that
>>> +read additional channel commands will not work as expected because the newly
>>> +read commands will only exist in guest memory and NOT within Qemu's channel
>>> +subsystem memory. Qemu's channel subsystem's implementation currently requires
>>> +this bit to be on for all channel programs. This is a problem because the IPL
>>> +process consists of transferring control from the "Read IPL" ccw immediately to
>>> +the IPL1 channel program that was read by "Read IPL".
>>> ++
>>
>> I have way too little insight into channel devices and how QEMU
>> implements them, however I wonder what hinders us from implementing
>> support for !prefetch in QEMU?
>>
>> What you tailored here seems impressive :) Just want to know what the
>> technical background of this prefetch thingy in QEMU is.
> 
> This has to do with how vfio-ccw processes and translates channel
> programs.
> 
> Currently, we hand over the chain of channel commands to the kernel to
> be translated (guest->host addresses) and to execute ssch on the real
> subchannel. However, this requires sending the channel program over in
> one go, which makes it impossible for the guest to modify an in-flight
> channel program (there are tricks like putting a suspend marker on a
> channel command and moving that marker forward as you go which make it
> possible to know that a channel command has not yet been processed;
> IIRC the lcs driver in Linux does that, or at least used to do that).
> Our implementation currently does not accommodate that (the Linux dasd
> driver does not use that feature). It's not impossible to implement it,
> but it would require some effort (and I don't think anybody currently
> has spare time for that...)


I disagree, IMHO we can not implement generic support for !prefetch in
vfio-ccw with what we have at our disposal at the abstraction level we
are currently working on. (If we were to abandon using IO instructions
in the host but rely on lower level protocols like FC it may be possible,
I don't want to make a statement about that).

The problem is the address translation. If the channel program reads
something that is going to be used as a ccw by the same (e.g. CCW-IPL)
the address within that ccw that was read is a guest-physical address.

We however need to translate the addresses in the guest ccws (and in the
(m)idaw's) too before these get executed as a part of a host channel
program. And because the address of the ccws themselves needs to be
31 bit addressable in host physical we actually copy the channel program
form guest memory to suitable host memory in the vfio-ccw driver.

So to translate the new stuff we would actually have to stop the channel
program and resubmit the rest (either by suspend+resume or by break
chaining+ssch). The problem with that an execution of a channel program
that is composed of four ccws A,B,C,D and an execution of a channel
programs composed of ccws A,B immediately followed by and execution
of a channel program composed of the ccws C,D is not the same. I.e. it
is not generally safe to break a chain of ccws.

If you still think it's possible to implement !prefetch, could you please
explain your idea.

Regards,
Halil
Cornelia Huck July 18, 2018, 11:35 a.m. UTC | #5
On Wed, 18 Jul 2018 12:55:51 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> On 07/18/2018 09:40 AM, Cornelia Huck wrote:
> > On Tue, 17 Jul 2018 22:43:27 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 05.07.2018 19:25, Jason J. Herne wrote:  
> >   
> >>> +*****************************************
> >>> +***** How this all pertains to Qemu *****
> >>> +*****************************************
> >>> +
> >>> +In theory we should merely have to do the following to IPL/boot a guest
> >>> +operating system from a DASD device:
> >>> +
> >>> +1. Place a "Read IPL" ccw into memory location 0x0 with chaining bit on.
> >>> +2. Execute channel program at 0x0.
> >>> +3. LPSW 0x0.
> >>> +
> >>> +However, our emulation of the machine's channel program logic is missing one key
> >>> +feature that is required for this process to work: non-prefetch of ccw data.
> >>> +
> >>> +When we start a channel program we pass the channel subsystem parameters via an
> >>> +ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
> >>> +bit is on then Qemu is allowed to read the entire channel program from guest
> >>> +memory before it starts executing it. This means that any channel commands that
> >>> +read additional channel commands will not work as expected because the newly
> >>> +read commands will only exist in guest memory and NOT within Qemu's channel
> >>> +subsystem memory. Qemu's channel subsystem's implementation currently requires
> >>> +this bit to be on for all channel programs. This is a problem because the IPL
> >>> +process consists of transferring control from the "Read IPL" ccw immediately to
> >>> +the IPL1 channel program that was read by "Read IPL".
> >>> ++  
> >>
> >> I have way too little insight into channel devices and how QEMU
> >> implements them, however I wonder what hinders us from implementing
> >> support for !prefetch in QEMU?
> >>
> >> What you tailored here seems impressive :) Just want to know what the
> >> technical background of this prefetch thingy in QEMU is.  
> > 
> > This has to do with how vfio-ccw processes and translates channel
> > programs.
> > 
> > Currently, we hand over the chain of channel commands to the kernel to
> > be translated (guest->host addresses) and to execute ssch on the real
> > subchannel. However, this requires sending the channel program over in
> > one go, which makes it impossible for the guest to modify an in-flight
> > channel program (there are tricks like putting a suspend marker on a
> > channel command and moving that marker forward as you go which make it
> > possible to know that a channel command has not yet been processed;
> > IIRC the lcs driver in Linux does that, or at least used to do that).
> > Our implementation currently does not accommodate that (the Linux dasd
> > driver does not use that feature). It's not impossible to implement it,
> > but it would require some effort (and I don't think anybody currently
> > has spare time for that...)  
> 
> 
> I disagree, IMHO we can not implement generic support for !prefetch in
> vfio-ccw with what we have at our disposal at the abstraction level we
> are currently working on. (If we were to abandon using IO instructions
> in the host but rely on lower level protocols like FC it may be possible,
> I don't want to make a statement about that).
> 
> The problem is the address translation. If the channel program reads
> something that is going to be used as a ccw by the same (e.g. CCW-IPL)
> the address within that ccw that was read is a guest-physical address.
> 
> We however need to translate the addresses in the guest ccws (and in the
> (m)idaw's) too before these get executed as a part of a host channel
> program. And because the address of the ccws themselves needs to be
> 31 bit addressable in host physical we actually copy the channel program
> form guest memory to suitable host memory in the vfio-ccw driver.
> 
> So to translate the new stuff we would actually have to stop the channel
> program and resubmit the rest (either by suspend+resume or by break
> chaining+ssch). The problem with that an execution of a channel program
> that is composed of four ccws A,B,C,D and an execution of a channel
> programs composed of ccws A,B immediately followed by and execution
> of a channel program composed of the ccws C,D is not the same. I.e. it
> is not generally safe to break a chain of ccws.

Exploiting suspending would have been my idea. Probably combined with a
new interface that fetches ccw-by-ccw.

But I don't think it makes sense to spend time thinking about this
right now.

> 
> If you still think it's possible to implement !prefetch, could you please
> explain your idea.
> 
> Regards,
> Halil
>
Halil Pasic July 18, 2018, 11:44 a.m. UTC | #6
On 07/05/2018 07:25 PM, Jason J. Herne wrote:
> From: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
> 
> Allows guest to boot from a vfio configured real dasd device.
> 
> Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
> Signed-off-by: Jason J. Herne <jjherne@linux.ibm.com>
> ---
>   docs/devel/s390-dasd-ipl.txt | 132 +++++++++++++++++++++++
>   pc-bios/s390-ccw/Makefile    |   2 +-
>   pc-bios/s390-ccw/dasd-ipl.c  | 249 +++++++++++++++++++++++++++++++++++++++++++
>   pc-bios/s390-ccw/dasd-ipl.h  |  16 +++
>   pc-bios/s390-ccw/main.c      |   4 +
>   pc-bios/s390-ccw/s390-arch.h |  13 +++
>   6 files changed, 415 insertions(+), 1 deletion(-)
>   create mode 100644 docs/devel/s390-dasd-ipl.txt
>   create mode 100644 pc-bios/s390-ccw/dasd-ipl.c
>   create mode 100644 pc-bios/s390-ccw/dasd-ipl.h
> 
> diff --git a/docs/devel/s390-dasd-ipl.txt b/docs/devel/s390-dasd-ipl.txt
> new file mode 100644
> index 0000000..87aecb9
> --- /dev/null
> +++ b/docs/devel/s390-dasd-ipl.txt
> @@ -0,0 +1,132 @@
> +*****************************
> +***** s390 hardware IPL *****
> +*****************************
> +
> +The s390 hardware IPL process consists of the following steps.
> +
> +1. A READ IPL ccw is constructed in memory location 0x0.
> +    This ccw, by definition, reads the IPL1 record which is located on the disk
> +    at cylinder 0 track 0 record 1. Note that the chain flag is on in this ccw
> +    so when it is complete another ccw will be fetched and executed from memory
> +    location 0x08.
> +
> +2. Execute the Read IPL ccw at 0x00, thereby reading IPL1 data into 0x00.
> +    IPL1 data is 24 bytes in length and consists of the following pieces of
> +    information: [psw][read ccw][tic ccw]. When the machine executes the Read
> +    IPL ccw it read the 24-bytes of IPL1 to be read into memory starting at
> +    location 0x0. Then the ccw program at 0x08 which consists of a read
> +    ccw and a tic ccw is automatically executed because of the chain flag from
> +    the original READ IPL ccw. The read ccw will read the IPL2 data into memory
> +    and the TIC (Tranfer In Channel) will transfer control to the channel
> +    program contained in the IPL2 data. The TIC channel command is the
> +    equivalent of a branch/jump/goto instruction for channel programs.
> +    NOTE: The ccws in IPL1 are defined by the architecture to be format 0.
> +
> +3. Execute IPL2.
> +    The TIC ccw instruction at the end of the IPL1 channel program will begin
> +    the execution of the IPL2 channel program. IPL2 is stage-2 of the boot
> +    process and will contain a larger channel program than IPL1. The point of
> +    IPL2 is to find and load either the operating system or a small program that
> +    loads the operating system from disk. At the end of this step all or some of
> +    the real operating system is loaded into memory and we are ready to hand
> +    control over to the guest operating system. At this point the guest
> +    operating system is entirely responsible for loading any more data it might
> +    need to function. NOTE: The IPL2 channel program might read data into memory
> +    location 0 thereby overwriting the IPL1 psw and channel program. This is ok
> +    as long as the data placed in location 0 contains a psw whose instruction
> +    address points to the guest operating system code to execute at the end of
> +    the IPL/boot process.
> +    NOTE: The ccws in IPL2 are defined by the architecture to be format 0.
> +

I don't really like this description. It sounds like a mix between architecture stuff
and how things actually work in the wild.

For example is there a guarantee that there is no IPL3 to which IPL 2 is going to
tic?

What your describe here is IMHO much better described in the (public) PoP. Could
we just reference the PoP?

> +4. Start executing the guest operating system.
> +    The psw that was loaded into memory location 0 as part of the ipl process
> +    should contain the needed flags for the operating system we have loaded. The
> +    psw's instruction address will point to the location in memory where we want
> +    to start executing the operating system. This psw is loaded (via LPSW
> +    instruction) causing control to be passed to the operating system code.
> +
> +In a non-virtualized environment this process, handled entirely by the hardware,
> +is kicked off by the user initiating a "Load" procedure from the hardware
> +management console. This "Load" procedure crafts a special "Read IPL" ccw in
> +memory location 0x0 that reads IPL1. It then executes this ccw thereby kicking
> +off the reading of IPL1 data. Since the channel program from IPL1 will be
> +written immediately after the special "Read IPL" ccw, the IPL1 channel program
> +will be executed immediately (the special read ccw has the chaining bit turned
> +on). The TIC at the end of the IPL1 channel program will cause the IPL2 channel
> +program to be executed automatically. After this sequence completes the "Load"
> +procedure then loads the psw from 0x0.
> +
> +*****************************************
> +***** How this all pertains to Qemu *****
> +*****************************************
> +
> +In theory we should merely have to do the following to IPL/boot a guest
> +operating system from a DASD device:
> +
> +1. Place a "Read IPL" ccw into memory location 0x0 with chaining bit on.
> +2. Execute channel program at 0x0.
> +3. LPSW 0x0.
> +
> +However, our emulation of the machine's channel program logic is missing one key
> +feature that is required for this process to work: non-prefetch of ccw data.
> +

The next paragraph is IMHO straight misleading

> +When we start a channel program we pass the channel subsystem parameters via an
> +ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
> +bit is on then Qemu is allowed to read the entire channel program from guest
> +memory before it starts executing it.

For vfio-ccw (passtrough) QEMU does not read the channel program AFAIR. For emulated,
I don't think we have a problem with the P bit not set in ORB.

> This means that any channel commands that
> +read additional channel commands will not work as expected because the newly
> +read commands will only exist in guest memory and NOT within Qemu's channel
> +subsystem memory. 

Thus this is also wrong. The actual problem is that on one hand we need to do
address translation and possibly ccw relocation in vfio-ccw, on the other hand
we are not allowed to break chains (e.g. we can't do: prepare a single ccw,
fire with chaining disabled, if needed prepare the next one all over).

> Qemu's channel subsystem's implementation currently requires
> +this bit to be on for all channel programs.

AFAIR only for passthrough.

> This is a problem because the IPL
> +process consists of transferring control from the "Read IPL" ccw immediately to
> +the IPL1 channel program that was read by "Read IPL".

That's right, there is no way vfio-ccw can translate IPL1 in advance becasue it
is still *not* in guest memory. (Compare to your "only exist in guest memory and NOT
within Qemu's channel subsystem memory".)

> +
> +Not being able to turn off prefetch will also prevent the TIC at the end of the
> +IPL1 channel program from transferring control to the IPL2 channel program.
> +

I would say, this is the same as IPL1.

> +Lastly, in some cases (the zipl bootloader for example) the IPL2 program also
> +tansfers control to another channel program segment immediately after reading it
> +from the disk. So we need to be able to handle this case.
> +
> +**************************
> +***** What Qemu does *****
> +**************************

Maybe say Qemu BIOS. When you say Qemu I tend to think of emulator code. But
here we are talking about code that runs in guest context.

> +
> +Since we are forced to live with prefetch we cannot use the very simple IPL
> +procedure we defined in the preceding section. So we compensate by doing the
> +following.
> +
> +1. Place "Read IPL" ccw into memory location 0x0, but turn off chaining bit.
> +2. Execute "Read IPL" at 0x0.
> +
> +   So now IPL1's psw is at 0x0 and IPL1's channel program is at 0x08.
> +
> +4. Write a custom channel program that will seek to the IPL2 record and then
> +   execute the READ and TIC ccws from IPL1.  Normamly the seek is not required

s/Normamly/Normally

> +   because after reading the IPL1 record the disk is automatically positioned
> +   to read the very next record which will be IPL2. But since we are not reading
> +   both IPL1 and IPL2 as part of the same channel program we must manually set
> +   the position.

The disk may be actually positioned like normally. That is not the point.

> +
> +5. Grab the target address of the TIC instruction from the IPL1 channel program.
> +   This address is where the IPL2 channel program starts.
> +
> +   Now IPL2 is loaded into memory somewhere, and we know the address.
> +
> +6. Execute the IPL2 channel program at the address obtained in step #5.
> +
> +   Because this channel program can be dynamic, we must use a special algorithm
> +   that detects a READ immediately followed by a TIC and breaks the ccw chain
> +   by turning off the chain bit in the READ ccw. When control is returned from
> +   the kernel/hardware to the Qemu bios code we immediately issue another start
> +   subchannel to execute the remaining TIC instruction.

Below in code you seem to skip over the tic and issue a ssch with the address in
tic as a start of the (next) channel program.

> This causes the entire
> +   channel program (starting from the TIC) and all needed data to be refetched

Are sure it was fetched previously? It could not have been fetched at the very beginning
as your initial channel program was a single read IPL.

I would go with something like "This causes the next portion of the intended
channel program to be proper translated ..."

> +   thereby stepping around the limitation that would otherwise prevent this
> +   channe program from executing properly.
> +
> +   Now the operating system code is loaded somewhere in guest memory and the psw

When the entire intended channel program terminates the operating ...

> +   in memory location 0x0 will point to entry code for the guest operating
> +   system.
> +
> +7. LPSW 0x0.
> +   LPSW transfers control to the guest operating system and we're done.

Sorry I did not read trough this properly earlier. I was concerned with the
code and assumed the documentation will just reflect what is in the code.

I'm not sure if we can shorten this substantially, but I prefer no documentation
over misleading documentation.

Regards,
Halil

> diff --git a/pc-bios/s390-ccw/Makefile b/pc-bios/s390-ccw/Makefile
> index 12ad9c1..a048b6b 100644
> --- a/pc-bios/s390-ccw/Makefile
> +++ b/pc-bios/s390-ccw/Makefile
> @@ -10,7 +10,7 @@ $(call set-vpath, $(SRC_PATH)/pc-bios/s390-ccw)
>   .PHONY : all clean build-all
>   
>   OBJECTS = start.o main.o bootmap.o jump2ipl.o sclp.o menu.o \
> -	  virtio.o virtio-scsi.o virtio-blkdev.o libc.o cio.o
> +	  virtio.o virtio-scsi.o virtio-blkdev.o libc.o cio.o dasd-ipl.o
>   
>   QEMU_CFLAGS := $(filter -W%, $(QEMU_CFLAGS))
>   QEMU_CFLAGS += -ffreestanding -fno-delete-null-pointer-checks -msoft-float
> diff --git a/pc-bios/s390-ccw/dasd-ipl.c b/pc-bios/s390-ccw/dasd-ipl.c
> new file mode 100644
> index 0000000..e8510f5
> --- /dev/null
> +++ b/pc-bios/s390-ccw/dasd-ipl.c
> @@ -0,0 +1,249 @@
> +/*
> + * S390 IPL (boot) from a real DASD device via vfio framework.
> + *
> + * Copyright (c) 2018 Jason J. Herne <jjherne@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or (at
> + * your option) any later version. See the COPYING file in the top-level
> + * directory.
> + */
> +
> +#include "libc.h"
> +#include "s390-ccw.h"
> +#include "s390-arch.h"
> +#include "dasd-ipl.h"
> +
> +static char prefix_page[PAGE_SIZE * 2]
> +            __attribute__((__aligned__(PAGE_SIZE * 2)));
> +
> +static void enable_prefixing(void)
> +{
> +    memcpy(&prefix_page, (void *)0, 4096);
> +    set_prefix(ptr2u32(&prefix_page));
> +}
> +
> +static void disable_prefixing(void)
> +{
> +    set_prefix(0);
> +    /* Copy io interrupt info back to low core */
> +    memcpy((void *)0xB8, prefix_page + 0xB8, 12);
> +}
> +
> +static bool is_read_tic_ccw_chain(Ccw0 *ccw)
> +{
> +    Ccw0 *next_ccw = ccw + 1;
> +
> +    return ((ccw->cmd_code == CCW_CMD_DASD_READ ||
> +            ccw->cmd_code == CCW_CMD_DASD_READ_MT) &&
> +            ccw->chain && next_ccw->cmd_code == CCW_CMD_TIC);
> +}
> +
> +static bool dynamic_cp_fixup(uint32_t ccw_addr, uint32_t  *next_cpa)
> +{
> +    Ccw0 *cur_ccw = (Ccw0 *)(uint64_t)ccw_addr;
> +    Ccw0 *tic_ccw;
> +
> +    while (true) {
> +        /* Skip over inline TIC (it might not have the chain bit on)  */
> +        if (cur_ccw->cmd_code == CCW_CMD_TIC &&
> +            cur_ccw->cda == ptr2u32(cur_ccw) - 8) {
> +            cur_ccw += 1;
> +            continue;
> +        }
> +
> +        if (!cur_ccw->chain) {
> +            break;
> +        }
> +        if (is_read_tic_ccw_chain(cur_ccw)) {
> +            /*
> +             * Breaking a chain of CCWs may alter the semantics or even the
> +             * validity of a channel program. The heuristic implemented below
> +             * seems to work well in practice for the channel programs
> +             * generated by zipl.
> +             */
> +            tic_ccw = cur_ccw + 1;
> +            *next_cpa = tic_ccw->cda;
> +            cur_ccw->chain = 0;
> +            return true;
> +        }
> +        cur_ccw += 1;
> +    }
> +    return false;
> +}
> +
> +static int run_dynamic_ccw_program(SubChannelId schid, uint32_t cpa)
> +{
> +    bool has_next;
> +    uint32_t next_cpa;
> +    int rc;
> +
> +    do {
> +        has_next = dynamic_cp_fixup(cpa, &next_cpa);
> +
> +        print_int("executing ccw chain at ", cpa);
> +        enable_prefixing();
> +        rc = do_cio(schid, cpa, CCW_FMT0);
> +        disable_prefixing();
> +
> +        if (rc) {
> +            break;
> +        }
> +        cpa = next_cpa;
> +    } while (has_next);
> +
> +    return rc;
> +}
> +
> +
> +static void make_readipl(void)
> +{
> +    Ccw0 *ccwIplRead = (Ccw0 *)0x00;
> +
> +    /* Create Read IPL ccw at address 0 */
> +    ccwIplRead->cmd_code = CCW_CMD_READ_IPL;
> +    ccwIplRead->cda = 0x00; /* Read into address 0x00 in main memory */
> +    ccwIplRead->chain = 0; /* Chain flag */
> +    ccwIplRead->count = 0x18; /* Read 0x18 bytes of data */
> +}
> +
> +static void run_readipl(SubChannelId schid)
> +{
> +    if (do_cio(schid, 0x00, CCW_FMT0)) {
> +        panic("dasd-ipl: Failed to run Read IPL channel program");
> +    }
> +}
> +
> +/*
> + * The architecture states that IPL1 data should consist of a psw followed by
> + * format-0 READ and TIC CCWs. Let's sanity check.
> + */
> +static void check_ipl1(void)
> +{
> +    Ccw0 *ccwread = (Ccw0 *)0x08;
> +    Ccw0 *ccwtic = (Ccw0 *)0x10;
> +
> +    if (ccwread->cmd_code != CCW_CMD_DASD_READ ||
> +        ccwtic->cmd_code != CCW_CMD_TIC) {
> +        panic("dasd-ipl: IPL1 data invalid. Is this disk really bootable?\n");
> +    }
> +}
> +
> +static void check_ipl2(uint32_t ipl2_addr)
> +{
> +    Ccw0 *ccw = u32toptr(ipl2_addr);
> +
> +    if (ipl2_addr == 0x00) {
> +        panic("IPL2 address invalid. Is this disk really bootable?\n");
> +    }
> +    if (ccw->cmd_code == 0x00) {
> +        panic("IPL2 ccw data invalid. Is this disk really bootable?\n");
> +    }
> +}
> +
> +static uint32_t read_ipl2_addr(void)
> +{
> +    Ccw0 *ccwtic = (Ccw0 *)0x10;
> +
> +    return ccwtic->cda;
> +}
> +
> +static void ipl1_fixup(void)
> +{
> +    Ccw0 *ccwSeek = (Ccw0 *) 0x08;
> +    Ccw0 *ccwSearchID = (Ccw0 *) 0x10;
> +    Ccw0 *ccwSearchTic = (Ccw0 *) 0x18;
> +    Ccw0 *ccwRead = (Ccw0 *) 0x20;
> +    CcwSeekData *seekData = (CcwSeekData *) 0x30;
> +    CcwSearchIdData *searchData = (CcwSearchIdData *) 0x38;
> +
> +    /* move IPL1 CCWs to make room for CCWs needed to locate record 2 */
> +    memcpy(ccwRead, (void *)0x08, 16);
> +
> +    /* Disable chaining so we don't TIC to IPL2 channel program */
> +    ccwRead->chain = 0x00;
> +
> +    ccwSeek->cmd_code = CCW_CMD_DASD_SEEK;
> +    ccwSeek->cda = ptr2u32(seekData);
> +    ccwSeek->chain = 1;
> +    ccwSeek->count = sizeof(seekData);
> +    seekData->reserved = 0x00;
> +    seekData->cyl = 0x00;
> +    seekData->head = 0x00;
> +
> +    ccwSearchID->cmd_code = CCW_CMD_DASD_SEARCH_ID_EQ;
> +    ccwSearchID->cda = ptr2u32(searchData);
> +    ccwSearchID->chain = 1;
> +    ccwSearchID->count = sizeof(searchData);
> +    searchData->cyl = 0;
> +    searchData->head = 0;
> +    searchData->record = 2;
> +
> +    /* Go back to Search CCW if correct record not yet found */
> +    ccwSearchTic->cmd_code = CCW_CMD_TIC;
> +    ccwSearchTic->cda = ptr2u32(ccwSearchID);
> +}
> +
> +static void run_ipl1(SubChannelId schid)
> + {
> +    uint32_t startAddr = 0x08;
> +
> +    if (do_cio(schid, startAddr, CCW_FMT0)) {
> +        panic("dasd-ipl: Failed to run IPL1 channel program");
> +    }
> +}
> +
> +static void run_ipl2(SubChannelId schid, uint32_t addr)
> +{
> +
> +    if (run_dynamic_ccw_program(schid, addr)) {
> +        panic("dasd-ipl: Failed to run IPL2 channel program");
> +    }
> +}
> +
> +static void lpsw(void *psw_addr)
> +{
> +    PSWLegacy *pswl = (PSWLegacy *) psw_addr;
> +
> +    pswl->mask |= PSW_MASK_EAMODE;   /* Force z-mode */
> +    pswl->addr |= PSW_MASK_BAMODE;
> +    asm volatile("  llgtr 0,0\n llgtr 1,1\n"     /* Some OS's expect to be */
> +                 "  llgtr 2,2\n llgtr 3,3\n"     /* in 32-bit mode. Clear  */
> +                 "  llgtr 4,4\n llgtr 5,5\n"     /* high part of regs to   */
> +                 "  llgtr 6,6\n llgtr 7,7\n"     /* avoid messing up       */
> +                 "  llgtr 8,8\n llgtr 9,9\n"     /* instructions that work */
> +                 "  llgtr 10,10\n llgtr 11,11\n" /* in both addressing     */
> +                 "  llgtr 12,12\n llgtr 13,13\n" /* modes, like servc.     */
> +                 "  llgtr 14,14\n llgtr 15,15\n"
> +                 "  lpsw %0\n"
> +                 : : "Q" (*pswl) : "cc");
> +}
> +
> +/*
> + * Limitations in QEMU's CCW support complicate the IPL process. Details can
> + * be found in docs/devel/s390-dasd-ipl.txt
> + */
> +void dasd_ipl(SubChannelId schid)
> +{
> +    uint32_t ipl2_addr;
> +
> +    /* Construct Read IPL CCW and run it to read IPL1 from boot disk */
> +    make_readipl();
> +    run_readipl(schid);
> +    ipl2_addr = read_ipl2_addr();
> +    check_ipl1();
> +
> +    /*
> +     * Fixup IPL1 channel program to account for QEMU limitations, then run it
> +     * to read IPL2 channel program from boot disk.
> +     */
> +    ipl1_fixup();
> +    run_ipl1(schid);
> +    check_ipl2(ipl2_addr);
> +
> +    /*
> +     * Run IPL2 channel program to read operating system code from boot disk
> +     * then transfer control to the guest operating system
> +     */
> +    run_ipl2(schid, ipl2_addr);
> +    lpsw(0);
> +}
> diff --git a/pc-bios/s390-ccw/dasd-ipl.h b/pc-bios/s390-ccw/dasd-ipl.h
> new file mode 100644
> index 0000000..56bba82
> --- /dev/null
> +++ b/pc-bios/s390-ccw/dasd-ipl.h
> @@ -0,0 +1,16 @@
> +/*
> + * S390 IPL (boot) from a real DASD device via vfio framework.
> + *
> + * Copyright (c) 2018 Jason J. Herne <jjherne@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or (at
> + * your option) any later version. See the COPYING file in the top-level
> + * directory.
> + */
> +
> +#ifndef DASD_IPL_H
> +#define DASD_IPL_H
> +
> +void dasd_ipl(SubChannelId schid);
> +
> +#endif /* DASD_IPL_H */
> diff --git a/pc-bios/s390-ccw/main.c b/pc-bios/s390-ccw/main.c
> index e4236c0..2bccfa7 100644
> --- a/pc-bios/s390-ccw/main.c
> +++ b/pc-bios/s390-ccw/main.c
> @@ -13,6 +13,7 @@
>   #include "s390-ccw.h"
>   #include "cio.h"
>   #include "virtio.h"
> +#include "dasd-ipl.h"
>   
>   char stack[PAGE_SIZE * 8] __attribute__((__aligned__(PAGE_SIZE)));
>   static SubChannelId blk_schid = { .one = 1 };
> @@ -207,6 +208,9 @@ int main(void)
>       enable_subchannel(blk_schid);
>   
>       switch (cu_type(blk_schid)) {
> +    case 0x3990:  /* Real DASD device */
> +        dasd_ipl(blk_schid); /* no return */
> +        break;
>       case 0x3832:  /* Virtio device */
>           virtio_setup();
>           zipl_load(); /* no return */
> diff --git a/pc-bios/s390-ccw/s390-arch.h b/pc-bios/s390-ccw/s390-arch.h
> index 9074ba2..f36f610 100644
> --- a/pc-bios/s390-ccw/s390-arch.h
> +++ b/pc-bios/s390-ccw/s390-arch.h
> @@ -97,4 +97,17 @@ typedef struct LowCore {
>   
>   extern LowCore *lowcore;
>   
> +static inline void set_prefix(uint32_t address)
> +{
> +    asm volatile("spx %0" : : "m" (address) : "memory");
> +}
> +
> +static inline uint32_t store_prefix(void)
> +{
> +    uint32_t address;
> +
> +    asm volatile("stpx %0" : "=m" (address));
> +    return address;
> +}
> +
>   #endif
>
Halil Pasic July 18, 2018, 11:47 a.m. UTC | #7
On 07/18/2018 01:35 PM, Cornelia Huck wrote:
>> So to translate the new stuff we would actually have to stop the channel
>> program and resubmit the rest (either by suspend+resume or by break
>> chaining+ssch). The problem with that an execution of a channel program
>> that is composed of four ccws A,B,C,D and an execution of a channel
>> programs composed of ccws A,B immediately followed by and execution
>> of a channel program composed of the ccws C,D is not the same. I.e. it
>> is not generally safe to break a chain of ccws.
> Exploiting suspending would have been my idea. Probably combined with a
> new interface that fetches ccw-by-ccw.
> 
> But I don't think it makes sense to spend time thinking about this
> right now.
> 

IMHO exploiting suspending won't work, because the rsch starts a new
chain. If the suspend is a part of the original program the author
of it is responsible to make sure this ain't a problem. But if we
start setting the suspend flag ourselves, we may end up in trouble.

Regards,
Halil
diff mbox series

Patch

diff --git a/docs/devel/s390-dasd-ipl.txt b/docs/devel/s390-dasd-ipl.txt
new file mode 100644
index 0000000..87aecb9
--- /dev/null
+++ b/docs/devel/s390-dasd-ipl.txt
@@ -0,0 +1,132 @@ 
+*****************************
+***** s390 hardware IPL *****
+*****************************
+
+The s390 hardware IPL process consists of the following steps.
+
+1. A READ IPL ccw is constructed in memory location 0x0.
+    This ccw, by definition, reads the IPL1 record which is located on the disk
+    at cylinder 0 track 0 record 1. Note that the chain flag is on in this ccw
+    so when it is complete another ccw will be fetched and executed from memory
+    location 0x08.
+
+2. Execute the Read IPL ccw at 0x00, thereby reading IPL1 data into 0x00.
+    IPL1 data is 24 bytes in length and consists of the following pieces of
+    information: [psw][read ccw][tic ccw]. When the machine executes the Read
+    IPL ccw it read the 24-bytes of IPL1 to be read into memory starting at
+    location 0x0. Then the ccw program at 0x08 which consists of a read
+    ccw and a tic ccw is automatically executed because of the chain flag from
+    the original READ IPL ccw. The read ccw will read the IPL2 data into memory
+    and the TIC (Tranfer In Channel) will transfer control to the channel
+    program contained in the IPL2 data. The TIC channel command is the
+    equivalent of a branch/jump/goto instruction for channel programs.
+    NOTE: The ccws in IPL1 are defined by the architecture to be format 0.
+
+3. Execute IPL2.
+    The TIC ccw instruction at the end of the IPL1 channel program will begin
+    the execution of the IPL2 channel program. IPL2 is stage-2 of the boot
+    process and will contain a larger channel program than IPL1. The point of
+    IPL2 is to find and load either the operating system or a small program that
+    loads the operating system from disk. At the end of this step all or some of
+    the real operating system is loaded into memory and we are ready to hand
+    control over to the guest operating system. At this point the guest
+    operating system is entirely responsible for loading any more data it might
+    need to function. NOTE: The IPL2 channel program might read data into memory
+    location 0 thereby overwriting the IPL1 psw and channel program. This is ok
+    as long as the data placed in location 0 contains a psw whose instruction
+    address points to the guest operating system code to execute at the end of
+    the IPL/boot process.
+    NOTE: The ccws in IPL2 are defined by the architecture to be format 0.
+
+4. Start executing the guest operating system.
+    The psw that was loaded into memory location 0 as part of the ipl process
+    should contain the needed flags for the operating system we have loaded. The
+    psw's instruction address will point to the location in memory where we want
+    to start executing the operating system. This psw is loaded (via LPSW
+    instruction) causing control to be passed to the operating system code.
+
+In a non-virtualized environment this process, handled entirely by the hardware,
+is kicked off by the user initiating a "Load" procedure from the hardware
+management console. This "Load" procedure crafts a special "Read IPL" ccw in
+memory location 0x0 that reads IPL1. It then executes this ccw thereby kicking
+off the reading of IPL1 data. Since the channel program from IPL1 will be
+written immediately after the special "Read IPL" ccw, the IPL1 channel program
+will be executed immediately (the special read ccw has the chaining bit turned
+on). The TIC at the end of the IPL1 channel program will cause the IPL2 channel
+program to be executed automatically. After this sequence completes the "Load"
+procedure then loads the psw from 0x0.
+
+*****************************************
+***** How this all pertains to Qemu *****
+*****************************************
+
+In theory we should merely have to do the following to IPL/boot a guest
+operating system from a DASD device:
+
+1. Place a "Read IPL" ccw into memory location 0x0 with chaining bit on.
+2. Execute channel program at 0x0.
+3. LPSW 0x0.
+
+However, our emulation of the machine's channel program logic is missing one key
+feature that is required for this process to work: non-prefetch of ccw data.
+
+When we start a channel program we pass the channel subsystem parameters via an
+ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
+bit is on then Qemu is allowed to read the entire channel program from guest
+memory before it starts executing it. This means that any channel commands that
+read additional channel commands will not work as expected because the newly
+read commands will only exist in guest memory and NOT within Qemu's channel
+subsystem memory. Qemu's channel subsystem's implementation currently requires
+this bit to be on for all channel programs. This is a problem because the IPL
+process consists of transferring control from the "Read IPL" ccw immediately to
+the IPL1 channel program that was read by "Read IPL".
+
+Not being able to turn off prefetch will also prevent the TIC at the end of the
+IPL1 channel program from transferring control to the IPL2 channel program.
+
+Lastly, in some cases (the zipl bootloader for example) the IPL2 program also
+tansfers control to another channel program segment immediately after reading it
+from the disk. So we need to be able to handle this case.
+
+**************************
+***** What Qemu does *****
+**************************
+
+Since we are forced to live with prefetch we cannot use the very simple IPL
+procedure we defined in the preceding section. So we compensate by doing the
+following.
+
+1. Place "Read IPL" ccw into memory location 0x0, but turn off chaining bit.
+2. Execute "Read IPL" at 0x0.
+
+   So now IPL1's psw is at 0x0 and IPL1's channel program is at 0x08.
+
+4. Write a custom channel program that will seek to the IPL2 record and then
+   execute the READ and TIC ccws from IPL1.  Normamly the seek is not required
+   because after reading the IPL1 record the disk is automatically positioned
+   to read the very next record which will be IPL2. But since we are not reading
+   both IPL1 and IPL2 as part of the same channel program we must manually set
+   the position.
+
+5. Grab the target address of the TIC instruction from the IPL1 channel program.
+   This address is where the IPL2 channel program starts.
+
+   Now IPL2 is loaded into memory somewhere, and we know the address.
+
+6. Execute the IPL2 channel program at the address obtained in step #5.
+
+   Because this channel program can be dynamic, we must use a special algorithm
+   that detects a READ immediately followed by a TIC and breaks the ccw chain
+   by turning off the chain bit in the READ ccw. When control is returned from
+   the kernel/hardware to the Qemu bios code we immediately issue another start
+   subchannel to execute the remaining TIC instruction. This causes the entire
+   channel program (starting from the TIC) and all needed data to be refetched
+   thereby stepping around the limitation that would otherwise prevent this
+   channe program from executing properly.
+
+   Now the operating system code is loaded somewhere in guest memory and the psw
+   in memory location 0x0 will point to entry code for the guest operating
+   system.
+
+7. LPSW 0x0.
+   LPSW transfers control to the guest operating system and we're done.
diff --git a/pc-bios/s390-ccw/Makefile b/pc-bios/s390-ccw/Makefile
index 12ad9c1..a048b6b 100644
--- a/pc-bios/s390-ccw/Makefile
+++ b/pc-bios/s390-ccw/Makefile
@@ -10,7 +10,7 @@  $(call set-vpath, $(SRC_PATH)/pc-bios/s390-ccw)
 .PHONY : all clean build-all
 
 OBJECTS = start.o main.o bootmap.o jump2ipl.o sclp.o menu.o \
-	  virtio.o virtio-scsi.o virtio-blkdev.o libc.o cio.o
+	  virtio.o virtio-scsi.o virtio-blkdev.o libc.o cio.o dasd-ipl.o
 
 QEMU_CFLAGS := $(filter -W%, $(QEMU_CFLAGS))
 QEMU_CFLAGS += -ffreestanding -fno-delete-null-pointer-checks -msoft-float
diff --git a/pc-bios/s390-ccw/dasd-ipl.c b/pc-bios/s390-ccw/dasd-ipl.c
new file mode 100644
index 0000000..e8510f5
--- /dev/null
+++ b/pc-bios/s390-ccw/dasd-ipl.c
@@ -0,0 +1,249 @@ 
+/*
+ * S390 IPL (boot) from a real DASD device via vfio framework.
+ *
+ * Copyright (c) 2018 Jason J. Herne <jjherne@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#include "libc.h"
+#include "s390-ccw.h"
+#include "s390-arch.h"
+#include "dasd-ipl.h"
+
+static char prefix_page[PAGE_SIZE * 2]
+            __attribute__((__aligned__(PAGE_SIZE * 2)));
+
+static void enable_prefixing(void)
+{
+    memcpy(&prefix_page, (void *)0, 4096);
+    set_prefix(ptr2u32(&prefix_page));
+}
+
+static void disable_prefixing(void)
+{
+    set_prefix(0);
+    /* Copy io interrupt info back to low core */
+    memcpy((void *)0xB8, prefix_page + 0xB8, 12);
+}
+
+static bool is_read_tic_ccw_chain(Ccw0 *ccw)
+{
+    Ccw0 *next_ccw = ccw + 1;
+
+    return ((ccw->cmd_code == CCW_CMD_DASD_READ ||
+            ccw->cmd_code == CCW_CMD_DASD_READ_MT) &&
+            ccw->chain && next_ccw->cmd_code == CCW_CMD_TIC);
+}
+
+static bool dynamic_cp_fixup(uint32_t ccw_addr, uint32_t  *next_cpa)
+{
+    Ccw0 *cur_ccw = (Ccw0 *)(uint64_t)ccw_addr;
+    Ccw0 *tic_ccw;
+
+    while (true) {
+        /* Skip over inline TIC (it might not have the chain bit on)  */
+        if (cur_ccw->cmd_code == CCW_CMD_TIC &&
+            cur_ccw->cda == ptr2u32(cur_ccw) - 8) {
+            cur_ccw += 1;
+            continue;
+        }
+
+        if (!cur_ccw->chain) {
+            break;
+        }
+        if (is_read_tic_ccw_chain(cur_ccw)) {
+            /*
+             * Breaking a chain of CCWs may alter the semantics or even the
+             * validity of a channel program. The heuristic implemented below
+             * seems to work well in practice for the channel programs
+             * generated by zipl.
+             */
+            tic_ccw = cur_ccw + 1;
+            *next_cpa = tic_ccw->cda;
+            cur_ccw->chain = 0;
+            return true;
+        }
+        cur_ccw += 1;
+    }
+    return false;
+}
+
+static int run_dynamic_ccw_program(SubChannelId schid, uint32_t cpa)
+{
+    bool has_next;
+    uint32_t next_cpa;
+    int rc;
+
+    do {
+        has_next = dynamic_cp_fixup(cpa, &next_cpa);
+
+        print_int("executing ccw chain at ", cpa);
+        enable_prefixing();
+        rc = do_cio(schid, cpa, CCW_FMT0);
+        disable_prefixing();
+
+        if (rc) {
+            break;
+        }
+        cpa = next_cpa;
+    } while (has_next);
+
+    return rc;
+}
+
+
+static void make_readipl(void)
+{
+    Ccw0 *ccwIplRead = (Ccw0 *)0x00;
+
+    /* Create Read IPL ccw at address 0 */
+    ccwIplRead->cmd_code = CCW_CMD_READ_IPL;
+    ccwIplRead->cda = 0x00; /* Read into address 0x00 in main memory */
+    ccwIplRead->chain = 0; /* Chain flag */
+    ccwIplRead->count = 0x18; /* Read 0x18 bytes of data */
+}
+
+static void run_readipl(SubChannelId schid)
+{
+    if (do_cio(schid, 0x00, CCW_FMT0)) {
+        panic("dasd-ipl: Failed to run Read IPL channel program");
+    }
+}
+
+/*
+ * The architecture states that IPL1 data should consist of a psw followed by
+ * format-0 READ and TIC CCWs. Let's sanity check.
+ */
+static void check_ipl1(void)
+{
+    Ccw0 *ccwread = (Ccw0 *)0x08;
+    Ccw0 *ccwtic = (Ccw0 *)0x10;
+
+    if (ccwread->cmd_code != CCW_CMD_DASD_READ ||
+        ccwtic->cmd_code != CCW_CMD_TIC) {
+        panic("dasd-ipl: IPL1 data invalid. Is this disk really bootable?\n");
+    }
+}
+
+static void check_ipl2(uint32_t ipl2_addr)
+{
+    Ccw0 *ccw = u32toptr(ipl2_addr);
+
+    if (ipl2_addr == 0x00) {
+        panic("IPL2 address invalid. Is this disk really bootable?\n");
+    }
+    if (ccw->cmd_code == 0x00) {
+        panic("IPL2 ccw data invalid. Is this disk really bootable?\n");
+    }
+}
+
+static uint32_t read_ipl2_addr(void)
+{
+    Ccw0 *ccwtic = (Ccw0 *)0x10;
+
+    return ccwtic->cda;
+}
+
+static void ipl1_fixup(void)
+{
+    Ccw0 *ccwSeek = (Ccw0 *) 0x08;
+    Ccw0 *ccwSearchID = (Ccw0 *) 0x10;
+    Ccw0 *ccwSearchTic = (Ccw0 *) 0x18;
+    Ccw0 *ccwRead = (Ccw0 *) 0x20;
+    CcwSeekData *seekData = (CcwSeekData *) 0x30;
+    CcwSearchIdData *searchData = (CcwSearchIdData *) 0x38;
+
+    /* move IPL1 CCWs to make room for CCWs needed to locate record 2 */
+    memcpy(ccwRead, (void *)0x08, 16);
+
+    /* Disable chaining so we don't TIC to IPL2 channel program */
+    ccwRead->chain = 0x00;
+
+    ccwSeek->cmd_code = CCW_CMD_DASD_SEEK;
+    ccwSeek->cda = ptr2u32(seekData);
+    ccwSeek->chain = 1;
+    ccwSeek->count = sizeof(seekData);
+    seekData->reserved = 0x00;
+    seekData->cyl = 0x00;
+    seekData->head = 0x00;
+
+    ccwSearchID->cmd_code = CCW_CMD_DASD_SEARCH_ID_EQ;
+    ccwSearchID->cda = ptr2u32(searchData);
+    ccwSearchID->chain = 1;
+    ccwSearchID->count = sizeof(searchData);
+    searchData->cyl = 0;
+    searchData->head = 0;
+    searchData->record = 2;
+
+    /* Go back to Search CCW if correct record not yet found */
+    ccwSearchTic->cmd_code = CCW_CMD_TIC;
+    ccwSearchTic->cda = ptr2u32(ccwSearchID);
+}
+
+static void run_ipl1(SubChannelId schid)
+ {
+    uint32_t startAddr = 0x08;
+
+    if (do_cio(schid, startAddr, CCW_FMT0)) {
+        panic("dasd-ipl: Failed to run IPL1 channel program");
+    }
+}
+
+static void run_ipl2(SubChannelId schid, uint32_t addr)
+{
+
+    if (run_dynamic_ccw_program(schid, addr)) {
+        panic("dasd-ipl: Failed to run IPL2 channel program");
+    }
+}
+
+static void lpsw(void *psw_addr)
+{
+    PSWLegacy *pswl = (PSWLegacy *) psw_addr;
+
+    pswl->mask |= PSW_MASK_EAMODE;   /* Force z-mode */
+    pswl->addr |= PSW_MASK_BAMODE;
+    asm volatile("  llgtr 0,0\n llgtr 1,1\n"     /* Some OS's expect to be */
+                 "  llgtr 2,2\n llgtr 3,3\n"     /* in 32-bit mode. Clear  */
+                 "  llgtr 4,4\n llgtr 5,5\n"     /* high part of regs to   */
+                 "  llgtr 6,6\n llgtr 7,7\n"     /* avoid messing up       */
+                 "  llgtr 8,8\n llgtr 9,9\n"     /* instructions that work */
+                 "  llgtr 10,10\n llgtr 11,11\n" /* in both addressing     */
+                 "  llgtr 12,12\n llgtr 13,13\n" /* modes, like servc.     */
+                 "  llgtr 14,14\n llgtr 15,15\n"
+                 "  lpsw %0\n"
+                 : : "Q" (*pswl) : "cc");
+}
+
+/*
+ * Limitations in QEMU's CCW support complicate the IPL process. Details can
+ * be found in docs/devel/s390-dasd-ipl.txt
+ */
+void dasd_ipl(SubChannelId schid)
+{
+    uint32_t ipl2_addr;
+
+    /* Construct Read IPL CCW and run it to read IPL1 from boot disk */
+    make_readipl();
+    run_readipl(schid);
+    ipl2_addr = read_ipl2_addr();
+    check_ipl1();
+
+    /*
+     * Fixup IPL1 channel program to account for QEMU limitations, then run it
+     * to read IPL2 channel program from boot disk.
+     */
+    ipl1_fixup();
+    run_ipl1(schid);
+    check_ipl2(ipl2_addr);
+
+    /*
+     * Run IPL2 channel program to read operating system code from boot disk
+     * then transfer control to the guest operating system
+     */
+    run_ipl2(schid, ipl2_addr);
+    lpsw(0);
+}
diff --git a/pc-bios/s390-ccw/dasd-ipl.h b/pc-bios/s390-ccw/dasd-ipl.h
new file mode 100644
index 0000000..56bba82
--- /dev/null
+++ b/pc-bios/s390-ccw/dasd-ipl.h
@@ -0,0 +1,16 @@ 
+/*
+ * S390 IPL (boot) from a real DASD device via vfio framework.
+ *
+ * Copyright (c) 2018 Jason J. Herne <jjherne@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#ifndef DASD_IPL_H
+#define DASD_IPL_H
+
+void dasd_ipl(SubChannelId schid);
+
+#endif /* DASD_IPL_H */
diff --git a/pc-bios/s390-ccw/main.c b/pc-bios/s390-ccw/main.c
index e4236c0..2bccfa7 100644
--- a/pc-bios/s390-ccw/main.c
+++ b/pc-bios/s390-ccw/main.c
@@ -13,6 +13,7 @@ 
 #include "s390-ccw.h"
 #include "cio.h"
 #include "virtio.h"
+#include "dasd-ipl.h"
 
 char stack[PAGE_SIZE * 8] __attribute__((__aligned__(PAGE_SIZE)));
 static SubChannelId blk_schid = { .one = 1 };
@@ -207,6 +208,9 @@  int main(void)
     enable_subchannel(blk_schid);
 
     switch (cu_type(blk_schid)) {
+    case 0x3990:  /* Real DASD device */
+        dasd_ipl(blk_schid); /* no return */
+        break;
     case 0x3832:  /* Virtio device */
         virtio_setup();
         zipl_load(); /* no return */
diff --git a/pc-bios/s390-ccw/s390-arch.h b/pc-bios/s390-ccw/s390-arch.h
index 9074ba2..f36f610 100644
--- a/pc-bios/s390-ccw/s390-arch.h
+++ b/pc-bios/s390-ccw/s390-arch.h
@@ -97,4 +97,17 @@  typedef struct LowCore {
 
 extern LowCore *lowcore;
 
+static inline void set_prefix(uint32_t address)
+{
+    asm volatile("spx %0" : : "m" (address) : "memory");
+}
+
+static inline uint32_t store_prefix(void)
+{
+    uint32_t address;
+
+    asm volatile("stpx %0" : "=m" (address));
+    return address;
+}
+
 #endif