diff mbox series

[5/5] s390x: Enable and document boot device fallback on panic

Message ID 20240529154311.734548-6-jrossi@linux.ibm.com
State New
Headers show
Series s390x: Add Full Boot Order Support | expand

Commit Message

Jared Rossi May 29, 2024, 3:43 p.m. UTC
From: Jared Rossi <jrossi@linux.ibm.com>

On a panic during IPL (i.e. a device failed to boot) check for another device
to boot from, as indicated by the presence of an unused IPLB.

If an IPLB is successfully loaded, then jump to the start of BIOS, restarting
IPL using the updated IPLB.  Otherwise enter disabled wait.

Signed-off-by: Jared Rossi <jrossi@linux.ibm.com>
---
 docs/system/bootindex.rst         | 7 ++++---
 docs/system/s390x/bootdevices.rst | 9 ++++++---
 pc-bios/s390-ccw/s390-ccw.h       | 6 ++++++
 3 files changed, 16 insertions(+), 6 deletions(-)

Comments

Thomas Huth June 5, 2024, 1:37 p.m. UTC | #1
On 29/05/2024 17.43, jrossi@linux.ibm.com wrote:
> From: Jared Rossi <jrossi@linux.ibm.com>
> 
> On a panic during IPL (i.e. a device failed to boot) check for another device
> to boot from, as indicated by the presence of an unused IPLB.
> 
> If an IPLB is successfully loaded, then jump to the start of BIOS, restarting
> IPL using the updated IPLB.  Otherwise enter disabled wait.
> 
> Signed-off-by: Jared Rossi <jrossi@linux.ibm.com>
> ---
>   docs/system/bootindex.rst         | 7 ++++---
>   docs/system/s390x/bootdevices.rst | 9 ++++++---
>   pc-bios/s390-ccw/s390-ccw.h       | 6 ++++++
>   3 files changed, 16 insertions(+), 6 deletions(-)

Could you please split the documentation changes into a separate patch in v2 
? ... I think that would be cleaner.

> diff --git a/docs/system/bootindex.rst b/docs/system/bootindex.rst
> index 8b057f812f..de597561bd 100644
> --- a/docs/system/bootindex.rst
> +++ b/docs/system/bootindex.rst
> @@ -50,9 +50,10 @@ Limitations
>   
>   Some firmware has limitations on which devices can be considered for
>   booting.  For instance, the PC BIOS boot specification allows only one
> -disk to be bootable.  If boot from disk fails for some reason, the BIOS
> -won't retry booting from other disk.  It can still try to boot from
> -floppy or net, though.
> +disk to be bootable, except for on s390x machines. If boot from disk fails for
> +some reason, the BIOS won't retry booting from other disk.  It can still try to
> +boot from floppy or net, though.  In the case of s390x, the BIOS will try up to
> +8 total devices, any number of which may be disks.

Since the old text was already talking about "PC BIOS", I'd rather leave 
that paragraph as it is (maybe just replace "PC BIOS" with "x86 PC BIOS"), 
and add a separate paragraph afterwards about s390x instead.

> diff --git a/pc-bios/s390-ccw/s390-ccw.h b/pc-bios/s390-ccw/s390-ccw.h
> index c977a52b50..de3d1f0d5a 100644
> --- a/pc-bios/s390-ccw/s390-ccw.h
> +++ b/pc-bios/s390-ccw/s390-ccw.h
> @@ -43,6 +43,7 @@ typedef unsigned long long u64;
>   #include "iplb.h"
>   
>   /* start.s */
> +extern char _start[];
>   void disabled_wait(void) __attribute__ ((__noreturn__));
>   void consume_sclp_int(void);
>   void consume_io_int(void);
> @@ -88,6 +89,11 @@ __attribute__ ((__noreturn__))
>   static inline void panic(const char *string)
>   {
>       sclp_print(string);
> +    if (load_next_iplb()) {
> +        sclp_print("\nTrying next boot device...");
> +        jump_to_IPL_code((long)_start);
> +    }
> +
>       disabled_wait();
>   }

Honestly, I am unsure whether this is a really cool idea or a very ugly hack 
... but I think I tend towards the latter, sorry. Jumping back to the 
startup code might cause various problem, e.g. pre-initialized variables 
don't get their values reset, causing different behavior when the s390-ccw 
bios runs a function a second time this way. Thus this sounds very fragile. 
Could we please try to get things cleaned up correctly, so that functions 
return with error codes instead of panicking when we can continue with 
another boot device? Even if its more work right now, I think this will be 
much more maintainable in the future.

  Thomas
Jared Rossi June 5, 2024, 2:48 p.m. UTC | #2
>> diff --git a/pc-bios/s390-ccw/s390-ccw.h b/pc-bios/s390-ccw/s390-ccw.h
>> index c977a52b50..de3d1f0d5a 100644
>> --- a/pc-bios/s390-ccw/s390-ccw.h
>> +++ b/pc-bios/s390-ccw/s390-ccw.h
>> @@ -43,6 +43,7 @@ typedef unsigned long long u64;
>>   #include "iplb.h"
>>     /* start.s */
>> +extern char _start[];
>>   void disabled_wait(void) __attribute__ ((__noreturn__));
>>   void consume_sclp_int(void);
>>   void consume_io_int(void);
>> @@ -88,6 +89,11 @@ __attribute__ ((__noreturn__))
>>   static inline void panic(const char *string)
>>   {
>>       sclp_print(string);
>> +    if (load_next_iplb()) {
>> +        sclp_print("\nTrying next boot device...");
>> +        jump_to_IPL_code((long)_start);
>> +    }
>> +
>>       disabled_wait();
>>   }
>
> Honestly, I am unsure whether this is a really cool idea or a very 
> ugly hack ... but I think I tend towards the latter, sorry. Jumping 
> back to the startup code might cause various problem, e.g. 
> pre-initialized variables don't get their values reset, causing 
> different behavior when the s390-ccw bios runs a function a second 
> time this way. Thus this sounds very fragile. Could we please try to 
> get things cleaned up correctly, so that functions return with error 
> codes instead of panicking when we can continue with another boot 
> device? Even if its more work right now, I think this will be much 
> more maintainable in the future.
>
>  Thomas
>

Thanks Thomas, I appreciate your insight.  Your hesitation is perfectly 
understandable as well.  My initial design was like you suggest, where 
the functions return instead of panic, but the issue I ran into is that 
netboot uses a separate image, which we jump in to at the start of IPL 
from a network device (see zipl_load() in pc-bios/s390-ccw/bootmap.c).  
I wasn't able to come up with a simple way to return to the main BIOS 
code if a netboot fails other than by jumping back.  So, it seems to me 
that netboot kind of throws a monkeywrench into the basic idea of 
reworking the panics into returns.

I'm open to suggestions on a better way to recover from a failed 
netboot, and it's certainly possible I've overlooked something, but as 
far as I can tell a jump is necessary in that particular case at least.  
Netboot could perhaps be handled as a special case where the jump back 
is permitted whereas other device types return, but I don't think that 
actually solves the main issue.

What are your thoughts on this?

Thanks,

Jared Rossi
Thomas Huth June 7, 2024, 5:57 a.m. UTC | #3
On 05/06/2024 16.48, Jared Rossi wrote:
> 
>>> diff --git a/pc-bios/s390-ccw/s390-ccw.h b/pc-bios/s390-ccw/s390-ccw.h
>>> index c977a52b50..de3d1f0d5a 100644
>>> --- a/pc-bios/s390-ccw/s390-ccw.h
>>> +++ b/pc-bios/s390-ccw/s390-ccw.h
>>> @@ -43,6 +43,7 @@ typedef unsigned long long u64;
>>>   #include "iplb.h"
>>>     /* start.s */
>>> +extern char _start[];
>>>   void disabled_wait(void) __attribute__ ((__noreturn__));
>>>   void consume_sclp_int(void);
>>>   void consume_io_int(void);
>>> @@ -88,6 +89,11 @@ __attribute__ ((__noreturn__))
>>>   static inline void panic(const char *string)
>>>   {
>>>       sclp_print(string);
>>> +    if (load_next_iplb()) {
>>> +        sclp_print("\nTrying next boot device...");
>>> +        jump_to_IPL_code((long)_start);
>>> +    }
>>> +
>>>       disabled_wait();
>>>   }
>>
>> Honestly, I am unsure whether this is a really cool idea or a very ugly 
>> hack ... but I think I tend towards the latter, sorry. Jumping back to the 
>> startup code might cause various problem, e.g. pre-initialized variables 
>> don't get their values reset, causing different behavior when the s390-ccw 
>> bios runs a function a second time this way. Thus this sounds very 
>> fragile. Could we please try to get things cleaned up correctly, so that 
>> functions return with error codes instead of panicking when we can 
>> continue with another boot device? Even if its more work right now, I 
>> think this will be much more maintainable in the future.
>>
>>  Thomas
>>
> 
> Thanks Thomas, I appreciate your insight.  Your hesitation is perfectly 
> understandable as well.  My initial design was like you suggest, where the 
> functions return instead of panic, but the issue I ran into is that netboot 
> uses a separate image, which we jump in to at the start of IPL from a 
> network device (see zipl_load() in pc-bios/s390-ccw/bootmap.c). I wasn't 
> able to come up with a simple way to return to the main BIOS code if a 
> netboot fails other than by jumping back.  So, it seems to me that netboot 
> kind of throws a monkeywrench into the basic idea of reworking the panics 
> into returns.
> 
> I'm open to suggestions on a better way to recover from a failed netboot, 
> and it's certainly possible I've overlooked something, but as far as I can 
> tell a jump is necessary in that particular case at least. Netboot could 
> perhaps be handled as a special case where the jump back is permitted 
> whereas other device types return, but I don't think that actually solves 
> the main issue.
> 
> What are your thoughts on this?

Yes, I agree that jumping is currently required to get back from the netboot 
code. So if you could rework your patches in a way that limits the jumping 
to a failed netboot, that would be acceptable, I think.

Apart from that: We originally decided to put the netboot code into a 
separate binary since the required roms/SLOF module might not always have 
been checked out (it needed to be done manually), so we were not able to 
compile it in all cases. But nowadays, this is handled in a much nicer way, 
the submodule is automatically checked out once you compile the 
s390x-softmmu target and have a s390x compiler available, so I wonder 
whether we should maybe do the next step and integrate the netboot code into 
the main s390-ccw.img now? Anybody got an opinion on this?

  Thomas
Jared Rossi June 16, 2024, 11:44 p.m. UTC | #4
On 6/7/24 1:57 AM, Thomas Huth wrote:
> On 05/06/2024 16.48, Jared Rossi wrote:
>>
>>>> diff --git a/pc-bios/s390-ccw/s390-ccw.h b/pc-bios/s390-ccw/s390-ccw.h
>>>> index c977a52b50..de3d1f0d5a 100644
>>>> --- a/pc-bios/s390-ccw/s390-ccw.h
>>>> +++ b/pc-bios/s390-ccw/s390-ccw.h
>>>> @@ -43,6 +43,7 @@ typedef unsigned long long u64;
>>>>   #include "iplb.h"
>>>>     /* start.s */
>>>> +extern char _start[];
>>>>   void disabled_wait(void) __attribute__ ((__noreturn__));
>>>>   void consume_sclp_int(void);
>>>>   void consume_io_int(void);
>>>> @@ -88,6 +89,11 @@ __attribute__ ((__noreturn__))
>>>>   static inline void panic(const char *string)
>>>>   {
>>>>       sclp_print(string);
>>>> +    if (load_next_iplb()) {
>>>> +        sclp_print("\nTrying next boot device...");
>>>> +        jump_to_IPL_code((long)_start);
>>>> +    }
>>>> +
>>>>       disabled_wait();
>>>>   }
>>>
>>> Honestly, I am unsure whether this is a really cool idea or a very 
>>> ugly hack ... but I think I tend towards the latter, sorry. Jumping 
>>> back to the startup code might cause various problem, e.g. 
>>> pre-initialized variables don't get their values reset, causing 
>>> different behavior when the s390-ccw bios runs a function a second 
>>> time this way. Thus this sounds very fragile. Could we please try to 
>>> get things cleaned up correctly, so that functions return with error 
>>> codes instead of panicking when we can continue with another boot 
>>> device? Even if its more work right now, I think this will be much 
>>> more maintainable in the future.
>>>
>>>  Thomas
>>>
>>
>> Thanks Thomas, I appreciate your insight.  Your hesitation is 
>> perfectly understandable as well.  My initial design was like you 
>> suggest, where the functions return instead of panic, but the issue I 
>> ran into is that netboot uses a separate image, which we jump in to 
>> at the start of IPL from a network device (see zipl_load() in 
>> pc-bios/s390-ccw/bootmap.c). I wasn't able to come up with a simple 
>> way to return to the main BIOS code if a netboot fails other than by 
>> jumping back.  So, it seems to me that netboot kind of throws a 
>> monkeywrench into the basic idea of reworking the panics into returns.
>>
>> I'm open to suggestions on a better way to recover from a failed 
>> netboot, and it's certainly possible I've overlooked something, but 
>> as far as I can tell a jump is necessary in that particular case at 
>> least. Netboot could perhaps be handled as a special case where the 
>> jump back is permitted whereas other device types return, but I don't 
>> think that actually solves the main issue.
>>
>> What are your thoughts on this?
>
> Yes, I agree that jumping is currently required to get back from the 
> netboot code. So if you could rework your patches in a way that limits 
> the jumping to a failed netboot, that would be acceptable, I think.
>
> Apart from that: We originally decided to put the netboot code into a 
> separate binary since the required roms/SLOF module might not always 
> have been checked out (it needed to be done manually), so we were not 
> able to compile it in all cases. But nowadays, this is handled in a 
> much nicer way, the submodule is automatically checked out once you 
> compile the s390x-softmmu target and have a s390x compiler available, 
> so I wonder whether we should maybe do the next step and integrate the 
> netboot code into the main s390-ccw.img now? Anybody got an opinion on 
> this?
>
>  Thomas
>

Hi Thomas,

I would generally defer the decision about integrating the netboot code 
to someone with more insight than me, but for what it's worth, I am of 
the opinion that if we want to rework all of panics into returns, then 
it would make the most sense to also do the integration now so that we 
can avoid using jump altogether.  Unless I'm missing something simple, I 
don't think the panic/return conversion will be trivial, and actually I 
think it will be quite invasive since there are dozens of calls to panic 
and assert that will need to be changed.   It doesn't seem worthwhile to 
do all of these conversions in order to avoid using jump, but then still 
being exposed to possible problems caused by jumping due to netboot 
requiring it anyway.

Regards,

Jared Rossi
Christian Borntraeger June 17, 2024, 2:49 p.m. UTC | #5
Am 05.06.24 um 15:37 schrieb Thomas Huth:
> On 29/05/2024 17.43, jrossi@linux.ibm.com wrote:
>> From: Jared Rossi <jrossi@linux.ibm.com>
>>
>> On a panic during IPL (i.e. a device failed to boot) check for another device
>> to boot from, as indicated by the presence of an unused IPLB.
>>
>> If an IPLB is successfully loaded, then jump to the start of BIOS, restarting
>> IPL using the updated IPLB.  Otherwise enter disabled wait.
>>
>> Signed-off-by: Jared Rossi <jrossi@linux.ibm.com>
>> ---
>>   docs/system/bootindex.rst         | 7 ++++---
>>   docs/system/s390x/bootdevices.rst | 9 ++++++---
>>   pc-bios/s390-ccw/s390-ccw.h       | 6 ++++++
>>   3 files changed, 16 insertions(+), 6 deletions(-)
> 
> Could you please split the documentation changes into a separate patch in v2 ? ... I think that would be cleaner.
> 
>> diff --git a/docs/system/bootindex.rst b/docs/system/bootindex.rst
>> index 8b057f812f..de597561bd 100644
>> --- a/docs/system/bootindex.rst
>> +++ b/docs/system/bootindex.rst
>> @@ -50,9 +50,10 @@ Limitations
>>   Some firmware has limitations on which devices can be considered for
>>   booting.  For instance, the PC BIOS boot specification allows only one
>> -disk to be bootable.  If boot from disk fails for some reason, the BIOS
>> -won't retry booting from other disk.  It can still try to boot from
>> -floppy or net, though.
>> +disk to be bootable, except for on s390x machines. If boot from disk fails for
>> +some reason, the BIOS won't retry booting from other disk.  It can still try to
>> +boot from floppy or net, though.  In the case of s390x, the BIOS will try up to
>> +8 total devices, any number of which may be disks.
> 
> Since the old text was already talking about "PC BIOS", I'd rather leave that paragraph as it is (maybe just replace "PC BIOS" with "x86 PC BIOS"), and add a separate paragraph afterwards about s390x instead.
> 
>> diff --git a/pc-bios/s390-ccw/s390-ccw.h b/pc-bios/s390-ccw/s390-ccw.h
>> index c977a52b50..de3d1f0d5a 100644
>> --- a/pc-bios/s390-ccw/s390-ccw.h
>> +++ b/pc-bios/s390-ccw/s390-ccw.h
>> @@ -43,6 +43,7 @@ typedef unsigned long long u64;
>>   #include "iplb.h"
>>   /* start.s */
>> +extern char _start[];
>>   void disabled_wait(void) __attribute__ ((__noreturn__));
>>   void consume_sclp_int(void);
>>   void consume_io_int(void);
>> @@ -88,6 +89,11 @@ __attribute__ ((__noreturn__))
>>   static inline void panic(const char *string)
>>   {
>>       sclp_print(string);
>> +    if (load_next_iplb()) {
>> +        sclp_print("\nTrying next boot device...");
>> +        jump_to_IPL_code((long)_start);
>> +    }
>> +
>>       disabled_wait();
>>   }
> 
> Honestly, I am unsure whether this is a really cool idea or a very ugly hack ... but I think I tend towards the latter, sorry. Jumping back to the startup code might cause various problem, e.g. pre-initialized variables don't get their values reset, causing different behavior when the s390-ccw bios runs a function a second time this way. 

We jump back to _start and to me it looks like that this code does the resetting of bss segment.
So anything that has a zero value this should be fine. But static variables != 0 are indeed tricky.
As far as I can see we do have some of those :-(

So instead of jumping, is there a way that remember somewhere at which device we are and then trigger a re-ipl to reload the BIOS?
Thomas Huth June 20, 2024, 8:10 a.m. UTC | #6
On 17/06/2024 01.44, Jared Rossi wrote:
> 
> 
> On 6/7/24 1:57 AM, Thomas Huth wrote:
>> On 05/06/2024 16.48, Jared Rossi wrote:
>>>
>>>>> diff --git a/pc-bios/s390-ccw/s390-ccw.h b/pc-bios/s390-ccw/s390-ccw.h
>>>>> index c977a52b50..de3d1f0d5a 100644
>>>>> --- a/pc-bios/s390-ccw/s390-ccw.h
>>>>> +++ b/pc-bios/s390-ccw/s390-ccw.h
>>>>> @@ -43,6 +43,7 @@ typedef unsigned long long u64;
>>>>>   #include "iplb.h"
>>>>>     /* start.s */
>>>>> +extern char _start[];
>>>>>   void disabled_wait(void) __attribute__ ((__noreturn__));
>>>>>   void consume_sclp_int(void);
>>>>>   void consume_io_int(void);
>>>>> @@ -88,6 +89,11 @@ __attribute__ ((__noreturn__))
>>>>>   static inline void panic(const char *string)
>>>>>   {
>>>>>       sclp_print(string);
>>>>> +    if (load_next_iplb()) {
>>>>> +        sclp_print("\nTrying next boot device...");
>>>>> +        jump_to_IPL_code((long)_start);
>>>>> +    }
>>>>> +
>>>>>       disabled_wait();
>>>>>   }
>>>>
>>>> Honestly, I am unsure whether this is a really cool idea or a very ugly 
>>>> hack ... but I think I tend towards the latter, sorry. Jumping back to 
>>>> the startup code might cause various problem, e.g. pre-initialized 
>>>> variables don't get their values reset, causing different behavior when 
>>>> the s390-ccw bios runs a function a second time this way. Thus this 
>>>> sounds very fragile. Could we please try to get things cleaned up 
>>>> correctly, so that functions return with error codes instead of 
>>>> panicking when we can continue with another boot device? Even if its 
>>>> more work right now, I think this will be much more maintainable in the 
>>>> future.
>>>>
>>>>  Thomas
>>>>
>>>
>>> Thanks Thomas, I appreciate your insight.  Your hesitation is perfectly 
>>> understandable as well.  My initial design was like you suggest, where 
>>> the functions return instead of panic, but the issue I ran into is that 
>>> netboot uses a separate image, which we jump in to at the start of IPL 
>>> from a network device (see zipl_load() in pc-bios/s390-ccw/bootmap.c). I 
>>> wasn't able to come up with a simple way to return to the main BIOS code 
>>> if a netboot fails other than by jumping back.  So, it seems to me that 
>>> netboot kind of throws a monkeywrench into the basic idea of reworking 
>>> the panics into returns.
>>>
>>> I'm open to suggestions on a better way to recover from a failed netboot, 
>>> and it's certainly possible I've overlooked something, but as far as I 
>>> can tell a jump is necessary in that particular case at least. Netboot 
>>> could perhaps be handled as a special case where the jump back is 
>>> permitted whereas other device types return, but I don't think that 
>>> actually solves the main issue.
>>>
>>> What are your thoughts on this?
>>
>> Yes, I agree that jumping is currently required to get back from the 
>> netboot code. So if you could rework your patches in a way that limits the 
>> jumping to a failed netboot, that would be acceptable, I think.
>>
>> Apart from that: We originally decided to put the netboot code into a 
>> separate binary since the required roms/SLOF module might not always have 
>> been checked out (it needed to be done manually), so we were not able to 
>> compile it in all cases. But nowadays, this is handled in a much nicer 
>> way, the submodule is automatically checked out once you compile the 
>> s390x-softmmu target and have a s390x compiler available, so I wonder 
>> whether we should maybe do the next step and integrate the netboot code 
>> into the main s390-ccw.img now? Anybody got an opinion on this?
>>
>>  Thomas
>>
> 
> Hi Thomas,
> 
> I would generally defer the decision about integrating the netboot code to 
> someone with more insight than me, but for what it's worth, I am of the 
> opinion that if we want to rework all of panics into returns, then it would 
> make the most sense to also do the integration now so that we can avoid 
> using jump altogether.  Unless I'm missing something simple, I don't think 
> the panic/return conversion will be trivial, and actually I think it will be 
> quite invasive since there are dozens of calls to panic and assert that will 
> need to be changed.   It doesn't seem worthwhile to do all of these 
> conversions in order to avoid using jump, but then still being exposed to 
> possible problems caused by jumping due to netboot requiring it anyway.

Agreed, we should either do it right and merge the two binaries, or it does 
not make too much sense to only partly convert the code.

I can look into merging the two binaries, but it might also take some time. 
So for the time being, I'm fine if we include the panic-jumping hack for 
now, we can still then clean it up later.

  Thomas
Thomas Huth June 20, 2024, 8:14 a.m. UTC | #7
On 17/06/2024 16.49, Christian Borntraeger wrote:
> 
> 
> Am 05.06.24 um 15:37 schrieb Thomas Huth:
>> On 29/05/2024 17.43, jrossi@linux.ibm.com wrote:
>>> From: Jared Rossi <jrossi@linux.ibm.com>
>>>
>>> On a panic during IPL (i.e. a device failed to boot) check for another 
>>> device
>>> to boot from, as indicated by the presence of an unused IPLB.
>>>
>>> If an IPLB is successfully loaded, then jump to the start of BIOS, 
>>> restarting
>>> IPL using the updated IPLB.  Otherwise enter disabled wait.
>>>
>>> Signed-off-by: Jared Rossi <jrossi@linux.ibm.com>
>>> ---
>>>   docs/system/bootindex.rst         | 7 ++++---
>>>   docs/system/s390x/bootdevices.rst | 9 ++++++---
>>>   pc-bios/s390-ccw/s390-ccw.h       | 6 ++++++
>>>   3 files changed, 16 insertions(+), 6 deletions(-)
>>
>> Could you please split the documentation changes into a separate patch in 
>> v2 ? ... I think that would be cleaner.
>>
>>> diff --git a/docs/system/bootindex.rst b/docs/system/bootindex.rst
>>> index 8b057f812f..de597561bd 100644
>>> --- a/docs/system/bootindex.rst
>>> +++ b/docs/system/bootindex.rst
>>> @@ -50,9 +50,10 @@ Limitations
>>>   Some firmware has limitations on which devices can be considered for
>>>   booting.  For instance, the PC BIOS boot specification allows only one
>>> -disk to be bootable.  If boot from disk fails for some reason, the BIOS
>>> -won't retry booting from other disk.  It can still try to boot from
>>> -floppy or net, though.
>>> +disk to be bootable, except for on s390x machines. If boot from disk 
>>> fails for
>>> +some reason, the BIOS won't retry booting from other disk.  It can still 
>>> try to
>>> +boot from floppy or net, though.  In the case of s390x, the BIOS will 
>>> try up to
>>> +8 total devices, any number of which may be disks.
>>
>> Since the old text was already talking about "PC BIOS", I'd rather leave 
>> that paragraph as it is (maybe just replace "PC BIOS" with "x86 PC BIOS"), 
>> and add a separate paragraph afterwards about s390x instead.
>>
>>> diff --git a/pc-bios/s390-ccw/s390-ccw.h b/pc-bios/s390-ccw/s390-ccw.h
>>> index c977a52b50..de3d1f0d5a 100644
>>> --- a/pc-bios/s390-ccw/s390-ccw.h
>>> +++ b/pc-bios/s390-ccw/s390-ccw.h
>>> @@ -43,6 +43,7 @@ typedef unsigned long long u64;
>>>   #include "iplb.h"
>>>   /* start.s */
>>> +extern char _start[];
>>>   void disabled_wait(void) __attribute__ ((__noreturn__));
>>>   void consume_sclp_int(void);
>>>   void consume_io_int(void);
>>> @@ -88,6 +89,11 @@ __attribute__ ((__noreturn__))
>>>   static inline void panic(const char *string)
>>>   {
>>>       sclp_print(string);
>>> +    if (load_next_iplb()) {
>>> +        sclp_print("\nTrying next boot device...");
>>> +        jump_to_IPL_code((long)_start);
>>> +    }
>>> +
>>>       disabled_wait();
>>>   }
>>
>> Honestly, I am unsure whether this is a really cool idea or a very ugly 
>> hack ... but I think I tend towards the latter, sorry. Jumping back to the 
>> startup code might cause various problem, e.g. pre-initialized variables 
>> don't get their values reset, causing different behavior when the s390-ccw 
>> bios runs a function a second time this way. 
> 
> We jump back to _start and to me it looks like that this code does the 
> resetting of bss segment.
> So anything that has a zero value this should be fine. But static variables 
> != 0 are indeed tricky.
> As far as I can see we do have some of those :-(
> 
> So instead of jumping, is there a way that remember somewhere at which 
> device we are and then trigger a re-ipl to reload the BIOS?

If there is an easy way, this could maybe an option, but in the long run, 
I'd really prefer if we'd merge the binaries and get rid of such tricks, 
since this makes the code flow quite hard to understand and maybe also more 
difficult to debug if you run into problems later.

  Thomas
diff mbox series

Patch

diff --git a/docs/system/bootindex.rst b/docs/system/bootindex.rst
index 8b057f812f..de597561bd 100644
--- a/docs/system/bootindex.rst
+++ b/docs/system/bootindex.rst
@@ -50,9 +50,10 @@  Limitations
 
 Some firmware has limitations on which devices can be considered for
 booting.  For instance, the PC BIOS boot specification allows only one
-disk to be bootable.  If boot from disk fails for some reason, the BIOS
-won't retry booting from other disk.  It can still try to boot from
-floppy or net, though.
+disk to be bootable, except for on s390x machines. If boot from disk fails for
+some reason, the BIOS won't retry booting from other disk.  It can still try to
+boot from floppy or net, though.  In the case of s390x, the BIOS will try up to
+8 total devices, any number of which may be disks.
 
 Sometimes, firmware cannot map the device path QEMU wants firmware to
 boot from to a boot method.  It doesn't happen for devices the firmware
diff --git a/docs/system/s390x/bootdevices.rst b/docs/system/s390x/bootdevices.rst
index 1a7a18b43b..f096e1cc06 100644
--- a/docs/system/s390x/bootdevices.rst
+++ b/docs/system/s390x/bootdevices.rst
@@ -6,9 +6,7 @@  Booting with bootindex parameter
 
 For classical mainframe guests (i.e. LPAR or z/VM installations), you always
 have to explicitly specify the disk where you want to boot from (or "IPL" from,
-in s390x-speak -- IPL means "Initial Program Load"). In particular, there can
-also be only one boot device according to the architecture specification, thus
-specifying multiple boot devices is not possible (yet).
+in s390x-speak -- IPL means "Initial Program Load").
 
 So for booting an s390x guest in QEMU, you should always mark the
 device where you want to boot from with the ``bootindex`` property, for
@@ -17,6 +15,11 @@  example::
  qemu-system-s390x -drive if=none,id=dr1,file=guest.qcow2 \
                    -device virtio-blk,drive=dr1,bootindex=1
 
+Multiple devices may have a bootindex. The lowest bootindex is assigned to the
+device to IPL first.  If the IPL fails for the first, the device with the second
+lowest bootindex will be tried and so on until IPL is successful or there are no
+remaining boot devices to try.
+
 For booting from a CD-ROM ISO image (which needs to include El-Torito boot
 information in order to be bootable), it is recommended to specify a ``scsi-cd``
 device, for example like this::
diff --git a/pc-bios/s390-ccw/s390-ccw.h b/pc-bios/s390-ccw/s390-ccw.h
index c977a52b50..de3d1f0d5a 100644
--- a/pc-bios/s390-ccw/s390-ccw.h
+++ b/pc-bios/s390-ccw/s390-ccw.h
@@ -43,6 +43,7 @@  typedef unsigned long long u64;
 #include "iplb.h"
 
 /* start.s */
+extern char _start[];
 void disabled_wait(void) __attribute__ ((__noreturn__));
 void consume_sclp_int(void);
 void consume_io_int(void);
@@ -88,6 +89,11 @@  __attribute__ ((__noreturn__))
 static inline void panic(const char *string)
 {
     sclp_print(string);
+    if (load_next_iplb()) {
+        sclp_print("\nTrying next boot device...");
+        jump_to_IPL_code((long)_start);
+    }
+
     disabled_wait();
 }