diff mbox series

[v7,10/18] jobs: rename static functions called with job_mutex held

Message ID 20220616131835.2004262-11-eesposit@redhat.com
State New
Headers show
Series job: replace AioContext lock with job_mutex | expand

Commit Message

Emanuele Giuseppe Esposito June 16, 2022, 1:18 p.m. UTC
With the *nop* job_lock/unlock placed, rename the static
functions that are always under job_mutex, adding "_locked" suffix.

List of functions that get this suffix:
job_txn_ref		   job_txn_del_job
job_txn_apply		   job_state_transition
job_should_pause	   job_event_cancelled
job_event_completed	   job_event_pending
job_event_ready		   job_event_idle
job_do_yield		   job_timer_not_pending
job_do_dismiss		   job_conclude
job_update_rc		   job_commit
job_abort		   job_clean
job_finalize_single	   job_cancel_async
job_completed_txn_abort	   job_prepare
job_needs_finalize	   job_do_finalize
job_transition_to_pending  job_completed_txn_success
job_completed		   job_cancel_err
job_force_cancel_err

Note that "locked" refers to the *nop* job_lock/unlock, and not
real_job_lock/unlock.

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 job.c | 247 +++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 141 insertions(+), 106 deletions(-)

Comments

Vladimir Sementsov-Ogievskiy June 21, 2022, 5:26 p.m. UTC | #1
On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
> With the*nop*  job_lock/unlock placed, rename the static
> functions that are always under job_mutex, adding "_locked" suffix.
> 
> List of functions that get this suffix:
> job_txn_ref		   job_txn_del_job
> job_txn_apply		   job_state_transition
> job_should_pause	   job_event_cancelled
> job_event_completed	   job_event_pending
> job_event_ready		   job_event_idle
> job_do_yield		   job_timer_not_pending
> job_do_dismiss		   job_conclude
> job_update_rc		   job_commit
> job_abort		   job_clean
> job_finalize_single	   job_cancel_async
> job_completed_txn_abort	   job_prepare
> job_needs_finalize	   job_do_finalize
> job_transition_to_pending  job_completed_txn_success
> job_completed		   job_cancel_err
> job_force_cancel_err
> 
> Note that "locked" refers to the*nop*  job_lock/unlock, and not
> real_job_lock/unlock.
> 
> No functional change intended.
> 
> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>


Hmm. Maybe it was already discussed.. But for me it seems, that it would be simpler to review previous patches, that fix job_ API users to use locking properly, if this renaming go earlier.

Anyway, in this series, we can't update everything at once. So patch to patch, we make the code more and more correct. (yes I remember that lock() is a noop, but I should review thinking that it real, otherwise, how to review?)

So, I'm saying about formal correctness of using lock() unlock() function in connection with introduced _locked prifixes and in connection with how it should finally work.

You do:

05. introduce some _locked functions, that just duplicates, and job_pause_point_locked() is formally inconsistent, as I said.

06. Update a lot of places, to give them their final form (but not final, as some functions will be renamed to _locked, some not, hard to imagine)

07,08,09. Update some more, and even more places. very hard to track formal correctness of using locks

10-...: rename APIs.


What do you think about the following:

1. Introduce noop lock, and some internal _locked() versions, and keep formal consistency inside job.c, considering all public interfaces as unlocked:

  at this point:
   - everything correct inside job.c
   - no public interfaces with _locked prefix
   - all public interfaces take mutex internally
   - no external user take mutex by hand

We can rename all internal static functions at this step too.

2. Introduce some public _locked APIs, that we'll use in next patches

3. Now start fixing external users in several patches:
   
   - protect by mutex direct use of job fields
   - make wider locks and move to _locked APIs inside them where needed


In this scenario, every updated unit becomes formally correct after update, and after all steps everything is formally correct, and we can move to turning-on the mutex.
Emanuele Giuseppe Esposito June 22, 2022, 2:26 p.m. UTC | #2
Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>> With the*nop*  job_lock/unlock placed, rename the static
>> functions that are always under job_mutex, adding "_locked" suffix.
>>
>> List of functions that get this suffix:
>> job_txn_ref           job_txn_del_job
>> job_txn_apply           job_state_transition
>> job_should_pause       job_event_cancelled
>> job_event_completed       job_event_pending
>> job_event_ready           job_event_idle
>> job_do_yield           job_timer_not_pending
>> job_do_dismiss           job_conclude
>> job_update_rc           job_commit
>> job_abort           job_clean
>> job_finalize_single       job_cancel_async
>> job_completed_txn_abort       job_prepare
>> job_needs_finalize       job_do_finalize
>> job_transition_to_pending  job_completed_txn_success
>> job_completed           job_cancel_err
>> job_force_cancel_err
>>
>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
>> real_job_lock/unlock.
>>
>> No functional change intended.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>
> 
> 
> Hmm. Maybe it was already discussed.. But for me it seems, that it would
> be simpler to review previous patches, that fix job_ API users to use
> locking properly, if this renaming go earlier.
> 
> Anyway, in this series, we can't update everything at once. So patch to
> patch, we make the code more and more correct. (yes I remember that
> lock() is a noop, but I should review thinking that it real, otherwise,
> how to review?)
> 
> So, I'm saying about formal correctness of using lock() unlock()
> function in connection with introduced _locked prifixes and in
> connection with how it should finally work.
> 
> You do:
> 
> 05. introduce some _locked functions, that just duplicates, and
> job_pause_point_locked() is formally inconsistent, as I said.
> 
> 06. Update a lot of places, to give them their final form (but not
> final, as some functions will be renamed to _locked, some not, hard to
> imagine)
> 
> 07,08,09. Update some more, and even more places. very hard to track
> formal correctness of using locks
> 
> 10-...: rename APIs.
> 
> 
> What do you think about the following:
> 
> 1. Introduce noop lock, and some internal _locked() versions, and keep
> formal consistency inside job.c, considering all public interfaces as
> unlocked:
> 
>  at this point:
>   - everything correct inside job.c
>   - no public interfaces with _locked prefix
>   - all public interfaces take mutex internally
>   - no external user take mutex by hand
> 
> We can rename all internal static functions at this step too.
> 
> 2. Introduce some public _locked APIs, that we'll use in next patches
> 
> 3. Now start fixing external users in several patches:
>     - protect by mutex direct use of job fields
>   - make wider locks and move to _locked APIs inside them where needed
> 
> 
> In this scenario, every updated unit becomes formally correct after
> update, and after all steps everything is formally correct, and we can
> move to turning-on the mutex.
> 

I don't understand your logic also here, sorry :(

I assume you want to keep patch 1-4, then the problem is assing job_lock
and renaming functions in _locked.
So I would say the problem is in patch 5-6-10-11-12-13. All the others
should be self contained.

I understand patch 5 is a little hard to follow.

Now, I am not sure what you propose here but it seems that the end goal
is to just have the same result, but with additional intermediate steps
that are just "do this just because in the next patch will be useful".
I think the problem is that we are going to miss the "why we need the
lock" logic in the patches if we do so.

The logic I tried to convey in this order is the following:
- job.h: add _locked duplicates for job API functions called with and
without job_mutex
	Just create duplicates of functions

- jobs: protect jobs with job_lock/unlock
	QMP and monitor functions call APIs that assume lock is taken,
	drivers must take explicitly the lock

- jobs: rename static functions called with job_mutex held
- job.h: rename job API functions called with job_mutex held
- block_job: rename block_job functions called with job_mutex held
	*given* that some functions are always under lock, transform
	them in _locked. Requires the job_lock/unlock patch

- job.h: define unlocked functions
	Comments on the public functions that are not _locked


@Kevin, since you also had some feedbacks on the patch ordering, do you
agree with this ordering or you have some other ideas?

Following your suggestion, we could move patches 10-11-12-13 before
patch 6 "jobs: protect jobs with job_lock/unlock".

(Apologies for changing my mind, but being the second complain I am
starting to reconsider reordering the patches).

Thank you,
Emanuele
Vladimir Sementsov-Ogievskiy June 22, 2022, 6:38 p.m. UTC | #3
On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
>> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>>> With the*nop*  job_lock/unlock placed, rename the static
>>> functions that are always under job_mutex, adding "_locked" suffix.
>>>
>>> List of functions that get this suffix:
>>> job_txn_ref           job_txn_del_job
>>> job_txn_apply           job_state_transition
>>> job_should_pause       job_event_cancelled
>>> job_event_completed       job_event_pending
>>> job_event_ready           job_event_idle
>>> job_do_yield           job_timer_not_pending
>>> job_do_dismiss           job_conclude
>>> job_update_rc           job_commit
>>> job_abort           job_clean
>>> job_finalize_single       job_cancel_async
>>> job_completed_txn_abort       job_prepare
>>> job_needs_finalize       job_do_finalize
>>> job_transition_to_pending  job_completed_txn_success
>>> job_completed           job_cancel_err
>>> job_force_cancel_err
>>>
>>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
>>> real_job_lock/unlock.
>>>
>>> No functional change intended.
>>>
>>> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>
>>
>>
>> Hmm. Maybe it was already discussed.. But for me it seems, that it would
>> be simpler to review previous patches, that fix job_ API users to use
>> locking properly, if this renaming go earlier.
>>
>> Anyway, in this series, we can't update everything at once. So patch to
>> patch, we make the code more and more correct. (yes I remember that
>> lock() is a noop, but I should review thinking that it real, otherwise,
>> how to review?)
>>
>> So, I'm saying about formal correctness of using lock() unlock()
>> function in connection with introduced _locked prifixes and in
>> connection with how it should finally work.
>>
>> You do:
>>
>> 05. introduce some _locked functions, that just duplicates, and
>> job_pause_point_locked() is formally inconsistent, as I said.
>>
>> 06. Update a lot of places, to give them their final form (but not
>> final, as some functions will be renamed to _locked, some not, hard to
>> imagine)
>>
>> 07,08,09. Update some more, and even more places. very hard to track
>> formal correctness of using locks
>>
>> 10-...: rename APIs.
>>
>>
>> What do you think about the following:
>>
>> 1. Introduce noop lock, and some internal _locked() versions, and keep
>> formal consistency inside job.c, considering all public interfaces as
>> unlocked:
>>
>>   at this point:
>>    - everything correct inside job.c
>>    - no public interfaces with _locked prefix
>>    - all public interfaces take mutex internally
>>    - no external user take mutex by hand
>>
>> We can rename all internal static functions at this step too.
>>
>> 2. Introduce some public _locked APIs, that we'll use in next patches
>>
>> 3. Now start fixing external users in several patches:
>>      - protect by mutex direct use of job fields
>>    - make wider locks and move to _locked APIs inside them where needed
>>
>>
>> In this scenario, every updated unit becomes formally correct after
>> update, and after all steps everything is formally correct, and we can
>> move to turning-on the mutex.
>>
> 
> I don't understand your logic also here, sorry :(
> 
> I assume you want to keep patch 1-4, then the problem is assing job_lock
> and renaming functions in _locked.
> So I would say the problem is in patch 5-6-10-11-12-13. All the others
> should be self contained.
> 
> I understand patch 5 is a little hard to follow.
> 
> Now, I am not sure what you propose here but it seems that the end goal
> is to just have the same result, but with additional intermediate steps
> that are just "do this just because in the next patch will be useful".
> I think the problem is that we are going to miss the "why we need the
> lock" logic in the patches if we do so.
> 
> The logic I tried to convey in this order is the following:
> - job.h: add _locked duplicates for job API functions called with and
> without job_mutex
> 	Just create duplicates of functions
> 
> - jobs: protect jobs with job_lock/unlock
> 	QMP and monitor functions call APIs that assume lock is taken,
> 	drivers must take explicitly the lock
> 
> - jobs: rename static functions called with job_mutex held
> - job.h: rename job API functions called with job_mutex held
> - block_job: rename block_job functions called with job_mutex held
> 	*given* that some functions are always under lock, transform
> 	them in _locked. Requires the job_lock/unlock patch
> 
> - job.h: define unlocked functions
> 	Comments on the public functions that are not _locked
> 
> 
> @Kevin, since you also had some feedbacks on the patch ordering, do you
> agree with this ordering or you have some other ideas?
> 
> Following your suggestion, we could move patches 10-11-12-13 before
> patch 6 "jobs: protect jobs with job_lock/unlock".
> 
> (Apologies for changing my mind, but being the second complain I am
> starting to reconsider reordering the patches).
> 

In two words, what I mean: let's keep the following invariant from patch to patch:

1. Function that has _locked() prefix is always called with lock held
2. Function that has _locked() prefix never calls functions that take lock by themselves so that would dead-lock
3. Function that is documented as "called with lock not held" is never called with lock held

That what I mean by "formal correctness": yes, we know that lock is noop, but still let's keep code logic to correspond function naming and comments that we add.
Emanuele Giuseppe Esposito June 23, 2022, 9:08 a.m. UTC | #4
Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:
> On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
>>> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>>>> With the*nop*  job_lock/unlock placed, rename the static
>>>> functions that are always under job_mutex, adding "_locked" suffix.
>>>>
>>>> List of functions that get this suffix:
>>>> job_txn_ref           job_txn_del_job
>>>> job_txn_apply           job_state_transition
>>>> job_should_pause       job_event_cancelled
>>>> job_event_completed       job_event_pending
>>>> job_event_ready           job_event_idle
>>>> job_do_yield           job_timer_not_pending
>>>> job_do_dismiss           job_conclude
>>>> job_update_rc           job_commit
>>>> job_abort           job_clean
>>>> job_finalize_single       job_cancel_async
>>>> job_completed_txn_abort       job_prepare
>>>> job_needs_finalize       job_do_finalize
>>>> job_transition_to_pending  job_completed_txn_success
>>>> job_completed           job_cancel_err
>>>> job_force_cancel_err
>>>>
>>>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
>>>> real_job_lock/unlock.
>>>>
>>>> No functional change intended.
>>>>
>>>> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>
>>>
>>>
>>> Hmm. Maybe it was already discussed.. But for me it seems, that it would
>>> be simpler to review previous patches, that fix job_ API users to use
>>> locking properly, if this renaming go earlier.
>>>
>>> Anyway, in this series, we can't update everything at once. So patch to
>>> patch, we make the code more and more correct. (yes I remember that
>>> lock() is a noop, but I should review thinking that it real, otherwise,
>>> how to review?)
>>>
>>> So, I'm saying about formal correctness of using lock() unlock()
>>> function in connection with introduced _locked prifixes and in
>>> connection with how it should finally work.
>>>
>>> You do:
>>>
>>> 05. introduce some _locked functions, that just duplicates, and
>>> job_pause_point_locked() is formally inconsistent, as I said.
>>>
>>> 06. Update a lot of places, to give them their final form (but not
>>> final, as some functions will be renamed to _locked, some not, hard to
>>> imagine)
>>>
>>> 07,08,09. Update some more, and even more places. very hard to track
>>> formal correctness of using locks
>>>
>>> 10-...: rename APIs.
>>>
>>>
>>> What do you think about the following:
>>>
>>> 1. Introduce noop lock, and some internal _locked() versions, and keep
>>> formal consistency inside job.c, considering all public interfaces as
>>> unlocked:
>>>
>>>   at this point:
>>>    - everything correct inside job.c
>>>    - no public interfaces with _locked prefix
>>>    - all public interfaces take mutex internally
>>>    - no external user take mutex by hand
>>>
>>> We can rename all internal static functions at this step too.
>>>
>>> 2. Introduce some public _locked APIs, that we'll use in next patches
>>>
>>> 3. Now start fixing external users in several patches:
>>>      - protect by mutex direct use of job fields
>>>    - make wider locks and move to _locked APIs inside them where needed
>>>
>>>
>>> In this scenario, every updated unit becomes formally correct after
>>> update, and after all steps everything is formally correct, and we can
>>> move to turning-on the mutex.
>>>
>>
>> I don't understand your logic also here, sorry :(
>>
>> I assume you want to keep patch 1-4, then the problem is assing job_lock
>> and renaming functions in _locked.
>> So I would say the problem is in patch 5-6-10-11-12-13. All the others
>> should be self contained.
>>
>> I understand patch 5 is a little hard to follow.
>>
>> Now, I am not sure what you propose here but it seems that the end goal
>> is to just have the same result, but with additional intermediate steps
>> that are just "do this just because in the next patch will be useful".
>> I think the problem is that we are going to miss the "why we need the
>> lock" logic in the patches if we do so.
>>
>> The logic I tried to convey in this order is the following:
>> - job.h: add _locked duplicates for job API functions called with and
>> without job_mutex
>>     Just create duplicates of functions
>>
>> - jobs: protect jobs with job_lock/unlock
>>     QMP and monitor functions call APIs that assume lock is taken,
>>     drivers must take explicitly the lock
>>
>> - jobs: rename static functions called with job_mutex held
>> - job.h: rename job API functions called with job_mutex held
>> - block_job: rename block_job functions called with job_mutex held
>>     *given* that some functions are always under lock, transform
>>     them in _locked. Requires the job_lock/unlock patch
>>
>> - job.h: define unlocked functions
>>     Comments on the public functions that are not _locked
>>
>>
>> @Kevin, since you also had some feedbacks on the patch ordering, do you
>> agree with this ordering or you have some other ideas?
>>
>> Following your suggestion, we could move patches 10-11-12-13 before
>> patch 6 "jobs: protect jobs with job_lock/unlock".
>>
>> (Apologies for changing my mind, but being the second complain I am
>> starting to reconsider reordering the patches).
>>
> 
> In two words, what I mean: let's keep the following invariant from patch
> to patch:
> 
> 1. Function that has _locked() prefix is always called with lock held
> 2. Function that has _locked() prefix never calls functions that take
> lock by themselves so that would dead-lock
> 3. Function that is documented as "called with lock not held" is never
> called with lock held
> 
> That what I mean by "formal correctness": yes, we know that lock is
> noop, but still let's keep code logic to correspond function naming and
> comments that we add.
> 

Ok I get what you mean, but then we have useless changes for public
functions that eventually will only be _locked() like job_next_locked:

The function is always called in a loop, so it is pointless to take the
lock inside. Therefore the patch would be "incorrect" on its own anyways.

Then, we would have a patch where we add the lock guard inside, and
another one where we remove it and rename to _locked and take the lock
outside. Seems unnecessary to me.

Again, I understand it is difficult to review as it is now, but this
won't make it better IMO.

Thank you,
Emanuele
Vladimir Sementsov-Ogievskiy June 23, 2022, 11:10 a.m. UTC | #5
On 6/23/22 12:08, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:
>> On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
>>>> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>>>>> With the*nop*  job_lock/unlock placed, rename the static
>>>>> functions that are always under job_mutex, adding "_locked" suffix.
>>>>>
>>>>> List of functions that get this suffix:
>>>>> job_txn_ref           job_txn_del_job
>>>>> job_txn_apply           job_state_transition
>>>>> job_should_pause       job_event_cancelled
>>>>> job_event_completed       job_event_pending
>>>>> job_event_ready           job_event_idle
>>>>> job_do_yield           job_timer_not_pending
>>>>> job_do_dismiss           job_conclude
>>>>> job_update_rc           job_commit
>>>>> job_abort           job_clean
>>>>> job_finalize_single       job_cancel_async
>>>>> job_completed_txn_abort       job_prepare
>>>>> job_needs_finalize       job_do_finalize
>>>>> job_transition_to_pending  job_completed_txn_success
>>>>> job_completed           job_cancel_err
>>>>> job_force_cancel_err
>>>>>
>>>>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
>>>>> real_job_lock/unlock.
>>>>>
>>>>> No functional change intended.
>>>>>
>>>>> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>
>>>>
>>>>
>>>> Hmm. Maybe it was already discussed.. But for me it seems, that it would
>>>> be simpler to review previous patches, that fix job_ API users to use
>>>> locking properly, if this renaming go earlier.
>>>>
>>>> Anyway, in this series, we can't update everything at once. So patch to
>>>> patch, we make the code more and more correct. (yes I remember that
>>>> lock() is a noop, but I should review thinking that it real, otherwise,
>>>> how to review?)
>>>>
>>>> So, I'm saying about formal correctness of using lock() unlock()
>>>> function in connection with introduced _locked prifixes and in
>>>> connection with how it should finally work.
>>>>
>>>> You do:
>>>>
>>>> 05. introduce some _locked functions, that just duplicates, and
>>>> job_pause_point_locked() is formally inconsistent, as I said.
>>>>
>>>> 06. Update a lot of places, to give them their final form (but not
>>>> final, as some functions will be renamed to _locked, some not, hard to
>>>> imagine)
>>>>
>>>> 07,08,09. Update some more, and even more places. very hard to track
>>>> formal correctness of using locks
>>>>
>>>> 10-...: rename APIs.
>>>>
>>>>
>>>> What do you think about the following:
>>>>
>>>> 1. Introduce noop lock, and some internal _locked() versions, and keep
>>>> formal consistency inside job.c, considering all public interfaces as
>>>> unlocked:
>>>>
>>>>    at this point:
>>>>     - everything correct inside job.c
>>>>     - no public interfaces with _locked prefix
>>>>     - all public interfaces take mutex internally
>>>>     - no external user take mutex by hand
>>>>
>>>> We can rename all internal static functions at this step too.
>>>>
>>>> 2. Introduce some public _locked APIs, that we'll use in next patches
>>>>
>>>> 3. Now start fixing external users in several patches:
>>>>       - protect by mutex direct use of job fields
>>>>     - make wider locks and move to _locked APIs inside them where needed
>>>>
>>>>
>>>> In this scenario, every updated unit becomes formally correct after
>>>> update, and after all steps everything is formally correct, and we can
>>>> move to turning-on the mutex.
>>>>
>>>
>>> I don't understand your logic also here, sorry :(
>>>
>>> I assume you want to keep patch 1-4, then the problem is assing job_lock
>>> and renaming functions in _locked.
>>> So I would say the problem is in patch 5-6-10-11-12-13. All the others
>>> should be self contained.
>>>
>>> I understand patch 5 is a little hard to follow.
>>>
>>> Now, I am not sure what you propose here but it seems that the end goal
>>> is to just have the same result, but with additional intermediate steps
>>> that are just "do this just because in the next patch will be useful".
>>> I think the problem is that we are going to miss the "why we need the
>>> lock" logic in the patches if we do so.
>>>
>>> The logic I tried to convey in this order is the following:
>>> - job.h: add _locked duplicates for job API functions called with and
>>> without job_mutex
>>>      Just create duplicates of functions
>>>
>>> - jobs: protect jobs with job_lock/unlock
>>>      QMP and monitor functions call APIs that assume lock is taken,
>>>      drivers must take explicitly the lock
>>>
>>> - jobs: rename static functions called with job_mutex held
>>> - job.h: rename job API functions called with job_mutex held
>>> - block_job: rename block_job functions called with job_mutex held
>>>      *given* that some functions are always under lock, transform
>>>      them in _locked. Requires the job_lock/unlock patch
>>>
>>> - job.h: define unlocked functions
>>>      Comments on the public functions that are not _locked
>>>
>>>
>>> @Kevin, since you also had some feedbacks on the patch ordering, do you
>>> agree with this ordering or you have some other ideas?
>>>
>>> Following your suggestion, we could move patches 10-11-12-13 before
>>> patch 6 "jobs: protect jobs with job_lock/unlock".
>>>
>>> (Apologies for changing my mind, but being the second complain I am
>>> starting to reconsider reordering the patches).
>>>
>>
>> In two words, what I mean: let's keep the following invariant from patch
>> to patch:
>>
>> 1. Function that has _locked() prefix is always called with lock held
>> 2. Function that has _locked() prefix never calls functions that take
>> lock by themselves so that would dead-lock
>> 3. Function that is documented as "called with lock not held" is never
>> called with lock held
>>
>> That what I mean by "formal correctness": yes, we know that lock is
>> noop, but still let's keep code logic to correspond function naming and
>> comments that we add.
>>
> 
> Ok I get what you mean, but then we have useless changes for public
> functions that eventually will only be _locked() like job_next_locked:
> 
> The function is always called in a loop, so it is pointless to take the
> lock inside. Therefore the patch would be "incorrect" on its own anyways.
> 
> Then, we would have a patch where we add the lock guard inside, and
> another one where we remove it and rename to _locked and take the lock
> outside. Seems unnecessary to me.

For me it looks a bit simpler than you describe. And anyway keeping the correctness from patch to patch worth the complexity. I'll give an argument.

First what is the best practices? Best practices is when every patch is good and absolutely correct. So that you can apply any number of patches from the beginning of the series (01-NN), commit them to master and this will break neither compilation, nor tests, nor readability, nothing. This makes the review process iterable: if I'm OK with patches 01-03, I give them r-b and don't think about them. I don't have to keep in mind any tricky things. And I can review 04 several days later not rereading 01-03 (or at least I can consider applied 01-03 as a good correct base state). This way I'm sure, that if I reviewed all patches one-by-one, each one is correct, then the whole thing is correct.

A lot harder to review when we have only collective correctness: the whole series being applied make a correct thing, but we can't say it about intermediate states. In your series we can't be absolutely correct with each patch, as we have to switch from aio-context lock to mutex in one patch, that's why mutex is added as noop. That's a reasonable and (seems) unsolvable drawback. That's a thing I have to keep in mind during the whole review. But I'd prefer not add more such things, like comments and _locked suffixes that don't correspond to the code.

With the invariant that I propose, the following logic works:

If
    1. we keep the invariant from patch to patch
    AND
    2. at the end we have updated all users of the internal and external APIs, not missed some file or function
Then everything is correct at the end.

Without the invariant I can't prove that everything is correct at the end, as it is hard to follow the degree of correctness from patch to patch. In your way the only invariant that we have from patch to patch, is that mutex is noop, so all changes do nothing, and therefore they are correct. This way I can give an r-b to all such patches not thinking about details, they are noop. But when I finally have to review the patch that turns on the mutex, I'll have to recheck all internal and external API users, which is equivalent to review all the changes merged into one patch.



Consider the case with job_next. The most correct way to update it IMHO:

1. Add lock inside job_next() and add job_next_locked() - in one patch with other similar changes of job.c and job.h.

At this moment we have job_next() calls in a loop, which is not good (we want larger critical section), but that doesn't break the invariant I proposed above.

2. Update the loop: add larger critical section and switch to job_next_locked().

What is good here: we don't  need to unite updates of external API users into one patch, we can update file-by-file or subsystem-by-sybsystem.

3. Delete unused job_next() API
Emanuele Giuseppe Esposito June 23, 2022, 11:19 a.m. UTC | #6
Am 23/06/2022 um 13:10 schrieb Vladimir Sementsov-Ogievskiy:
> On 6/23/22 12:08, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:
>>> On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:
>>>>
>>>>
>>>> Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>>>>>> With the*nop*  job_lock/unlock placed, rename the static
>>>>>> functions that are always under job_mutex, adding "_locked" suffix.
>>>>>>
>>>>>> List of functions that get this suffix:
>>>>>> job_txn_ref           job_txn_del_job
>>>>>> job_txn_apply           job_state_transition
>>>>>> job_should_pause       job_event_cancelled
>>>>>> job_event_completed       job_event_pending
>>>>>> job_event_ready           job_event_idle
>>>>>> job_do_yield           job_timer_not_pending
>>>>>> job_do_dismiss           job_conclude
>>>>>> job_update_rc           job_commit
>>>>>> job_abort           job_clean
>>>>>> job_finalize_single       job_cancel_async
>>>>>> job_completed_txn_abort       job_prepare
>>>>>> job_needs_finalize       job_do_finalize
>>>>>> job_transition_to_pending  job_completed_txn_success
>>>>>> job_completed           job_cancel_err
>>>>>> job_force_cancel_err
>>>>>>
>>>>>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
>>>>>> real_job_lock/unlock.
>>>>>>
>>>>>> No functional change intended.
>>>>>>
>>>>>> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>
>>>>>
>>>>>
>>>>> Hmm. Maybe it was already discussed.. But for me it seems, that it
>>>>> would
>>>>> be simpler to review previous patches, that fix job_ API users to use
>>>>> locking properly, if this renaming go earlier.
>>>>>
>>>>> Anyway, in this series, we can't update everything at once. So
>>>>> patch to
>>>>> patch, we make the code more and more correct. (yes I remember that
>>>>> lock() is a noop, but I should review thinking that it real,
>>>>> otherwise,
>>>>> how to review?)
>>>>>
>>>>> So, I'm saying about formal correctness of using lock() unlock()
>>>>> function in connection with introduced _locked prifixes and in
>>>>> connection with how it should finally work.
>>>>>
>>>>> You do:
>>>>>
>>>>> 05. introduce some _locked functions, that just duplicates, and
>>>>> job_pause_point_locked() is formally inconsistent, as I said.
>>>>>
>>>>> 06. Update a lot of places, to give them their final form (but not
>>>>> final, as some functions will be renamed to _locked, some not, hard to
>>>>> imagine)
>>>>>
>>>>> 07,08,09. Update some more, and even more places. very hard to track
>>>>> formal correctness of using locks
>>>>>
>>>>> 10-...: rename APIs.
>>>>>
>>>>>
>>>>> What do you think about the following:
>>>>>
>>>>> 1. Introduce noop lock, and some internal _locked() versions, and keep
>>>>> formal consistency inside job.c, considering all public interfaces as
>>>>> unlocked:
>>>>>
>>>>>    at this point:
>>>>>     - everything correct inside job.c
>>>>>     - no public interfaces with _locked prefix
>>>>>     - all public interfaces take mutex internally
>>>>>     - no external user take mutex by hand
>>>>>
>>>>> We can rename all internal static functions at this step too.
>>>>>
>>>>> 2. Introduce some public _locked APIs, that we'll use in next patches
>>>>>
>>>>> 3. Now start fixing external users in several patches:
>>>>>       - protect by mutex direct use of job fields
>>>>>     - make wider locks and move to _locked APIs inside them where
>>>>> needed
>>>>>
>>>>>
>>>>> In this scenario, every updated unit becomes formally correct after
>>>>> update, and after all steps everything is formally correct, and we can
>>>>> move to turning-on the mutex.
>>>>>
>>>>
>>>> I don't understand your logic also here, sorry :(
>>>>
>>>> I assume you want to keep patch 1-4, then the problem is assing
>>>> job_lock
>>>> and renaming functions in _locked.
>>>> So I would say the problem is in patch 5-6-10-11-12-13. All the others
>>>> should be self contained.
>>>>
>>>> I understand patch 5 is a little hard to follow.
>>>>
>>>> Now, I am not sure what you propose here but it seems that the end goal
>>>> is to just have the same result, but with additional intermediate steps
>>>> that are just "do this just because in the next patch will be useful".
>>>> I think the problem is that we are going to miss the "why we need the
>>>> lock" logic in the patches if we do so.
>>>>
>>>> The logic I tried to convey in this order is the following:
>>>> - job.h: add _locked duplicates for job API functions called with and
>>>> without job_mutex
>>>>      Just create duplicates of functions
>>>>
>>>> - jobs: protect jobs with job_lock/unlock
>>>>      QMP and monitor functions call APIs that assume lock is taken,
>>>>      drivers must take explicitly the lock
>>>>
>>>> - jobs: rename static functions called with job_mutex held
>>>> - job.h: rename job API functions called with job_mutex held
>>>> - block_job: rename block_job functions called with job_mutex held
>>>>      *given* that some functions are always under lock, transform
>>>>      them in _locked. Requires the job_lock/unlock patch
>>>>
>>>> - job.h: define unlocked functions
>>>>      Comments on the public functions that are not _locked
>>>>
>>>>
>>>> @Kevin, since you also had some feedbacks on the patch ordering, do you
>>>> agree with this ordering or you have some other ideas?
>>>>
>>>> Following your suggestion, we could move patches 10-11-12-13 before
>>>> patch 6 "jobs: protect jobs with job_lock/unlock".
>>>>
>>>> (Apologies for changing my mind, but being the second complain I am
>>>> starting to reconsider reordering the patches).
>>>>
>>>
>>> In two words, what I mean: let's keep the following invariant from patch
>>> to patch:
>>>
>>> 1. Function that has _locked() prefix is always called with lock held
>>> 2. Function that has _locked() prefix never calls functions that take
>>> lock by themselves so that would dead-lock
>>> 3. Function that is documented as "called with lock not held" is never
>>> called with lock held
>>>
>>> That what I mean by "formal correctness": yes, we know that lock is
>>> noop, but still let's keep code logic to correspond function naming and
>>> comments that we add.
>>>
>>
>> Ok I get what you mean, but then we have useless changes for public
>> functions that eventually will only be _locked() like job_next_locked:
>>
>> The function is always called in a loop, so it is pointless to take the
>> lock inside. Therefore the patch would be "incorrect" on its own anyways.
>>
>> Then, we would have a patch where we add the lock guard inside, and
>> another one where we remove it and rename to _locked and take the lock
>> outside. Seems unnecessary to me.
> 
> For me it looks a bit simpler than you describe. And anyway keeping the
> correctness from patch to patch worth the complexity. I'll give an
> argument.
> 
> First what is the best practices? Best practices is when every patch is
> good and absolutely correct. So that you can apply any number of patches
> from the beginning of the series (01-NN), commit them to master and this
> will break neither compilation, nor tests, nor readability, nothing.
> This makes the review process iterable: if I'm OK with patches 01-03, I
> give them r-b and don't think about them. I don't have to keep in mind
> any tricky things. And I can review 04 several days later not rereading
> 01-03 (or at least I can consider applied 01-03 as a good correct base
> state). This way I'm sure, that if I reviewed all patches one-by-one,
> each one is correct, then the whole thing is correct.
> 
> A lot harder to review when we have only collective correctness: the
> whole series being applied make a correct thing, but we can't say it
> about intermediate states. In your series we can't be absolutely correct
> with each patch, as we have to switch from aio-context lock to mutex in
> one patch, that's why mutex is added as noop. That's a reasonable and
> (seems) unsolvable drawback. That's a thing I have to keep in mind
> during the whole review. But I'd prefer not add more such things, like
> comments and _locked suffixes that don't correspond to the code.
> 
> With the invariant that I propose, the following logic works:
> 
> If
>    1. we keep the invariant from patch to patch
>    AND
>    2. at the end we have updated all users of the internal and external
> APIs, not missed some file or function
> Then everything is correct at the end.
> 
> Without the invariant I can't prove that everything is correct at the
> end, as it is hard to follow the degree of correctness from patch to
> patch. In your way the only invariant that we have from patch to patch,
> is that mutex is noop, so all changes do nothing, and therefore they are
> correct. This way I can give an r-b to all such patches not thinking
> about details, they are noop. But when I finally have to review the
> patch that turns on the mutex, I'll have to recheck all internal and
> external API users, which is equivalent to review all the changes merged
> into one patch.
> 
> 
> 
> Consider the case with job_next. The most correct way to update it IMHO:
> 
> 1. Add lock inside job_next() and add job_next_locked() - in one patch
> with other similar changes of job.c and job.h.
> 
> At this moment we have job_next() calls in a loop, which is not good (we
> want larger critical section), but that doesn't break the invariant I
> proposed above.

The only thing I am pointing here is that this breaks "readability",
meaning if someone bisects here it will find a very weird situation
(aside from the fact that there is a noop lock).

But I guess this is fine, as long as I write it in the commit message.

And since these patch are waiting here for more than 3 months now, I
would say if the others (Kevin?) agree I will change the order with what
you proposed here.

Emanuele

> 
> 2. Update the loop: add larger critical section and switch to
> job_next_locked().
> 
> What is good here: we don't  need to unite updates of external API users
> into one patch, we can update file-by-file or subsystem-by-sybsystem.
> 
> 3. Delete unused job_next() API
> 
>
Vladimir Sementsov-Ogievskiy June 23, 2022, 11:58 a.m. UTC | #7
On 6/23/22 14:19, Emanuele Giuseppe Esposito wrote:
> 
> Am 23/06/2022 um 13:10 schrieb Vladimir Sementsov-Ogievskiy:
>> On 6/23/22 12:08, Emanuele Giuseppe Esposito wrote:
>>>
>>> Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:
>>>> On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:
>>>>>
>>>>> Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
>>>>>> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>>>>>>> With the*nop*  job_lock/unlock placed, rename the static
>>>>>>> functions that are always under job_mutex, adding "_locked" suffix.
>>>>>>>
>>>>>>> List of functions that get this suffix:
>>>>>>> job_txn_ref           job_txn_del_job
>>>>>>> job_txn_apply           job_state_transition
>>>>>>> job_should_pause       job_event_cancelled
>>>>>>> job_event_completed       job_event_pending
>>>>>>> job_event_ready           job_event_idle
>>>>>>> job_do_yield           job_timer_not_pending
>>>>>>> job_do_dismiss           job_conclude
>>>>>>> job_update_rc           job_commit
>>>>>>> job_abort           job_clean
>>>>>>> job_finalize_single       job_cancel_async
>>>>>>> job_completed_txn_abort       job_prepare
>>>>>>> job_needs_finalize       job_do_finalize
>>>>>>> job_transition_to_pending  job_completed_txn_success
>>>>>>> job_completed           job_cancel_err
>>>>>>> job_force_cancel_err
>>>>>>>
>>>>>>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
>>>>>>> real_job_lock/unlock.
>>>>>>>
>>>>>>> No functional change intended.
>>>>>>>
>>>>>>> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>
>>>>>>
>>>>>> Hmm. Maybe it was already discussed.. But for me it seems, that it
>>>>>> would
>>>>>> be simpler to review previous patches, that fix job_ API users to use
>>>>>> locking properly, if this renaming go earlier.
>>>>>>
>>>>>> Anyway, in this series, we can't update everything at once. So
>>>>>> patch to
>>>>>> patch, we make the code more and more correct. (yes I remember that
>>>>>> lock() is a noop, but I should review thinking that it real,
>>>>>> otherwise,
>>>>>> how to review?)
>>>>>>
>>>>>> So, I'm saying about formal correctness of using lock() unlock()
>>>>>> function in connection with introduced _locked prifixes and in
>>>>>> connection with how it should finally work.
>>>>>>
>>>>>> You do:
>>>>>>
>>>>>> 05. introduce some _locked functions, that just duplicates, and
>>>>>> job_pause_point_locked() is formally inconsistent, as I said.
>>>>>>
>>>>>> 06. Update a lot of places, to give them their final form (but not
>>>>>> final, as some functions will be renamed to _locked, some not, hard to
>>>>>> imagine)
>>>>>>
>>>>>> 07,08,09. Update some more, and even more places. very hard to track
>>>>>> formal correctness of using locks
>>>>>>
>>>>>> 10-...: rename APIs.
>>>>>>
>>>>>>
>>>>>> What do you think about the following:
>>>>>>
>>>>>> 1. Introduce noop lock, and some internal _locked() versions, and keep
>>>>>> formal consistency inside job.c, considering all public interfaces as
>>>>>> unlocked:
>>>>>>
>>>>>>     at this point:
>>>>>>      - everything correct inside job.c
>>>>>>      - no public interfaces with _locked prefix
>>>>>>      - all public interfaces take mutex internally
>>>>>>      - no external user take mutex by hand
>>>>>>
>>>>>> We can rename all internal static functions at this step too.
>>>>>>
>>>>>> 2. Introduce some public _locked APIs, that we'll use in next patches
>>>>>>
>>>>>> 3. Now start fixing external users in several patches:
>>>>>>        - protect by mutex direct use of job fields
>>>>>>      - make wider locks and move to _locked APIs inside them where
>>>>>> needed
>>>>>>
>>>>>>
>>>>>> In this scenario, every updated unit becomes formally correct after
>>>>>> update, and after all steps everything is formally correct, and we can
>>>>>> move to turning-on the mutex.
>>>>>>
>>>>> I don't understand your logic also here, sorry:(
>>>>>
>>>>> I assume you want to keep patch 1-4, then the problem is assing
>>>>> job_lock
>>>>> and renaming functions in _locked.
>>>>> So I would say the problem is in patch 5-6-10-11-12-13. All the others
>>>>> should be self contained.
>>>>>
>>>>> I understand patch 5 is a little hard to follow.
>>>>>
>>>>> Now, I am not sure what you propose here but it seems that the end goal
>>>>> is to just have the same result, but with additional intermediate steps
>>>>> that are just "do this just because in the next patch will be useful".
>>>>> I think the problem is that we are going to miss the "why we need the
>>>>> lock" logic in the patches if we do so.
>>>>>
>>>>> The logic I tried to convey in this order is the following:
>>>>> - job.h: add _locked duplicates for job API functions called with and
>>>>> without job_mutex
>>>>>       Just create duplicates of functions
>>>>>
>>>>> - jobs: protect jobs with job_lock/unlock
>>>>>       QMP and monitor functions call APIs that assume lock is taken,
>>>>>       drivers must take explicitly the lock
>>>>>
>>>>> - jobs: rename static functions called with job_mutex held
>>>>> - job.h: rename job API functions called with job_mutex held
>>>>> - block_job: rename block_job functions called with job_mutex held
>>>>>       *given*  that some functions are always under lock, transform
>>>>>       them in _locked. Requires the job_lock/unlock patch
>>>>>
>>>>> - job.h: define unlocked functions
>>>>>       Comments on the public functions that are not _locked
>>>>>
>>>>>
>>>>> @Kevin, since you also had some feedbacks on the patch ordering, do you
>>>>> agree with this ordering or you have some other ideas?
>>>>>
>>>>> Following your suggestion, we could move patches 10-11-12-13 before
>>>>> patch 6 "jobs: protect jobs with job_lock/unlock".
>>>>>
>>>>> (Apologies for changing my mind, but being the second complain I am
>>>>> starting to reconsider reordering the patches).
>>>>>
>>>> In two words, what I mean: let's keep the following invariant from patch
>>>> to patch:
>>>>
>>>> 1. Function that has _locked() prefix is always called with lock held
>>>> 2. Function that has _locked() prefix never calls functions that take
>>>> lock by themselves so that would dead-lock
>>>> 3. Function that is documented as "called with lock not held" is never
>>>> called with lock held
>>>>
>>>> That what I mean by "formal correctness": yes, we know that lock is
>>>> noop, but still let's keep code logic to correspond function naming and
>>>> comments that we add.
>>>>
>>> Ok I get what you mean, but then we have useless changes for public
>>> functions that eventually will only be _locked() like job_next_locked:
>>>
>>> The function is always called in a loop, so it is pointless to take the
>>> lock inside. Therefore the patch would be "incorrect" on its own anyways.
>>>
>>> Then, we would have a patch where we add the lock guard inside, and
>>> another one where we remove it and rename to _locked and take the lock
>>> outside. Seems unnecessary to me.
>> For me it looks a bit simpler than you describe. And anyway keeping the
>> correctness from patch to patch worth the complexity. I'll give an
>> argument.
>>
>> First what is the best practices? Best practices is when every patch is
>> good and absolutely correct. So that you can apply any number of patches
>> from the beginning of the series (01-NN), commit them to master and this
>> will break neither compilation, nor tests, nor readability, nothing.
>> This makes the review process iterable: if I'm OK with patches 01-03, I
>> give them r-b and don't think about them. I don't have to keep in mind
>> any tricky things. And I can review 04 several days later not rereading
>> 01-03 (or at least I can consider applied 01-03 as a good correct base
>> state). This way I'm sure, that if I reviewed all patches one-by-one,
>> each one is correct, then the whole thing is correct.
>>
>> A lot harder to review when we have only collective correctness: the
>> whole series being applied make a correct thing, but we can't say it
>> about intermediate states. In your series we can't be absolutely correct
>> with each patch, as we have to switch from aio-context lock to mutex in
>> one patch, that's why mutex is added as noop. That's a reasonable and
>> (seems) unsolvable drawback. That's a thing I have to keep in mind
>> during the whole review. But I'd prefer not add more such things, like
>> comments and _locked suffixes that don't correspond to the code.
>>
>> With the invariant that I propose, the following logic works:
>>
>> If
>>     1. we keep the invariant from patch to patch
>>     AND
>>     2. at the end we have updated all users of the internal and external
>> APIs, not missed some file or function
>> Then everything is correct at the end.
>>
>> Without the invariant I can't prove that everything is correct at the
>> end, as it is hard to follow the degree of correctness from patch to
>> patch. In your way the only invariant that we have from patch to patch,
>> is that mutex is noop, so all changes do nothing, and therefore they are
>> correct. This way I can give an r-b to all such patches not thinking
>> about details, they are noop. But when I finally have to review the
>> patch that turns on the mutex, I'll have to recheck all internal and
>> external API users, which is equivalent to review all the changes merged
>> into one patch.
>>
>>
>>
>> Consider the case with job_next. The most correct way to update it IMHO:
>>
>> 1. Add lock inside job_next() and add job_next_locked() - in one patch
>> with other similar changes of job.c and job.h.
>>
>> At this moment we have job_next() calls in a loop, which is not good (we
>> want larger critical section), but that doesn't break the invariant I
>> proposed above.
> The only thing I am pointing here is that this breaks "readability",
> meaning if someone bisects here it will find a very weird situation
> (aside from the fact that there is a noop lock).

IHMO, calling job_next() in a loop, when job_next() takes mutex internally is still more readable and more correct than when we directly break _locked() prefix semantics and comments about "held / not held" that we add.

But I understand that arguing about what is more correct and what is less correct is not correct itself. In math something that is a bit incorrect is considered absolutely incorrect)

> 
> But I guess this is fine, as long as I write it in the commit message.
> 
> And since these patch are waiting here for more than 3 months now, I

I understand this :/ Feel unfair to require conceptual changes at stage of v7. If needed, I can take part in preparing v8.

> would say if the others (Kevin?) agree I will change the order with what
> you proposed here.

Yes, would be great to have more opinions
Kevin Wolf June 24, 2022, 2:29 p.m. UTC | #8
Am 23.06.2022 um 13:19 hat Emanuele Giuseppe Esposito geschrieben:
> 
> 
> Am 23/06/2022 um 13:10 schrieb Vladimir Sementsov-Ogievskiy:
> > On 6/23/22 12:08, Emanuele Giuseppe Esposito wrote:
> >>
> >>
> >> Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:
> >>> On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:
> >>>>
> >>>>
> >>>> Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
> >>>>> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
> >>>>>> With the*nop*  job_lock/unlock placed, rename the static
> >>>>>> functions that are always under job_mutex, adding "_locked" suffix.
> >>>>>>
> >>>>>> List of functions that get this suffix:
> >>>>>> job_txn_ref           job_txn_del_job
> >>>>>> job_txn_apply           job_state_transition
> >>>>>> job_should_pause       job_event_cancelled
> >>>>>> job_event_completed       job_event_pending
> >>>>>> job_event_ready           job_event_idle
> >>>>>> job_do_yield           job_timer_not_pending
> >>>>>> job_do_dismiss           job_conclude
> >>>>>> job_update_rc           job_commit
> >>>>>> job_abort           job_clean
> >>>>>> job_finalize_single       job_cancel_async
> >>>>>> job_completed_txn_abort       job_prepare
> >>>>>> job_needs_finalize       job_do_finalize
> >>>>>> job_transition_to_pending  job_completed_txn_success
> >>>>>> job_completed           job_cancel_err
> >>>>>> job_force_cancel_err
> >>>>>>
> >>>>>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
> >>>>>> real_job_lock/unlock.
> >>>>>>
> >>>>>> No functional change intended.
> >>>>>>
> >>>>>> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>
> >>>>>
> >>>>>
> >>>>> Hmm. Maybe it was already discussed.. But for me it seems, that it
> >>>>> would
> >>>>> be simpler to review previous patches, that fix job_ API users to use
> >>>>> locking properly, if this renaming go earlier.
> >>>>>
> >>>>> Anyway, in this series, we can't update everything at once. So
> >>>>> patch to
> >>>>> patch, we make the code more and more correct. (yes I remember that
> >>>>> lock() is a noop, but I should review thinking that it real,
> >>>>> otherwise,
> >>>>> how to review?)
> >>>>>
> >>>>> So, I'm saying about formal correctness of using lock() unlock()
> >>>>> function in connection with introduced _locked prifixes and in
> >>>>> connection with how it should finally work.
> >>>>>
> >>>>> You do:
> >>>>>
> >>>>> 05. introduce some _locked functions, that just duplicates, and
> >>>>> job_pause_point_locked() is formally inconsistent, as I said.
> >>>>>
> >>>>> 06. Update a lot of places, to give them their final form (but not
> >>>>> final, as some functions will be renamed to _locked, some not, hard to
> >>>>> imagine)
> >>>>>
> >>>>> 07,08,09. Update some more, and even more places. very hard to track
> >>>>> formal correctness of using locks
> >>>>>
> >>>>> 10-...: rename APIs.
> >>>>>
> >>>>>
> >>>>> What do you think about the following:
> >>>>>
> >>>>> 1. Introduce noop lock, and some internal _locked() versions, and keep
> >>>>> formal consistency inside job.c, considering all public interfaces as
> >>>>> unlocked:
> >>>>>
> >>>>>    at this point:
> >>>>>     - everything correct inside job.c
> >>>>>     - no public interfaces with _locked prefix
> >>>>>     - all public interfaces take mutex internally
> >>>>>     - no external user take mutex by hand
> >>>>>
> >>>>> We can rename all internal static functions at this step too.
> >>>>>
> >>>>> 2. Introduce some public _locked APIs, that we'll use in next patches
> >>>>>
> >>>>> 3. Now start fixing external users in several patches:
> >>>>>       - protect by mutex direct use of job fields
> >>>>>     - make wider locks and move to _locked APIs inside them where
> >>>>> needed
> >>>>>
> >>>>>
> >>>>> In this scenario, every updated unit becomes formally correct after
> >>>>> update, and after all steps everything is formally correct, and we can
> >>>>> move to turning-on the mutex.
> >>>>>
> >>>>
> >>>> I don't understand your logic also here, sorry :(
> >>>>
> >>>> I assume you want to keep patch 1-4, then the problem is assing
> >>>> job_lock
> >>>> and renaming functions in _locked.
> >>>> So I would say the problem is in patch 5-6-10-11-12-13. All the others
> >>>> should be self contained.
> >>>>
> >>>> I understand patch 5 is a little hard to follow.
> >>>>
> >>>> Now, I am not sure what you propose here but it seems that the end goal
> >>>> is to just have the same result, but with additional intermediate steps
> >>>> that are just "do this just because in the next patch will be useful".
> >>>> I think the problem is that we are going to miss the "why we need the
> >>>> lock" logic in the patches if we do so.
> >>>>
> >>>> The logic I tried to convey in this order is the following:
> >>>> - job.h: add _locked duplicates for job API functions called with and
> >>>> without job_mutex
> >>>>      Just create duplicates of functions
> >>>>
> >>>> - jobs: protect jobs with job_lock/unlock
> >>>>      QMP and monitor functions call APIs that assume lock is taken,
> >>>>      drivers must take explicitly the lock
> >>>>
> >>>> - jobs: rename static functions called with job_mutex held
> >>>> - job.h: rename job API functions called with job_mutex held
> >>>> - block_job: rename block_job functions called with job_mutex held
> >>>>      *given* that some functions are always under lock, transform
> >>>>      them in _locked. Requires the job_lock/unlock patch
> >>>>
> >>>> - job.h: define unlocked functions
> >>>>      Comments on the public functions that are not _locked
> >>>>
> >>>>
> >>>> @Kevin, since you also had some feedbacks on the patch ordering, do you
> >>>> agree with this ordering or you have some other ideas?
> >>>>
> >>>> Following your suggestion, we could move patches 10-11-12-13 before
> >>>> patch 6 "jobs: protect jobs with job_lock/unlock".
> >>>>
> >>>> (Apologies for changing my mind, but being the second complain I am
> >>>> starting to reconsider reordering the patches).
> >>>>
> >>>
> >>> In two words, what I mean: let's keep the following invariant from patch
> >>> to patch:
> >>>
> >>> 1. Function that has _locked() prefix is always called with lock held
> >>> 2. Function that has _locked() prefix never calls functions that take
> >>> lock by themselves so that would dead-lock
> >>> 3. Function that is documented as "called with lock not held" is never
> >>> called with lock held
> >>>
> >>> That what I mean by "formal correctness": yes, we know that lock is
> >>> noop, but still let's keep code logic to correspond function naming and
> >>> comments that we add.
> >>>
> >>
> >> Ok I get what you mean, but then we have useless changes for public
> >> functions that eventually will only be _locked() like job_next_locked:
> >>
> >> The function is always called in a loop, so it is pointless to take the
> >> lock inside. Therefore the patch would be "incorrect" on its own anyways.
> >>
> >> Then, we would have a patch where we add the lock guard inside, and
> >> another one where we remove it and rename to _locked and take the lock
> >> outside. Seems unnecessary to me.
> > 
> > For me it looks a bit simpler than you describe. And anyway keeping the
> > correctness from patch to patch worth the complexity. I'll give an
> > argument.
> > 
> > First what is the best practices? Best practices is when every patch is
> > good and absolutely correct. So that you can apply any number of patches
> > from the beginning of the series (01-NN), commit them to master and this
> > will break neither compilation, nor tests, nor readability, nothing.
> > This makes the review process iterable: if I'm OK with patches 01-03, I
> > give them r-b and don't think about them. I don't have to keep in mind
> > any tricky things. And I can review 04 several days later not rereading
> > 01-03 (or at least I can consider applied 01-03 as a good correct base
> > state). This way I'm sure, that if I reviewed all patches one-by-one,
> > each one is correct, then the whole thing is correct.
> > 
> > A lot harder to review when we have only collective correctness: the
> > whole series being applied make a correct thing, but we can't say it
> > about intermediate states. In your series we can't be absolutely correct
> > with each patch, as we have to switch from aio-context lock to mutex in
> > one patch, that's why mutex is added as noop. That's a reasonable and
> > (seems) unsolvable drawback. That's a thing I have to keep in mind
> > during the whole review. But I'd prefer not add more such things, like
> > comments and _locked suffixes that don't correspond to the code.
> > 
> > With the invariant that I propose, the following logic works:
> > 
> > If
> >    1. we keep the invariant from patch to patch
> >    AND
> >    2. at the end we have updated all users of the internal and external
> > APIs, not missed some file or function
> > Then everything is correct at the end.
> > 
> > Without the invariant I can't prove that everything is correct at the
> > end, as it is hard to follow the degree of correctness from patch to
> > patch. In your way the only invariant that we have from patch to patch,
> > is that mutex is noop, so all changes do nothing, and therefore they are
> > correct. This way I can give an r-b to all such patches not thinking
> > about details, they are noop. But when I finally have to review the
> > patch that turns on the mutex, I'll have to recheck all internal and
> > external API users, which is equivalent to review all the changes merged
> > into one patch.
> > 
> > 
> > 
> > Consider the case with job_next. The most correct way to update it IMHO:
> > 
> > 1. Add lock inside job_next() and add job_next_locked() - in one patch
> > with other similar changes of job.c and job.h.
> > 
> > At this moment we have job_next() calls in a loop, which is not good (we
> > want larger critical section), but that doesn't break the invariant I
> > proposed above.
> 
> The only thing I am pointing here is that this breaks "readability",
> meaning if someone bisects here it will find a very weird situation
> (aside from the fact that there is a noop lock).
> 
> But I guess this is fine, as long as I write it in the commit message.
> 
> And since these patch are waiting here for more than 3 months now, I
> would say if the others (Kevin?) agree I will change the order with what
> you proposed here.

Yes, I think Vladimir is having the same difficulties with reading the
series as I had. And I believe his suggestion would make the
intermediate states less impossible to review. The question is how much
work it would be and whether you're willing to do this. As I said, if
reorganising is too hard, I'm okay with just ignoring the intermediate
state and reviewing the series as if it were a single patch.

Kevin
Paolo Bonzini June 24, 2022, 3:28 p.m. UTC | #9
On 6/24/22 16:29, Kevin Wolf wrote:
> Yes, I think Vladimir is having the same difficulties with reading the
> series as I had. And I believe his suggestion would make the
> intermediate states less impossible to review. The question is how much
> work it would be and whether you're willing to do this. As I said, if
> reorganising is too hard, I'm okay with just ignoring the intermediate
> state and reviewing the series as if it were a single patch.

I think we've tried different intermediate states for each of the 
previous 6 versions, and none of them were really satisfactory. :(

Paolo
Emanuele Giuseppe Esposito June 24, 2022, 5:20 p.m. UTC | #10
Am 24/06/2022 um 17:28 schrieb Paolo Bonzini:
> On 6/24/22 16:29, Kevin Wolf wrote:
>> Yes, I think Vladimir is having the same difficulties with reading the
>> series as I had. And I believe his suggestion would make the
>> intermediate states less impossible to review. The question is how much
>> work it would be and whether you're willing to do this. As I said, if
>> reorganising is too hard, I'm okay with just ignoring the intermediate
>> state and reviewing the series as if it were a single patch.
> 
> I think we've tried different intermediate states for each of the
> previous 6 versions, and none of them were really satisfactory. :(
> 

Yes. v7 in this case basically means that we tried at least 4-5 times to
reorganize patches.

Nevertheless I could give it a try. I just hope I won't regret it :)

If I don't manage, I will just give up and re-send the serie with
Vladimir's nitpicks.

But yeah, I guess we all agree that this is the last time I reorganize
this serie.

Feedback are always very well welcome, but not anymore on reordering
please ;)

Thank you,
Emanuele
Emanuele Giuseppe Esposito June 28, 2022, 7:40 a.m. UTC | #11
Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:
> On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
>>> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>>>> With the*nop*  job_lock/unlock placed, rename the static
>>>> functions that are always under job_mutex, adding "_locked" suffix.
>>>>
>>>> List of functions that get this suffix:
>>>> job_txn_ref           job_txn_del_job
>>>> job_txn_apply           job_state_transition
>>>> job_should_pause       job_event_cancelled
>>>> job_event_completed       job_event_pending
>>>> job_event_ready           job_event_idle
>>>> job_do_yield           job_timer_not_pending
>>>> job_do_dismiss           job_conclude
>>>> job_update_rc           job_commit
>>>> job_abort           job_clean
>>>> job_finalize_single       job_cancel_async
>>>> job_completed_txn_abort       job_prepare
>>>> job_needs_finalize       job_do_finalize
>>>> job_transition_to_pending  job_completed_txn_success
>>>> job_completed           job_cancel_err
>>>> job_force_cancel_err
>>>>
>>>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
>>>> real_job_lock/unlock.
>>>>
>>>> No functional change intended.
>>>>
>>>> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>
>>>
>>>
>>> Hmm. Maybe it was already discussed.. But for me it seems, that it would
>>> be simpler to review previous patches, that fix job_ API users to use
>>> locking properly, if this renaming go earlier.
>>>
>>> Anyway, in this series, we can't update everything at once. So patch to
>>> patch, we make the code more and more correct. (yes I remember that
>>> lock() is a noop, but I should review thinking that it real, otherwise,
>>> how to review?)
>>>
>>> So, I'm saying about formal correctness of using lock() unlock()
>>> function in connection with introduced _locked prifixes and in
>>> connection with how it should finally work.
>>>
>>> You do:
>>>
>>> 05. introduce some _locked functions, that just duplicates, and
>>> job_pause_point_locked() is formally inconsistent, as I said.
>>>
>>> 06. Update a lot of places, to give them their final form (but not
>>> final, as some functions will be renamed to _locked, some not, hard to
>>> imagine)
>>>
>>> 07,08,09. Update some more, and even more places. very hard to track
>>> formal correctness of using locks
>>>
>>> 10-...: rename APIs.
>>>
>>>
>>> What do you think about the following:
>>>
>>> 1. Introduce noop lock, and some internal _locked() versions, and keep
>>> formal consistency inside job.c, considering all public interfaces as
>>> unlocked:
>>>
>>>   at this point:
>>>    - everything correct inside job.c
>>>    - no public interfaces with _locked prefix
>>>    - all public interfaces take mutex internally
>>>    - no external user take mutex by hand
>>>
>>> We can rename all internal static functions at this step too.
>>>
>>> 2. Introduce some public _locked APIs, that we'll use in next patches
>>>
>>> 3. Now start fixing external users in several patches:
>>>      - protect by mutex direct use of job fields
>>>    - make wider locks and move to _locked APIs inside them where needed
>>>
>>>
>>> In this scenario, every updated unit becomes formally correct after
>>> update, and after all steps everything is formally correct, and we can
>>> move to turning-on the mutex.
>>>
>>
>> I don't understand your logic also here, sorry :(
>>
>> I assume you want to keep patch 1-4, then the problem is assing job_lock
>> and renaming functions in _locked.
>> So I would say the problem is in patch 5-6-10-11-12-13. All the others
>> should be self contained.
>>
>> I understand patch 5 is a little hard to follow.
>>
>> Now, I am not sure what you propose here but it seems that the end goal
>> is to just have the same result, but with additional intermediate steps
>> that are just "do this just because in the next patch will be useful".
>> I think the problem is that we are going to miss the "why we need the
>> lock" logic in the patches if we do so.
>>
>> The logic I tried to convey in this order is the following:
>> - job.h: add _locked duplicates for job API functions called with and
>> without job_mutex
>>     Just create duplicates of functions
>>
>> - jobs: protect jobs with job_lock/unlock
>>     QMP and monitor functions call APIs that assume lock is taken,
>>     drivers must take explicitly the lock
>>
>> - jobs: rename static functions called with job_mutex held
>> - job.h: rename job API functions called with job_mutex held
>> - block_job: rename block_job functions called with job_mutex held
>>     *given* that some functions are always under lock, transform
>>     them in _locked. Requires the job_lock/unlock patch
>>
>> - job.h: define unlocked functions
>>     Comments on the public functions that are not _locked
>>
>>
>> @Kevin, since you also had some feedbacks on the patch ordering, do you
>> agree with this ordering or you have some other ideas?
>>
>> Following your suggestion, we could move patches 10-11-12-13 before
>> patch 6 "jobs: protect jobs with job_lock/unlock".
>>
>> (Apologies for changing my mind, but being the second complain I am
>> starting to reconsider reordering the patches).
>>
> 
> In two words, what I mean: let's keep the following invariant from patch
> to patch:
> 
> 1. Function that has _locked() prefix is always called with lock held
> 2. Function that has _locked() prefix never calls functions that take
> lock by themselves so that would dead-lock
> 3. Function that is documented as "called with lock not held" is never
> called with lock held
> 
> That what I mean by "formal correctness": yes, we know that lock is
> noop, but still let's keep code logic to correspond function naming and
> comments that we add.
> 
> 

Ok so far I did the following:

- duplicated each public function as static {function}_locked()
- made sure all functions in job.c call only _locked() functions, since
the lock is always taken internally

Now, we need to do the same also for blockjob API in blockjob.h
The only problem is that in order to use and create functions like
block_job_get_locked(), we need:
- job_get_locked() to be public, and can't be just replacing job_get()
because it is still used everywhere
- block_job_get_locked() to be public too, since it is used in other
files like blockdev.c

so we will have:
Job *job_get()
Job *job_get_locked()

BlockJob *block_job_get(const char *id)
BlockJob *block_job_get_locked(const char *id)


Therefore with this approach I need to make all _locked() functions
public, duplicating the API. Is that what you want?

Emanuele
Vladimir Sementsov-Ogievskiy June 28, 2022, 10:47 a.m. UTC | #12
On 6/28/22 10:40, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:
>> On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
>>>> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>>>>> With the*nop*  job_lock/unlock placed, rename the static
>>>>> functions that are always under job_mutex, adding "_locked" suffix.
>>>>>
>>>>> List of functions that get this suffix:
>>>>> job_txn_ref           job_txn_del_job
>>>>> job_txn_apply           job_state_transition
>>>>> job_should_pause       job_event_cancelled
>>>>> job_event_completed       job_event_pending
>>>>> job_event_ready           job_event_idle
>>>>> job_do_yield           job_timer_not_pending
>>>>> job_do_dismiss           job_conclude
>>>>> job_update_rc           job_commit
>>>>> job_abort           job_clean
>>>>> job_finalize_single       job_cancel_async
>>>>> job_completed_txn_abort       job_prepare
>>>>> job_needs_finalize       job_do_finalize
>>>>> job_transition_to_pending  job_completed_txn_success
>>>>> job_completed           job_cancel_err
>>>>> job_force_cancel_err
>>>>>
>>>>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
>>>>> real_job_lock/unlock.
>>>>>
>>>>> No functional change intended.
>>>>>
>>>>> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>
>>>>
>>>>
>>>> Hmm. Maybe it was already discussed.. But for me it seems, that it would
>>>> be simpler to review previous patches, that fix job_ API users to use
>>>> locking properly, if this renaming go earlier.
>>>>
>>>> Anyway, in this series, we can't update everything at once. So patch to
>>>> patch, we make the code more and more correct. (yes I remember that
>>>> lock() is a noop, but I should review thinking that it real, otherwise,
>>>> how to review?)
>>>>
>>>> So, I'm saying about formal correctness of using lock() unlock()
>>>> function in connection with introduced _locked prifixes and in
>>>> connection with how it should finally work.
>>>>
>>>> You do:
>>>>
>>>> 05. introduce some _locked functions, that just duplicates, and
>>>> job_pause_point_locked() is formally inconsistent, as I said.
>>>>
>>>> 06. Update a lot of places, to give them their final form (but not
>>>> final, as some functions will be renamed to _locked, some not, hard to
>>>> imagine)
>>>>
>>>> 07,08,09. Update some more, and even more places. very hard to track
>>>> formal correctness of using locks
>>>>
>>>> 10-...: rename APIs.
>>>>
>>>>
>>>> What do you think about the following:
>>>>
>>>> 1. Introduce noop lock, and some internal _locked() versions, and keep
>>>> formal consistency inside job.c, considering all public interfaces as
>>>> unlocked:
>>>>
>>>>    at this point:
>>>>     - everything correct inside job.c
>>>>     - no public interfaces with _locked prefix
>>>>     - all public interfaces take mutex internally
>>>>     - no external user take mutex by hand
>>>>
>>>> We can rename all internal static functions at this step too.
>>>>
>>>> 2. Introduce some public _locked APIs, that we'll use in next patches
>>>>
>>>> 3. Now start fixing external users in several patches:
>>>>       - protect by mutex direct use of job fields
>>>>     - make wider locks and move to _locked APIs inside them where needed
>>>>
>>>>
>>>> In this scenario, every updated unit becomes formally correct after
>>>> update, and after all steps everything is formally correct, and we can
>>>> move to turning-on the mutex.
>>>>
>>>
>>> I don't understand your logic also here, sorry :(
>>>
>>> I assume you want to keep patch 1-4, then the problem is assing job_lock
>>> and renaming functions in _locked.
>>> So I would say the problem is in patch 5-6-10-11-12-13. All the others
>>> should be self contained.
>>>
>>> I understand patch 5 is a little hard to follow.
>>>
>>> Now, I am not sure what you propose here but it seems that the end goal
>>> is to just have the same result, but with additional intermediate steps
>>> that are just "do this just because in the next patch will be useful".
>>> I think the problem is that we are going to miss the "why we need the
>>> lock" logic in the patches if we do so.
>>>
>>> The logic I tried to convey in this order is the following:
>>> - job.h: add _locked duplicates for job API functions called with and
>>> without job_mutex
>>>      Just create duplicates of functions
>>>
>>> - jobs: protect jobs with job_lock/unlock
>>>      QMP and monitor functions call APIs that assume lock is taken,
>>>      drivers must take explicitly the lock
>>>
>>> - jobs: rename static functions called with job_mutex held
>>> - job.h: rename job API functions called with job_mutex held
>>> - block_job: rename block_job functions called with job_mutex held
>>>      *given* that some functions are always under lock, transform
>>>      them in _locked. Requires the job_lock/unlock patch
>>>
>>> - job.h: define unlocked functions
>>>      Comments on the public functions that are not _locked
>>>
>>>
>>> @Kevin, since you also had some feedbacks on the patch ordering, do you
>>> agree with this ordering or you have some other ideas?
>>>
>>> Following your suggestion, we could move patches 10-11-12-13 before
>>> patch 6 "jobs: protect jobs with job_lock/unlock".
>>>
>>> (Apologies for changing my mind, but being the second complain I am
>>> starting to reconsider reordering the patches).
>>>
>>
>> In two words, what I mean: let's keep the following invariant from patch
>> to patch:
>>
>> 1. Function that has _locked() prefix is always called with lock held
>> 2. Function that has _locked() prefix never calls functions that take
>> lock by themselves so that would dead-lock
>> 3. Function that is documented as "called with lock not held" is never
>> called with lock held
>>
>> That what I mean by "formal correctness": yes, we know that lock is
>> noop, but still let's keep code logic to correspond function naming and
>> comments that we add.
>>
>>
> 
> Ok so far I did the following:
> 
> - duplicated each public function as static {function}_locked()

They shouldn't be duplicates: function without _locked suffix should take the mutex.

> - made sure all functions in job.c call only _locked() functions, since
> the lock is always taken internally
> 
> Now, we need to do the same also for blockjob API in blockjob.h
> The only problem is that in order to use and create functions like
> block_job_get_locked(), we need:
> - job_get_locked() to be public, and can't be just replacing job_get()
> because it is still used everywhere
> - block_job_get_locked() to be public too, since it is used in other
> files like blockdev.c
> 
> so we will have:
> Job *job_get()
> Job *job_get_locked()
> 
> BlockJob *block_job_get(const char *id)
> BlockJob *block_job_get_locked(const char *id)
> 
> 
> Therefore with this approach I need to make all _locked() functions
> public, duplicating the API. Is that what you want?
> 

I don't see any problem in it. After the whole update we can drop public APIs that are unused.
Emanuele Giuseppe Esposito June 28, 2022, 1:04 p.m. UTC | #13
Am 28/06/2022 um 12:47 schrieb Vladimir Sementsov-Ogievskiy:
> On 6/28/22 10:40, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 22/06/2022 um 20:38 schrieb Vladimir Sementsov-Ogievskiy:
>>> On 6/22/22 17:26, Emanuele Giuseppe Esposito wrote:
>>>>
>>>>
>>>> Am 21/06/2022 um 19:26 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> On 6/16/22 16:18, Emanuele Giuseppe Esposito wrote:
>>>>>> With the*nop*  job_lock/unlock placed, rename the static
>>>>>> functions that are always under job_mutex, adding "_locked" suffix.
>>>>>>
>>>>>> List of functions that get this suffix:
>>>>>> job_txn_ref           job_txn_del_job
>>>>>> job_txn_apply           job_state_transition
>>>>>> job_should_pause       job_event_cancelled
>>>>>> job_event_completed       job_event_pending
>>>>>> job_event_ready           job_event_idle
>>>>>> job_do_yield           job_timer_not_pending
>>>>>> job_do_dismiss           job_conclude
>>>>>> job_update_rc           job_commit
>>>>>> job_abort           job_clean
>>>>>> job_finalize_single       job_cancel_async
>>>>>> job_completed_txn_abort       job_prepare
>>>>>> job_needs_finalize       job_do_finalize
>>>>>> job_transition_to_pending  job_completed_txn_success
>>>>>> job_completed           job_cancel_err
>>>>>> job_force_cancel_err
>>>>>>
>>>>>> Note that "locked" refers to the*nop*  job_lock/unlock, and not
>>>>>> real_job_lock/unlock.
>>>>>>
>>>>>> No functional change intended.
>>>>>>
>>>>>> Signed-off-by: Emanuele Giuseppe Esposito<eesposit@redhat.com>
>>>>>
>>>>>
>>>>> Hmm. Maybe it was already discussed.. But for me it seems, that it
>>>>> would
>>>>> be simpler to review previous patches, that fix job_ API users to use
>>>>> locking properly, if this renaming go earlier.
>>>>>
>>>>> Anyway, in this series, we can't update everything at once. So
>>>>> patch to
>>>>> patch, we make the code more and more correct. (yes I remember that
>>>>> lock() is a noop, but I should review thinking that it real,
>>>>> otherwise,
>>>>> how to review?)
>>>>>
>>>>> So, I'm saying about formal correctness of using lock() unlock()
>>>>> function in connection with introduced _locked prifixes and in
>>>>> connection with how it should finally work.
>>>>>
>>>>> You do:
>>>>>
>>>>> 05. introduce some _locked functions, that just duplicates, and
>>>>> job_pause_point_locked() is formally inconsistent, as I said.
>>>>>
>>>>> 06. Update a lot of places, to give them their final form (but not
>>>>> final, as some functions will be renamed to _locked, some not, hard to
>>>>> imagine)
>>>>>
>>>>> 07,08,09. Update some more, and even more places. very hard to track
>>>>> formal correctness of using locks
>>>>>
>>>>> 10-...: rename APIs.
>>>>>
>>>>>
>>>>> What do you think about the following:
>>>>>
>>>>> 1. Introduce noop lock, and some internal _locked() versions, and keep
>>>>> formal consistency inside job.c, considering all public interfaces as
>>>>> unlocked:
>>>>>
>>>>>    at this point:
>>>>>     - everything correct inside job.c
>>>>>     - no public interfaces with _locked prefix
>>>>>     - all public interfaces take mutex internally
>>>>>     - no external user take mutex by hand
>>>>>
>>>>> We can rename all internal static functions at this step too.
>>>>>
>>>>> 2. Introduce some public _locked APIs, that we'll use in next patches
>>>>>
>>>>> 3. Now start fixing external users in several patches:
>>>>>       - protect by mutex direct use of job fields
>>>>>     - make wider locks and move to _locked APIs inside them where
>>>>> needed
>>>>>
>>>>>
>>>>> In this scenario, every updated unit becomes formally correct after
>>>>> update, and after all steps everything is formally correct, and we can
>>>>> move to turning-on the mutex.
>>>>>
>>>>
>>>> I don't understand your logic also here, sorry :(
>>>>
>>>> I assume you want to keep patch 1-4, then the problem is assing
>>>> job_lock
>>>> and renaming functions in _locked.
>>>> So I would say the problem is in patch 5-6-10-11-12-13. All the others
>>>> should be self contained.
>>>>
>>>> I understand patch 5 is a little hard to follow.
>>>>
>>>> Now, I am not sure what you propose here but it seems that the end goal
>>>> is to just have the same result, but with additional intermediate steps
>>>> that are just "do this just because in the next patch will be useful".
>>>> I think the problem is that we are going to miss the "why we need the
>>>> lock" logic in the patches if we do so.
>>>>
>>>> The logic I tried to convey in this order is the following:
>>>> - job.h: add _locked duplicates for job API functions called with and
>>>> without job_mutex
>>>>      Just create duplicates of functions
>>>>
>>>> - jobs: protect jobs with job_lock/unlock
>>>>      QMP and monitor functions call APIs that assume lock is taken,
>>>>      drivers must take explicitly the lock
>>>>
>>>> - jobs: rename static functions called with job_mutex held
>>>> - job.h: rename job API functions called with job_mutex held
>>>> - block_job: rename block_job functions called with job_mutex held
>>>>      *given* that some functions are always under lock, transform
>>>>      them in _locked. Requires the job_lock/unlock patch
>>>>
>>>> - job.h: define unlocked functions
>>>>      Comments on the public functions that are not _locked
>>>>
>>>>
>>>> @Kevin, since you also had some feedbacks on the patch ordering, do you
>>>> agree with this ordering or you have some other ideas?
>>>>
>>>> Following your suggestion, we could move patches 10-11-12-13 before
>>>> patch 6 "jobs: protect jobs with job_lock/unlock".
>>>>
>>>> (Apologies for changing my mind, but being the second complain I am
>>>> starting to reconsider reordering the patches).
>>>>
>>>
>>> In two words, what I mean: let's keep the following invariant from patch
>>> to patch:
>>>
>>> 1. Function that has _locked() prefix is always called with lock held
>>> 2. Function that has _locked() prefix never calls functions that take
>>> lock by themselves so that would dead-lock
>>> 3. Function that is documented as "called with lock not held" is never
>>> called with lock held
>>>
>>> That what I mean by "formal correctness": yes, we know that lock is
>>> noop, but still let's keep code logic to correspond function naming and
>>> comments that we add.
>>>
>>>
>>
>> Ok so far I did the following:
>>
>> - duplicated each public function as static {function}_locked()
> 
> They shouldn't be duplicates: function without _locked suffix should
> take the mutex.

By "duplicate" I mean same function name, with just _locked suffix.
Maybe a better definition?

Almost done preparing the patches!

Emanuele

> 
>> - made sure all functions in job.c call only _locked() functions, since
>> the lock is always taken internally
>>
>> Now, we need to do the same also for blockjob API in blockjob.h
>> The only problem is that in order to use and create functions like
>> block_job_get_locked(), we need:
>> - job_get_locked() to be public, and can't be just replacing job_get()
>> because it is still used everywhere
>> - block_job_get_locked() to be public too, since it is used in other
>> files like blockdev.c
>>
>> so we will have:
>> Job *job_get()
>> Job *job_get_locked()
>>
>> BlockJob *block_job_get(const char *id)
>> BlockJob *block_job_get_locked(const char *id)
>>
>>
>> Therefore with this approach I need to make all _locked() functions
>> public, duplicating the API. Is that what you want?
>>
> 
> I don't see any problem in it. After the whole update we can drop public
> APIs that are unused.
> 
>
Vladimir Sementsov-Ogievskiy June 28, 2022, 3:22 p.m. UTC | #14
On 6/28/22 16:04, Emanuele Giuseppe Esposito wrote:
>>> Ok so far I did the following:
>>>
>>> - duplicated each public function as static {function}_locked()
>> They shouldn't be duplicates: function without _locked suffix should
>> take the mutex.
> By "duplicate" I mean same function name, with just _locked suffix.
> Maybe a better definition?
> 
> Almost done preparing the patches!

Why not just add _locked version and rework the version without suffix to call _locked under mutex one in one patch, to just keep it all meaningful?
Vladimir Sementsov-Ogievskiy June 28, 2022, 3:26 p.m. UTC | #15
On 6/28/22 18:22, Vladimir Sementsov-Ogievskiy wrote:
> On 6/28/22 16:04, Emanuele Giuseppe Esposito wrote:
>>>> Ok so far I did the following:
>>>>
>>>> - duplicated each public function as static {function}_locked()
>>> They shouldn't be duplicates: function without _locked suffix should
>>> take the mutex.
>> By "duplicate" I mean same function name, with just _locked suffix.
>> Maybe a better definition?
>>
>> Almost done preparing the patches!
> 
> Why not just add _locked version and rework the version without suffix to call _locked under mutex one in one patch, to just keep it all meaningful?
> 

I mean, instead of:

patch 1: add a _locked() duplicate

   At this point we have a duplicated function that's just bad practice.

patch 2: remake version without prefix to call _locked() under mutex
  
   Now everything is correct. But we have to track the moment when something strange becomes something correct.


do just

patch 1: rename function to _locked() and add a wrapper without suffix, that calls _locked() under mutex
Emanuele Giuseppe Esposito June 28, 2022, 5:28 p.m. UTC | #16
Am 28/06/2022 um 17:26 schrieb Vladimir Sementsov-Ogievskiy:
> On 6/28/22 18:22, Vladimir Sementsov-Ogievskiy wrote:
>> On 6/28/22 16:04, Emanuele Giuseppe Esposito wrote:
>>>>> Ok so far I did the following:
>>>>>
>>>>> - duplicated each public function as static {function}_locked()
>>>> They shouldn't be duplicates: function without _locked suffix should
>>>> take the mutex.
>>> By "duplicate" I mean same function name, with just _locked suffix.
>>> Maybe a better definition?
>>>
>>> Almost done preparing the patches!
>>
>> Why not just add _locked version and rework the version without suffix
>> to call _locked under mutex one in one patch, to just keep it all
>> meaningful?
>>
> 
> I mean, instead of:
> 
> patch 1: add a _locked() duplicate
> 
>   At this point we have a duplicated function that's just bad practice.
> 
> patch 2: remake version without prefix to call _locked() under mutex
>  
>   Now everything is correct. But we have to track the moment when
> something strange becomes something correct.
> 
> 
> do just
> 
> patch 1: rename function to _locked() and add a wrapper without suffix,
> that calls _locked() under mutex
> 
> 

That's what I always intended to do. As I said, I just used the wrong word.

Emanuele
Vladimir Sementsov-Ogievskiy June 28, 2022, 7:42 p.m. UTC | #17
On 6/28/22 20:28, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 28/06/2022 um 17:26 schrieb Vladimir Sementsov-Ogievskiy:
>> On 6/28/22 18:22, Vladimir Sementsov-Ogievskiy wrote:
>>> On 6/28/22 16:04, Emanuele Giuseppe Esposito wrote:
>>>>>> Ok so far I did the following:
>>>>>>
>>>>>> - duplicated each public function as static {function}_locked()
>>>>> They shouldn't be duplicates: function without _locked suffix should
>>>>> take the mutex.
>>>> By "duplicate" I mean same function name, with just _locked suffix.
>>>> Maybe a better definition?
>>>>
>>>> Almost done preparing the patches!
>>>
>>> Why not just add _locked version and rework the version without suffix
>>> to call _locked under mutex one in one patch, to just keep it all
>>> meaningful?
>>>
>>
>> I mean, instead of:
>>
>> patch 1: add a _locked() duplicate
>>
>>    At this point we have a duplicated function that's just bad practice.
>>
>> patch 2: remake version without prefix to call _locked() under mutex
>>   
>>    Now everything is correct. But we have to track the moment when
>> something strange becomes something correct.
>>
>>
>> do just
>>
>> patch 1: rename function to _locked() and add a wrapper without suffix,
>> that calls _locked() under mutex
>>
>>
> 
> That's what I always intended to do. As I said, I just used the wrong word.
> 

Ah, OK then, I misunderstood.
diff mbox series

Patch

diff --git a/job.c b/job.c
index 55b92b2332..4f4b387625 100644
--- a/job.c
+++ b/job.c
@@ -113,7 +113,8 @@  JobTxn *job_txn_new(void)
     return txn;
 }
 
-static void job_txn_ref(JobTxn *txn)
+/* Called with job_mutex held. */
+static void job_txn_ref_locked(JobTxn *txn)
 {
     txn->refcnt++;
 }
@@ -145,10 +146,11 @@  static void job_txn_add_job(JobTxn *txn, Job *job)
     job->txn = txn;
 
     QLIST_INSERT_HEAD(&txn->jobs, job, txn_list);
-    job_txn_ref(txn);
+    job_txn_ref_locked(txn);
 }
 
-static void job_txn_del_job(Job *job)
+/* Called with job_mutex held. */
+static void job_txn_del_job_locked(Job *job)
 {
     if (job->txn) {
         QLIST_REMOVE(job, txn_list);
@@ -157,7 +159,8 @@  static void job_txn_del_job(Job *job)
     }
 }
 
-static int job_txn_apply(Job *job, int fn(Job *))
+/* Called with job_mutex held. */
+static int job_txn_apply_locked(Job *job, int fn(Job *))
 {
     AioContext *inner_ctx;
     Job *other_job, *next;
@@ -165,10 +168,10 @@  static int job_txn_apply(Job *job, int fn(Job *))
     int rc = 0;
 
     /*
-     * Similar to job_completed_txn_abort, we take each job's lock before
-     * applying fn, but since we assume that outer_ctx is held by the caller,
-     * we need to release it here to avoid holding the lock twice - which would
-     * break AIO_WAIT_WHILE from within fn.
+     * Similar to job_completed_txn_abort_locked, we take each job's lock
+     * before applying fn, but since we assume that outer_ctx is held by
+     * the caller, we need to release it here to avoid holding the lock
+     * twice - which would break AIO_WAIT_WHILE from within fn.
      */
     job_ref(job);
     aio_context_release(job->aio_context);
@@ -197,7 +200,8 @@  bool job_is_internal(Job *job)
     return (job->id == NULL);
 }
 
-static void job_state_transition(Job *job, JobStatus s1)
+/* Called with job_mutex held. */
+static void job_state_transition_locked(Job *job, JobStatus s1)
 {
     JobStatus s0 = job->status;
     assert(s1 >= 0 && s1 < JOB_STATUS__MAX);
@@ -322,7 +326,8 @@  static bool job_started(Job *job)
     return job->co;
 }
 
-static bool job_should_pause(Job *job)
+/* Called with job_mutex held. */
+static bool job_should_pause_locked(Job *job)
 {
     return job->pause_count > 0;
 }
@@ -402,7 +407,7 @@  void *job_create(const char *job_id, const JobDriver *driver, JobTxn *txn,
     notifier_list_init(&job->on_ready);
     notifier_list_init(&job->on_idle);
 
-    job_state_transition(job, JOB_STATUS_CREATED);
+    job_state_transition_locked(job, JOB_STATUS_CREATED);
     aio_timer_init(qemu_get_aio_context(), &job->sleep_timer,
                    QEMU_CLOCK_REALTIME, SCALE_NS,
                    job_sleep_timer_cb, job);
@@ -468,31 +473,36 @@  void job_progress_increase_remaining(Job *job, uint64_t delta)
 
 /**
  * To be called when a cancelled job is finalised.
+ * Called with job_mutex held.
  */
-static void job_event_cancelled(Job *job)
+static void job_event_cancelled_locked(Job *job)
 {
     notifier_list_notify(&job->on_finalize_cancelled, job);
 }
 
 /**
  * To be called when a successfully completed job is finalised.
+ * Called with job_mutex held.
  */
-static void job_event_completed(Job *job)
+static void job_event_completed_locked(Job *job)
 {
     notifier_list_notify(&job->on_finalize_completed, job);
 }
 
-static void job_event_pending(Job *job)
+/* Called with job_mutex held. */
+static void job_event_pending_locked(Job *job)
 {
     notifier_list_notify(&job->on_pending, job);
 }
 
-static void job_event_ready(Job *job)
+/* Called with job_mutex held. */
+static void job_event_ready_locked(Job *job)
 {
     notifier_list_notify(&job->on_ready, job);
 }
 
-static void job_event_idle(Job *job)
+/* Called with job_mutex held. */
+static void job_event_idle_locked(Job *job)
 {
     notifier_list_notify(&job->on_idle, job);
 }
@@ -530,20 +540,24 @@  void job_enter(Job *job)
     job_enter_cond(job, NULL);
 }
 
-/* Yield, and schedule a timer to reenter the coroutine after @ns nanoseconds.
+/*
+ * Yield, and schedule a timer to reenter the coroutine after @ns nanoseconds.
  * Reentering the job coroutine with job_enter() before the timer has expired
  * is allowed and cancels the timer.
  *
  * If @ns is (uint64_t) -1, no timer is scheduled and job_enter() must be
- * called explicitly. */
-static void coroutine_fn job_do_yield(Job *job, uint64_t ns)
+ * called explicitly.
+ *
+ * Called with job_mutex held, but releases it temporarily.
+ */
+static void coroutine_fn job_do_yield_locked(Job *job, uint64_t ns)
 {
     real_job_lock();
     if (ns != -1) {
         timer_mod(&job->sleep_timer, ns);
     }
     job->busy = false;
-    job_event_idle(job);
+    job_event_idle_locked(job);
     real_job_unlock();
     job_unlock();
     qemu_coroutine_yield();
@@ -558,7 +572,7 @@  static void coroutine_fn job_pause_point_locked(Job *job)
 {
     assert(job && job_started(job));
 
-    if (!job_should_pause(job)) {
+    if (!job_should_pause_locked(job)) {
         return;
     }
     if (job_is_cancelled_locked(job)) {
@@ -571,15 +585,15 @@  static void coroutine_fn job_pause_point_locked(Job *job)
         job_lock();
     }
 
-    if (job_should_pause(job) && !job_is_cancelled_locked(job)) {
+    if (job_should_pause_locked(job) && !job_is_cancelled_locked(job)) {
         JobStatus status = job->status;
-        job_state_transition(job, status == JOB_STATUS_READY
-                                  ? JOB_STATUS_STANDBY
-                                  : JOB_STATUS_PAUSED);
+        job_state_transition_locked(job, status == JOB_STATUS_READY
+                                    ? JOB_STATUS_STANDBY
+                                    : JOB_STATUS_PAUSED);
         job->paused = true;
-        job_do_yield(job, -1);
+        job_do_yield_locked(job, -1);
         job->paused = false;
-        job_state_transition(job, status);
+        job_state_transition_locked(job, status);
     }
 
     if (job->driver->resume) {
@@ -605,8 +619,8 @@  void job_yield(Job *job)
         return;
     }
 
-    if (!job_should_pause(job)) {
-        job_do_yield(job, -1);
+    if (!job_should_pause_locked(job)) {
+        job_do_yield_locked(job, -1);
     }
 
     job_pause_point_locked(job);
@@ -622,15 +636,15 @@  void coroutine_fn job_sleep_ns(Job *job, int64_t ns)
         return;
     }
 
-    if (!job_should_pause(job)) {
-        job_do_yield(job, qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + ns);
+    if (!job_should_pause_locked(job)) {
+        job_do_yield_locked(job, qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + ns);
     }
 
     job_pause_point_locked(job);
 }
 
 /* Assumes the job_mutex is held */
-static bool job_timer_not_pending(Job *job)
+static bool job_timer_not_pending_locked(Job *job)
 {
     return !timer_pending(&job->sleep_timer);
 }
@@ -652,7 +666,7 @@  void job_resume(Job *job)
     }
 
     /* kick only if no timer is pending */
-    job_enter_cond(job, job_timer_not_pending);
+    job_enter_cond(job, job_timer_not_pending_locked);
 }
 
 void job_user_pause(Job *job, Error **errp)
@@ -693,16 +707,17 @@  void job_user_resume(Job *job, Error **errp)
     job_resume(job);
 }
 
-static void job_do_dismiss(Job *job)
+/* Called with job_mutex held. */
+static void job_do_dismiss_locked(Job *job)
 {
     assert(job);
     job->busy = false;
     job->paused = false;
     job->deferred_to_main_loop = true;
 
-    job_txn_del_job(job);
+    job_txn_del_job_locked(job);
 
-    job_state_transition(job, JOB_STATUS_NULL);
+    job_state_transition_locked(job, JOB_STATUS_NULL);
     job_unref(job);
 }
 
@@ -715,7 +730,7 @@  void job_dismiss(Job **jobptr, Error **errp)
         return;
     }
 
-    job_do_dismiss(job);
+    job_do_dismiss_locked(job);
     *jobptr = NULL;
 }
 
@@ -723,18 +738,20 @@  void job_early_fail(Job *job)
 {
     JOB_LOCK_GUARD();
     assert(job->status == JOB_STATUS_CREATED);
-    job_do_dismiss(job);
+    job_do_dismiss_locked(job);
 }
 
-static void job_conclude(Job *job)
+/* Called with job_mutex held. */
+static void job_conclude_locked(Job *job)
 {
-    job_state_transition(job, JOB_STATUS_CONCLUDED);
+    job_state_transition_locked(job, JOB_STATUS_CONCLUDED);
     if (job->auto_dismiss || !job_started(job)) {
-        job_do_dismiss(job);
+        job_do_dismiss_locked(job);
     }
 }
 
-static void job_update_rc(Job *job)
+/* Called with job_mutex held. */
+static void job_update_rc_locked(Job *job)
 {
     if (!job->ret && job_is_cancelled_locked(job)) {
         job->ret = -ECANCELED;
@@ -743,11 +760,12 @@  static void job_update_rc(Job *job)
         if (!job->err) {
             error_setg(&job->err, "%s", strerror(-job->ret));
         }
-        job_state_transition(job, JOB_STATUS_ABORTING);
+        job_state_transition_locked(job, JOB_STATUS_ABORTING);
     }
 }
 
-static void job_commit(Job *job)
+/* Called with job_mutex held, but releases it temporarily. */
+static void job_commit_locked(Job *job)
 {
     assert(!job->ret);
     GLOBAL_STATE_CODE();
@@ -758,7 +776,8 @@  static void job_commit(Job *job)
     }
 }
 
-static void job_abort(Job *job)
+/* Called with job_mutex held, but releases it temporarily. */
+static void job_abort_locked(Job *job)
 {
     assert(job->ret);
     GLOBAL_STATE_CODE();
@@ -769,7 +788,8 @@  static void job_abort(Job *job)
     }
 }
 
-static void job_clean(Job *job)
+/* Called with job_mutex held, but releases it temporarily. */
+static void job_clean_locked(Job *job)
 {
     GLOBAL_STATE_CODE();
     if (job->driver->clean) {
@@ -779,21 +799,22 @@  static void job_clean(Job *job)
     }
 }
 
-static int job_finalize_single(Job *job)
+/* Called with job_mutex held, but releases it temporarily. */
+static int job_finalize_single_locked(Job *job)
 {
     int job_ret;
 
     assert(job_is_completed_locked(job));
 
     /* Ensure abort is called for late-transactional failures */
-    job_update_rc(job);
+    job_update_rc_locked(job);
 
     if (!job->ret) {
-        job_commit(job);
+        job_commit_locked(job);
     } else {
-        job_abort(job);
+        job_abort_locked(job);
     }
-    job_clean(job);
+    job_clean_locked(job);
 
     if (job->cb) {
         job_ret = job->ret;
@@ -805,18 +826,19 @@  static int job_finalize_single(Job *job)
     /* Emit events only if we actually started */
     if (job_started(job)) {
         if (job_is_cancelled_locked(job)) {
-            job_event_cancelled(job);
+            job_event_cancelled_locked(job);
         } else {
-            job_event_completed(job);
+            job_event_completed_locked(job);
         }
     }
 
-    job_txn_del_job(job);
-    job_conclude(job);
+    job_txn_del_job_locked(job);
+    job_conclude_locked(job);
     return 0;
 }
 
-static void job_cancel_async(Job *job, bool force)
+/* Called with job_mutex held, but releases it temporarily. */
+static void job_cancel_async_locked(Job *job, bool force)
 {
     GLOBAL_STATE_CODE();
     if (job->driver->cancel) {
@@ -854,7 +876,8 @@  static void job_cancel_async(Job *job, bool force)
     }
 }
 
-static void job_completed_txn_abort(Job *job)
+/* Called with job_mutex held. */
+static void job_completed_txn_abort_locked(Job *job)
 {
     AioContext *ctx;
     JobTxn *txn = job->txn;
@@ -867,12 +890,12 @@  static void job_completed_txn_abort(Job *job)
         return;
     }
     txn->aborting = true;
-    job_txn_ref(txn);
+    job_txn_ref_locked(txn);
 
     /*
      * We can only hold the single job's AioContext lock while calling
-     * job_finalize_single() because the finalization callbacks can involve
-     * calls of AIO_WAIT_WHILE(), which could deadlock otherwise.
+     * job_finalize_single_locked() because the finalization callbacks can
+     * involve calls of AIO_WAIT_WHILE(), which could deadlock otherwise.
      * Note that the job's AioContext may change when it is finalized.
      */
     job_ref(job);
@@ -890,7 +913,7 @@  static void job_completed_txn_abort(Job *job)
              * Therefore, pass force=true to terminate all other jobs as quickly
              * as possible.
              */
-            job_cancel_async(other_job, true);
+            job_cancel_async_locked(other_job, true);
             aio_context_release(ctx);
         }
     }
@@ -906,13 +929,13 @@  static void job_completed_txn_abort(Job *job)
             assert(job_cancel_requested_locked(other_job));
             job_finish_sync(other_job, NULL, NULL);
         }
-        job_finalize_single(other_job);
+        job_finalize_single_locked(other_job);
         aio_context_release(ctx);
     }
 
     /*
      * Use job_ref()/job_unref() so we can read the AioContext here
-     * even if the job went away during job_finalize_single().
+     * even if the job went away during job_finalize_single_locked().
      */
     aio_context_acquire(job->aio_context);
     job_unref(job);
@@ -920,7 +943,8 @@  static void job_completed_txn_abort(Job *job)
     job_txn_unref(txn);
 }
 
-static int job_prepare(Job *job)
+/* Called with job_mutex held, but releases it temporarily. */
+static int job_prepare_locked(Job *job)
 {
     int ret;
 
@@ -930,27 +954,29 @@  static int job_prepare(Job *job)
         ret = job->driver->prepare(job);
         job_lock();
         job->ret = ret;
-        job_update_rc(job);
+        job_update_rc_locked(job);
     }
     return job->ret;
 }
 
-static int job_needs_finalize(Job *job)
+/* Called with job_mutex held. */
+static int job_needs_finalize_locked(Job *job)
 {
     return !job->auto_finalize;
 }
 
-static void job_do_finalize(Job *job)
+/* Called with job_mutex held. */
+static void job_do_finalize_locked(Job *job)
 {
     int rc;
     assert(job && job->txn);
 
     /* prepare the transaction to complete */
-    rc = job_txn_apply(job, job_prepare);
+    rc = job_txn_apply_locked(job, job_prepare_locked);
     if (rc) {
-        job_completed_txn_abort(job);
+        job_completed_txn_abort_locked(job);
     } else {
-        job_txn_apply(job, job_finalize_single);
+        job_txn_apply_locked(job, job_finalize_single_locked);
     }
 }
 
@@ -960,14 +986,15 @@  void job_finalize(Job *job, Error **errp)
     if (job_apply_verb(job, JOB_VERB_FINALIZE, errp)) {
         return;
     }
-    job_do_finalize(job);
+    job_do_finalize_locked(job);
 }
 
-static int job_transition_to_pending(Job *job)
+/* Called with job_mutex held. */
+static int job_transition_to_pending_locked(Job *job)
 {
-    job_state_transition(job, JOB_STATUS_PENDING);
+    job_state_transition_locked(job, JOB_STATUS_PENDING);
     if (!job->auto_finalize) {
-        job_event_pending(job);
+        job_event_pending_locked(job);
     }
     return 0;
 }
@@ -975,16 +1002,17 @@  static int job_transition_to_pending(Job *job)
 void job_transition_to_ready(Job *job)
 {
     JOB_LOCK_GUARD();
-    job_state_transition(job, JOB_STATUS_READY);
-    job_event_ready(job);
+    job_state_transition_locked(job, JOB_STATUS_READY);
+    job_event_ready_locked(job);
 }
 
-static void job_completed_txn_success(Job *job)
+/* Called with job_mutex held. */
+static void job_completed_txn_success_locked(Job *job)
 {
     JobTxn *txn = job->txn;
     Job *other_job;
 
-    job_state_transition(job, JOB_STATUS_WAITING);
+    job_state_transition_locked(job, JOB_STATUS_WAITING);
 
     /*
      * Successful completion, see if there are other running jobs in this
@@ -997,24 +1025,25 @@  static void job_completed_txn_success(Job *job)
         assert(other_job->ret == 0);
     }
 
-    job_txn_apply(job, job_transition_to_pending);
+    job_txn_apply_locked(job, job_transition_to_pending_locked);
 
     /* If no jobs need manual finalization, automatically do so */
-    if (job_txn_apply(job, job_needs_finalize) == 0) {
-        job_do_finalize(job);
+    if (job_txn_apply_locked(job, job_needs_finalize_locked) == 0) {
+        job_do_finalize_locked(job);
     }
 }
 
-static void job_completed(Job *job)
+/* Called with job_mutex held. */
+static void job_completed_locked(Job *job)
 {
     assert(job && job->txn && !job_is_completed_locked(job));
 
-    job_update_rc(job);
+    job_update_rc_locked(job);
     trace_job_completed(job, job->ret);
     if (job->ret) {
-        job_completed_txn_abort(job);
+        job_completed_txn_abort_locked(job);
     } else {
-        job_completed_txn_success(job);
+        job_completed_txn_success_locked(job);
     }
 }
 
@@ -1036,15 +1065,16 @@  static void job_exit(void *opaque)
      * drain block nodes, and if .drained_poll still returned true, we would
      * deadlock. */
     job->busy = false;
-    job_event_idle(job);
+    job_event_idle_locked(job);
 
-    job_completed(job);
+    job_completed_locked(job);
 
     /*
-     * Note that calling job_completed can move the job to a different
-     * aio_context, so we cannot cache from above. job_txn_apply takes care of
-     * acquiring the new lock, and we ref/unref to avoid job_completed freeing
-     * the job underneath us.
+     * Note that calling job_completed_locked can move the job to a different
+     * aio_context, so we cannot cache from above.
+     * job_txn_apply_locked takes care of
+     * acquiring the new lock, and we ref/unref to avoid job_completed_locked
+     * freeing the job underneath us.
      */
     ctx = job->aio_context;
     job_unref(job);
@@ -1083,7 +1113,7 @@  void job_start(Job *job)
         job->pause_count--;
         job->busy = true;
         job->paused = false;
-        job_state_transition(job, JOB_STATUS_RUNNING);
+        job_state_transition_locked(job, JOB_STATUS_RUNNING);
     }
     aio_co_enter(job->aio_context, job->co);
 }
@@ -1091,25 +1121,25 @@  void job_start(Job *job)
 void job_cancel(Job *job, bool force)
 {
     if (job->status == JOB_STATUS_CONCLUDED) {
-        job_do_dismiss(job);
+        job_do_dismiss_locked(job);
         return;
     }
-    job_cancel_async(job, force);
+    job_cancel_async_locked(job, force);
     if (!job_started(job)) {
-        job_completed(job);
+        job_completed_locked(job);
     } else if (job->deferred_to_main_loop) {
         /*
-         * job_cancel_async() ignores soft-cancel requests for jobs
+         * job_cancel_async_locked() ignores soft-cancel requests for jobs
          * that are already done (i.e. deferred to the main loop).  We
          * have to check again whether the job is really cancelled.
          * (job_cancel_requested_locked() and job_is_cancelled_locked()
-         * are equivalent here, because job_cancel_async() will
+         * are equivalent here, because job_cancel_async_locked() will
          * make soft-cancel requests no-ops when deferred_to_main_loop is true.
          * We choose to call job_is_cancelled_locked() to show that we invoke
-         * job_completed_txn_abort() only for force-cancelled jobs.)
+         * job_completed_txn_abort_locked() only for force-cancelled jobs.)
          */
         if (job_is_cancelled_locked(job)) {
-            job_completed_txn_abort(job);
+            job_completed_txn_abort_locked(job);
         }
     } else {
         job_enter_cond(job, NULL);
@@ -1124,18 +1154,23 @@  void job_user_cancel(Job *job, bool force, Error **errp)
     job_cancel(job, force);
 }
 
-/* A wrapper around job_cancel() taking an Error ** parameter so it may be
+/*
+ * A wrapper around job_cancel() taking an Error ** parameter so it may be
  * used with job_finish_sync() without the need for (rather nasty) function
- * pointer casts there. */
-static void job_cancel_err(Job *job, Error **errp)
+ * pointer casts there.
+ *
+ * Called with job_mutex held.
+ */
+static void job_cancel_err_locked(Job *job, Error **errp)
 {
     job_cancel(job, false);
 }
 
 /**
- * Same as job_cancel_err(), but force-cancel.
+ * Same as job_cancel_err_locked(), but force-cancel.
+ * Called with job_mutex held.
  */
-static void job_force_cancel_err(Job *job, Error **errp)
+static void job_force_cancel_err_locked(Job *job, Error **errp)
 {
     job_cancel(job, true);
 }
@@ -1143,9 +1178,9 @@  static void job_force_cancel_err(Job *job, Error **errp)
 int job_cancel_sync(Job *job, bool force)
 {
     if (force) {
-        return job_finish_sync(job, &job_force_cancel_err, NULL);
+        return job_finish_sync(job, &job_force_cancel_err_locked, NULL);
     } else {
-        return job_finish_sync(job, &job_cancel_err, NULL);
+        return job_finish_sync(job, &job_cancel_err_locked, NULL);
     }
 }