diff mbox series

[iwl-net] igc: Ensure PTM request is completed before timeout has started

Message ID 20240708-igc-flush-ptm-request-before-timeout-6-10-v1-1-70e5ebec9efe@l-acoustics.com
State Superseded
Headers show
Series [iwl-net] igc: Ensure PTM request is completed before timeout has started | expand

Commit Message

Rodrigo Cataldo via B4 Relay July 8, 2024, 9:31 a.m. UTC
From: Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com>

When a PTM is requested via wr32(IGC_PTM_STAT), the operation may only
be completed by the next read operation (flush). Unfortunately, the next
read operation in the PTM request loop happens after we have already
started evaluating the response timeout.

Thus, the following behavior has been observed::

  phc2sys-1655  [010]   103.233752: funcgraph_entry:                    |  igc_ptp_getcrosststamp() {
  phc2sys-1655  [010]   103.233754: funcgraph_entry:                    |    igc_phc_get_syncdevice_time() {
  phc2sys-1655  [010]   103.233755: funcgraph_entry:                    |      igc_rd32() {
  phc2sys-1655  [010]   103.233931: preempt_disable: caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
  phc2sys-1655  [010]   103.233932: local_timer_entry: vector=236
  phc2sys-1655  [010]   103.233932: hrtimer_cancel: hrtimer=0xffff8edeef526118
  phc2sys-1655  [010]   103.233933: hrtimer_expire_entry: hrtimer=0xffff8edeef526118 now=103200127876 function=tick_nohz_handler/0x0

  ... tick handler ...

  phc2sys-1655  [010]   103.233971: funcgraph_exit:       !  215.559 us |      }
  phc2sys-1655  [010]   103.233972: funcgraph_entry:                    |      igc_rd32() {
  phc2sys-1655  [010]   103.234135: funcgraph_exit:       !  164.370 us |      }
  phc2sys-1655  [010]   103.234136: funcgraph_entry:         1.942 us   |      igc_rd32();
  phc2sys-1655  [010]   103.234147: console:              igc 0000:03:00.0 enp3s0: Timeout reading IGC_PTM_STAT register

Based on the (simplified) code::

	ctrl = rd32(IGC_PTM_CTRL);
        /* simplified: multiple writes here */
	wr32(IGC_PTM_STAT, IGC_PTM_STAT_VALID);

	err = readx_poll_timeout(rd32, IGC_PTM_STAT, stat,
				 stat, IGC_PTM_STAT_SLEEP,
				 IGC_PTM_STAT_TIMEOUT);
	if (err < 0) {
		netdev_err(adapter->netdev, "Timeout reading IGC_PTM_STAT register\n");
		return err;
	}

Where readx_poll_timeout() starts the timeout evaluation before calling
the rd32() parameter (rd32() is a macro for igc_rd32()).

In the trace shown, the read operation of readx_poll_timeout() (second
igc_rd32()) took so long that the timeout (IGC_PTM_STAT_VALID) has expired
and no sleep has been performed.

With this patch, a write flush is added (which is an additional
igc_rd32() in practice) that can wait for the write before the timeout
is evaluated::

  phc2sys-1615  [010]    74.517954: funcgraph_entry:                    |  igc_ptp_getcrosststamp() {
  phc2sys-1615  [010]    74.517956: funcgraph_entry:                    |    igc_phc_get_syncdevicetime() {
  phc2sys-1615  [010]    74.517957: funcgraph_entry:                    |      igc_rd32() {
  phc2sys-1615  [010]    74.518127: preempt_disable: caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
  phc2sys-1615  [010]    74.518128: local_timer_entry: vector=236
  phc2sys-1615  [010]    74.518128: hrtimer_cancel: hrtimer=0xffff96466f526118
  phc2sys-1615  [010]    74.518128: hrtimer_expire_entry: hrtimer=0xffff96466f526118 now=74484007229 function=tick_nohz_handler/0x0

  ... tick handler ...

  phc2sys-1615  [010]    74.518180: funcgraph_exit:       !  222.282 us |      }
  phc2sys-1615  [010]    74.518181: funcgraph_entry:                    |      igc_rd32() {
  phc2sys-1615  [010]    74.518349: funcgraph_exit:       !  168.160 us |      }
  phc2sys-1615  [010]    74.518349: funcgraph_entry:         1.970 us   |      igc_rd32();
  phc2sys-1615  [010]    74.518352: hrtimer_init: hrtimer=0xffffa6f9413a3940 clockid=CLOCK_MONOTONIC mode=0x0
  phc2sys-1615  [010]    74.518352: preempt_disable: caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
  phc2sys-1615  [010]    74.518353: hrtimer_start: hrtimer=0xffffa6f9413a3940 function=hrtimer_wakeup/0x0 expires=74484232878 softexpires=74484231878

  .. hrtimer setup and return ...

  kworker/10:1-242   [010]    74.518382: sched_switch: kworker/10:1:242 [120] W ==> phc2sys:1615 [120]
  phc2sys-1615  [010]    74.518383: preempt_enable: caller=schedule+0x36 parent=schedule+0x36
  phc2sys-1615  [010]    74.518384: funcgraph_entry:      !  100.088 us |      igc_rd32();
  phc2sys-1615  [010]    74.518484: funcgraph_entry:         1.958 us   |      igc_rd32();
  phc2sys-1615  [010]    74.518488: funcgraph_entry:         2.019 us   |      igc_rd32();
  phc2sys-1615  [010]    74.518490: funcgraph_entry:         1.956 us   |      igc_rd32();
  phc2sys-1615  [010]    74.518492: funcgraph_entry:         1.980 us   |      igc_rd32();
  phc2sys-1615  [010]    74.518494: funcgraph_exit:       !  539.386 us |    }

Now the sleep is called as expected, and the operation succeeds.
Therefore, regardless of how long it will take for the write to be
completed, we will poll+sleep at least for the time specified in
IGC_PTM_STAT_TIMEOUT.

Fixes: a90ec8483732 ("igc: Add support for PTP getcrosststamp()")
Signed-off-by: Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com>
---
I have been searching for the proper solution for this PTM issue for a long
time. The issue was observed on a 13700 (Raptor Lake). We also use a 8500
(Coffee Lake) that is much less susceptible for this issue, but still
happens. Both are using I225-LM cards.

For a 24 hours test, idle system, I have observed with 13700::

	number of timeouts in 86400 seconds: 2509

The same test on a 8500::

        number of timeouts in 86400 seconds: 9

Where one PTM request is sent per second. Test was done this script::

  record_multiple_timeout_param()
  {
  	local taskset_cpu=$1
  	local cur_limit=$((SECONDS + LIMIT_SECONDS ))
  	local timeouts=0
  
  	while [ $SECONDS -lt $cur_limit ]
  	do
  		REL_TO=$((cur_limit - SECONDS))
  
  		timeout $REL_TO taskset $taskset_cpu \
  			phc2sys -c $ITF_NAME -s CLOCK_REALTIME -O 37 -m 1>/dev/null
  		if [ $? -eq 255 ]; then
  			timeouts=$((timeouts + 1))
  		fi
  	done
  	printf "\tnumber of timeouts in %s seconds: %d\n" $LIMIT_SECONDS $timeouts
  }

  record_multiple_timeout_param $NON_ISOLCPU_MASK

Firmware version for the cards::

  # lshw -class network -json | jq '.[0].product,.[0].configuration.firmware'
  "Ethernet Controller I225-LM"
  "1057:8754"

  # lshw -class network -json | jq '.[2].product,.[2].configuration.firmware'
  "Ethernet Controller I225-LM"
  "1057:8754

A couple of attempts were made that did not lead to solving the
issue (disabling ASPM in kernel and boot, using periodic tick), and a couple
of solutions that worked but that were subpar:

1. The issue was not observed for a phc2sys(8) running on a fully
   isolated nohz_full core. We do not have the luxury of dedicating a a
   core for it.
2. Bumping the IGC_PTM_STAT_TIMEOUT value. Other machines may need
   another value though.
3. Retry (multiple times) readx_poll_timeout() in case of failure. This may
   significantly increase the function latency, because each
   readx_poll_timeout() can take more than 100 us.
4. Disabling preemption during the PTM request. Horrible.

For the Coffee Lake machine, the issue tends to be avoided because the
read does not take so long. Here is basically the same trace using the
Cofee Lake machine::

  phc2sys-1204  [002]  1778.220288: funcgraph_entry:                    |  igc_ptp_getcrosststamp() {
  phc2sys-1204  [002]  1778.220290: funcgraph_entry:                    |    igc_phc_get_syncdevicetime() {
  phc2sys-1204  [002]  1778.220291: funcgraph_entry:                    |      igc_rd32() {
  phc2sys-1204  [002]  1778.220373: preempt_disable: caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
  phc2sys-1204  [002]  1778.220374: local_timer_entry: vector=236
  phc2sys-1204  [002]  1778.220375: hrtimer_cancel: hrtimer=0xffff894027326118
  phc2sys-1204  [002]  1778.220376: hrtimer_expire_entry: hrtimer=0xffff894027326118 now=1778228034802 function=tick_nohz_handler/0x0

  ... tick handler ...

  phc2sys-1204  [002]  1778.220411: funcgraph_exit:       !  119.843 us |      }
  phc2sys-1204  [002]  1778.220412: funcgraph_entry:                    |      igc_rd32() {
  phc2sys-1204  [002]  1778.220492: funcgraph_exit:       + 80.094 us   |      }
  phc2sys-1204  [002]  1778.220493: funcgraph_entry:        2.951 us    |      igc_rd32();
  phc2sys-1204  [002]  1778.220497: hrtimer_init: hrtimer=0xffffa504c0d83aa0 clockid=CLOCK_MONOTONIC mode=0x0
  phc2sys-1204  [002]  1778.220498: preempt_disable: caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
  phc2sys-1204  [002]  1778.220499: hrtimer_start: hrtimer=0xffffa504c0d83aa0 function=hrtimer_wakeup/0x0 expires=1778228158866 softexpires=1778228157866

  ... timer setup ....

  phc2sys-1204  [002]  1778.220509: preempt_enable: caller=_raw_spin_unlock_irqrestore+0x2b parent=hrtimer_start_range_ns+0x12d
  phc2sys-1204  [002]  1778.220511: funcgraph_entry:        7.338 us   |      igc_rd32();
  phc2sys-1204  [002]  1778.220519: funcgraph_entry:        2.769 us   |      igc_rd32();
  phc2sys-1204  [002]  1778.220522: funcgraph_entry:        2.798 us   |      igc_rd32();
  phc2sys-1204  [002]  1778.220525: funcgraph_entry:        2.736 us   |      igc_rd32();
  phc2sys-1204  [002]  1778.220529: funcgraph_entry:        2.750 us   |      igc_rd32();
  phc2sys-1204  [002]  1778.220532: funcgraph_exit:       !  242.656 us |    }

For both machines, I observed that the first igc_rd32() after an idle
period (more than 10us) tends to take significantly more time. I assume
this is a hardware power-saving technique, but I could not find any
mention in the manuals. This is very easily observable with an idle
system running phc2sys, since it will request only once every second.

This is the typical trace of the operation::

  phc2sys-1204  [002]  1749.209397: funcgraph_entry:                   |  igc_ptp_getcrosststamp() {
  phc2sys-1204  [002]  1749.209398: funcgraph_entry:                   |    igc_phc_get_syncdevicetime() {
  phc2sys-1204  [002]  1749.209400: funcgraph_entry:      + 81.491 us  |      igc_rd32();
  phc2sys-1204  [002]  1749.209482: funcgraph_entry:        3.691 us   |      igc_rd32();
  phc2sys-1204  [002]  1749.209487: funcgraph_entry:        2.942 us   |      igc_rd32();
  phc2sys-1204  [002]  1749.209490: hrtimer_init: hrtimer=0xffffa504c0d83a00 clockid=CLOCK_MONOTONIC mode=0x0
  phc2sys-1204  [002]  1749.209491: preempt_disable: caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
  ... timer setup and it goes on like before ...

The preemption needs to happen for this issue, so that this power-saving
mode is triggered, making the igc_rd32 'slow enough' so that all of the
timeout is consumed before the card has time to answer.

I believe flushing the write solves the issue definitely, since the
write should be completed before the timeout has started. So that, even
if the timeout is consumed by a slow read operation, the write has been
received before and the card had time to process the request.
---
 drivers/net/ethernet/intel/igc/igc_ptp.c | 4 ++++
 1 file changed, 4 insertions(+)


---
base-commit: 0005b2dc43f96b93fc5b0850d7ca3f7aeac9129c
change-id: 20240705-igc-flush-ptm-request-before-timeout-6-10-f6e02c96f6d4

Best regards,

Comments

Vinicius Costa Gomes July 11, 2024, 12:48 a.m. UTC | #1
Rodrigo Cataldo via B4 Relay
<devnull+rodrigo.cadore.l-acoustics.com@kernel.org> writes:

> From: Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com>
>
> When a PTM is requested via wr32(IGC_PTM_STAT), the operation may only
> be completed by the next read operation (flush). Unfortunately, the next
> read operation in the PTM request loop happens after we have already
> started evaluating the response timeout.
>
> Thus, the following behavior has been observed::
>
>   phc2sys-1655  [010]   103.233752: funcgraph_entry:                    |  igc_ptp_getcrosststamp() {
>   phc2sys-1655  [010]   103.233754: funcgraph_entry:                    |    igc_phc_get_syncdevice_time() {
>   phc2sys-1655  [010]   103.233755: funcgraph_entry:                    |      igc_rd32() {
>   phc2sys-1655  [010]   103.233931: preempt_disable: caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
>   phc2sys-1655  [010]   103.233932: local_timer_entry: vector=236
>   phc2sys-1655  [010]   103.233932: hrtimer_cancel: hrtimer=0xffff8edeef526118
>   phc2sys-1655  [010]   103.233933: hrtimer_expire_entry: hrtimer=0xffff8edeef526118 now=103200127876 function=tick_nohz_handler/0x0
>
>   ... tick handler ...
>
>   phc2sys-1655  [010]   103.233971: funcgraph_exit:       !  215.559 us |      }
>   phc2sys-1655  [010]   103.233972: funcgraph_entry:                    |      igc_rd32() {
>   phc2sys-1655  [010]   103.234135: funcgraph_exit:       !  164.370 us |      }
>   phc2sys-1655  [010]   103.234136: funcgraph_entry:         1.942 us   |      igc_rd32();
>   phc2sys-1655  [010]   103.234147: console:              igc 0000:03:00.0 enp3s0: Timeout reading IGC_PTM_STAT register
>
> Based on the (simplified) code::
>
> 	ctrl = rd32(IGC_PTM_CTRL);
>         /* simplified: multiple writes here */
> 	wr32(IGC_PTM_STAT, IGC_PTM_STAT_VALID);
>
> 	err = readx_poll_timeout(rd32, IGC_PTM_STAT, stat,
> 				 stat, IGC_PTM_STAT_SLEEP,
> 				 IGC_PTM_STAT_TIMEOUT);
> 	if (err < 0) {
> 		netdev_err(adapter->netdev, "Timeout reading IGC_PTM_STAT register\n");
> 		return err;
> 	}
>
> Where readx_poll_timeout() starts the timeout evaluation before calling
> the rd32() parameter (rd32() is a macro for igc_rd32()).
>
> In the trace shown, the read operation of readx_poll_timeout() (second
> igc_rd32()) took so long that the timeout (IGC_PTM_STAT_VALID) has expired
> and no sleep has been performed.
>
> With this patch, a write flush is added (which is an additional
> igc_rd32() in practice) that can wait for the write before the timeout
> is evaluated::
>
>   phc2sys-1615  [010]    74.517954: funcgraph_entry:                    |  igc_ptp_getcrosststamp() {
>   phc2sys-1615  [010]    74.517956: funcgraph_entry:                    |    igc_phc_get_syncdevicetime() {
>   phc2sys-1615  [010]    74.517957: funcgraph_entry:                    |      igc_rd32() {
>   phc2sys-1615  [010]    74.518127: preempt_disable: caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
>   phc2sys-1615  [010]    74.518128: local_timer_entry: vector=236
>   phc2sys-1615  [010]    74.518128: hrtimer_cancel: hrtimer=0xffff96466f526118
>   phc2sys-1615  [010]    74.518128: hrtimer_expire_entry: hrtimer=0xffff96466f526118 now=74484007229 function=tick_nohz_handler/0x0
>
>   ... tick handler ...
>
>   phc2sys-1615  [010]    74.518180: funcgraph_exit:       !  222.282 us |      }
>   phc2sys-1615  [010]    74.518181: funcgraph_entry:                    |      igc_rd32() {
>   phc2sys-1615  [010]    74.518349: funcgraph_exit:       !  168.160 us |      }
>   phc2sys-1615  [010]    74.518349: funcgraph_entry:         1.970 us   |      igc_rd32();
>   phc2sys-1615  [010]    74.518352: hrtimer_init: hrtimer=0xffffa6f9413a3940 clockid=CLOCK_MONOTONIC mode=0x0
>   phc2sys-1615  [010]    74.518352: preempt_disable: caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
>   phc2sys-1615  [010]    74.518353: hrtimer_start: hrtimer=0xffffa6f9413a3940 function=hrtimer_wakeup/0x0 expires=74484232878 softexpires=74484231878
>
>   .. hrtimer setup and return ...
>
>   kworker/10:1-242   [010]    74.518382: sched_switch: kworker/10:1:242 [120] W ==> phc2sys:1615 [120]
>   phc2sys-1615  [010]    74.518383: preempt_enable: caller=schedule+0x36 parent=schedule+0x36
>   phc2sys-1615  [010]    74.518384: funcgraph_entry:      !  100.088 us |      igc_rd32();
>   phc2sys-1615  [010]    74.518484: funcgraph_entry:         1.958 us   |      igc_rd32();
>   phc2sys-1615  [010]    74.518488: funcgraph_entry:         2.019 us   |      igc_rd32();
>   phc2sys-1615  [010]    74.518490: funcgraph_entry:         1.956 us   |      igc_rd32();
>   phc2sys-1615  [010]    74.518492: funcgraph_entry:         1.980 us   |      igc_rd32();
>   phc2sys-1615  [010]    74.518494: funcgraph_exit:       !  539.386 us |    }
>
> Now the sleep is called as expected, and the operation succeeds.
> Therefore, regardless of how long it will take for the write to be
> completed, we will poll+sleep at least for the time specified in
> IGC_PTM_STAT_TIMEOUT.
>
> Fixes: a90ec8483732 ("igc: Add support for PTP getcrosststamp()")
> Signed-off-by: Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com>
> ---
> I have been searching for the proper solution for this PTM issue for a long
> time. The issue was observed on a 13700 (Raptor Lake). We also use a 8500
> (Coffee Lake) that is much less susceptible for this issue, but still
> happens. Both are using I225-LM cards.
>
> For a 24 hours test, idle system, I have observed with 13700::
>
> 	number of timeouts in 86400 seconds: 2509
>
> The same test on a 8500::
>
>         number of timeouts in 86400 seconds: 9
>
> Where one PTM request is sent per second. Test was done this script::
>
>   record_multiple_timeout_param()
>   {
>   	local taskset_cpu=$1
>   	local cur_limit=$((SECONDS + LIMIT_SECONDS ))
>   	local timeouts=0
>   
>   	while [ $SECONDS -lt $cur_limit ]
>   	do
>   		REL_TO=$((cur_limit - SECONDS))
>   
>   		timeout $REL_TO taskset $taskset_cpu \
>   			phc2sys -c $ITF_NAME -s CLOCK_REALTIME -O 37 -m 1>/dev/null
>   		if [ $? -eq 255 ]; then
>   			timeouts=$((timeouts + 1))
>   		fi
>   	done
>   	printf "\tnumber of timeouts in %s seconds: %d\n" $LIMIT_SECONDS $timeouts
>   }
>
>   record_multiple_timeout_param $NON_ISOLCPU_MASK
>
> Firmware version for the cards::
>
>   # lshw -class network -json | jq '.[0].product,.[0].configuration.firmware'
>   "Ethernet Controller I225-LM"
>   "1057:8754"
>
>   # lshw -class network -json | jq '.[2].product,.[2].configuration.firmware'
>   "Ethernet Controller I225-LM"
>   "1057:8754
>
> A couple of attempts were made that did not lead to solving the
> issue (disabling ASPM in kernel and boot, using periodic tick), and a couple
> of solutions that worked but that were subpar:
>
> 1. The issue was not observed for a phc2sys(8) running on a fully
>    isolated nohz_full core. We do not have the luxury of dedicating a a
>    core for it.

This one is interesting. Was it because the isolated CPU never got to
sleep and readx_poll_timeout() became a busy loop? I am trying to fit
this one on my mental model.

> 2. Bumping the IGC_PTM_STAT_TIMEOUT value. Other machines may need
>    another value though.

This one is not horrible as well. Some value like 400us comes to mind.

Your proposed fix looks fine as well, I was thinking a bit about it, and
even if it's more a like a timing issue than "the absolutely correct
fix", there are little chances for bad side effects.

If there aren't any other ideas, I am fine with this:

Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>

> 3. Retry (multiple times) readx_poll_timeout() in case of failure. This may
>    significantly increase the function latency, because each
>    readx_poll_timeout() can take more than 100 us.
> 4. Disabling preemption during the PTM request. Horrible.
>

I wonder if you tried moving the I225 to the graphics card PCIe slot, or
is it an onboard card? IIRC there are different PCIe controllers for the
graphics slot vs. the other ones, at least on Coffee Lake.

> For the Coffee Lake machine, the issue tends to be avoided because the
> read does not take so long. Here is basically the same trace using the
> Cofee Lake machine::
>
>   phc2sys-1204  [002]  1778.220288: funcgraph_entry:                    |  igc_ptp_getcrosststamp() {
>   phc2sys-1204  [002]  1778.220290: funcgraph_entry:                    |    igc_phc_get_syncdevicetime() {
>   phc2sys-1204  [002]  1778.220291: funcgraph_entry:                    |      igc_rd32() {
>   phc2sys-1204  [002]  1778.220373: preempt_disable: caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
>   phc2sys-1204  [002]  1778.220374: local_timer_entry: vector=236
>   phc2sys-1204  [002]  1778.220375: hrtimer_cancel: hrtimer=0xffff894027326118
>   phc2sys-1204  [002]  1778.220376: hrtimer_expire_entry: hrtimer=0xffff894027326118 now=1778228034802 function=tick_nohz_handler/0x0
>
>   ... tick handler ...
>
>   phc2sys-1204  [002]  1778.220411: funcgraph_exit:       !  119.843 us |      }
>   phc2sys-1204  [002]  1778.220412: funcgraph_entry:                    |      igc_rd32() {
>   phc2sys-1204  [002]  1778.220492: funcgraph_exit:       + 80.094 us   |      }
>   phc2sys-1204  [002]  1778.220493: funcgraph_entry:        2.951 us    |      igc_rd32();
>   phc2sys-1204  [002]  1778.220497: hrtimer_init: hrtimer=0xffffa504c0d83aa0 clockid=CLOCK_MONOTONIC mode=0x0
>   phc2sys-1204  [002]  1778.220498: preempt_disable: caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
>   phc2sys-1204  [002]  1778.220499: hrtimer_start: hrtimer=0xffffa504c0d83aa0 function=hrtimer_wakeup/0x0 expires=1778228158866 softexpires=1778228157866
>
>   ... timer setup ....
>
>   phc2sys-1204  [002]  1778.220509: preempt_enable: caller=_raw_spin_unlock_irqrestore+0x2b parent=hrtimer_start_range_ns+0x12d
>   phc2sys-1204  [002]  1778.220511: funcgraph_entry:        7.338 us   |      igc_rd32();
>   phc2sys-1204  [002]  1778.220519: funcgraph_entry:        2.769 us   |      igc_rd32();
>   phc2sys-1204  [002]  1778.220522: funcgraph_entry:        2.798 us   |      igc_rd32();
>   phc2sys-1204  [002]  1778.220525: funcgraph_entry:        2.736 us   |      igc_rd32();
>   phc2sys-1204  [002]  1778.220529: funcgraph_entry:        2.750 us   |      igc_rd32();
>   phc2sys-1204  [002]  1778.220532: funcgraph_exit:       !  242.656 us |    }
>
> For both machines, I observed that the first igc_rd32() after an idle
> period (more than 10us) tends to take significantly more time. I assume
> this is a hardware power-saving technique, but I could not find any
> mention in the manuals. This is very easily observable with an idle
> system running phc2sys, since it will request only once every second.
>
> This is the typical trace of the operation::
>
>   phc2sys-1204  [002]  1749.209397: funcgraph_entry:                   |  igc_ptp_getcrosststamp() {
>   phc2sys-1204  [002]  1749.209398: funcgraph_entry:                   |    igc_phc_get_syncdevicetime() {
>   phc2sys-1204  [002]  1749.209400: funcgraph_entry:      + 81.491 us  |      igc_rd32();
>   phc2sys-1204  [002]  1749.209482: funcgraph_entry:        3.691 us   |      igc_rd32();
>   phc2sys-1204  [002]  1749.209487: funcgraph_entry:        2.942 us   |      igc_rd32();
>   phc2sys-1204  [002]  1749.209490: hrtimer_init: hrtimer=0xffffa504c0d83a00 clockid=CLOCK_MONOTONIC mode=0x0
>   phc2sys-1204  [002]  1749.209491: preempt_disable: caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
>   ... timer setup and it goes on like before ...
>
> The preemption needs to happen for this issue, so that this power-saving
> mode is triggered, making the igc_rd32 'slow enough' so that all of the
> timeout is consumed before the card has time to answer.
>
> I believe flushing the write solves the issue definitely, since the
> write should be completed before the timeout has started. So that, even
> if the timeout is consumed by a slow read operation, the write has been
> received before and the card had time to process the request.
> ---
>  drivers/net/ethernet/intel/igc/igc_ptp.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/igc/igc_ptp.c b/drivers/net/ethernet/intel/igc/igc_ptp.c
> index 1bb026232efc..d7269e4f1a21 100644
> --- a/drivers/net/ethernet/intel/igc/igc_ptp.c
> +++ b/drivers/net/ethernet/intel/igc/igc_ptp.c
> @@ -1005,6 +1005,10 @@ static int igc_phc_get_syncdevicetime(ktime_t *device,
>  		 * VALID bit.
>  		 */
>  		wr32(IGC_PTM_STAT, IGC_PTM_STAT_VALID);
> +		/* Ensure the hardware receives the ptm request before the
> +		 * response timeout starts.
> +		 */
> +		wrfl();
>  
>  		err = readx_poll_timeout(rd32, IGC_PTM_STAT, stat,
>  					 stat, IGC_PTM_STAT_SLEEP,
>
> ---
> base-commit: 0005b2dc43f96b93fc5b0850d7ca3f7aeac9129c
> change-id: 20240705-igc-flush-ptm-request-before-timeout-6-10-f6e02c96f6d4
>
> Best regards,
> -- 
> Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com>
>
>


Cheers,
Rodrigo CADORE CATALDO July 16, 2024, 8:13 a.m. UTC | #2
Vinicius Costa Gomes <vinicius.gomes@intel.com> writes: 

> From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> 
> Rodrigo Cataldo via B4 Relay
> <devnull+rodrigo.cadore.l-acoustics.com@kernel.org> writes:
> 
> > From: Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com>
> >
> > When a PTM is requested via wr32(IGC_PTM_STAT), the operation may only
> > be completed by the next read operation (flush). Unfortunately, the next
> > read operation in the PTM request loop happens after we have already
> > started evaluating the response timeout.
> >
> > Thus, the following behavior has been observed::
> >
> >   phc2sys-1655  [010]   103.233752: funcgraph_entry:                    |
> igc_ptp_getcrosststamp() {
> >   phc2sys-1655  [010]   103.233754: funcgraph_entry:                    |
> igc_phc_get_syncdevice_time() {
> >   phc2sys-1655  [010]   103.233755: funcgraph_entry:                    |      igc_rd32()
> {
> >   phc2sys-1655  [010]   103.233931: preempt_disable:
> caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
> >   phc2sys-1655  [010]   103.233932: local_timer_entry: vector=236
> >   phc2sys-1655  [010]   103.233932: hrtimer_cancel:
> hrtimer=0xffff8edeef526118
> >   phc2sys-1655  [010]   103.233933: hrtimer_expire_entry:
> hrtimer=0xffff8edeef526118 now=103200127876
> function=tick_nohz_handler/0x0
> >
> >   ... tick handler ...
> >
> >   phc2sys-1655  [010]   103.233971: funcgraph_exit:       !  215.559 us |      }
> >   phc2sys-1655  [010]   103.233972: funcgraph_entry:                    |      igc_rd32()
> {
> >   phc2sys-1655  [010]   103.234135: funcgraph_exit:       !  164.370 us |      }
> >   phc2sys-1655  [010]   103.234136: funcgraph_entry:         1.942 us   |
> igc_rd32();
> >   phc2sys-1655  [010]   103.234147: console:              igc 0000:03:00.0 enp3s0:
> Timeout reading IGC_PTM_STAT register
> >
> > Based on the (simplified) code::
> >
> >       ctrl = rd32(IGC_PTM_CTRL);
> >         /* simplified: multiple writes here */
> >       wr32(IGC_PTM_STAT, IGC_PTM_STAT_VALID);
> >
> >       err = readx_poll_timeout(rd32, IGC_PTM_STAT, stat,
> >                                stat, IGC_PTM_STAT_SLEEP,
> >                                IGC_PTM_STAT_TIMEOUT);
> >       if (err < 0) {
> >               netdev_err(adapter->netdev, "Timeout reading IGC_PTM_STAT
> register\n");
> >               return err;
> >       }
> >
> > Where readx_poll_timeout() starts the timeout evaluation before calling
> > the rd32() parameter (rd32() is a macro for igc_rd32()).
> >
> > In the trace shown, the read operation of readx_poll_timeout() (second
> > igc_rd32()) took so long that the timeout (IGC_PTM_STAT_VALID) has expired
> > and no sleep has been performed.
> >
> > With this patch, a write flush is added (which is an additional
> > igc_rd32() in practice) that can wait for the write before the timeout
> > is evaluated::
> >
> >   phc2sys-1615  [010]    74.517954: funcgraph_entry:                    |
> igc_ptp_getcrosststamp() {
> >   phc2sys-1615  [010]    74.517956: funcgraph_entry:                    |
> igc_phc_get_syncdevicetime() {
> >   phc2sys-1615  [010]    74.517957: funcgraph_entry:                    |      igc_rd32() {
> >   phc2sys-1615  [010]    74.518127: preempt_disable:
> caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
> >   phc2sys-1615  [010]    74.518128: local_timer_entry: vector=236
> >   phc2sys-1615  [010]    74.518128: hrtimer_cancel:
> hrtimer=0xffff96466f526118
> >   phc2sys-1615  [010]    74.518128: hrtimer_expire_entry:
> hrtimer=0xffff96466f526118 now=74484007229 function=tick_nohz_handler/0x0
> >
> >   ... tick handler ...
> >
> >   phc2sys-1615  [010]    74.518180: funcgraph_exit:       !  222.282 us |      }
> >   phc2sys-1615  [010]    74.518181: funcgraph_entry:                    |      igc_rd32() {
> >   phc2sys-1615  [010]    74.518349: funcgraph_exit:       !  168.160 us |      }
> >   phc2sys-1615  [010]    74.518349: funcgraph_entry:         1.970 us   |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518352: hrtimer_init: hrtimer=0xffffa6f9413a3940
> clockid=CLOCK_MONOTONIC mode=0x0
> >   phc2sys-1615  [010]    74.518352: preempt_disable:
> caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
> >   phc2sys-1615  [010]    74.518353: hrtimer_start: hrtimer=0xffffa6f9413a3940
> function=hrtimer_wakeup/0x0 expires=74484232878 softexpires=74484231878
> >
> >   .. hrtimer setup and return ...
> >
> >   kworker/10:1-242   [010]    74.518382: sched_switch: kworker/10:1:242 [120]
> W ==> phc2sys:1615 [120]
> >   phc2sys-1615  [010]    74.518383: preempt_enable: caller=schedule+0x36
> parent=schedule+0x36
> >   phc2sys-1615  [010]    74.518384: funcgraph_entry:      !  100.088 us |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518484: funcgraph_entry:         1.958 us   |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518488: funcgraph_entry:         2.019 us   |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518490: funcgraph_entry:         1.956 us   |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518492: funcgraph_entry:         1.980 us   |
> igc_rd32();
> >   phc2sys-1615  [010]    74.518494: funcgraph_exit:       !  539.386 us |    }
> >
> > Now the sleep is called as expected, and the operation succeeds.
> > Therefore, regardless of how long it will take for the write to be
> > completed, we will poll+sleep at least for the time specified in
> > IGC_PTM_STAT_TIMEOUT.
> >
> > Fixes: a90ec8483732 ("igc: Add support for PTP getcrosststamp()")
> > Signed-off-by: Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com>
> > ---
> > I have been searching for the proper solution for this PTM issue for a long
> > time. The issue was observed on a 13700 (Raptor Lake). We also use a 8500
> > (Coffee Lake) that is much less susceptible for this issue, but still
> > happens. Both are using I225-LM cards.
> >
> > For a 24 hours test, idle system, I have observed with 13700::
> >
> >       number of timeouts in 86400 seconds: 2509
> >
> > The same test on a 8500::
> >
> >         number of timeouts in 86400 seconds: 9
> >
> > Where one PTM request is sent per second. Test was done this script::
> >
> >   record_multiple_timeout_param()
> >   {
> >       local taskset_cpu=$1
> >       local cur_limit=$((SECONDS + LIMIT_SECONDS ))
> >       local timeouts=0
> >
> >       while [ $SECONDS -lt $cur_limit ]
> >       do
> >               REL_TO=$((cur_limit - SECONDS))
> >
> >               timeout $REL_TO taskset $taskset_cpu \
> >                       phc2sys -c $ITF_NAME -s CLOCK_REALTIME -O 37 -m 1>/dev/null
> >               if [ $? -eq 255 ]; then
> >                       timeouts=$((timeouts + 1))
> >               fi
> >       done
> >       printf "\tnumber of timeouts in %s seconds: %d\n" $LIMIT_SECONDS
> $timeouts
> >   }
> >
> >   record_multiple_timeout_param $NON_ISOLCPU_MASK
> >
> > Firmware version for the cards::
> >
> >   # lshw -class network -json | jq '.[0].product,.[0].configuration.firmware'
> >   "Ethernet Controller I225-LM"
> >   "1057:8754"
> >
> >   # lshw -class network -json | jq '.[2].product,.[2].configuration.firmware'
> >   "Ethernet Controller I225-LM"
> >   "1057:8754
> >
> > A couple of attempts were made that did not lead to solving the
> > issue (disabling ASPM in kernel and boot, using periodic tick), and a couple
> > of solutions that worked but that were subpar:
> >
> > 1. The issue was not observed for a phc2sys(8) running on a fully
> >    isolated nohz_full core. We do not have the luxury of dedicating a a
> >    core for it.
> 
> This one is interesting. Was it because the isolated CPU never got to
> sleep and readx_poll_timeout() became a busy loop? I am trying to fit
> this one on my mental model.

Hello Vinicius! thank you for the reply.
Actually, in the isolation case, it is the opposite! the CPU always sleeps.
Which is good in this scenario.. More specifically, this is because the 
request always arrives 'quickly' in the device with the rd32() after 
readx_poll_timeout(). As long as the rd32() is 'quick', you have time to sleep.

This 'quickly' reference is about the 'power-saving' hypothesis I mention
below. Since the CPU is isolated, it is never interrupted, so I do not observe
the 'slow' rd32() in readx_poll_timeout().

In the end, the idea of this patch is to insert the write flush for avoiding this 
'quickly' or 'slowly' rd32() inside readx_poll_timeout().

> 
> > 2. Bumping the IGC_PTM_STAT_TIMEOUT value. Other machines may need
> >    another value though.
> 
> This one is not horrible as well. Some value like 400us comes to mind.
> 
> Your proposed fix looks fine as well, I was thinking a bit about it, and
> even if it's more a like a timing issue than "the absolutely correct
> fix", there are little chances for bad side effects.
> 
> If there aren't any other ideas, I am fine with this:
> 
> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>

Thank you for the Ack!
I can test with a 400us. It should work fine.

(I talked with some of your intel colleagues before sending the patch, and there is 
supposed to be a patch response here with (another?) PTM solution. We will see :)

> 
> > 3. Retry (multiple times) readx_poll_timeout() in case of failure. This may
> >    significantly increase the function latency, because each
> >    readx_poll_timeout() can take more than 100 us.
> > 4. Disabling preemption during the PTM request. Horrible.
> >
> 
> I wonder if you tried moving the I225 to the graphics card PCIe slot, or
> is it an onboard card? IIRC there are different PCIe controllers for the
> graphics slot vs. the other ones, at least on Coffee Lake.
> 

I was only able to come to the lab yesterday, hence why I did not answer before.

It is true, I was not using the graphic card slot. So, I changed from the PCIe root
1c.4/1c.7 to 01 (for both machines)

I'm running the 24-hour test again.  Here are the results

For Raptor Lake (13700), PCIe root 01 (width x16, 32GT/s)::
    # dmesg | grep "Timeout reading IGC_PTM_STAT register" | wc -l
    514
    $ ps -eo user,pid,lstart,comm | grep 1797
    root        1797 Mon Jul 15 11:32:03 2024 ptm_timeout.sh
    $ date 
    Tue Jul 16 10:02:48 AM CEST 2024

514 timeouts for ~23h. The issue persists.

For Coffee Lake (8500), I was surprised to see this::
    # dmesg | grep PTM
    [    0.412210] pci 0000:00:1c.0: PTM enabled (root), 4ns granularity
    [    0.413769] pci 0000:00:1c.7: PTM enabled (root), 4ns granularity
    [    0.415250] pci 0000:00:1d.0: PTM enabled (root), 4ns granularity
    [    1.353468] igc 0000:01:00.0: PCIe PTM not supported by PCIe bus/controller

It seems PCIe root 01 does not support PTM.

There is something else I want to mention. We aim to support redundancy mode
for AVB, where we will likely use two PCIe slots; so, one of them would not use the
graphic card slot.

> > For the Coffee Lake machine, the issue tends to be avoided because the
> > read does not take so long. Here is basically the same trace using the
> > Cofee Lake machine::
> >
> >   phc2sys-1204  [002]  1778.220288: funcgraph_entry:                    |
> igc_ptp_getcrosststamp() {
> >   phc2sys-1204  [002]  1778.220290: funcgraph_entry:                    |
> igc_phc_get_syncdevicetime() {
> >   phc2sys-1204  [002]  1778.220291: funcgraph_entry:                    |      igc_rd32()
> {
> >   phc2sys-1204  [002]  1778.220373: preempt_disable:
> caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
> >   phc2sys-1204  [002]  1778.220374: local_timer_entry: vector=236
> >   phc2sys-1204  [002]  1778.220375: hrtimer_cancel:
> hrtimer=0xffff894027326118
> >   phc2sys-1204  [002]  1778.220376: hrtimer_expire_entry:
> hrtimer=0xffff894027326118 now=1778228034802
> function=tick_nohz_handler/0x0
> >
> >   ... tick handler ...
> >
> >   phc2sys-1204  [002]  1778.220411: funcgraph_exit:       !  119.843 us |      }
> >   phc2sys-1204  [002]  1778.220412: funcgraph_entry:                    |      igc_rd32()
> {
> >   phc2sys-1204  [002]  1778.220492: funcgraph_exit:       + 80.094 us   |      }
> >   phc2sys-1204  [002]  1778.220493: funcgraph_entry:        2.951 us    |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220497: hrtimer_init: hrtimer=0xffffa504c0d83aa0
> clockid=CLOCK_MONOTONIC mode=0x0
> >   phc2sys-1204  [002]  1778.220498: preempt_disable:
> caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
> >   phc2sys-1204  [002]  1778.220499: hrtimer_start:
> hrtimer=0xffffa504c0d83aa0 function=hrtimer_wakeup/0x0
> expires=1778228158866 softexpires=1778228157866
> >
> >   ... timer setup ....
> >
> >   phc2sys-1204  [002]  1778.220509: preempt_enable:
> caller=_raw_spin_unlock_irqrestore+0x2b parent=hrtimer_start_range_ns+0x12d
> >   phc2sys-1204  [002]  1778.220511: funcgraph_entry:        7.338 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220519: funcgraph_entry:        2.769 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220522: funcgraph_entry:        2.798 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220525: funcgraph_entry:        2.736 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220529: funcgraph_entry:        2.750 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1778.220532: funcgraph_exit:       !  242.656 us |    }
> >
> > For both machines, I observed that the first igc_rd32() after an idle
> > period (more than 10us) tends to take significantly more time. I assume
> > this is a hardware power-saving technique, but I could not find any
> > mention in the manuals. This is very easily observable with an idle
> > system running phc2sys, since it will request only once every second.
> >
> > This is the typical trace of the operation::
> >
> >   phc2sys-1204  [002]  1749.209397: funcgraph_entry:                   |
> igc_ptp_getcrosststamp() {
> >   phc2sys-1204  [002]  1749.209398: funcgraph_entry:                   |
> igc_phc_get_syncdevicetime() {
> >   phc2sys-1204  [002]  1749.209400: funcgraph_entry:      + 81.491 us  |
> igc_rd32();
> >   phc2sys-1204  [002]  1749.209482: funcgraph_entry:        3.691 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1749.209487: funcgraph_entry:        2.942 us   |
> igc_rd32();
> >   phc2sys-1204  [002]  1749.209490: hrtimer_init: hrtimer=0xffffa504c0d83a00
> clockid=CLOCK_MONOTONIC mode=0x0
> >   phc2sys-1204  [002]  1749.209491: preempt_disable:
> caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
> >   ... timer setup and it goes on like before ...
> >
> > The preemption needs to happen for this issue, so that this power-saving
> > mode is triggered, making the igc_rd32 'slow enough' so that all of the
> > timeout is consumed before the card has time to answer.
> >
> > I believe flushing the write solves the issue definitely, since the
> > write should be completed before the timeout has started. So that, even
> > if the timeout is consumed by a slow read operation, the write has been
> > received before and the card had time to process the request.
> > ---
> >  drivers/net/ethernet/intel/igc/igc_ptp.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/intel/igc/igc_ptp.c
> b/drivers/net/ethernet/intel/igc/igc_ptp.c
> > index 1bb026232efc..d7269e4f1a21 100644
> > --- a/drivers/net/ethernet/intel/igc/igc_ptp.c
> > +++ b/drivers/net/ethernet/intel/igc/igc_ptp.c
> > @@ -1005,6 +1005,10 @@ static int igc_phc_get_syncdevicetime(ktime_t
> *device,
> >                * VALID bit.
> >                */
> >               wr32(IGC_PTM_STAT, IGC_PTM_STAT_VALID);
> > +             /* Ensure the hardware receives the ptm request before the
> > +              * response timeout starts.
> > +              */
> > +             wrfl();
> >
> >               err = readx_poll_timeout(rd32, IGC_PTM_STAT, stat,
> >                                        stat, IGC_PTM_STAT_SLEEP,
> >
> > ---
> > base-commit: 0005b2dc43f96b93fc5b0850d7ca3f7aeac9129c
> > change-id: 20240705-igc-flush-ptm-request-before-timeout-6-10-f6e02c96f6d4
> >
> > Best regards,
> > --
> > Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com>
> >
> >
> 
> 
> Cheers,
> --
> Vinicius
Kurt Kanzenbach July 24, 2024, 8:48 a.m. UTC | #3
On Mon Jul 08 2024, Rodrigo Cataldo via B4 Relay wrote:
> From: Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com>
>
> When a PTM is requested via wr32(IGC_PTM_STAT), the operation may only
> be completed by the next read operation (flush). Unfortunately, the next
> read operation in the PTM request loop happens after we have already
> started evaluating the response timeout.
>
> Thus, the following behavior has been observed::
>
>   phc2sys-1655  [010]   103.233752: funcgraph_entry:                    |  igc_ptp_getcrosststamp() {
>   phc2sys-1655  [010]   103.233754: funcgraph_entry:                    |    igc_phc_get_syncdevice_time() {
>   phc2sys-1655  [010]   103.233755: funcgraph_entry:                    |      igc_rd32() {
>   phc2sys-1655  [010]   103.233931: preempt_disable: caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
>   phc2sys-1655  [010]   103.233932: local_timer_entry: vector=236
>   phc2sys-1655  [010]   103.233932: hrtimer_cancel: hrtimer=0xffff8edeef526118
>   phc2sys-1655  [010]   103.233933: hrtimer_expire_entry: hrtimer=0xffff8edeef526118 now=103200127876 function=tick_nohz_handler/0x0
>
>   ... tick handler ...
>
>   phc2sys-1655  [010]   103.233971: funcgraph_exit:       !  215.559 us |      }
>   phc2sys-1655  [010]   103.233972: funcgraph_entry:                    |      igc_rd32() {
>   phc2sys-1655  [010]   103.234135: funcgraph_exit:       !  164.370 us |      }
>   phc2sys-1655  [010]   103.234136: funcgraph_entry:         1.942 us   |      igc_rd32();
>   phc2sys-1655  [010]   103.234147: console:              igc 0000:03:00.0 enp3s0: Timeout reading IGC_PTM_STAT register
>
> Based on the (simplified) code::
>
> 	ctrl = rd32(IGC_PTM_CTRL);
>         /* simplified: multiple writes here */
> 	wr32(IGC_PTM_STAT, IGC_PTM_STAT_VALID);
>
> 	err = readx_poll_timeout(rd32, IGC_PTM_STAT, stat,
> 				 stat, IGC_PTM_STAT_SLEEP,
> 				 IGC_PTM_STAT_TIMEOUT);
> 	if (err < 0) {
> 		netdev_err(adapter->netdev, "Timeout reading IGC_PTM_STAT register\n");
> 		return err;
> 	}
>
> Where readx_poll_timeout() starts the timeout evaluation before calling
> the rd32() parameter (rd32() is a macro for igc_rd32()).
>
> In the trace shown, the read operation of readx_poll_timeout() (second
> igc_rd32()) took so long that the timeout (IGC_PTM_STAT_VALID) has expired
> and no sleep has been performed.
>
> With this patch, a write flush is added (which is an additional
> igc_rd32() in practice) that can wait for the write before the timeout
> is evaluated::
>
>   phc2sys-1615  [010]    74.517954: funcgraph_entry:                    |  igc_ptp_getcrosststamp() {
>   phc2sys-1615  [010]    74.517956: funcgraph_entry:                    |    igc_phc_get_syncdevicetime() {
>   phc2sys-1615  [010]    74.517957: funcgraph_entry:                    |      igc_rd32() {
>   phc2sys-1615  [010]    74.518127: preempt_disable: caller=irq_enter_rcu+0x14 parent=irq_enter_rcu+0x14
>   phc2sys-1615  [010]    74.518128: local_timer_entry: vector=236
>   phc2sys-1615  [010]    74.518128: hrtimer_cancel: hrtimer=0xffff96466f526118
>   phc2sys-1615  [010]    74.518128: hrtimer_expire_entry: hrtimer=0xffff96466f526118 now=74484007229 function=tick_nohz_handler/0x0
>
>   ... tick handler ...
>
>   phc2sys-1615  [010]    74.518180: funcgraph_exit:       !  222.282 us |      }
>   phc2sys-1615  [010]    74.518181: funcgraph_entry:                    |      igc_rd32() {
>   phc2sys-1615  [010]    74.518349: funcgraph_exit:       !  168.160 us |      }
>   phc2sys-1615  [010]    74.518349: funcgraph_entry:         1.970 us   |      igc_rd32();
>   phc2sys-1615  [010]    74.518352: hrtimer_init: hrtimer=0xffffa6f9413a3940 clockid=CLOCK_MONOTONIC mode=0x0
>   phc2sys-1615  [010]    74.518352: preempt_disable: caller=_raw_spin_lock_irqsave+0x28 parent=hrtimer_start_range_ns+0x56
>   phc2sys-1615  [010]    74.518353: hrtimer_start: hrtimer=0xffffa6f9413a3940 function=hrtimer_wakeup/0x0 expires=74484232878 softexpires=74484231878
>
>   .. hrtimer setup and return ...
>
>   kworker/10:1-242   [010]    74.518382: sched_switch: kworker/10:1:242 [120] W ==> phc2sys:1615 [120]
>   phc2sys-1615  [010]    74.518383: preempt_enable: caller=schedule+0x36 parent=schedule+0x36
>   phc2sys-1615  [010]    74.518384: funcgraph_entry:      !  100.088 us |      igc_rd32();
>   phc2sys-1615  [010]    74.518484: funcgraph_entry:         1.958 us   |      igc_rd32();
>   phc2sys-1615  [010]    74.518488: funcgraph_entry:         2.019 us   |      igc_rd32();
>   phc2sys-1615  [010]    74.518490: funcgraph_entry:         1.956 us   |      igc_rd32();
>   phc2sys-1615  [010]    74.518492: funcgraph_entry:         1.980 us   |      igc_rd32();
>   phc2sys-1615  [010]    74.518494: funcgraph_exit:       !  539.386 us |    }
>
> Now the sleep is called as expected, and the operation succeeds.
> Therefore, regardless of how long it will take for the write to be
> completed, we will poll+sleep at least for the time specified in
> IGC_PTM_STAT_TIMEOUT.
>
> Fixes: a90ec8483732 ("igc: Add support for PTP getcrosststamp()")
> Signed-off-by: Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com>

Thanks for sending this upstream.

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/igc/igc_ptp.c b/drivers/net/ethernet/intel/igc/igc_ptp.c
index 1bb026232efc..d7269e4f1a21 100644
--- a/drivers/net/ethernet/intel/igc/igc_ptp.c
+++ b/drivers/net/ethernet/intel/igc/igc_ptp.c
@@ -1005,6 +1005,10 @@  static int igc_phc_get_syncdevicetime(ktime_t *device,
 		 * VALID bit.
 		 */
 		wr32(IGC_PTM_STAT, IGC_PTM_STAT_VALID);
+		/* Ensure the hardware receives the ptm request before the
+		 * response timeout starts.
+		 */
+		wrfl();
 
 		err = readx_poll_timeout(rd32, IGC_PTM_STAT, stat,
 					 stat, IGC_PTM_STAT_SLEEP,