Message ID | 20170824174736.909-1-oohall@gmail.com |
---|---|
State | Accepted |
Headers | show |
Oliver O'Halloran <oohall@gmail.com> writes: > There's a few issues with the Host<->OCC I2C bus handshaking. First up, > skiboot is currently examining the wrong bit when checking if the OCC > is currently using the bus. Secondly, when we need to wait for the OCC > to release the bus we are scheduling a recovery timer to run zero > timebase ticks after the current moment so the recovery timeout handler > will run immediately after the bus was requested, which will in turn > re-schedule itself, etc, etc. There's also a race between the OCC > interrupt and the recovery handler which can result in an assertion > failure in the recovery thread. All of this is bad. > > This patch addresses all these issues and sets the recovery timeout to > 10ms. > > Signed-off-by: Oliver O'Halloran <oohall@gmail.com> > --- > hw/p8-i2c.c | 58 +++++++++++++++++++++++++++++++++++++--------------------- > 1 file changed, 37 insertions(+), 21 deletions(-) > --- > I've been using this script to test the changes. With this patch it's fairly > stable, but there's still intermittent data corruption issues when > when reading from the bus which I'm still investigating. > --- > #!/bin/bash -e > > while true; > do > for f in $(find /sys/devices/platform/ -name '*-????'|grep a3000 ) > do > bus="$(basename $f | awk -F - -- '{ print $1 }' )" > dev="$(basename $f | awk -F - -- '{ print $2 }' )" > > i2cdump -y $bus 0x$dev i > $bus-$dev > md5sum $bus-$dev > sleep 0.1 > done > done Pridhiviraj, could you have a look at adding something to our EEPROM tests in op-test-framework to do something like the above? That is, md5sum the result of reading it and read the eeprom in a loop a few times. (that is, if we don't already have that test there)
On 2017-08-25 10:59, Stewart Smith wrote: > Oliver O'Halloran <oohall@gmail.com> writes: >> There's a few issues with the Host<->OCC I2C bus handshaking. First >> up, >> skiboot is currently examining the wrong bit when checking if the OCC >> is currently using the bus. Secondly, when we need to wait for the OCC >> to release the bus we are scheduling a recovery timer to run zero >> timebase ticks after the current moment so the recovery timeout >> handler >> will run immediately after the bus was requested, which will in turn >> re-schedule itself, etc, etc. There's also a race between the OCC >> interrupt and the recovery handler which can result in an assertion >> failure in the recovery thread. All of this is bad. >> >> This patch addresses all these issues and sets the recovery timeout to >> 10ms. >> >> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> >> --- >> hw/p8-i2c.c | 58 >> +++++++++++++++++++++++++++++++++++++--------------------- >> 1 file changed, 37 insertions(+), 21 deletions(-) >> --- >> I've been using this script to test the changes. With this patch it's >> fairly >> stable, but there's still intermittent data corruption issues when >> when reading from the bus which I'm still investigating. >> --- >> #!/bin/bash -e >> >> while true; >> do >> for f in $(find /sys/devices/platform/ -name '*-????'|grep a3000 ) >> do >> bus="$(basename $f | awk -F - -- '{ print $1 }' )" >> dev="$(basename $f | awk -F - -- '{ print $2 }' )" >> >> i2cdump -y $bus 0x$dev i > $bus-$dev >> md5sum $bus-$dev >> sleep 0.1 >> done >> done > > > Pridhiviraj, could you have a look at adding something to our EEPROM > tests in op-test-framework to do something like the above? That is, > md5sum the result of reading it and read the eeprom in a loop a few > times. > > (that is, if we don't already have that test there) Stewart, added the above one here, Please take a look at it. https://github.com/open-power/op-test-framework/pull/156/commits/c3a1ebd41cb02d5591e929c5a1b9e1c92b9a2ec1 Thanks Pridhiviraj
ppaidipe <ppaidipe@linux.vnet.ibm.com> writes: > On 2017-08-25 10:59, Stewart Smith wrote: >> Oliver O'Halloran <oohall@gmail.com> writes: >>> There's a few issues with the Host<->OCC I2C bus handshaking. First >>> up, >>> skiboot is currently examining the wrong bit when checking if the OCC >>> is currently using the bus. Secondly, when we need to wait for the OCC >>> to release the bus we are scheduling a recovery timer to run zero >>> timebase ticks after the current moment so the recovery timeout >>> handler >>> will run immediately after the bus was requested, which will in turn >>> re-schedule itself, etc, etc. There's also a race between the OCC >>> interrupt and the recovery handler which can result in an assertion >>> failure in the recovery thread. All of this is bad. >>> >>> This patch addresses all these issues and sets the recovery timeout to >>> 10ms. >>> >>> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> >>> --- >>> hw/p8-i2c.c | 58 >>> +++++++++++++++++++++++++++++++++++++--------------------- >>> 1 file changed, 37 insertions(+), 21 deletions(-) >>> --- >>> I've been using this script to test the changes. With this patch it's >>> fairly >>> stable, but there's still intermittent data corruption issues when >>> when reading from the bus which I'm still investigating. >>> --- >>> #!/bin/bash -e >>> >>> while true; >>> do >>> for f in $(find /sys/devices/platform/ -name '*-????'|grep a3000 ) >>> do >>> bus="$(basename $f | awk -F - -- '{ print $1 }' )" >>> dev="$(basename $f | awk -F - -- '{ print $2 }' )" >>> >>> i2cdump -y $bus 0x$dev i > $bus-$dev >>> md5sum $bus-$dev >>> sleep 0.1 >>> done >>> done >> >> >> Pridhiviraj, could you have a look at adding something to our EEPROM >> tests in op-test-framework to do something like the above? That is, >> md5sum the result of reading it and read the eeprom in a loop a few >> times. >> >> (that is, if we don't already have that test there) > > Stewart, added the above one here, Please take a look at it. > > https://github.com/open-power/op-test-framework/pull/156/commits/c3a1ebd41cb02d5591e929c5a1b9e1c92b9a2ec1 Thanks for that! I merged it with a bit of a change so that it used difflib to display the difference between reads - so we get a really nice error message out.
Oliver O'Halloran <oohall@gmail.com> writes: > There's a few issues with the Host<->OCC I2C bus handshaking. First up, > skiboot is currently examining the wrong bit when checking if the OCC > is currently using the bus. Secondly, when we need to wait for the OCC > to release the bus we are scheduling a recovery timer to run zero > timebase ticks after the current moment so the recovery timeout handler > will run immediately after the bus was requested, which will in turn > re-schedule itself, etc, etc. There's also a race between the OCC > interrupt and the recovery handler which can result in an assertion > failure in the recovery thread. All of this is bad. > > This patch addresses all these issues and sets the recovery timeout to > 10ms. > > Signed-off-by: Oliver O'Halloran <oohall@gmail.com> > --- > hw/p8-i2c.c | 58 +++++++++++++++++++++++++++++++++++++--------------------- > 1 file changed, 37 insertions(+), 21 deletions(-) Thanks! Merged to master as of 201fd50f208d680cfb19c0e508ca46b4f9cc75dc > --- > I've been using this script to test the changes. With this patch it's fairly > stable, but there's still intermittent data corruption issues when > when reading from the bus which I'm still investigating. > --- > #!/bin/bash -e > > while true; > do > for f in $(find /sys/devices/platform/ -name '*-????'|grep a3000 ) > do > bus="$(basename $f | awk -F - -- '{ print $1 }' )" > dev="$(basename $f | awk -F - -- '{ print $2 }' )" > > i2cdump -y $bus 0x$dev i > $bus-$dev > md5sum $bus-$dev > sleep 0.1 > done > done how does this compare to https://github.com/open-power/op-test-framework/blob/master/testcases/AT24driver.py#L97 ? It looks like we're about functionally equivalent, except our eeprom gathering code is: https://github.com/open-power/op-test-framework/blob/master/testcases/I2C.py#L139 Patches welcome there if you don't think it looks quite right. with "current" code (op-build of a day or two ago, skiboot that I just pushed) we do get the kind of fail you mention: ====================================================================== FAIL [7.456s]: runTest (testcases.AT24driver.AT24driver) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/stewart/op-test-framework/testcases/AT24driver.py", line 118, in runTest err="repeated i2cdump read of EEPROM doesn't match") File "/home/stewart/op-test-framework/testcases/AT24driver.py", line 149, in diff_commands self.assertMultiLineEqual(''.join(r),''.join(last_r), "%s:\n%s" % (err,diff)) AssertionError: repeated i2cdump read of EEPROM doesn't match: --- i2cdump -f -y 4 0x50 w +++ i2cdump -f -y 4 0x50 w @@ -1,5 +1,5 @@ 0,8 1,9 2,a 3,b 4,c 5,d 6,e 7,f -00: 1623 1430 0000 0000 0000 3358 0000 0000 +00: 0000 1430 0000 0000 0000 3358 0000 0000 08: 0000 0000 0000 0000 0000 0000 ffff ffff 10: ffff ffff ffff ffff ffff ffff ffff ffff 18: ffff ffff ffff ffff ffff ffff ffff ffff ----------------------------------------------------------------------
diff --git a/hw/p8-i2c.c b/hw/p8-i2c.c index f4666e24953e..a6274171debb 100644 --- a/hw/p8-i2c.c +++ b/hw/p8-i2c.c @@ -1063,8 +1063,11 @@ static int occ_i2c_lock(struct p8_i2c_master *master) busflag = PPC_BIT(16 + (master->engine_id - 1) * 2); - DBG("occflags = %llx (locks = %.6llx)\n", (u64)occflags, - GETFIELD(PPC_BITMASK(16, 22), occflags)); + DBG("I2C: c%de%d: occflags = %llx (locks = %x:%x:%x)\n", + master->chip_id, master->engine_id, (u64) occflags, + (u32) GETFIELD(PPC_BITMASK(16, 17), occflags), + (u32) GETFIELD(PPC_BITMASK(18, 19), occflags), + (u32) GETFIELD(PPC_BITMASK(20, 21), occflags)); rc = xscom_write(master->chip_id, OCCFLG_SET, busflag); if (rc) { @@ -1073,8 +1076,11 @@ static int occ_i2c_lock(struct p8_i2c_master *master) } /* If the OCC also has this bus locked then wait for IRQ */ - if (occflags & (busflag << 1)) + if (occflags & (busflag >> 1)) { + DBG("I2C: c%de%d: Master in use by OCC\n", + master->chip_id, master->engine_id); return 1; + } master->occ_lock_acquired = true; @@ -1098,8 +1104,8 @@ static int occ_i2c_unlock(struct p8_i2c_master *master) busflag = PPC_BIT(16 + (master->engine_id - 1) * 2); if (!(occflags & busflag)) { - prerror("I2C: busflag for %d already cleared (flags = %.16llx)", - master->engine_id, occflags); + DBG("I2C: spurious unlock for c%de%d already cleared (flags = %.16llx)", + master->chip_id, master->engine_id, occflags); } rc = xscom_write(master->chip_id, OCCFLG_CLEAR, busflag); @@ -1161,7 +1167,7 @@ static int p8_i2c_start_request(struct p8_i2c_master *master, if (rc > 0) { /* Wait for OCC IRQ */ master->state = state_occache_dis; - schedule_timer(&master->recovery, rc); + schedule_timer(&master->recovery, msecs_to_tb(10)); return 0; } @@ -1281,29 +1287,29 @@ void p9_i2c_bus_owner_change(u32 chip_id) { struct proc_chip *chip = get_chip(chip_id); struct p8_i2c_master *master = NULL; - int rc; assert(chip); list_for_each(&chip->i2cms, master, link) { - if (master->state == state_idle || - master->state != state_occache_dis) - continue; - lock(&master->lock); + /* spurious */ + if (master->state != state_occache_dis) + goto done; + /* Can we now lock this master? */ - rc = occ_i2c_lock(master); - if (rc) { - unlock(&master->lock); - continue; - } + if (occ_i2c_lock(master)) + goto done; - /* Run the state machine */ - p8_i2c_check_status(master); + /* clear the existing wait timer */ + cancel_timer(&master->recovery); + + /* re-start the request now that we own the master */ + master->state = state_idle; - /* Check for new work */ p8_i2c_check_work(master); + p8_i2c_check_status(master); +done: unlock(&master->lock); } } @@ -1453,8 +1459,18 @@ static void p8_i2c_recover(struct timer *t __unused, void *data, struct p8_i2c_master *master = data; lock(&master->lock); - assert(master->state == state_recovery || - master->state == state_occache_dis); + + /* + * The recovery timer can race with the OCC interrupt. If the interrupt + * comes in just before this is called, then we'll get a spurious + * timeout which we need to ignore. + */ + if (master->state != state_recovery && + master->state != state_occache_dis) { + unlock(&master->lock); + return; + } + master->state = state_idle; /* We may or may not still have work pending, re-enable the sensor cache
There's a few issues with the Host<->OCC I2C bus handshaking. First up, skiboot is currently examining the wrong bit when checking if the OCC is currently using the bus. Secondly, when we need to wait for the OCC to release the bus we are scheduling a recovery timer to run zero timebase ticks after the current moment so the recovery timeout handler will run immediately after the bus was requested, which will in turn re-schedule itself, etc, etc. There's also a race between the OCC interrupt and the recovery handler which can result in an assertion failure in the recovery thread. All of this is bad. This patch addresses all these issues and sets the recovery timeout to 10ms. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> --- hw/p8-i2c.c | 58 +++++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 37 insertions(+), 21 deletions(-) --- I've been using this script to test the changes. With this patch it's fairly stable, but there's still intermittent data corruption issues when when reading from the bus which I'm still investigating. --- #!/bin/bash -e while true; do for f in $(find /sys/devices/platform/ -name '*-????'|grep a3000 ) do bus="$(basename $f | awk -F - -- '{ print $1 }' )" dev="$(basename $f | awk -F - -- '{ print $2 }' )" i2cdump -y $bus 0x$dev i > $bus-$dev md5sum $bus-$dev sleep 0.1 done done ---