Message ID | 1330443135.75072.YahooMailClassic@web29019.mail.ird.yahoo.com |
---|---|
State | New, archived |
Headers | show |
> > If an Erase Suspend Command (0xb0) is issued while a Block Erase operation > (0x20, 0xd0) is in progress, Status Register bit 6 (SR.6) will be set to 1. > After an Erase Resume Command (0xd0) is issued, SR.6 will be set back to 0. > > Unfortunately when inval_cache_and_wait_for_operation() is called to wait for > a Block Erase operation to finish, it never checks that SR.6 = 0 before > returning. Consequently it can (and does) return while a Block Erase operation > is still suspended, resulting in random breakage. > > This patch ensures that when inval_cache_and_wait_for_operation() is called to > wait for a Block Erase operation to finish (chip->state == FL_ERASING), it does > not return until both SR.7 = 1 (as before) and SR.6 = 0. Interesting but I do wonder, should not the if (chip->state != chip_state) test just before the status test prevent this in the first place? > > Signed-off-by: Paul Parsons <lost.distance@yahoo.com> > --- > > I found this after switching my HP iPAQ hx4700 from using jffs2 to ubifs. It > then consistently reported "block erase error: (bad VPP)" errors. Intriguingly, > the bootloader never reporting erase or programming failures; it turned out that > the bootloader never issued Erase Suspend Commands. > > Should SR.6 = 0 also be checked after an Erase Suspend Command (0xb0) is issued? > This should preclude an unnecessary (and perhaps unpredictable) subsequent Erase > Resume Command (0xd0). > > --- clean-3.3-rc5/drivers/mtd/chips/cfi_cmdset_0001.c 2012-02-25 20:18:16.000000000 +0000 > +++ linux-3.3-rc5/drivers/mtd/chips/cfi_cmdset_0001.c 2012-02-28 03:13:30.521537614 +0000 > @@ -1211,6 +1211,7 @@ static int inval_cache_and_wait_for_oper > { > struct cfi_private *cfi = map->fldrv_priv; > map_word status, status_OK = CMD(0x80); > + map_word status_76 = (chip->state == FL_ERASING) ? CMD(0xc0) : CMD(0x80); hmm, I wonder if you can get by without the chip->state test? Just always set it to 0xc0? Jocke
> > If an Erase Suspend Command (0xb0) is issued while a > Block Erase operation > > (0x20, 0xd0) is in progress, Status Register bit 6 > (SR.6) will be set to 1. > > After an Erase Resume Command (0xd0) is issued, SR.6 > will be set back to 0. > > > > Unfortunately when inval_cache_and_wait_for_operation() > is called to wait for > > a Block Erase operation to finish, it never checks that > SR.6 = 0 before > > returning. Consequently it can (and does) return while > a Block Erase operation > > is still suspended, resulting in random breakage. > > > > This patch ensures that when > inval_cache_and_wait_for_operation() is called to > > wait for a Block Erase operation to finish > (chip->state == FL_ERASING), it does > > not return until both SR.7 = 1 (as before) and SR.6 = > 0. > > Interesting but I do wonder, should not the > if (chip->state != chip_state) > test just before the status test prevent this in the first > place? I'm not sure I understand. The (chip->state != chip_state) test forces inval_cache_and_wait_for_operation() to stop polling and sleep, pending the operation which suspended it. After it wakes it resumes polling. The test doesn't prevent inval_cache_and_wait_for_operation() returning prematurely, i.e. before the Block Erase has finished. An important point which I suspect has been overlooked is that both SR.7 and SR.6 are set when a Block Erase is suspended. Thus testing SR.7 only is not sufficient to determine that a Block Erase has completed. > > + map_word status_76 = (chip->state > == FL_ERASING) ? CMD(0xc0) : CMD(0x80); > > hmm, I wonder if you can get by without the chip->state > test? Just always set it to 0xc0? No, unfortunately not. For example a write buffer sequence will finish whether or not a Block Erase is currently suspended; thus it must test SR.7 only, not SR.6. Paul
Paul Parsons <lost.distance@yahoo.com> wrote on 2012/02/29 01:23:21: > From: Paul Parsons <lost.distance@yahoo.com> > > > > If an Erase Suspend Command (0xb0) is issued while a > > Block Erase operation > > > (0x20, 0xd0) is in progress, Status Register bit 6 > > (SR.6) will be set to 1. > > > After an Erase Resume Command (0xd0) is issued, SR.6 > > will be set back to 0. > > > > > > Unfortunately when inval_cache_and_wait_for_operation() > > is called to wait for > > > a Block Erase operation to finish, it never checks that > > SR.6 = 0 before > > > returning. Consequently it can (and does) return while > > a Block Erase operation > > > is still suspended, resulting in random breakage. > > > > > > This patch ensures that when > > inval_cache_and_wait_for_operation() is called to > > > wait for a Block Erase operation to finish > > (chip->state == FL_ERASING), it does > > > not return until both SR.7 = 1 (as before) and SR.6 = > > 0. > > > > Interesting but I do wonder, should not the > > if (chip->state != chip_state) > > test just before the status test prevent this in the first > > place? > > I'm not sure I understand. The (chip->state != chip_state) test forces > inval_cache_and_wait_for_operation() to stop polling and sleep, pending > the operation which suspended it. After it wakes it resumes polling. The > test doesn't prevent inval_cache_and_wait_for_operation() returning > prematurely, i.e. before the Block Erase has finished. Been to long since I been in this code I guess, but I still don't get it. If erase is suspended, chip->state is changed which will keep the function in the for(;;) loop. Once erase has been resumed again chip->state will change back. It seems to me that chip->state and SR.6 are mutally exclusive? I can not see how you can get to status = map_read(map, cmd_adr); if (map_word_andequal(map, status, status_OK, status_OK)) if the the erase has been suspended. > > An important point which I suspect has been overlooked is that both SR.7 > and SR.6 are set when a Block Erase is suspended. Thus testing SR.7 only > is not sufficient to determine that a Block Erase has completed. > > > > + map_word status_76 = (chip->state > > == FL_ERASING) ? CMD(0xc0) : CMD(0x80); > > > > hmm, I wonder if you can get by without the chip->state > > test? Just always set it to 0xc0? > > No, unfortunately not. For example a write buffer sequence will finish > whether or not a Block Erase is currently suspended; thus it must test > SR.7 only, not SR.6. I see, thanks. Jocke
--- On Wed, 29/2/12, Joakim Tjernlund <joakim.tjernlund@transmode.se> wrote: > If erase is suspended, chip->state is changed which will > keep the function in > the for(;;) loop. Once erase has been resumed again > chip->state will change back. > It seems to me that chip->state and SR.6 are mutally > exclusive? I can not see how > you can get to > status = map_read(map, cmd_adr); > if (map_word_andequal(map, status, > status_OK, status_OK)) > if the the erase has been suspended. OK, I understand. Let me demonstrate. I Added the following debug to the original linux-3.3-rc5 drivers/mtd/chips/cfi_cmdset_0001.c file (i.e. without my patch): -------- 797a798 > printk("%d: CRDY: about to Erase Suspend adr=0x%08x\n", current->pid, adr); 958a960 > printk("%d: PUTC: enter\n", current->pid); 974a977 > printk("%d: PUTC: about to wake_up (a)\n", current->pid); 988a992 > printk("%d: PUTC: about to wake_up (b)\n", current->pid); 1005a1010 > printk("%d: PUTC: about to Erase Resume adr=0x%08x\n", current->pid, adr); 1025a1031 > printk("%d: PUTC: about to wake_up (c)\n", current->pid); 1216a1223 > printk("%d: WAIT: call-state=%d, cmd_adr=0x%08x\n", current->pid, chip_state, cmd_adr); 1234a1242 > printk("%d: WAIT: about to suspend\n", current->pid); 1235a1244 > printk("%d: WAIT: resumed\n", current->pid); 1241a1251 > printk("%d: WAIT: status=0x%08x\n", current->pid, status.x[0]); 1257a1268 > printk("%d: WAIT: return -ETIME\n", current->pid); 1262a1274 > printk("%d: WAIT: about to sleep: sleep_time=%d\n", current->pid, sleep_time); 1281a1294 > printk("%d: WAIT: return 0\n", current->pid); 1891a1905 > printk("%d: ERAS: adr=0x%08x\n", current->pid, adr); -------- I rebooted my hx4700, saved the dmesg output, and edited and annotated what I feel to be the important messages: -------- [ 2.224667] 30: WAIT: status=0x00a000a0 [ 2.224693] 30: WAIT: return 0 PID 30 has just failed FL_ERASING of block 0x00440000 for the 3rd time: erase has failed (SR.5 = 1). [ 2.224719] block erase failed at 0x00440000: status 0xa000a0. Retrying... Confirmation of erase fail from elsewhere. [ 2.224779] 30: ERAS: adr=0x00440000 [ 2.224798] 30: WAIT: call-state=4, cmd_adr=0x00440000 [ 2.224814] 30: WAIT: status=0x00000000 [ 2.224830] 30: WAIT: about to sleep: sleep_time=512000 PID 30 is now attempting FL_ERASING of block 0x00440000 for the 4th time. [ 2.492129] 76: CRDY: about to Erase Suspend adr=0x0060f748 [ 2.492216] 76: PUTC: enter [ 2.492233] 76: PUTC: about to Erase Resume adr=0x0060f748 PID 76 has issued Erase Suspend and Erase Resume (the address is irrelevant). PID 30 is still sleeping at this point. [ 2.785235] 30: WAIT: status=0x00e800e8 [ 2.785250] 30: WAIT: return 0 PID 30 has woken up to find: bad F-VPP (SR.3 = 1), erase has failed (SR.5 = 1), Erase Suspend is in effect (SR.6 = 1). [ 2.785276] physmap-flash: block erase error: (bad VPP) Confirmation of bad VPP from elsewhere. [ 2.863384] 30: ERAS: adr=0x00440000 [ 2.863412] 30: WAIT: call-state=4, cmd_adr=0x00440000 [ 2.863431] 30: WAIT: status=0x00c000c0 [ 2.863444] 30: WAIT: return 0 PID 30 is now attempting FL_ERASING of block 0x00440000 for the 5th time: Erase Suspend is in effect (SR.6 = 1), but no errors, so the erase seemingly succeeded. [ 2.863543] 30: WAIT: call-state=8, cmd_adr=0x00440000 [ 2.863562] 30: WAIT: status=0x00c000c0 [ 2.863574] 30: WAIT: return 0 PID 30 is now attempting FL_WRITING_TO_BUFFER: Erase Suspend is in effect (SR.6 = 1), but no errors, so the FL_WRITING_TO_BUFFER seemingly succeeded. [ 2.863599] 30: WAIT: call-state=7, cmd_adr=0x00440000 [ 2.863615] 30: WAIT: status=0x00d000d0 [ 2.863628] 30: WAIT: return 0 PID 30 is now attempting FL_WRITING to block 0x00440000: Erase Suspend is in effect (SR.6 = 1), program has failed (SR.3 = 1). [ 2.863649] physmap-flash: buffer write error (status 0xd000d0) Confirmation of write error from elsewhere. [ 2.863703] UBI error: ubi_io_write: error -22 while writing 64 bytes to PEB 262:0, written 0 bytes [ 2.863728] UBI error: erase_worker: failed to erase PEB 262, error -22 [ 2.863750] UBI warning: ubi_ro_mode: switch to read-only mode [ 2.863770] UBI error: do_work: work failed with error code -22 [ 2.863791] UBI error: ubi_thread: ubi_bgt0d: work failed with error code -22 Bye bye UBI. -------- First we note that the WAIT (inval_cache_and_wait_for_operation) function never suspends itself; i.e. it never prints "WAIT: about to suspend". The message at: [ 2.863431] 30: WAIT: status=0x00c000c0 shows that WAIT found status_OK (SR.7) set and Erase Suspend in effect. Now things get complicated. I repeated with exercise with my patch applied. This time the WAIT function looped forever trying to erase the same block (0x00440000) because the status never changed from 0x00e800e8 (rather like 2.785235 above, but with my patch now applied the function cannot return). Disheartened, I thought my patch had only succeeded in livelocking the function because of a hardware error. However, remembering our version of the Heisenberg uncertainty principle - that adding debug sometimes changes the behaviour - I stripped out almost all of the debug and repeated the exercise. This time block 0x00440000 erased successfully without errors. Where do we go from here? The above debug demonstrates that inval_cache_and_wait_for_operation() can return while Erase Suspend is in effect. My patch prevents that, and fixes UBI for me. I am open to the suggestion that my hx4700 exhibits random hardware failures. This would explain the results. But I have a nagging doubt about that because my bootloader has never reported erase errors (apart from once when I broke its erase block size table). As I mentioned previously, the bootloader never suspends erase operations. I am open to the suggestion that my patch merely fixes a symptom rather than an underlying cause. So perhaps the MTD driver contains a race condition. The stripping out of debug to fix the erase error tends to support that explanation. I don't know what else to say. Paul
Did you try to validate your setup with mtd tests?
--- On Wed, 29/2/12, Artem Bityutskiy <dedekind1@gmail.com> wrote: > Did you try to validate your setup > with mtd tests? No I did not. I've just quickly tried the readtest on all mtd partitions and they seemed to pass. A torturetest on a small unused partition reported "retried 3 times, still errors, give-up", but a stresstest on the same partition passed. Was there a specific test and set of options I should try? The hx4700 has NOR flash (Intel StrataFlash) so it looks like at least 4 of the tests - pagetest, oobtest, subpagetest, nandecctest - are not appropriate.
Paul Parsons <lost.distance@yahoo.com> wrote on 2012/02/29 18:22:39: > > --- On Wed, 29/2/12, Joakim Tjernlund <joakim.tjernlund@transmode.se> wrote: > > If erase is suspended, chip->state is changed which will > > keep the function in > > the for(;;) loop. Once erase has been resumed again > > chip->state will change back. > > It seems to me that chip->state and SR.6 are mutally > > exclusive? I can not see how > > you can get to > > status = map_read(map, cmd_adr); > > if (map_word_andequal(map, status, > > status_OK, status_OK)) > > if the the erase has been suspended. > > OK, I understand. Let me demonstrate. > > I Added the following debug to the original linux-3.3-rc5 > drivers/mtd/chips/cfi_cmdset_0001.c file (i.e. without my patch): This looks like a erase resume problem, cfi driver issues a resume command but the chip fails or is slow to respond. I remember someone having a similar problem some time ago. Don't think we got to bottom of it though, look for subject: Numonyx NOR and chip->mutex bug? Try adding extra read status cmds around resume, check if SR.6 is cleared too. Enable all driver debug code and look for the first sign of trouble. That said, there seems to be something fishy going on with UBI as it is erasing the same failing block over and over again. Also, it would probably be good to add test of SR.6 in some places, just to make sure the flash does what the driver wants it to do. Jocke
On Wed, 2012-02-29 at 22:26 +0000, Paul Parsons wrote: > --- On Wed, 29/2/12, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > Did you try to validate your setup > > with mtd tests? > > No I did not. > > I've just quickly tried the readtest on all mtd partitions and they seemed > to pass. A torturetest on a small unused partition reported "retried 3 > times, still errors, give-up", but a stresstest on the same partition passed. > > Was there a specific test and set of options I should try? > The hx4700 has NOR flash (Intel StrataFlash) so it looks like at least 4 > of the tests - pagetest, oobtest, subpagetest, nandecctest - are not > appropriate. In the past all tests either worked on nor or refused to load. I thought stresstest should work as well. I do not have any specific test in mind, threw out standard suggestion I give to people :-)
--- clean-3.3-rc5/drivers/mtd/chips/cfi_cmdset_0001.c 2012-02-25 20:18:16.000000000 +0000 +++ linux-3.3-rc5/drivers/mtd/chips/cfi_cmdset_0001.c 2012-02-28 03:13:30.521537614 +0000 @@ -1211,6 +1211,7 @@ static int inval_cache_and_wait_for_oper { struct cfi_private *cfi = map->fldrv_priv; map_word status, status_OK = CMD(0x80); + map_word status_76 = (chip->state == FL_ERASING) ? CMD(0xc0) : CMD(0x80); int chip_state = chip->state; unsigned int timeo, sleep_time, reset_timeo; @@ -1239,7 +1240,7 @@ static int inval_cache_and_wait_for_oper } status = map_read(map, cmd_adr); - if (map_word_andequal(map, status, status_OK, status_OK)) + if (map_word_andequal(map, status, status_76, status_OK)) break; if (chip->erase_suspended && chip_state == FL_ERASING) {
If an Erase Suspend Command (0xb0) is issued while a Block Erase operation (0x20, 0xd0) is in progress, Status Register bit 6 (SR.6) will be set to 1. After an Erase Resume Command (0xd0) is issued, SR.6 will be set back to 0. Unfortunately when inval_cache_and_wait_for_operation() is called to wait for a Block Erase operation to finish, it never checks that SR.6 = 0 before returning. Consequently it can (and does) return while a Block Erase operation is still suspended, resulting in random breakage. This patch ensures that when inval_cache_and_wait_for_operation() is called to wait for a Block Erase operation to finish (chip->state == FL_ERASING), it does not return until both SR.7 = 1 (as before) and SR.6 = 0. Signed-off-by: Paul Parsons <lost.distance@yahoo.com> --- I found this after switching my HP iPAQ hx4700 from using jffs2 to ubifs. It then consistently reported "block erase error: (bad VPP)" errors. Intriguingly, the bootloader never reporting erase or programming failures; it turned out that the bootloader never issued Erase Suspend Commands. Should SR.6 = 0 also be checked after an Erase Suspend Command (0xb0) is issued? This should preclude an unnecessary (and perhaps unpredictable) subsequent Erase Resume Command (0xd0).