Message ID | 4EE5C824.2050704@grandegger.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Hello Wolfgang, > Hi Wolfgang, > > On 12/11/2011 07:33 PM, Wolfgang Zarre wrote: >> Hello Wolfgang, >>> On 12/07/2011 02:42 PM, Wolfgang Grandegger wrote: >>>> Hi Wolfgang, >>>> >>>> On 12/06/2011 10:08 PM, Wolfgang Zarre wrote: > ... >>>>> Let me know if You need more or some other tests. >>>> >>>> You could provoke some state changes or bus-off conditions to see if the >>>> berr-counter shows reasonable results. I'm currently consolidating and >>>> unifying error state and bus-off handling. Would be nice if you could do >>>> some further tests when I have the patches ready... >>> >>> I just pushed the mentioned modifications to the "devel" branch of my >>> "wg-linux-can-next" [1] repository. You can get it as shown below: >>> >>> $ git clone --reference=<some-recent-net-next-tree> \ >>> git://gitorious.org/~wgrandegger/linux-can/wg-linux-can-next.git >>> $ git checkout -b devel devel >>> >>> [1] https://gitorious.org/~wgrandegger/linux-can/wg-linux-can-next >>> >>> Wolfgang. >> >> OK, I was trying so far and You will find below the results. >> Just FYI the states on the PLC side couldn't be verified because the >> function >> provided by the manufacturer is not working at all and CAN analyser was not >> available. >> >> We are running CANopen and therefore the PLC will send automatically a >> heartbeat. >> >> I produced the bus-off state through a short circuit between L/H which was >> working as expected. >> >> A bit odd was that on the second try I had to reload the module >> because a ip down/up was not enough. > > Oops, not good. > But might be in connection with the strange behaviour of the PLC. >> Let me know if You would need further tests or different procedure. > > The state changes are reported via error messages, which you can list > with "candump -td -e any,0:0,#FFFFFFFF" with the attached patch. > Thanks, I'll try this with the next series of tests. >> Producing L/H short circuit for 2 seconds >> dmesg: >> [ 885.409058] cc770_isa cc770_isa.0: can0: status interrupt (0x5b) >> [ 885.420475] cc770_isa cc770_isa.0: can0: status interrupt (0xc5) >> [ 885.420496] cc770_isa cc770_isa.0: can0: bus-off >> >> ip -d -s link show can0 >> 4: can0:<NO-CARRIER,NOARP,UP,ECHO> mtu 16 qdisc pfifo_fast state DOWN >> qlen 10 >> link/can >> can state BUS-OFF (berr-counter tx 92 rx 103) restart-ms 0 >> bitrate 500000 sample-point 0.875 >> tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 >> cc770: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 >> clock 8000000 >> re-started bus-errors arbit-lost error-warn error-pass bus-off >> 0 0 0 1 0 1 >> RX: bytes packets errors dropped overrun mcast >> 544 382 0 0 0 0 >> TX: bytes packets errors dropped carrier collsns >> 30 29 0 0 0 0 >> >> Sending and receiving stops. >> >> Trying to recover on PC: >> ip link set can0 down; >> ip -d -s link show can0 >> 4: can0:<NOARP,ECHO> mtu 16 qdisc pfifo_fast state DOWN qlen 10 >> link/can >> can state STOPPED (berr-counter tx 92 rx 103) restart-ms 0 >> bitrate 500000 sample-point 0.875 >> tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 >> cc770: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 >> clock 8000000 >> re-started bus-errors arbit-lost error-warn error-pass bus-off >> 0 0 0 1 0 1 >> RX: bytes packets errors dropped overrun mcast >> 544 382 0 0 0 0 >> TX: bytes packets errors dropped carrier collsns >> 30 29 0 1 0 0 >> >> ip link set can0 up type can bitrate 500000; >> dmesg: >> [ 1090.937778] cc770_isa cc770_isa.0: can0: setting BTR0=0x00 BTR1=0x1c >> [ 1090.937869] cc770_isa cc770_isa.0: can0: Message object 15 for RX >> data, RTR, SFF and EFF >> [ 1090.937885] cc770_isa cc770_isa.0: can0: Message object 11 for TX >> data, RTR, SFF and EFF >> [ 1090.938050] ADDRCONF(NETDEV_CHANGE): can0: link becomes ready >> [ 1090.940769] cc770_isa cc770_isa.0: can0: status interrupt (0x5) >> >> ip -d -s link show can0 >> 4: can0:<NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP qlen 10 >> link/can >> can state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 >> bitrate 500000 sample-point 0.875 >> tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 >> cc770: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 >> clock 8000000 >> re-started bus-errors arbit-lost error-warn error-pass bus-off >> 0 0 0 1 0 1 >> RX: bytes packets errors dropped overrun mcast >> 552 383 0 0 0 0 >> TX: bytes packets errors dropped carrier collsns >> 30 29 0 1 0 0 >> >> PLC in unknown state but not sending heartbeat, >> Rebooting PLC > > Hm, does it work if you do the bus-off recovery manually with? > > # ip link set can0 up type can restart > > ... or automatically with? > > # ip link set can0 up type can restart-ms 5000 Ah, ok, good point, will try out as well with the next series of tests > > Anyway, rebooting/reloading should never be necessary. I will check on > my i82572. > >> ----------------------------------------- >> Disconnecting cable for around 4 seconds: >> >> dmesg: >> [ 2339.660283] cc770_isa cc770_isa.0: can0: status interrupt (0x5b) >> >> ip -d -s link show can0 >> 6: can0:<NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN >> qlen 10 >> link/can >> can state ERROR-WARNING (berr-counter tx 128 rx 128) restart-ms 0 >> bitrate 500000 sample-point 0.875 >> tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 >> cc770: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 >> clock 8000000 >> re-started bus-errors arbit-lost error-warn error-pass bus-off >> 0 0 0 1 0 0 >> RX: bytes packets errors dropped overrun mcast >> 459 298 0 0 0 0 >> TX: bytes packets errors dropped carrier collsns >> 193 192 0 0 0 0 > > TX and RX berr-counter are>= 128. I wonder why error passive was not > reached. Hmmm, that is a good question and You are right > 127 should be error-passive, anyway, just realised now, what means then 'error-warning' because I just know error-active, error-passive and bus-off. > >> Connecting again: >> ip -d -s link show can0 >> 6: can0:<NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN >> qlen 10 >> link/can >> can state ERROR-WARNING (berr-counter tx 120 rx 0) restart-ms 0 >> bitrate 500000 sample-point 0.875 >> tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 >> cc770: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 >> clock 8000000 >> re-started bus-errors arbit-lost error-warn error-pass bus-off >> 0 0 0 1 0 0 >> RX: bytes packets errors dropped overrun mcast >> 473 311 0 0 0 0 >> TX: bytes packets errors dropped carrier collsns >> 200 200 0 0 0 0 >> >> After some time (around 125 seconds): >> dmesg: >> [ 2387.172008] cc770_isa cc770_isa.0: can0: status interrupt (0x18) >> ip -d -s link show can0 >> 6: can0:<NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN >> qlen 10 >> link/can >> can state ERROR-ACTIVE (berr-counter tx 29 rx 0) restart-ms 0 >> bitrate 500000 sample-point 0.875 >> tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 >> cc770: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 >> clock 8000000 >> re-started bus-errors arbit-lost error-warn error-pass bus-off >> 0 0 0 1 0 0 >> RX: bytes packets errors dropped overrun mcast >> 616 447 0 0 0 0 >> TX: bytes packets errors dropped carrier collsns >> 291 291 0 0 0 0 > > OK, the state is back to error active (counter< 96). > > Thanks for testing... You are welcome, however, I have to thank You for Your work done. So, I'll try as soon as I can another series of tests and may be You let me know if You have patches I should include as well. > > Wolfgang. > > > Thanks Wolfgang -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/12/2011 12:18 PM, Wolfgang Zarre wrote: > Hello Wolfgang, >> Hi Wolfgang, >> >> On 12/11/2011 07:33 PM, Wolfgang Zarre wrote: >>> Hello Wolfgang, >>>> On 12/07/2011 02:42 PM, Wolfgang Grandegger wrote: >>>>> Hi Wolfgang, >>>>> >>>>> On 12/06/2011 10:08 PM, Wolfgang Zarre wrote: >> ... >>>>>> Let me know if You need more or some other tests. >>>>> >>>>> You could provoke some state changes or bus-off conditions to see >>>>> if the >>>>> berr-counter shows reasonable results. I'm currently consolidating and >>>>> unifying error state and bus-off handling. Would be nice if you >>>>> could do >>>>> some further tests when I have the patches ready... >>>> >>>> I just pushed the mentioned modifications to the "devel" branch of my >>>> "wg-linux-can-next" [1] repository. You can get it as shown below: >>>> >>>> $ git clone --reference=<some-recent-net-next-tree> \ >>>> >>>> git://gitorious.org/~wgrandegger/linux-can/wg-linux-can-next.git >>>> $ git checkout -b devel devel >>>> >>>> [1] https://gitorious.org/~wgrandegger/linux-can/wg-linux-can-next >>>> >>>> Wolfgang. >>> >>> OK, I was trying so far and You will find below the results. >>> Just FYI the states on the PLC side couldn't be verified because the >>> function >>> provided by the manufacturer is not working at all and CAN analyser >>> was not >>> available. >>> >>> We are running CANopen and therefore the PLC will send automatically a >>> heartbeat. >>> >>> I produced the bus-off state through a short circuit between L/H >>> which was >>> working as expected. >>> >>> A bit odd was that on the second try I had to reload the module >>> because a ip down/up was not enough. >> >> Oops, not good. >> > > But might be in connection with the strange behaviour of the PLC. It's a bug! netif_start_queue is missing at the end of the open function. Got lost some how. I have just updated (rebased!) my wg-linux-can-next repository. Wolfgang. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Wolfgang, > On 12/12/2011 12:18 PM, Wolfgang Zarre wrote: >> Hello Wolfgang, >>> Hi Wolfgang, >>> >>> On 12/11/2011 07:33 PM, Wolfgang Zarre wrote: >>>> Hello Wolfgang, >>>>> On 12/07/2011 02:42 PM, Wolfgang Grandegger wrote: >>>>>> Hi Wolfgang, >>>>>> >>>>>> On 12/06/2011 10:08 PM, Wolfgang Zarre wrote: >>> ... >>>>>>> Let me know if You need more or some other tests. >>>>>> >>>>>> You could provoke some state changes or bus-off conditions to see >>>>>> if the >>>>>> berr-counter shows reasonable results. I'm currently consolidating and >>>>>> unifying error state and bus-off handling. Would be nice if you >>>>>> could do >>>>>> some further tests when I have the patches ready... >>>>> >>>>> I just pushed the mentioned modifications to the "devel" branch of my >>>>> "wg-linux-can-next" [1] repository. You can get it as shown below: >>>>> >>>>> $ git clone --reference=<some-recent-net-next-tree> \ >>>>> >>>>> git://gitorious.org/~wgrandegger/linux-can/wg-linux-can-next.git >>>>> $ git checkout -b devel devel >>>>> >>>>> [1] https://gitorious.org/~wgrandegger/linux-can/wg-linux-can-next >>>>> >>>>> Wolfgang. >>>> >>>> OK, I was trying so far and You will find below the results. >>>> Just FYI the states on the PLC side couldn't be verified because the >>>> function >>>> provided by the manufacturer is not working at all and CAN analyser >>>> was not >>>> available. >>>> >>>> We are running CANopen and therefore the PLC will send automatically a >>>> heartbeat. >>>> >>>> I produced the bus-off state through a short circuit between L/H >>>> which was >>>> working as expected. >>>> >>>> A bit odd was that on the second try I had to reload the module >>>> because a ip down/up was not enough. >>> >>> Oops, not good. >>> >> >> But might be in connection with the strange behaviour of the PLC. > > It's a bug! netif_start_queue is missing at the end of the open > function. Got lost some how. I have just updated (rebased!) my > wg-linux-can-next repository. Ok, I was checking out last week and since I'm running one test series after the other. There are several odd issues I could found and I'm trying to trace them down beside some other work. Even with an assumed correct configuration like I was using with the lincan driver I'm loosing telegrams so around 1 till 2 in 500000 but might be a different sample-point at the PLC which is opaque due the predefined setting. For the next test I'll set the BTR's directly. Further sometimes I can find one in dropped but mostly not. But more odd is that after an undefined time the transmission gets stuck followed by a buffer overrun but can receive. No error messages nor changes in ip -d -s link show can0. Additional it seems that neither the automatic restart nor the manual one works. ip link set can0 up type can restart gives me 'RTNETLINK answers: Invalid argument' and ip link set can0 up type can bitrate 500000 restart a RTNETLINK answers: Device or resource busy but nothing connected to can0. So I have to perform per example ip link set can0 down;ip link set can0 up type can bitrate 500000 restart-ms 2000 sample-point 0.75 but this is emptying the buffer and these telegrams are lost then as well. I was comparing with my lincan driver which was running so far ok also to confirm a proper working PLC. First I assumed that maybe the set_reset_mode procedure is responsible for that misbehaviour because according to the cc770 manual we should wait for a zero of bit 7 RstST of the CPU interface register but when the transmission gets stuck there was no call for set_reset_mode. Maybe it's ending up somehow recessive. Anyway, I might compare the registers of both drivers just to figure out what's going on but maybe You have an idea as well. Problem is just it runs always quite some time until the issues happen otherwise it would be more easy. > > Wolfgang. Wolfgang -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Wolfgang, On 12/21/2011 07:32 PM, Wolfgang Zarre wrote: > Hello Wolfgang, ... >> It's a bug! netif_start_queue is missing at the end of the open >> function. Got lost some how. I have just updated (rebased!) my >> wg-linux-can-next repository. > > Ok, I was checking out last week and since I'm running one test series > after the other. > > There are several odd issues I could found and I'm trying to trace them > down beside some other work. > > Even with an assumed correct configuration like I was using with the lincan > driver I'm loosing telegrams so around 1 till 2 in 500000 but might be a > different sample-point at the PLC which is opaque due the predefined > setting. In principle, messages can be lost because the cc770 does buffer only up to two messages in hardware. If they are not read out quickly enough, message loss will happen. The CAN statistics should list such overruns, though. > For the next test I'll set the BTR's directly. OK, if you do not see bus errors, everything should be fine. > Further sometimes I can find one in dropped but mostly not. > > But more odd is that after an undefined time the transmission gets > stuck followed by a buffer overrun but can receive. I recently found a bug. Please try this fix: http://marc.info/?l=linux-can&m=132370253713701&w=4 Did you realize related error messages in the dmesg output? > No error messages nor changes in ip -d -s link show can0. > > Additional it seems that neither the automatic restart nor > the manual one works. What version are you using. I think this problem has been fixed by adding the missing netif_start_queue() at the end of the open function, as mentioned above. Do you have that in your driver? > ip link set can0 up type can restart gives me 'RTNETLINK answers: Invalid > argument' and ip link set can0 up type can bitrate 500000 restart a > RTNETLINK answers: Device or resource busy but nothing connected to can0. The error message is shown because you try to set bitrate when the device is up. For the restart after bus-off just type: # ip link set can0 type can restart Anyway, if you run into a bus-off, then it's likely that you have electrical problems on the CAN bus, e.g. termination, mismatching bit-timing parameters. > So I have to perform per example ip link set can0 down;ip link set can0 up > type can bitrate 500000 restart-ms 2000 sample-point 0.75 > but this is emptying the buffer and these telegrams are lost then as well. > > I was comparing with my lincan driver which was running so far ok also > to confirm a proper working PLC. > > First I assumed that maybe the set_reset_mode procedure is responsible for > that misbehaviour because according to the cc770 manual we should wait for > a zero of bit 7 RstST of the CPU interface register but when the > transmission > gets stuck there was no call for set_reset_mode. > > Maybe it's ending up somehow recessive. > > Anyway, I might compare the registers of both drivers just to figure out > what's going on but maybe You have an idea as well. > > Problem is just it runs always quite some time until the issues happen > otherwise it would be more easy. Again, please check if you have netif_start_queue() at the end of the open function. Wolfgang. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Wolfgang, > Hi Wolfgang, > > On 12/21/2011 07:32 PM, Wolfgang Zarre wrote: >> Hello Wolfgang, > ... > >>> It's a bug! netif_start_queue is missing at the end of the open >>> function. Got lost some how. I have just updated (rebased!) my >>> wg-linux-can-next repository. >> >> Ok, I was checking out last week and since I'm running one test series >> after the other. >> >> There are several odd issues I could found and I'm trying to trace them >> down beside some other work. >> >> Even with an assumed correct configuration like I was using with the lincan >> driver I'm loosing telegrams so around 1 till 2 in 500000 but might be a >> different sample-point at the PLC which is opaque due the predefined >> setting. > > In principle, messages can be lost because the cc770 does buffer only up > to two messages in hardware. If they are not read out quickly enough, > message loss will happen. The CAN statistics should list such overruns, > though. > Actually I loose them on transmission, not reception, but as mentioned one time we traced with a second PC and there the telegrams are not lost which means they are really going over the bus physically. So maybe just a timing issue but for now secondary. However the telegrams are sent with 5ms space parallel to the heartbeat. >> For the next test I'll set the BTR's directly. > > OK, if you do not see bus errors, everything should be fine. > The test with BTR's set was not working out due the fact that the software for coding the PLC doesn't allow, I'm loving it. >> Further sometimes I can find one in dropped but mostly not. >> >> But more odd is that after an undefined time the transmission gets >> stuck followed by a buffer overrun but can receive. > > I recently found a bug. Please try this fix: > > http://marc.info/?l=linux-can&m=132370253713701&w=4 The fix is already included as checked out. > > Did you realize related error messages in the dmesg output? Nothing at all, as mentioned . > >> No error messages nor changes in ip -d -s link show can0. >> >> Additional it seems that neither the automatic restart nor >> the manual one works. > > What version are you using. I think this problem has been fixed by > adding the missing netif_start_queue() at the end of the open > function, as mentioned above. Do you have that in your driver? > Yes, is already included as well, I'm using commit eec921ac28fde243456078a557768808d93d94a3 >> ip link set can0 up type can restart gives me 'RTNETLINK answers: Invalid >> argument' and ip link set can0 up type can bitrate 500000 restart a >> RTNETLINK answers: Device or resource busy but nothing connected to can0. > > The error message is shown because you try to set bitrate when the > device is up. For the restart after bus-off just type: > > # ip link set can0 type can restart Actually I tried it when it's get stuck but is anyway a hint that the device is still up, > > Anyway, if you run into a bus-off, then it's likely that you have > electrical problems on the CAN bus, e.g. termination, mismatching > bit-timing parameters. As said I have no indication of any kind of problem: 5: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN qlen 10 link/can can state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 2000 bitrate 500000 sample-point 0.750 tq 125 prop-seg 5 phase-seg1 6 phase-seg2 4 sjw 1 cc770: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 clock 16000000 re-started bus-errors arbit-lost error-warn error-pass bus-off 0 0 0 0 0 0 RX: bytes packets errors dropped overrun mcast 76506 74616 0 0 0 0 TX: bytes packets errors dropped carrier collsns 2450703 616355 0 0 0 0 > >> So I have to perform per example ip link set can0 down;ip link set can0 up >> type can bitrate 500000 restart-ms 2000 sample-point 0.75 >> but this is emptying the buffer and these telegrams are lost then as well. >> >> I was comparing with my lincan driver which was running so far ok also >> to confirm a proper working PLC. >> >> First I assumed that maybe the set_reset_mode procedure is responsible for >> that misbehaviour because according to the cc770 manual we should wait for >> a zero of bit 7 RstST of the CPU interface register but when the >> transmission >> gets stuck there was no call for set_reset_mode. >> >> Maybe it's ending up somehow recessive. >> >> Anyway, I might compare the registers of both drivers just to figure out >> what's going on but maybe You have an idea as well. >> >> Problem is just it runs always quite some time until the issues happen >> otherwise it would be more easy. > > Again, please check if you have netif_start_queue() at the end of the > open function. > As said I'm using eec921ac28fde243456078a557768808d93d94a3 However, I'll try further to investigate that issue due the fact having it running with my lincan without problems and therefore it should be possible to find the problem. > Wolfgang. Wolfgang -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From e7b36500c9491ab026bd3c16dfca2ca4338524ac Mon Sep 17 00:00:00 2001 From: Wolfgang Grandegger <wg@grandegger.com> Date: Mon, 12 Dec 2011 10:09:22 +0100 Subject: [PATCH] candump: add support for error states going backward Signed-off-by: Wolfgang Grandegger <wg@grandegger.com> --- lib.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/lib.c b/lib.c index a8ed2fe..7f810b9 100644 --- a/lib.c +++ b/lib.c @@ -318,6 +318,7 @@ static const char *error_classes[] = { "bus-off", "bus-error", "restarted-after-bus-off", + "state-change", }; static const char *controller_problems[] = { @@ -327,6 +328,7 @@ static const char *controller_problems[] = { "tx-error-warning", "rx-error-passive", "tx-error-passive", + "back-to-error-active", }; static const char *protocol_violation_types[] = { @@ -471,6 +473,8 @@ void snprintf_can_error_frame(char *buf, size_t len, struct can_frame *cf, if (mask == CAN_ERR_LOSTARB) n += snprintf_error_lostarb(buf + n, len - n, cf); + if (mask == CAN_ERR_STATE_CHANGE) + n += snprintf_error_ctrl(buf + n, len - n, cf); if (mask == CAN_ERR_CRTL) n += snprintf_error_ctrl(buf + n, len - n, cf); if (mask == CAN_ERR_PROT) -- 1.7.4.1