Discussion:
[PIC] MLA TCP/IP Stack Recovering from Ethernet Error
Harold Hallikainen
2017-09-23 18:57:49 UTC
Permalink
In the Microchip Library for Applications ENCX24J600.c, there is the
following code:

// Validate the data returned from the ENC624J600 Family device. Random
// data corruption, such as if a single SPI/PSP bit error occurs while
// communicating or a momentary power glitch could cause this to occur
// in rare circumstances. Also, certain hardware bugs such as violations
// of the absolute maximum electrical specs can cause this. For example,
// if an MCU with a high slew rate were to access the interface, parasitic
// inductance in the traces could cause excessive voltage undershoot.
// If the voltage goes too far below ground, the ENCx24J600's internal
// ESD structure may activate and disrupt the communication. To prevent
// this, ensure that you have a clean board layout and consider adding
// resistors in series with the MCU output pins to limit the slew rate
// of signals going to the ENCx24J600. 100 Ohm resistors is a good value
// to start testing with.
if(header.NextPacketPointer > RXSTOP ||
((BYTE_VAL*)(&header.NextPacketPointer))->bits.b0 ||
header.StatusVector.bits.Zero || header.StatusVector.bits.ZeroH ||
header.StatusVector.bits.CRCError ||
header.StatusVector.bits.ByteCount > 1522u ||
!header.StatusVector.bits.ReceiveOk)
{
Reset();
}


In several products using both parallel and SPI interfaces to the ECN,
I've found that this condition occurs causing a system reset when the
network is very busy. Some of these products are on very small PCBs where
I do not believe overshoot or undershoot is an issue. What I'd like is a
graceful way of recovering from this condition without rebooting the
system. I don't really know which of the ORd conditions is causing the
reset. Is there a way of recovering without resetting the system, maybe
without even breaking an existing TCP connection?

Thanks!

Harold
--
FCC Rules Updated Daily at http://www.hallikainen.com
Not sent from an iPhone.
--
http://www.piclist.com/techref/piclist PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist
William Westfield
2017-09-23 21:27:44 UTC
Permalink
Post by Harold Hallikainen
// Validate the data returned from the ENC624J600 Family device.
What a load of CYA crap. “Sometimes the packet is invalid; it must be a layout problem! Or maybe Alpha Particles!” That “CRCError” is am actual ethernet packet CRC error, as far as I can tell. While those are supposed to be pretty rare in modern real ethernets
Post by Harold Hallikainen
if(header.NextPacketPointer > RXSTOP ||
((BYTE_VAL*)(&header.NextPacketPointer))->bits.b0 ||
header.StatusVector.bits.Zero || header.StatusVector.bits.ZeroH ||
header.StatusVector.bits.CRCError ||
header.StatusVector.bits.ByteCount > 1522u ||
!header.StatusVector.bits.ReceiveOk)
{
Reset();
}
I've found that this condition occurs causing a system reset when the
network is very busy.
Is there a way of recovering without resetting the system, maybe
without even breaking an existing TCP connection?
You can try just dropping the packet. They seem to be assuming “the hardware is in a weird state, we’d better reset everything”, and I don’t think that’s true of all of the errors. A more complete fix might be to separate out the individual errors, count them, and only do the reset when there is “unacceptable frequency of an error.”

BillW
--
http://www.piclist.com/techref/piclist PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist
smplx
2017-09-23 23:03:51 UTC
Permalink
Post by Harold Hallikainen
// Validate the data returned from the ENC624J600 Family device.
What a load of CYA crap. “Sometimes the packet is invalid; it must be a layout problem! Or maybe Alpha Particles!” That “CRCError” is am actual ethernet packet CRC error, as far as I can tell. While those are supposed to be pretty rare in modern real ethernets
Post by Harold Hallikainen
if(header.NextPacketPointer > RXSTOP ||
((BYTE_VAL*)(&header.NextPacketPointer))->bits.b0 ||
header.StatusVector.bits.Zero || header.StatusVector.bits.ZeroH ||
header.StatusVector.bits.CRCError ||
header.StatusVector.bits.ByteCount > 1522u ||
!header.StatusVector.bits.ReceiveOk)
{
Reset();
}
I've found that this condition occurs causing a system reset when the
network is very busy.
Is there a way of recovering without resetting the system, maybe
without even breaking an existing TCP connection?
You can try just dropping the packet. They seem to be assuming “the
hardware is in a weird state, we’d better reset everything”, and I don’t
think that’s true of all of the errors. A more complete fix might be to
separate out the individual errors, count them, and only do the reset
when there is “unacceptable frequency of an error.”
Perhaps resetting on a "long burst" would be better than "frequency"?

What if every 2 out of 3 packets are being dropped. Something useful is
still getting through. Would you still want to reset? If the reset did not
cure the problem then nothing would get through (endlessly resetting). If
however you dropped lots of packet but limped along then at least you
might be able to diagnose the fault.

Regards
Sergio Masci
Harold Hallikainen
2017-09-24 05:44:39 UTC
Permalink
Post by Harold Hallikainen
// Validate the data returned from the ENC624J600 Family device.
What a load of CYA crap. “Sometimes the packet is invalid; it must be a
layout problem! Or maybe Alpha Particles!” That “CRCError” is am actual
ethernet packet CRC error, as far as I can tell. While those are supposed
to be pretty rare in modern real ethernets
Post by Harold Hallikainen
if(header.NextPacketPointer > RXSTOP ||
((BYTE_VAL*)(&header.NextPacketPointer))->bits.b0 ||
header.StatusVector.bits.Zero || header.StatusVector.bits.ZeroH ||
header.StatusVector.bits.CRCError ||
header.StatusVector.bits.ByteCount > 1522u ||
!header.StatusVector.bits.ReceiveOk)
{
Reset();
}
I've found that this condition occurs causing a system reset when the
network is very busy.
Is there a way of recovering without resetting the system, maybe
without even breaking an existing TCP connection?
You can try just dropping the packet. They seem to be assuming “the
hardware is in a weird state, we’d better reset everything”, and I don’t
think that’s true of all of the errors. A more complete fix might be to
separate out the individual errors, count them, and only do the reset when
there is “unacceptable frequency of an error.”
BillW
Thanks! Catching this with a debugger has proved very difficult. It only
seems to happen in the field and on maybe 20 out of 10,000 systems on very
busy networks. I do log the errors, so I can split the test up and log
them separately to see what is going wrong. I've tried reinitializing
stuff by calling MacInit and TcpStackInit (or whatever they are called),
but that does not seem to recover the system. I'll start with more
detailed logging.

Thanks!

Harold
--
FCC Rules Updated Daily at http://www.hallikainen.com
Not sent from an iPhone.
--
http://www.piclist.com/techref/piclist PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist
James Cameron
2017-09-24 07:42:01 UTC
Permalink
I don't know the peripheral, but on ethernet both ByteCount above 1522
and CRCError can be a collision or interference, and device reset in
response seems a bit eager.

CRCError can also be a fault in a sending device.

I remember having to split networks with hubs or switches to keep
numbers down to what the customer wanted.
[...] I've tried reinitializing stuff by calling MacInit and
TcpStackInit (or whatever they are called), but that does not seem
to recover the system. [...]
Suggests it isn't CRCError or large packets. I'd expect an ethernet
peripheral to keep going quite cheerfully after either of these
events, without anything additional other than what is normally done
to prepare the peripheral for the next transfer.
--
James Cameron
http://quozl.netrek.org/
--
http://www.piclist.com/techref/piclist PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist
Harold Hallikainen
2017-09-24 16:07:21 UTC
Permalink
Post by James Cameron
I don't know the peripheral, but on ethernet both ByteCount above 1522
and CRCError can be a collision or interference, and device reset in
response seems a bit eager.
CRCError can also be a fault in a sending device.
I remember having to split networks with hubs or switches to keep
numbers down to what the customer wanted.
Thanks! I have a WireShark capture from a customer site where the network
gets flooded with TCP communications between a couple other devices. I
don't know why my system is seeing all that traffic. It SEEMS like the
switch should send it to the receiving device only based on the MAC
address.

Harold
--
FCC Rules Updated Daily at http://www.hallikainen.com
Not sent from an iPhone.
--
http://www.piclist.com/techref/piclist PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist
James Cameron
2017-09-24 22:28:55 UTC
Permalink
Post by Harold Hallikainen
Post by James Cameron
I don't know the peripheral, but on ethernet both ByteCount above 1522
and CRCError can be a collision or interference, and device reset in
response seems a bit eager.
CRCError can also be a fault in a sending device.
I remember having to split networks with hubs or switches to keep
numbers down to what the customer wanted.
Thanks! I have a WireShark capture from a customer site where the
network gets flooded with TCP communications between a couple other
devices. I don't know why my system is seeing all that traffic. It
SEEMS like the switch should send it to the receiving device only
based on the MAC address.
Possibly. A switch is supposed to segregate traffic by MAC address.
Consult the switch vendor. It also depends on how it was captured.

Please excuse excess detail; I'm ignorant of your knowledge.

If Wireshark was used on a host that is not your embedded system;

Ask for the MAC addresses of the host that was running Wireshark.
There may be more than one. Label this "List A".

Use Wireshark to get the foreign TCP stream MAC addresses. There
should be at least two. Label this "List B"

If any address in List B can be found in List A; i.e. the latter are a
subset of the former, then ignore the packets, as they are a
side-effect of using Wireshark on a host; packets to or from the host
are accidentally captured.

--

In general, I've found byte counters and packet counters to be a very
effective way to locate network misconfiguration.
--
James Cameron
http://quozl.netrek.org/
--
http://www.piclist.com/techref/piclist PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist
Loading...