diff mbox

[ovs-dev,v3] ovn-northd: logical router icmp response should not care about inport

Message ID 1464280776-5446-1-git-send-email-flavio@flaviof.com
State Superseded
Headers show

Commit Message

Flaviof May 26, 2016, 4:39 p.m. UTC
When responding to icmp echo requests (aka ping) packets, the logical
router should not restrict responses based on the inport.

Example diagram:

vm: IP1.1 (subnet1)
logical_router: IP1.2 (subnet1) and IP2.2 (subnet2)

   vm -------[subnet1]------- logical_router -------[subnet2]
   <IP1.1>                <IP1.2>        <IP2.2>

vm should be able to ping <IP2.2>, even though it is an address
of a subnet that can only be reached through L3 routing.

Reference to the mailing list thread:
http://openvswitch.org/pipermail/discuss/2016-May/021172.html

---
Changes v1->v2:
  - Add unit test.

Changes v2->v3:
  - Code review feedback from Russell Bryant
  - Fix comment section on how vm can ping ip2.2

To be discussed:

One caveat here is that ttl should be decremented before vm
reaches <IP2.2>, assuming the ping packet is coming from
a remote subnet.

In short, we ideally would have the match augmented to look
like this.

  "(inport == %s || ip.ttl > 1) && (ip4.dst == "IP_FMT" || ip4.dst == "IP_FMT") && ...

Afer talking to Russel [1] it seems that we have either a bug or
a known limitation on how a match on ip.ttl can be used with the
greater than operator. I tried a couple of tweaks [2] against a
test ping with ttl 10 and concluded that both
negation and greater than are not working as expected. Assuming
ip.ttl should be supported in the match, there are 3 issues that
should be followed upon, which may be outside the scope of this
patch:
  1) fix/support negation and greater than operator for ip.ttl
  2) update doc on supported matches (ovn-sb);
  3) expand unit tests to verify ip.ttl in match

[1]: https://gist.github.com/725798db0ec7c607c648e3e85d338aa7

[2]:
* Works when sending test ping from non-local subnet and a ttl of 10:

  "(inport == %s || ip.ttl == 10) && (ip4.dst == "IP_FMT" || ip4.dst == "IP_FMT") && " ...
  "(inport == %s || ip.ttl == {2, 3, 4, 5, 6, 7, 8, 9, 10}) && (ip4.dst == "IP_FMT" || ip4.dst == "IP_FMT") && " ...

* Does NOT work when sending ping from non-local subnet and a ttl of 10:

  "(inport == %s || ip.ttl > 1) && (ip4.dst == "IP_FMT" || ip4.dst == "IP_FMT") && " ...
  "(inport == %s || ip.ttl != {2, 3, 4, 5, 6, 7, 8, 9}) && (ip4.dst == "IP_FMT" || ip4.dst == "IP_FMT") && " ...
  "!(inport != %s && ip.ttl == {0,1}) && (ip4.dst == "IP_FMT" || ip4.dst == "IP_FMT") && " ...
  "(inport == %s || !(ip.ttl == {0, 1})) && (ip4.dst == "IP_FMT" || ip4.dst == "IP_FMT") && " ...

Signed-off-by: Flavio Fernandes <flavio@flaviof.com>
---
 ovn/northd/ovn-northd.c |   6 +-
 tests/ovn.at            | 173 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 177 insertions(+), 2 deletions(-)

Comments

Ben Pfaff May 26, 2016, 6:05 p.m. UTC | #1
On Thu, May 26, 2016 at 12:39:36PM -0400, Flavio Fernandes wrote:
> To be discussed:
> 
> One caveat here is that ttl should be decremented before vm
> reaches <IP2.2>, assuming the ping packet is coming from
> a remote subnet.
> 
> In short, we ideally would have the match augmented to look
> like this.
> 
>   "(inport == %s || ip.ttl > 1) && (ip4.dst == "IP_FMT" || ip4.dst == "IP_FMT") && ...
> 
> Afer talking to Russel [1] it seems that we have either a bug or
> a known limitation on how a match on ip.ttl can be used with the
> greater than operator. I tried a couple of tweaks [2] against a
> test ping with ttl 10 and concluded that both
> negation and greater than are not working as expected. Assuming
> ip.ttl should be supported in the match, there are 3 issues that
> should be followed upon, which may be outside the scope of this
> patch:
>   1) fix/support negation and greater than operator for ip.ttl
>   2) update doc on supported matches (ovn-sb);
>   3) expand unit tests to verify ip.ttl in match

Inequalities are intentionally unsupported for ip.ttl.  You can see this
pretty clearly from ovstest output:

    $ echo 'ip.ttl == 1' | tests/ovstest test-ovn parse-expr
    ip.ttl == 0x1
    $ echo 'ip.ttl > 1' | tests/ovstest test-ovn parse-expr
    Only == and != operators may be used with nominal field ip.ttl.

The proximate reason for this is that ip.ttl is a nominal field.  That's
because the OpenFlow field for IP TTL (NXM_NX_IP_TTL) is nonmaskable;
that is, at the OpenFlow layer, OVS rejects attempts to match on a
subset of bits in the TTL.

It would be easy to change both of these decisions.  Making
NXM_NX_IP_TTL maskable is a one-line change to a comment (!), and making
ip.ttl ordinal would be equally trivial.

But I made NXM_NX_IP_TTL nonmaskable for good reason.  That is because
inequality matches are expensive.  The comparison "ip.ttl > 1" needs 7
OpenFlow flows, each of which tests only one bit.  This can be a big
deal because each of those flows ends up in a separate hash table in the
classifier, adding 7 separate hash table lookups to any classifier
search, which is inefficient.

That's why the existing ip.ttl checks instead use "ip.ttl == {0,1}",
which only uses 2 OpenFlow flows, both of which can fit in the same
classifier lookup table.

So, it's better if we can use a design that can use a higher-priority
flow to handle too-low TTLs as a special case, then handle the common
case with a lower priority flow.

I may be microoptimizing.  I am open to that discussion.
Flaviof May 26, 2016, 8:25 p.m. UTC | #2
On Thu, May 26, 2016 at 2:05 PM, Ben Pfaff <blp@ovn.org> wrote:

> On Thu, May 26, 2016 at 12:39:36PM -0400, Flavio Fernandes wrote:
> > To be discussed:
> >
> > One caveat here is that ttl should be decremented before vm
> > reaches <IP2.2>, assuming the ping packet is coming from
> > a remote subnet.
> >
> > In short, we ideally would have the match augmented to look
> > like this.
> >
> >   "(inport == %s || ip.ttl > 1) && (ip4.dst == "IP_FMT" || ip4.dst ==
> "IP_FMT") && ...
> >
> > Afer talking to Russel [1] it seems that we have either a bug or
> > a known limitation on how a match on ip.ttl can be used with the
> > greater than operator. I tried a couple of tweaks [2] against a
> > test ping with ttl 10 and concluded that both
> > negation and greater than are not working as expected. Assuming
> > ip.ttl should be supported in the match, there are 3 issues that
> > should be followed upon, which may be outside the scope of this
> > patch:
> >   1) fix/support negation and greater than operator for ip.ttl
> >   2) update doc on supported matches (ovn-sb);
> >   3) expand unit tests to verify ip.ttl in match
>
> Inequalities are intentionally unsupported for ip.ttl.  You can see this
> pretty clearly from ovstest output:
>
>     $ echo 'ip.ttl == 1' | tests/ovstest test-ovn parse-expr
>     ip.ttl == 0x1
>     $ echo 'ip.ttl > 1' | tests/ovstest test-ovn parse-expr
>     Only == and != operators may be used with nominal field ip.ttl.
>
>
Very cool! Thanks for mentioning about 'tests/ovstest test-ovn parse-expr'.
Do you think it would be a good idea to have this ttl expression added to
ovn.at ? If so, I can do it in a separate patch.



> The proximate reason for this is that ip.ttl is a nominal field.  That's
> because the OpenFlow field for IP TTL (NXM_NX_IP_TTL) is nonmaskable;
> that is, at the OpenFlow layer, OVS rejects attempts to match on a
> subset of bits in the TTL.
>
> It would be easy to change both of these decisions.  Making
> NXM_NX_IP_TTL maskable is a one-line change to a comment (!), and making
> ip.ttl ordinal would be equally trivial.
>
> But I made NXM_NX_IP_TTL nonmaskable for good reason.  That is because
> inequality matches are expensive.  The comparison "ip.ttl > 1" needs 7
> OpenFlow flows, each of which tests only one bit.  This can be a big
> deal because each of those flows ends up in a separate hash table in the
> classifier, adding 7 separate hash table lookups to any classifier
> search, which is inefficient.
>
> That's why the existing ip.ttl checks instead use "ip.ttl == {0,1}",
> which only uses 2 OpenFlow flows, both of which can fit in the same
> classifier lookup table.
>
>
Make sense. Just for giggles I did a little experiment using my vanilla
linux distro
as a router to see if it cared about ttl when responging to pings in a
similar
situation.

To my surprise -- kinda not -- the linux router code was happy to reply to
that
ping without prejudice. ;)  See below [pingTest] for the steps and output.

With that, I'm now proposing that we skip doing anything
beyond the removal of the inport from the logical flow match. If you do not
agree, please make it known on the next patch I submit for this.


> So, it's better if we can use a design that can use a higher-priority
> flow to handle too-low TTLs as a special case, then handle the common
> case with a lower priority flow.
>

And you already added that [ttldis]. My only point was that this is done
later
in the router datapath processing, so the packets destined to the router
are the
only ones that do not have the ttl check.

** update **: Darrell pointed out the excerpt in RFC that indicates that
ttl checking
is not to happen when packet is destined to the router. With that, there is
no
further discussion required on this particular topic.


> I may be microoptimizing.  I am open to that discussion.
>

Not at all, blp! And this is all very good info. I really appreciate it.

Best,

-- flaviof

[ttldis]:
https://github.com/openvswitch/ovs/commit/9975d7beb36ab3aadfd07c9d566f8d3d1d340fc4#diff-2c35162acf6ad144624954fdc4c3d9f4R1348

[pingTest]:
https://gist.github.com/bee26f82113ff9dd2715c6451e19718c

vm ---(192.168.50.0/24)--- rtr ---(192.168.111.0/24)

vm: 192.168.50.200
rtr (port1): 192.168.50.254
rtr (port2): 192.168.111.254

vagrant@ubuntu:~$ sudo ip route 192.168.111.254 255.255.255.255
192.168.50.254

vagrant@ubuntu:~$ sudo tcpdump -n -vvv -i eth1 &
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 65535
bytes

vagrant@ubuntu:~$ ping -t 1 -c 1 -I eth1 192.168.111.254 ; echo "--X-X--"
PING 192.168.111.254 (192.168.111.254) from 192.168.50.200 eth1: 56(84)
bytes of data.
64 bytes from 192.168.111.254: icmp_seq=1 ttl=64 time=0.526 ms

--- 192.168.111.254 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.526/0.526/0.526/0.000 ms
--X-X--
vagrant@ubuntu:~$ 19:53:21.606126 IP (tos 0x0, ttl 1, id 21870, offset 0,
flags [DF], proto ICMP (1), length 84)
    192.168.50.200 > 192.168.111.254: ICMP echo request, id 3656, seq 1,
length 64
19:53:21.606632 IP (tos 0x0, ttl 64, id 47178, offset 0, flags [none],
proto ICMP (1), length 84)
    192.168.111.254 > 192.168.50.200: ICMP echo reply, id 3656, seq 1,
length 64
diff mbox

Patch

diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
index 44e9430..f295052 100644
--- a/ovn/northd/ovn-northd.c
+++ b/ovn/northd/ovn-northd.c
@@ -1892,9 +1892,11 @@  build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
         free(match);
 
         /* ICMP echo reply.  These flows reply to ICMP echo requests
-         * received for the router's IP address. */
+         * received for the router's IP address. Since packets only
+         * get here as part of the logical router datapath, the inport
+         * (i.e. the incoming locally attached net) does not matter. */
         match = xasprintf(
-            "inport == %s && (ip4.dst == "IP_FMT" || ip4.dst == "IP_FMT") && "
+            "(inport == %s || ip.ttl > 1) && (ip4.dst == "IP_FMT" || ip4.dst == "IP_FMT") && "
             "icmp4.type == 8 && icmp4.code == 0",
             op->json_key, IP_ARGS(op->ip), IP_ARGS(op->bcast));
         char *actions = xasprintf(
diff --git a/tests/ovn.at b/tests/ovn.at
index e6ac1d7..8cd3677 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -2611,3 +2611,176 @@  OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 AT_CLEANUP
+
+AT_SETUP([ovn -- icmp_reply: 1 HVs, 2 LSs, 1 lport/LS, 1 LR])
+AT_KEYWORDS([router-icmp-reply])
+AT_SKIP_IF([test $HAVE_PYTHON = no])
+ovn_start
+
+# Logical network:
+# One LR - R1 has switch ls1 (191.168.1.0/24) connected to it,
+# and has switch ls2 (172.16.1.0/24) connected to it.
+
+ovn-nbctl create Logical_Router name=R1
+
+ovn-nbctl lswitch-add ls1
+ovn-nbctl lswitch-add ls2
+
+# Connect ls1 to R1
+ovn-nbctl -- --id=@lrp create Logical_Router_port name=ls1 \
+network=192.168.1.1/24 mac=\"00:00:00:01:02:f1\" -- add Logical_Router R1 \
+ports @lrp -- lport-add ls1 rp-ls1
+
+ovn-nbctl set Logical_port rp-ls1 type=router options:router-port=ls1 \
+addresses=\"00:00:00:01:02:f1\"
+
+# Connect ls2 to R1
+ovn-nbctl -- --id=@lrp create Logical_Router_port name=ls2 \
+network=172.16.1.1/24 mac=\"00:00:00:01:02:f2\" -- add Logical_Router R1 \
+ports @lrp -- lport-add ls2 rp-ls2
+
+ovn-nbctl set Logical_port rp-ls2 type=router options:router-port=ls2 \
+addresses=\"00:00:00:01:02:f2\"
+
+# Create logical port ls1-lp1 in ls1
+ovn-nbctl lport-add ls1 ls1-lp1 \
+-- lport-set-addresses ls1-lp1 "00:00:00:01:02:03 192.168.1.2"
+
+# Create logical port ls2-lp1 in ls2
+ovn-nbctl lport-add ls2 ls2-lp1 \
+-- lport-set-addresses ls2-lp1 "00:00:00:01:02:04 172.16.1.2"
+
+# Create one hypervisor and create OVS ports corresponding to logical ports.
+net_add n1
+
+sim_add hv1
+as hv1
+ovs-vsctl add-br br-phys
+ovn_attach n1 br-phys 192.168.0.1
+ovs-vsctl -- add-port br-int vif1 -- \
+    set interface vif1 external-ids:iface-id=ls1-lp1 \
+    options:tx_pcap=hv1/vif1-tx.pcap \
+    options:rxq_pcap=hv1/vif1-rx.pcap \
+    ofport-request=1
+
+ovs-vsctl -- add-port br-int vif2 -- \
+    set interface vif2 external-ids:iface-id=ls2-lp1 \
+    options:tx_pcap=hv1/vif2-tx.pcap \
+    options:rxq_pcap=hv1/vif2-rx.pcap \
+    ofport-request=1
+
+
+# Allow some time for ovn-northd and ovn-controller to catch up.
+# XXX This should be more systematic.
+sleep 1
+
+
+ip_to_hex() {
+    printf "%02x%02x%02x%02x" "$@"
+}
+trim_zeros() {
+    sed 's/\(00\)\{1,\}$//'
+}
+for i in 1 2; do
+    : > vif$i.expected
+done
+# test_ipv4_icmp_request INPORT ETH_SRC ETH_DST IPV4_SRC IPV4_DST IP_CHKSUM ICMP_CHKSUM [EXP_IP_CHKSUM EXP_ICMP_CHKSUM]
+#
+# Causes a packet to be received on INPORT.  The packet is an ICMPv4
+# request with ETH_SRC, ETH_DST, IPV4_SRC and IPV4_DST as specified.  If EXP_IP_CHKSUM
+# is provided, then it should be the ip checksum of the packet responded;
+# otherwise no reply is expected.
+# In the absence of an ip checksum calculation helpers, we will rely on caller to provide the checksums for the ip
+# and the icmp headers.
+# XXX This should be more systematic.
+#
+# INPORT is an lport number, e.g. 11 for vif11.
+# ETH_SRC and ETH_DST are each 12 hex digits.
+# IPV4_SRC and IPV4_DST are each 8 hex digits.
+# IP_CHSUM and ICMP_CHKSUM are each 4 hex digits.
+# EXP_IP_CHSUM and EXP_ICMP_CHKSUM are each 4 hex digits.
+test_ipv4_icmp_request() {
+    local inport=$1 eth_src=$2 eth_dst=$3 ipv4_src=$4 ipv4_dst=$5 ip_chksum=$6 icmp_chksum=$7
+    local exp_ip_chksum=$8 exp_icmp_chksum=$9
+    shift; shift; shift; shift; shift; shift; shift
+    shift; shift
+
+    local ip_ttl=0a
+    local icmp_id=5fbf
+    local icmp_seq=0001
+    local icmp_data=$(seq 1 56 | xargs printf "%02x")
+    local icmp_type_code_request=0800
+    local icmp_payload=${icmp_type_code_request}${icmp_chksum}${icmp_id}${icmp_seq}${icmp_data}
+    local packet=${eth_dst}${eth_src}08004500005400004000${ip_ttl}01${ip_chksum}${ipv4_src}${ipv4_dst}${icmp_payload}
+
+    as hv1 ovs-appctl netdev-dummy/receive vif$inport $packet
+    if test X$exp_ip_chksum != X; then
+        # Expect to receive the reply, if any. In same port where packet was sent.
+        # Note: src and dst are expected to be reversed.
+        local icmp_type_code_response=0000
+        local reply_icmp_ttl=fe
+        local reply_icmp_payload=${icmp_type_code_response}${exp_icmp_chksum}${icmp_id}${icmp_seq}${icmp_data}
+        local reply=${eth_src}${eth_dst}08004500005400004000${reply_icmp_ttl}01${exp_ip_chksum}${ipv4_dst}${ipv4_src}${reply_icmp_payload}
+        echo $reply >> vif$inport.expected
+    fi
+}
+
+# send ping packet to router's ip addresses, from each of the 2 logical ports.
+rtr_l1_ip=$(ip_to_hex 192 168 1 1)
+rtr_l2_ip=$(ip_to_hex 172 16 1 1)
+l1_ip=$(ip_to_hex 192 168 1 2)
+l2_ip=$(ip_to_hex 172 16 1 2)
+
+# ping router ip address that is on same subnet as the logical port
+test_ipv4_icmp_request 1 000000010203 0000000102f1 $l1_ip $rtr_l1_ip 0000 8510 0bff 8d10
+test_ipv4_icmp_request 2 000000010204 0000000102f2 $l2_ip $rtr_l2_ip 0000 8510 0bff 8d10
+
+# ping router ip address that is on the other side of the logical ports
+test_ipv4_icmp_request 1 000000010203 0000000102f1 $l1_ip $rtr_l2_ip 0000 8510 0bff 8d10
+test_ipv4_icmp_request 2 000000010204 0000000102f2 $l2_ip $rtr_l1_ip 0000 8510 0bff 8d10
+
+echo "---------NB dump-----"
+ovn-nbctl show
+echo "---------------------"
+ovn-nbctl list logical_router
+echo "---------------------"
+ovn-nbctl list logical_router_port
+echo "---------------------"
+
+echo "---------SB dump-----"
+ovn-sbctl list datapath_binding
+echo "---------------------"
+ovn-sbctl list logical_flow
+echo "---------------------"
+
+echo "------ hv1 dump ----------"
+as hv1 ovs-ofctl dump-flows br-int
+
+# check received packets against expected
+for inport in 1 2; do
+    file=hv1/vif${inport}-tx.pcap
+    echo $file
+    $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $file | trim_zeros > received.packets
+    cat vif$inport.expected | trim_zeros > expout
+    AT_CHECK([cat received.packets], [0], [expout])
+done
+
+as hv1
+OVS_APP_EXIT_AND_WAIT([ovn-controller])
+OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as ovn-sb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as ovn-nb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as northd
+OVS_APP_EXIT_AND_WAIT([ovn-northd])
+
+as main
+OVS_APP_EXIT_AND_WAIT([ovs-vswitchd])
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+AT_CLEANUP