[ovs-dev,v3,1/2] ofproto: Allow xlate_actions() to fail.

Message ID	A3C6B412-9531-4D16-B4B8-4E068C95294B@ovn.org
State	Accepted
Headers	show Return-Path: <dev-bounces@openvswitch.org> Received-SPF: pass (mx1-pf2.cudamail.com: SPF record at ovn.org designates 217.70.183.194 as permitted sender) sender: jarno@ovn.org) by relay2-d.mail.gandi.net (Postfix) with ESMTPSA id B82ACC5A5D; Wed, 25 Nov 2015 20:23:36 +0100 (CET) Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) ofproto: Allow xlate_actions() to fail. From: Jarno Rajahalme <jarno@ovn.org> In-Reply-To: <DC48686D-DE05-4261-94A0-816A4B7209A2@ovn.org> Date: Wed, 25 Nov 2015 11:23:35 -0800 Message-Id: <A3C6B412-9531-4D16-B4B8-4E068C95294B@ovn.org> References: <1448401261-33942-1-git-send-email-jarno@ovn.org> <CAOftzPgQED76mTBCe27V+xdYF_+QhQvozTOOO3XcqY_d7FHqJA@mail.gmail.com> <E3472633-38FF-4E6B-8A15-FEC0CF4DAA09@ovn.org> <CAOftzPjO8r7u0SBZtfhqxm5jqG=pqgA4jO_JAQE+F_rJhSSeJw@mail.gmail.com> <DC48686D-DE05-4261-94A0-816A4B7209A2@ovn.org> To: Joe Stringer <joe@ovn.org> Cc: "dev@openvswitch.org" <dev@openvswitch.org> Subject: Re: [ovs-dev] [PATCH v3 1/2] ofproto: Allow xlate_actions() to fail. Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dev-bounces@openvswitch.org Sender: "dev" <dev-bounces@openvswitch.org>

Jarno Rajahalme Nov. 25, 2015, 7:23 p.m. UTC

> On Nov 25, 2015, at 11:11 AM, Jarno Rajahalme <jarno@ovn.org> wrote:
> 
> 
>> On Nov 25, 2015, at 10:52 AM, Joe Stringer <joe@ovn.org <mailto:joe@ovn.org>> wrote:
>> 
>> On 25 November 2015 at 10:31, Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>> wrote:
>>> 
>>>> On Nov 24, 2015, at 5:02 PM, Joe Stringer <joe@ovn.org <mailto:joe@ovn.org>> wrote:
>>>> 
>>>> On 24 November 2015 at 13:41, Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>> wrote:
>>>>> Sometimes xlate_actions() fails due to too deep recursion, too many
>>>>> MPLS labels, or missing recirculation context.  Make xlate_actions()
>>>>> clear out the produced odp actions in these cases to make it easy for
>>>>> the caller to install a drop flow (instead or installing a flow with
>>>>> partially translated actions).  Also, return a specific error code, so
>>>>> that the error can be properly propagated where meaningful.
>>>>> 
>>>>> Before this patch it was possible that the revalidation installed a
>>>>> flow with a recirculation ID with an invalid recirc ID (== 0), due to
>>>>> the introduction of in-place modification in commit 43b2f131a229
>>>>> (ofproto: Allow in-place modifications of datapath flows).
>>>>> 
>>>>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>>
>>>> 
>>>> Should this also set the error when receiving packets on a mirror port
>>>> in xlate_actions()? Or when receiving tagged VLAN traffic that doesn't
>>>> correspond to the port's vlan tag? Or when a group has no live bucket?
>>>> Are there any other cases that should also be covered? (I just scanned
>>>> across ofproto/ofproto-dpif-xlate.c looking for cases where we're
>>>> already logging that we drop the packet, but maybe there's a reasoning
>>>> behind not including these - if so, please enlighten me)
>>> 
>>> No reasoning for missing those, I just did not notice them. Thanks for pointing them out.
>> 
>> OK, I thought it may have been something like "expected errors" vs.
>> "unexpected errors".
> 
> Looking into these I noticed this to be the case. Must discern whether to fail just the individual action v.s. the whole pipeline.
> 

How about this incremental to cover two cases here (rest are “expected errors” IMO):


The last one is debatable, as setting the error fails the whole translation rather than just the normal action. But this is most likely an configuration error, so maybe failing the whole pipeline (and installing a drop flow) is the right thing to do here?

>   Jarno
>

Jarno Rajahalme Nov. 25, 2015, 9:38 p.m. UTC | #1

> On Nov 25, 2015, at 11:23 AM, Jarno Rajahalme <jarno@ovn.org> wrote:
> 
>> 
>> On Nov 25, 2015, at 11:11 AM, Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>> wrote:
>> 
>> 
>>> On Nov 25, 2015, at 10:52 AM, Joe Stringer <joe@ovn.org <mailto:joe@ovn.org>> wrote:
>>> 
>>> On 25 November 2015 at 10:31, Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>> wrote:
>>>> 
>>>>> On Nov 24, 2015, at 5:02 PM, Joe Stringer <joe@ovn.org <mailto:joe@ovn.org>> wrote:
>>>>> 
>>>>> On 24 November 2015 at 13:41, Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>> wrote:
>>>>>> Sometimes xlate_actions() fails due to too deep recursion, too many
>>>>>> MPLS labels, or missing recirculation context.  Make xlate_actions()
>>>>>> clear out the produced odp actions in these cases to make it easy for
>>>>>> the caller to install a drop flow (instead or installing a flow with
>>>>>> partially translated actions).  Also, return a specific error code, so
>>>>>> that the error can be properly propagated where meaningful.
>>>>>> 
>>>>>> Before this patch it was possible that the revalidation installed a
>>>>>> flow with a recirculation ID with an invalid recirc ID (== 0), due to
>>>>>> the introduction of in-place modification in commit 43b2f131a229
>>>>>> (ofproto: Allow in-place modifications of datapath flows).
>>>>>> 
>>>>>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>>
>>>>> 
>>>>> Should this also set the error when receiving packets on a mirror port
>>>>> in xlate_actions()? Or when receiving tagged VLAN traffic that doesn't
>>>>> correspond to the port's vlan tag? Or when a group has no live bucket?
>>>>> Are there any other cases that should also be covered? (I just scanned
>>>>> across ofproto/ofproto-dpif-xlate.c looking for cases where we're
>>>>> already logging that we drop the packet, but maybe there's a reasoning
>>>>> behind not including these - if so, please enlighten me)
>>>> 
>>>> No reasoning for missing those, I just did not notice them. Thanks for pointing them out.
>>> 
>>> OK, I thought it may have been something like "expected errors" vs.
>>> "unexpected errors".
>> 
>> Looking into these I noticed this to be the case. Must discern whether to fail just the individual action v.s. the whole pipeline.
>> 
> 
> How about this incremental to cover two cases here (rest are “expected errors” IMO):
> 
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 36a6fbc..2908339 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -336,6 +336,10 @@ const char *xlate_strerror(enum xlate_error error)
>          return "Recirculation conflict";
>      case XLATE_TOO_MANY_MPLS_LABELS:
>          return "Too many MPLS labels";
> +    case XLATE_BUCKET_CHAINING_TOO_DEEP:
> +        return "Bucket chaining too deep";
> +    case XLATE_NO_INPUT_BUNDLE:
> +        return "No input bundle";
>      }
>      return "Unknown error";
>  }
> @@ -1444,10 +1448,9 @@ bucket_is_alive(const struct xlate_ctx *ctx,
>                  struct ofputil_bucket *bucket, int depth)
>  {
>      if (depth >= MAX_LIVENESS_RECURSION) {
> -        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
> -
> -        VLOG_WARN_RL(&rl, "bucket chaining exceeded %d links",
> -                     MAX_LIVENESS_RECURSION);
> +        XLATE_REPORT_ERROR(ctx, "bucket chaining exceeded %d links",
> +                           MAX_LIVENESS_RECURSION);
> +        ctx->error = XLATE_BUCKET_CHAINING_TOO_DEEP;
>          return false;
>      }
>  
> @@ -2323,7 +2326,8 @@ xlate_normal(struct xlate_ctx *ctx)
>      in_xbundle = lookup_input_bundle(ctx->xbridge, flow->in_port.ofp_port,
>                                       ctx->xin->packet != NULL, &in_port);
>      if (!in_xbundle) {
> -        xlate_report(ctx, "no input bundle, dropping");
> +        XLATE_REPORT_ERROR(ctx, "no input bundle, dropping");
> +        ctx->error = XLATE_NO_INPUT_BUNDLE;
>          return;
>      }
>  
> 
> The last one is debatable, as setting the error fails the whole translation rather than just the normal action. But this is most likely an configuration error, so maybe failing the whole pipeline (and installing a drop flow) is the right thing to do here?
> 

I’d like to push this patch today, either with or without the above incremental, if possible, and then post a new version of the NAT series re-based on this. Ben already Acked the 2nd patch.

  Jarno

Joe Stringer Nov. 25, 2015, 10:58 p.m. UTC | #2

On 25 November 2015 at 11:23, Jarno Rajahalme <jarno@ovn.org> wrote:
>
> On Nov 25, 2015, at 11:11 AM, Jarno Rajahalme <jarno@ovn.org> wrote:
>
>
> On Nov 25, 2015, at 10:52 AM, Joe Stringer <joe@ovn.org> wrote:
>
> On 25 November 2015 at 10:31, Jarno Rajahalme <jarno@ovn.org> wrote:
>
>
> On Nov 24, 2015, at 5:02 PM, Joe Stringer <joe@ovn.org> wrote:
>
> On 24 November 2015 at 13:41, Jarno Rajahalme <jarno@ovn.org> wrote:
>
> Sometimes xlate_actions() fails due to too deep recursion, too many
> MPLS labels, or missing recirculation context.  Make xlate_actions()
> clear out the produced odp actions in these cases to make it easy for
> the caller to install a drop flow (instead or installing a flow with
> partially translated actions).  Also, return a specific error code, so
> that the error can be properly propagated where meaningful.
>
> Before this patch it was possible that the revalidation installed a
> flow with a recirculation ID with an invalid recirc ID (== 0), due to
> the introduction of in-place modification in commit 43b2f131a229
> (ofproto: Allow in-place modifications of datapath flows).
>
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>
>
> Should this also set the error when receiving packets on a mirror port
> in xlate_actions()? Or when receiving tagged VLAN traffic that doesn't
> correspond to the port's vlan tag? Or when a group has no live bucket?
> Are there any other cases that should also be covered? (I just scanned
> across ofproto/ofproto-dpif-xlate.c looking for cases where we're
> already logging that we drop the packet, but maybe there's a reasoning
> behind not including these - if so, please enlighten me)
>
>
> No reasoning for missing those, I just did not notice them. Thanks for
> pointing them out.
>
>
> OK, I thought it may have been something like "expected errors" vs.
> "unexpected errors".
>
>
> Looking into these I noticed this to be the case. Must discern whether to
> fail just the individual action v.s. the whole pipeline.
>
>
> How about this incremental to cover two cases here (rest are “expected
> errors” IMO):
>
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 36a6fbc..2908339 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -336,6 +336,10 @@ const char *xlate_strerror(enum xlate_error error)
>          return "Recirculation conflict";
>      case XLATE_TOO_MANY_MPLS_LABELS:
>          return "Too many MPLS labels";
> +    case XLATE_BUCKET_CHAINING_TOO_DEEP:
> +        return "Bucket chaining too deep";
> +    case XLATE_NO_INPUT_BUNDLE:
> +        return "No input bundle";
>      }
>      return "Unknown error";
>  }
> @@ -1444,10 +1448,9 @@ bucket_is_alive(const struct xlate_ctx *ctx,
>                  struct ofputil_bucket *bucket, int depth)
>  {
>      if (depth >= MAX_LIVENESS_RECURSION) {
> -        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
> -
> -        VLOG_WARN_RL(&rl, "bucket chaining exceeded %d links",
> -                     MAX_LIVENESS_RECURSION);
> +        XLATE_REPORT_ERROR(ctx, "bucket chaining exceeded %d links",
> +                           MAX_LIVENESS_RECURSION);
> +        ctx->error = XLATE_BUCKET_CHAINING_TOO_DEEP;
>          return false;
>      }
>
> @@ -2323,7 +2326,8 @@ xlate_normal(struct xlate_ctx *ctx)
>      in_xbundle = lookup_input_bundle(ctx->xbridge, flow->in_port.ofp_port,
>                                       ctx->xin->packet != NULL, &in_port);
>      if (!in_xbundle) {
> -        xlate_report(ctx, "no input bundle, dropping");
> +        XLATE_REPORT_ERROR(ctx, "no input bundle, dropping");
> +        ctx->error = XLATE_NO_INPUT_BUNDLE;
>          return;
>      }
>
>
> The last one is debatable, as setting the error fails the whole translation
> rather than just the normal action. But this is most likely an configuration
> error, so maybe failing the whole pipeline (and installing a drop flow) is
> the right thing to do here?

Jarno and I discussed this offline, and I'll try to summarise here.
Broadly speaking, we're talking about the decision between failing an
individual (piece of an) action or completely failing the action
processing for the flow. And I think arguably the approach should be
that if it is a serious error such as running out of resources or an
internal conflict of recirc IDs, then we should fail the entire action
processing. In this case it will have two user-visible effects:
1) ofproto/trace will tell the user which serious condition is being
triggered that causes dropping of the flow
2) OpenFlow controllers attempting packet_out could be notified that
the error occurred (rather than silently failing like currently)

However, in the two cases in the incremental patch here, the actions
inherently have some ambiguity as to whether they successfully execute
(eg output) or not. The more obvious case is in the bucket_is_alive()
logic, where recursion will cause a bucket to be not used. If a bucket
is not live in the spec, this doesn't mean that the entire flow should
stop processing. In the case of normal, I'd argue it's very similar in
that 'normal' doesn't specifically attempt to output to a particular
port; sending packets out to different ports may fail for different
reasons, but this shouldn't prevent later actions in the actions list
from being executed.

I think the latter cases should be reported for ofproto/trace, though.

Looking back across this thread, it looks not far off your reasoning
described earlier so I think we're converging on the same view. Does
this sound like a fair approach?

--

In the mirror case, the point is moot because do_xlate_actions() isn't
even called in that case, so it's purely a matter of whether we want
to return the error up the stack or not. Maybe that should be reported
for ofproto/trace as well.

I didn't see any other cases that might need handling through this.

Jarno Rajahalme Nov. 25, 2015, 11:06 p.m. UTC | #3

> On Nov 25, 2015, at 2:58 PM, Joe Stringer <joe@ovn.org> wrote:
> 
> On 25 November 2015 at 11:23, Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>> wrote:
>> 
>> On Nov 25, 2015, at 11:11 AM, Jarno Rajahalme <jarno@ovn.org> wrote:
>> 
>> 
>> On Nov 25, 2015, at 10:52 AM, Joe Stringer <joe@ovn.org> wrote:
>> 
>> On 25 November 2015 at 10:31, Jarno Rajahalme <jarno@ovn.org> wrote:
>> 
>> 
>> On Nov 24, 2015, at 5:02 PM, Joe Stringer <joe@ovn.org> wrote:
>> 
>> On 24 November 2015 at 13:41, Jarno Rajahalme <jarno@ovn.org> wrote:
>> 
>> Sometimes xlate_actions() fails due to too deep recursion, too many
>> MPLS labels, or missing recirculation context.  Make xlate_actions()
>> clear out the produced odp actions in these cases to make it easy for
>> the caller to install a drop flow (instead or installing a flow with
>> partially translated actions).  Also, return a specific error code, so
>> that the error can be properly propagated where meaningful.
>> 
>> Before this patch it was possible that the revalidation installed a
>> flow with a recirculation ID with an invalid recirc ID (== 0), due to
>> the introduction of in-place modification in commit 43b2f131a229
>> (ofproto: Allow in-place modifications of datapath flows).
>> 
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>> 
>> 
>> Should this also set the error when receiving packets on a mirror port
>> in xlate_actions()? Or when receiving tagged VLAN traffic that doesn't
>> correspond to the port's vlan tag? Or when a group has no live bucket?
>> Are there any other cases that should also be covered? (I just scanned
>> across ofproto/ofproto-dpif-xlate.c looking for cases where we're
>> already logging that we drop the packet, but maybe there's a reasoning
>> behind not including these - if so, please enlighten me)
>> 
>> 
>> No reasoning for missing those, I just did not notice them. Thanks for
>> pointing them out.
>> 
>> 
>> OK, I thought it may have been something like "expected errors" vs.
>> "unexpected errors".
>> 
>> 
>> Looking into these I noticed this to be the case. Must discern whether to
>> fail just the individual action v.s. the whole pipeline.
>> 
>> 
>> How about this incremental to cover two cases here (rest are “expected
>> errors” IMO):
>> 
>> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
>> index 36a6fbc..2908339 100644
>> --- a/ofproto/ofproto-dpif-xlate.c
>> +++ b/ofproto/ofproto-dpif-xlate.c
>> @@ -336,6 +336,10 @@ const char *xlate_strerror(enum xlate_error error)
>>         return "Recirculation conflict";
>>     case XLATE_TOO_MANY_MPLS_LABELS:
>>         return "Too many MPLS labels";
>> +    case XLATE_BUCKET_CHAINING_TOO_DEEP:
>> +        return "Bucket chaining too deep";
>> +    case XLATE_NO_INPUT_BUNDLE:
>> +        return "No input bundle";
>>     }
>>     return "Unknown error";
>> }
>> @@ -1444,10 +1448,9 @@ bucket_is_alive(const struct xlate_ctx *ctx,
>>                 struct ofputil_bucket *bucket, int depth)
>> {
>>     if (depth >= MAX_LIVENESS_RECURSION) {
>> -        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>> -
>> -        VLOG_WARN_RL(&rl, "bucket chaining exceeded %d links",
>> -                     MAX_LIVENESS_RECURSION);
>> +        XLATE_REPORT_ERROR(ctx, "bucket chaining exceeded %d links",
>> +                           MAX_LIVENESS_RECURSION);
>> +        ctx->error = XLATE_BUCKET_CHAINING_TOO_DEEP;
>>         return false;
>>     }
>> 
>> @@ -2323,7 +2326,8 @@ xlate_normal(struct xlate_ctx *ctx)
>>     in_xbundle = lookup_input_bundle(ctx->xbridge, flow->in_port.ofp_port,
>>                                      ctx->xin->packet != NULL, &in_port);
>>     if (!in_xbundle) {
>> -        xlate_report(ctx, "no input bundle, dropping");
>> +        XLATE_REPORT_ERROR(ctx, "no input bundle, dropping");
>> +        ctx->error = XLATE_NO_INPUT_BUNDLE;
>>         return;
>>     }
>> 
>> 
>> The last one is debatable, as setting the error fails the whole translation
>> rather than just the normal action. But this is most likely an configuration
>> error, so maybe failing the whole pipeline (and installing a drop flow) is
>> the right thing to do here?
> 
> Jarno and I discussed this offline, and I'll try to summarise here.
> Broadly speaking, we're talking about the decision between failing an
> individual (piece of an) action or completely failing the action
> processing for the flow. And I think arguably the approach should be
> that if it is a serious error such as running out of resources or an
> internal conflict of recirc IDs, then we should fail the entire action
> processing. In this case it will have two user-visible effects:
> 1) ofproto/trace will tell the user which serious condition is being
> triggered that causes dropping of the flow
> 2) OpenFlow controllers attempting packet_out could be notified that
> the error occurred (rather than silently failing like currently)
> 
> However, in the two cases in the incremental patch here, the actions
> inherently have some ambiguity as to whether they successfully execute
> (eg output) or not. The more obvious case is in the bucket_is_alive()
> logic, where recursion will cause a bucket to be not used. If a bucket
> is not live in the spec, this doesn't mean that the entire flow should
> stop processing. In the case of normal, I'd argue it's very similar in
> that 'normal' doesn't specifically attempt to output to a particular
> port; sending packets out to different ports may fail for different
> reasons, but this shouldn't prevent later actions in the actions list
> from being executed.
> 
> I think the latter cases should be reported for ofproto/trace, though.
> 
> Looking back across this thread, it looks not far off your reasoning
> described earlier so I think we're converging on the same view. Does
> this sound like a fair approach?
> 
> --
> 
> In the mirror case, the point is moot because do_xlate_actions() isn't
> even called in that case, so it's purely a matter of whether we want
> to return the error up the stack or not. Maybe that should be reported
> for ofproto/trace as well.
> 
> I didn't see any other cases that might need handling through this.

So the only ask I see here is that more of the cases of individual actions bailing out should have xlate_report() calls on them. To me this sounds like a different patch, not directly related to erroring out the whole translation. As such I hope to get an Ack on the original patch of this now lengthy discussion…

  Jarno

Joe Stringer Nov. 25, 2015, 11:12 p.m. UTC | #4

On 25 November 2015 at 15:06, Jarno Rajahalme <jarno@ovn.org> wrote:
>
>> On Nov 25, 2015, at 2:58 PM, Joe Stringer <joe@ovn.org> wrote:
>>
>> On 25 November 2015 at 11:23, Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>> wrote:
>>>
>>> On Nov 25, 2015, at 11:11 AM, Jarno Rajahalme <jarno@ovn.org> wrote:
>>>
>>>
>>> On Nov 25, 2015, at 10:52 AM, Joe Stringer <joe@ovn.org> wrote:
>>>
>>> On 25 November 2015 at 10:31, Jarno Rajahalme <jarno@ovn.org> wrote:
>>>
>>>
>>> On Nov 24, 2015, at 5:02 PM, Joe Stringer <joe@ovn.org> wrote:
>>>
>>> On 24 November 2015 at 13:41, Jarno Rajahalme <jarno@ovn.org> wrote:
>>>
>>> Sometimes xlate_actions() fails due to too deep recursion, too many
>>> MPLS labels, or missing recirculation context.  Make xlate_actions()
>>> clear out the produced odp actions in these cases to make it easy for
>>> the caller to install a drop flow (instead or installing a flow with
>>> partially translated actions).  Also, return a specific error code, so
>>> that the error can be properly propagated where meaningful.
>>>
>>> Before this patch it was possible that the revalidation installed a
>>> flow with a recirculation ID with an invalid recirc ID (== 0), due to
>>> the introduction of in-place modification in commit 43b2f131a229
>>> (ofproto: Allow in-place modifications of datapath flows).
>>>
>>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>>>
>>>
>>> Should this also set the error when receiving packets on a mirror port
>>> in xlate_actions()? Or when receiving tagged VLAN traffic that doesn't
>>> correspond to the port's vlan tag? Or when a group has no live bucket?
>>> Are there any other cases that should also be covered? (I just scanned
>>> across ofproto/ofproto-dpif-xlate.c looking for cases where we're
>>> already logging that we drop the packet, but maybe there's a reasoning
>>> behind not including these - if so, please enlighten me)
>>>
>>>
>>> No reasoning for missing those, I just did not notice them. Thanks for
>>> pointing them out.
>>>
>>>
>>> OK, I thought it may have been something like "expected errors" vs.
>>> "unexpected errors".
>>>
>>>
>>> Looking into these I noticed this to be the case. Must discern whether to
>>> fail just the individual action v.s. the whole pipeline.
>>>
>>>
>>> How about this incremental to cover two cases here (rest are “expected
>>> errors” IMO):
>>>
>>> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
>>> index 36a6fbc..2908339 100644
>>> --- a/ofproto/ofproto-dpif-xlate.c
>>> +++ b/ofproto/ofproto-dpif-xlate.c
>>> @@ -336,6 +336,10 @@ const char *xlate_strerror(enum xlate_error error)
>>>         return "Recirculation conflict";
>>>     case XLATE_TOO_MANY_MPLS_LABELS:
>>>         return "Too many MPLS labels";
>>> +    case XLATE_BUCKET_CHAINING_TOO_DEEP:
>>> +        return "Bucket chaining too deep";
>>> +    case XLATE_NO_INPUT_BUNDLE:
>>> +        return "No input bundle";
>>>     }
>>>     return "Unknown error";
>>> }
>>> @@ -1444,10 +1448,9 @@ bucket_is_alive(const struct xlate_ctx *ctx,
>>>                 struct ofputil_bucket *bucket, int depth)
>>> {
>>>     if (depth >= MAX_LIVENESS_RECURSION) {
>>> -        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>>> -
>>> -        VLOG_WARN_RL(&rl, "bucket chaining exceeded %d links",
>>> -                     MAX_LIVENESS_RECURSION);
>>> +        XLATE_REPORT_ERROR(ctx, "bucket chaining exceeded %d links",
>>> +                           MAX_LIVENESS_RECURSION);
>>> +        ctx->error = XLATE_BUCKET_CHAINING_TOO_DEEP;
>>>         return false;
>>>     }
>>>
>>> @@ -2323,7 +2326,8 @@ xlate_normal(struct xlate_ctx *ctx)
>>>     in_xbundle = lookup_input_bundle(ctx->xbridge, flow->in_port.ofp_port,
>>>                                      ctx->xin->packet != NULL, &in_port);
>>>     if (!in_xbundle) {
>>> -        xlate_report(ctx, "no input bundle, dropping");
>>> +        XLATE_REPORT_ERROR(ctx, "no input bundle, dropping");
>>> +        ctx->error = XLATE_NO_INPUT_BUNDLE;
>>>         return;
>>>     }
>>>
>>>
>>> The last one is debatable, as setting the error fails the whole translation
>>> rather than just the normal action. But this is most likely an configuration
>>> error, so maybe failing the whole pipeline (and installing a drop flow) is
>>> the right thing to do here?
>>
>> Jarno and I discussed this offline, and I'll try to summarise here.
>> Broadly speaking, we're talking about the decision between failing an
>> individual (piece of an) action or completely failing the action
>> processing for the flow. And I think arguably the approach should be
>> that if it is a serious error such as running out of resources or an
>> internal conflict of recirc IDs, then we should fail the entire action
>> processing. In this case it will have two user-visible effects:
>> 1) ofproto/trace will tell the user which serious condition is being
>> triggered that causes dropping of the flow
>> 2) OpenFlow controllers attempting packet_out could be notified that
>> the error occurred (rather than silently failing like currently)
>>
>> However, in the two cases in the incremental patch here, the actions
>> inherently have some ambiguity as to whether they successfully execute
>> (eg output) or not. The more obvious case is in the bucket_is_alive()
>> logic, where recursion will cause a bucket to be not used. If a bucket
>> is not live in the spec, this doesn't mean that the entire flow should
>> stop processing. In the case of normal, I'd argue it's very similar in
>> that 'normal' doesn't specifically attempt to output to a particular
>> port; sending packets out to different ports may fail for different
>> reasons, but this shouldn't prevent later actions in the actions list
>> from being executed.
>>
>> I think the latter cases should be reported for ofproto/trace, though.
>>
>> Looking back across this thread, it looks not far off your reasoning
>> described earlier so I think we're converging on the same view. Does
>> this sound like a fair approach?
>>
>> --
>>
>> In the mirror case, the point is moot because do_xlate_actions() isn't
>> even called in that case, so it's purely a matter of whether we want
>> to return the error up the stack or not. Maybe that should be reported
>> for ofproto/trace as well.
>>
>> I didn't see any other cases that might need handling through this.
>
> So the only ask I see here is that more of the cases of individual actions bailing out should have xlate_report() calls on them. To me this sounds like a different patch, not directly related to erroring out the whole translation. As such I hope to get an Ack on the original patch of this now lengthy discussion…

I agree.

Acked-by: Joe Stringer <joe@ovn.org>

Joe Stringer Nov. 25, 2015, 11:14 p.m. UTC | #5

On 25 November 2015 at 15:12, Joe Stringer <joestringer@nicira.com> wrote:
> On 25 November 2015 at 15:06, Jarno Rajahalme <jarno@ovn.org> wrote:
>>
>>> On Nov 25, 2015, at 2:58 PM, Joe Stringer <joe@ovn.org> wrote:
>>>
>>> On 25 November 2015 at 11:23, Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>> wrote:
>>>>
>>>> On Nov 25, 2015, at 11:11 AM, Jarno Rajahalme <jarno@ovn.org> wrote:
>>>>
>>>>
>>>> On Nov 25, 2015, at 10:52 AM, Joe Stringer <joe@ovn.org> wrote:
>>>>
>>>> On 25 November 2015 at 10:31, Jarno Rajahalme <jarno@ovn.org> wrote:
>>>>
>>>>
>>>> On Nov 24, 2015, at 5:02 PM, Joe Stringer <joe@ovn.org> wrote:
>>>>
>>>> On 24 November 2015 at 13:41, Jarno Rajahalme <jarno@ovn.org> wrote:
>>>>
>>>> Sometimes xlate_actions() fails due to too deep recursion, too many
>>>> MPLS labels, or missing recirculation context.  Make xlate_actions()
>>>> clear out the produced odp actions in these cases to make it easy for
>>>> the caller to install a drop flow (instead or installing a flow with
>>>> partially translated actions).  Also, return a specific error code, so
>>>> that the error can be properly propagated where meaningful.
>>>>
>>>> Before this patch it was possible that the revalidation installed a
>>>> flow with a recirculation ID with an invalid recirc ID (== 0), due to
>>>> the introduction of in-place modification in commit 43b2f131a229
>>>> (ofproto: Allow in-place modifications of datapath flows).
>>>>
>>>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>>>>
>>>>
>>>> Should this also set the error when receiving packets on a mirror port
>>>> in xlate_actions()? Or when receiving tagged VLAN traffic that doesn't
>>>> correspond to the port's vlan tag? Or when a group has no live bucket?
>>>> Are there any other cases that should also be covered? (I just scanned
>>>> across ofproto/ofproto-dpif-xlate.c looking for cases where we're
>>>> already logging that we drop the packet, but maybe there's a reasoning
>>>> behind not including these - if so, please enlighten me)
>>>>
>>>>
>>>> No reasoning for missing those, I just did not notice them. Thanks for
>>>> pointing them out.
>>>>
>>>>
>>>> OK, I thought it may have been something like "expected errors" vs.
>>>> "unexpected errors".
>>>>
>>>>
>>>> Looking into these I noticed this to be the case. Must discern whether to
>>>> fail just the individual action v.s. the whole pipeline.
>>>>
>>>>
>>>> How about this incremental to cover two cases here (rest are “expected
>>>> errors” IMO):
>>>>
>>>> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
>>>> index 36a6fbc..2908339 100644
>>>> --- a/ofproto/ofproto-dpif-xlate.c
>>>> +++ b/ofproto/ofproto-dpif-xlate.c
>>>> @@ -336,6 +336,10 @@ const char *xlate_strerror(enum xlate_error error)
>>>>         return "Recirculation conflict";
>>>>     case XLATE_TOO_MANY_MPLS_LABELS:
>>>>         return "Too many MPLS labels";
>>>> +    case XLATE_BUCKET_CHAINING_TOO_DEEP:
>>>> +        return "Bucket chaining too deep";
>>>> +    case XLATE_NO_INPUT_BUNDLE:
>>>> +        return "No input bundle";
>>>>     }
>>>>     return "Unknown error";
>>>> }
>>>> @@ -1444,10 +1448,9 @@ bucket_is_alive(const struct xlate_ctx *ctx,
>>>>                 struct ofputil_bucket *bucket, int depth)
>>>> {
>>>>     if (depth >= MAX_LIVENESS_RECURSION) {
>>>> -        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>>>> -
>>>> -        VLOG_WARN_RL(&rl, "bucket chaining exceeded %d links",
>>>> -                     MAX_LIVENESS_RECURSION);
>>>> +        XLATE_REPORT_ERROR(ctx, "bucket chaining exceeded %d links",
>>>> +                           MAX_LIVENESS_RECURSION);
>>>> +        ctx->error = XLATE_BUCKET_CHAINING_TOO_DEEP;
>>>>         return false;
>>>>     }
>>>>
>>>> @@ -2323,7 +2326,8 @@ xlate_normal(struct xlate_ctx *ctx)
>>>>     in_xbundle = lookup_input_bundle(ctx->xbridge, flow->in_port.ofp_port,
>>>>                                      ctx->xin->packet != NULL, &in_port);
>>>>     if (!in_xbundle) {
>>>> -        xlate_report(ctx, "no input bundle, dropping");
>>>> +        XLATE_REPORT_ERROR(ctx, "no input bundle, dropping");
>>>> +        ctx->error = XLATE_NO_INPUT_BUNDLE;
>>>>         return;
>>>>     }
>>>>
>>>>
>>>> The last one is debatable, as setting the error fails the whole translation
>>>> rather than just the normal action. But this is most likely an configuration
>>>> error, so maybe failing the whole pipeline (and installing a drop flow) is
>>>> the right thing to do here?
>>>
>>> Jarno and I discussed this offline, and I'll try to summarise here.
>>> Broadly speaking, we're talking about the decision between failing an
>>> individual (piece of an) action or completely failing the action
>>> processing for the flow. And I think arguably the approach should be
>>> that if it is a serious error such as running out of resources or an
>>> internal conflict of recirc IDs, then we should fail the entire action
>>> processing. In this case it will have two user-visible effects:
>>> 1) ofproto/trace will tell the user which serious condition is being
>>> triggered that causes dropping of the flow
>>> 2) OpenFlow controllers attempting packet_out could be notified that
>>> the error occurred (rather than silently failing like currently)
>>>
>>> However, in the two cases in the incremental patch here, the actions
>>> inherently have some ambiguity as to whether they successfully execute
>>> (eg output) or not. The more obvious case is in the bucket_is_alive()
>>> logic, where recursion will cause a bucket to be not used. If a bucket
>>> is not live in the spec, this doesn't mean that the entire flow should
>>> stop processing. In the case of normal, I'd argue it's very similar in
>>> that 'normal' doesn't specifically attempt to output to a particular
>>> port; sending packets out to different ports may fail for different
>>> reasons, but this shouldn't prevent later actions in the actions list
>>> from being executed.
>>>
>>> I think the latter cases should be reported for ofproto/trace, though.
>>>
>>> Looking back across this thread, it looks not far off your reasoning
>>> described earlier so I think we're converging on the same view. Does
>>> this sound like a fair approach?
>>>
>>> --
>>>
>>> In the mirror case, the point is moot because do_xlate_actions() isn't
>>> even called in that case, so it's purely a matter of whether we want
>>> to return the error up the stack or not. Maybe that should be reported
>>> for ofproto/trace as well.
>>>
>>> I didn't see any other cases that might need handling through this.
>>
>> So the only ask I see here is that more of the cases of individual actions bailing out should have xlate_report() calls on them. To me this sounds like a different patch, not directly related to erroring out the whole translation. As such I hope to get an Ack on the original patch of this now lengthy discussion…
>
> I agree.
>
> Acked-by: Joe Stringer <joe@ovn.org>

Ah, mismatch between my sender email an the ack. Here:

Acked-by: Joe Stringer <joe@ovn.org>

Jarno Rajahalme Nov. 25, 2015, 11:33 p.m. UTC | #6

> On Nov 25, 2015, at 3:14 PM, Joe Stringer <joe@ovn.org> wrote:
> 
> On 25 November 2015 at 15:12, Joe Stringer <joestringer@nicira.com> wrote:
>> On 25 November 2015 at 15:06, Jarno Rajahalme <jarno@ovn.org> wrote:
>>> 
>>>> On Nov 25, 2015, at 2:58 PM, Joe Stringer <joe@ovn.org> wrote:
>>>> 
>>>> On 25 November 2015 at 11:23, Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>> wrote:
>>>>> 
>>>>> On Nov 25, 2015, at 11:11 AM, Jarno Rajahalme <jarno@ovn.org> wrote:
>>>>> 
>>>>> 
>>>>> On Nov 25, 2015, at 10:52 AM, Joe Stringer <joe@ovn.org> wrote:
>>>>> 
>>>>> On 25 November 2015 at 10:31, Jarno Rajahalme <jarno@ovn.org> wrote:
>>>>> 
>>>>> 
>>>>> On Nov 24, 2015, at 5:02 PM, Joe Stringer <joe@ovn.org> wrote:
>>>>> 
>>>>> On 24 November 2015 at 13:41, Jarno Rajahalme <jarno@ovn.org> wrote:
>>>>> 
>>>>> Sometimes xlate_actions() fails due to too deep recursion, too many
>>>>> MPLS labels, or missing recirculation context.  Make xlate_actions()
>>>>> clear out the produced odp actions in these cases to make it easy for
>>>>> the caller to install a drop flow (instead or installing a flow with
>>>>> partially translated actions).  Also, return a specific error code, so
>>>>> that the error can be properly propagated where meaningful.
>>>>> 
>>>>> Before this patch it was possible that the revalidation installed a
>>>>> flow with a recirculation ID with an invalid recirc ID (== 0), due to
>>>>> the introduction of in-place modification in commit 43b2f131a229
>>>>> (ofproto: Allow in-place modifications of datapath flows).
>>>>> 
>>>>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>>>>> 
>>>>> 
>>>>> Should this also set the error when receiving packets on a mirror port
>>>>> in xlate_actions()? Or when receiving tagged VLAN traffic that doesn't
>>>>> correspond to the port's vlan tag? Or when a group has no live bucket?
>>>>> Are there any other cases that should also be covered? (I just scanned
>>>>> across ofproto/ofproto-dpif-xlate.c looking for cases where we're
>>>>> already logging that we drop the packet, but maybe there's a reasoning
>>>>> behind not including these - if so, please enlighten me)
>>>>> 
>>>>> 
>>>>> No reasoning for missing those, I just did not notice them. Thanks for
>>>>> pointing them out.
>>>>> 
>>>>> 
>>>>> OK, I thought it may have been something like "expected errors" vs.
>>>>> "unexpected errors".
>>>>> 
>>>>> 
>>>>> Looking into these I noticed this to be the case. Must discern whether to
>>>>> fail just the individual action v.s. the whole pipeline.
>>>>> 
>>>>> 
>>>>> How about this incremental to cover two cases here (rest are “expected
>>>>> errors” IMO):
>>>>> 
>>>>> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
>>>>> index 36a6fbc..2908339 100644
>>>>> --- a/ofproto/ofproto-dpif-xlate.c
>>>>> +++ b/ofproto/ofproto-dpif-xlate.c
>>>>> @@ -336,6 +336,10 @@ const char *xlate_strerror(enum xlate_error error)
>>>>>        return "Recirculation conflict";
>>>>>    case XLATE_TOO_MANY_MPLS_LABELS:
>>>>>        return "Too many MPLS labels";
>>>>> +    case XLATE_BUCKET_CHAINING_TOO_DEEP:
>>>>> +        return "Bucket chaining too deep";
>>>>> +    case XLATE_NO_INPUT_BUNDLE:
>>>>> +        return "No input bundle";
>>>>>    }
>>>>>    return "Unknown error";
>>>>> }
>>>>> @@ -1444,10 +1448,9 @@ bucket_is_alive(const struct xlate_ctx *ctx,
>>>>>                struct ofputil_bucket *bucket, int depth)
>>>>> {
>>>>>    if (depth >= MAX_LIVENESS_RECURSION) {
>>>>> -        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>>>>> -
>>>>> -        VLOG_WARN_RL(&rl, "bucket chaining exceeded %d links",
>>>>> -                     MAX_LIVENESS_RECURSION);
>>>>> +        XLATE_REPORT_ERROR(ctx, "bucket chaining exceeded %d links",
>>>>> +                           MAX_LIVENESS_RECURSION);
>>>>> +        ctx->error = XLATE_BUCKET_CHAINING_TOO_DEEP;
>>>>>        return false;
>>>>>    }
>>>>> 
>>>>> @@ -2323,7 +2326,8 @@ xlate_normal(struct xlate_ctx *ctx)
>>>>>    in_xbundle = lookup_input_bundle(ctx->xbridge, flow->in_port.ofp_port,
>>>>>                                     ctx->xin->packet != NULL, &in_port);
>>>>>    if (!in_xbundle) {
>>>>> -        xlate_report(ctx, "no input bundle, dropping");
>>>>> +        XLATE_REPORT_ERROR(ctx, "no input bundle, dropping");
>>>>> +        ctx->error = XLATE_NO_INPUT_BUNDLE;
>>>>>        return;
>>>>>    }
>>>>> 
>>>>> 
>>>>> The last one is debatable, as setting the error fails the whole translation
>>>>> rather than just the normal action. But this is most likely an configuration
>>>>> error, so maybe failing the whole pipeline (and installing a drop flow) is
>>>>> the right thing to do here?
>>>> 
>>>> Jarno and I discussed this offline, and I'll try to summarise here.
>>>> Broadly speaking, we're talking about the decision between failing an
>>>> individual (piece of an) action or completely failing the action
>>>> processing for the flow. And I think arguably the approach should be
>>>> that if it is a serious error such as running out of resources or an
>>>> internal conflict of recirc IDs, then we should fail the entire action
>>>> processing. In this case it will have two user-visible effects:
>>>> 1) ofproto/trace will tell the user which serious condition is being
>>>> triggered that causes dropping of the flow
>>>> 2) OpenFlow controllers attempting packet_out could be notified that
>>>> the error occurred (rather than silently failing like currently)
>>>> 
>>>> However, in the two cases in the incremental patch here, the actions
>>>> inherently have some ambiguity as to whether they successfully execute
>>>> (eg output) or not. The more obvious case is in the bucket_is_alive()
>>>> logic, where recursion will cause a bucket to be not used. If a bucket
>>>> is not live in the spec, this doesn't mean that the entire flow should
>>>> stop processing. In the case of normal, I'd argue it's very similar in
>>>> that 'normal' doesn't specifically attempt to output to a particular
>>>> port; sending packets out to different ports may fail for different
>>>> reasons, but this shouldn't prevent later actions in the actions list
>>>> from being executed.
>>>> 
>>>> I think the latter cases should be reported for ofproto/trace, though.
>>>> 
>>>> Looking back across this thread, it looks not far off your reasoning
>>>> described earlier so I think we're converging on the same view. Does
>>>> this sound like a fair approach?
>>>> 
>>>> --
>>>> 
>>>> In the mirror case, the point is moot because do_xlate_actions() isn't
>>>> even called in that case, so it's purely a matter of whether we want
>>>> to return the error up the stack or not. Maybe that should be reported
>>>> for ofproto/trace as well.
>>>> 
>>>> I didn't see any other cases that might need handling through this.
>>> 
>>> So the only ask I see here is that more of the cases of individual actions bailing out should have xlate_report() calls on them. To me this sounds like a different patch, not directly related to erroring out the whole translation. As such I hope to get an Ack on the original patch of this now lengthy discussion…
>> 
>> I agree.
>> 
>> Acked-by: Joe Stringer <joe@ovn.org>
> 
> Ah, mismatch between my sender email an the ack. Here:
> 
> Acked-by: Joe Stringer <joe@ovn.org>

Thanks for the thorough review, pushed to master.

  Jarno

[ovs-dev,v3,1/2] ofproto: Allow xlate_actions() to fail.

Commit Message

Comments

Patch