From patchwork Wed Nov 30 17:01:24 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bernd Edlinger X-Patchwork-Id: 701090 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3tTRXc4njGz9t14 for ; Thu, 1 Dec 2016 04:01:51 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="EIixbYik"; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:references:in-reply-to :content-type:mime-version; q=dns; s=default; b=x08teh8UQKnJi8Hv PilgCYoyUEr4+JEGE5vz5tAPGB/g/wzl7ncBaDKXxsUoz5bY/hbVSRgX/r0nBqKg 1RIt5dqFmReT4VRabw00p4Iiy7qL2m7r5O2ET1o3s6vi5z3jU8OFC7Kq3yAYwClX wpakcv73GDBkA5qotxop8pX1SBk= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:references:in-reply-to :content-type:mime-version; s=default; bh=ZmPvSKUNrBTg4gLcge5Q4f qjueo=; b=EIixbYiknaZmTDtfYT83OIWnhg5DhxclJsJ5+grjxgU67DRjPY715F dhkgBvBbOkimScmx7QSoh7VrA78ib6Kh0MDCpmrVc9XSIVq471QQiyIaXwre0neb 5FtjphpakbXrRjl5oZ14DBgQxILeoVAH/Ylt8zWff54uM+LpvrJ6M= Received: (qmail 111152 invoked by alias); 30 Nov 2016 17:01:39 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 111140 invoked by uid 89); 30 Nov 2016 17:01:38 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=2500, influence, Unlike, differently X-HELO: SNT004-OMC2S22.hotmail.com Received: from snt004-omc2s22.hotmail.com (HELO SNT004-OMC2S22.hotmail.com) (65.55.90.97) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 30 Nov 2016 17:01:28 +0000 Received: from EUR03-DB5-obe.outbound.protection.outlook.com ([65.55.90.71]) by SNT004-OMC2S22.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.23008); Wed, 30 Nov 2016 09:01:27 -0800 Received: from AM5EUR03FT010.eop-EUR03.prod.protection.outlook.com (10.152.16.57) by AM5EUR03HT140.eop-EUR03.prod.protection.outlook.com (10.152.17.74) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.734.4; Wed, 30 Nov 2016 17:01:24 +0000 Received: from AM4PR0701MB2162.eurprd07.prod.outlook.com (10.152.16.54) by AM5EUR03FT010.mail.protection.outlook.com (10.152.16.134) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.734.4 via Frontend Transport; Wed, 30 Nov 2016 17:01:24 +0000 Received: from AM4PR0701MB2162.eurprd07.prod.outlook.com ([10.167.132.147]) by AM4PR0701MB2162.eurprd07.prod.outlook.com ([10.167.132.147]) with mapi id 15.01.0761.009; Wed, 30 Nov 2016 17:01:24 +0000 From: Bernd Edlinger To: Wilco Dijkstra , Ramana Radhakrishnan CC: GCC Patches , Kyrill Tkachov , Richard Earnshaw , nd Subject: Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) Date: Wed, 30 Nov 2016 17:01:24 +0000 Message-ID: References: In-Reply-To: authentication-results: arm.com; dkim=none (message not signed) header.d=none; arm.com; dmarc=none action=none header.from=hotmail.de; x-incomingtopheadermarker: OriginalChecksum:; UpperCasedChecksum:; SizeAsReceived:8136; Count:37 x-ms-exchange-messagesentrepresentingtype: 1 x-incomingheadercount: 37 x-eopattributedmessage: 0 x-microsoft-exchange-diagnostics: 1; AM5EUR03HT140; 7:m6Hn5qfXZKAhbdWHQhBbYKzd3ceA3tVGeAg9d15SFh0kXhbd1fwhQ62m63NVzkMu3GC3mDk0nLyQqtPuegDLu4myL5gcQnKIAOcJk7SdmpRCZapansPWenvSPuyxxyzg5xYCbQ1AxZ0oilYnGWzbtdLzKrZkxdbHtTpuZvzA5+Zx4TyELPMgd0oAIj9ui/NEKmZSMGHUaWrKWvfW5xJYOu+Jv8owXUWNGUQfiqCb0/45t2tubimDqsaUpJAzG2ALS7pn0ba3kTnRkrClTGUC8yxG/F0Tu2DEU9DoStyghIGH6osSUGtyvVdGAhpGljF5RMSo9Rw4VgG230q6VT/KG/dJR1HhtBEmQHfkrvwAcX4= x-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003); DIR:OUT; SFP:1102; SCL:1; SRVR:AM5EUR03HT140; H:AM4PR0701MB2162.eurprd07.prod.outlook.com; FPR:; SPF:None; LANG:en; x-ms-office365-filtering-correlation-id: 65e60204-05f3-447a-1ba5-08d41942837c x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(22001)(1601124038)(1603103113)(1601125047); SRVR:AM5EUR03HT140; x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(432015012)(102415395)(82015046); SRVR:AM5EUR03HT140; BCL:0; PCL:0; RULEID:; SRVR:AM5EUR03HT140; x-forefront-prvs: 0142F22657 spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM MIME-Version: 1.0 X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-originalarrivaltime: 30 Nov 2016 17:01:24.4238 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Internet X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM5EUR03HT140 On 11/30/16 13:01, Wilco Dijkstra wrote: > Bernd Edlinger wrote: >> On 11/29/16 16:06, Wilco Dijkstra wrote: >>> Bernd Edlinger wrote: >>> >>> - "TARGET_32BIT && reload_completed >>> + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) >>> && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" >>> >>> This is equivalent to "&& (!TARGET_IWMMXT || reload_completed)" since we're >>> already excluding NEON. >> >> Aehm, no. This would split the addi_neon insn before it is clear >> if the reload pass will assign a VFP register. > > Hmm that's strange... This instruction shouldn't be used to also split some random > Neon pattern - for example arm_subdi3 doesn't do the same. To understand and > reason about any of these complex patterns they should all work in the same way... > I was a bit surprised as well, when I saw that happen. But subdi3 is different: "TARGET_32BIT && !TARGET_NEON" "#" ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2" "&& reload_completed" so this never splits anything if TARGET_NEON. but adddi3 can not expand if TARGET_NEON but it's pattern simply looks exactly like the addi3_neon: (define_insn_and_split "*arm_adddi3" [(set (match_operand:DI 0 "s_register_operand" "=&r,&r,&r,&r,&r") (plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r") (match_operand:DI 2 "arm_adddi_operand" "r, 0, r, Dd, Dd"))) (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" "TARGET_32BIT && reload_completed && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" (define_insn "adddi3_neon" [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?&r,?w,?&r,?&r,?&r") (plus:DI (match_operand:DI 1 "s_register_operand" "%w,0,0,w,r,0,r") (match_operand:DI 2 "arm_adddi_operand" "w,r,0,w,r,Dd,Dd"))) (clobber (reg:CC CC_REGNUM))] "TARGET_NEON" { switch (which_alternative) { case 0: /* fall through */ case 3: return "vadd.i64\t%P0, %P1, %P2"; case 1: return "#"; case 2: return "#"; case 4: return "#"; case 5: return "#"; case 6: return "#"; default: gcc_unreachable (); } Even the return "#" explicitly invokes the former pattern. So I think the author knew that, and did it on purpose. >> But when I make *arm_cmpdi_insn split early, it ICEs: > > (insn 4870 4869 1636 87 (set (scratch:SI) > (minus:SI (minus:SI (subreg:SI (reg:DI 2261) 4) > (subreg:SI (reg:DI 473 [ X$14 ]) 4)) > (ltu:SI (reg:CC_C 100 cc) > (const_int 0 [0])))) "pr77308-2.c":140 -1 > (nil)) > > That's easy, we don't have a sbcs , r1, r2 pattern. A quick workaround is > to create a temporary for operand[2] (if before reload) so it will match the standard > sbcs pattern, and then the split works fine. > >> So it is certainly possible, but not really simple to improve the >> stack size even further. But I would prefer to do that in a >> separate patch. > > Yes separate patches would be fine. However there is a lot of scope to improve this > further. For example after your patch shifts and logical operations are expanded in > expand, add/sub are in split1 after combine runs and everything else is split after > reload. It doesn't make sense to split different operations at different times - it means > you're still going to get the bad DImode subregs and miss lots of optimization > opportunities due to the mix of partly split and partly not-yet-split operations. > Yes. I did the add/sub differently because it was more easy this way, and it was simply sufficient to make the existing test cases happy. Also, the biggest benefit was IIRC from the very early splitting of the anddi/iordi/xordi patterns, because they have completely separate data flow in low and high parts. And that is not the case for the arihmetic patterns, but nevertheless they can still be optimized, preferably, when a new test case is found, that can demonstrate an improvement. I am not sure why the cmpdi pattern have an influence at all, because from the data flow you need all 64 bits of both sides. Nevertheless it is a fact: With the modified test case I get 264 bytes frame size, and that was 1920 before. I attached the completely untested follow-up patch now, but I would like to post that one again for review, after I applied my current patch, which is still waiting for final review (please feel pinged!). This is really exciting... Thanks Bernd. --- gcc/config/arm/arm.md.orig 2016-11-27 09:22:41.794790123 +0100 +++ gcc/config/arm/arm.md 2016-11-30 16:40:30.140532737 +0100 @@ -4738,7 +4738,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_ARM" "#" ; "rsbs\\t%Q0, %Q1, #0\;rsc\\t%R0, %R1, #0" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) @@ -7432,7 +7432,7 @@ (clobber (match_scratch:SI 2 "=r"))] "TARGET_32BIT" "#" ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 0) (match_dup 1))) (parallel [(set (reg:CC CC_REGNUM) @@ -7456,7 +7456,10 @@ operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]); } operands[1] = gen_lowpart (SImode, operands[1]); - operands[2] = gen_lowpart (SImode, operands[2]); + if (can_create_pseudo_p ()) + operands[2] = gen_reg_rtx (SImode); + else + operands[2] = gen_lowpart (SImode, operands[2]); } [(set_attr "conds" "set") (set_attr "length" "8") @@ -7470,7 +7473,7 @@ "TARGET_32BIT" "#" ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 2) (match_dup 3))) (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0)) --- gcc/config/arm/thumb2.md.orig 2016-11-30 16:57:44.760589624 +0100 +++ gcc/config/arm/thumb2.md 2016-11-30 16:58:05.310590754 +0100 @@ -132,7 +132,7 @@ (clobber (reg:CC CC_REGNUM))] "TARGET_THUMB2" "#" ; negs\\t%Q0, %Q1\;sbc\\t%R0, %R1, %R1, lsl #1 - "&& reload_completed" + "&& (!TARGET_NEON || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) --- /dev/null 2016-11-30 15:23:46.779473644 +0100 +++ gcc/testsuite/gcc.target/arm/pr77308-2.c 2016-11-30 17:05:21.021614711 +0100 @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks. + It improves the test coverage of cmpdi and negdi2 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x)) +#define Sigma1(x) (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x)) +#define sigma0(x) (ROTR((x),1) ^ ROTR((x),8) ^ (((x)>>7) > (x)) ? -(x) : (x)) +#define sigma1(x) (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +}