Quantcast
Channel: Raspberry Pi Forums
Viewing all articles
Browse latest Browse all 6990

SDK • Anyone got assembly code for 32x32 => 64 that works?

$
0
0
Sorry to wimp out, but I'm struggling with unfamiliarity with the instruction set and assembler syntax. I'm trying to get the fastest 32x32 multiply into 64-bit result that I can, the version here is around 25% faster than the C version but I can't get it to work. If anyone can help I'd be dead chuffed. I'm sure it's just something dumb I'm doing, but I have all day failed to find it and life is too short :)

The C is just this :

Code:

inline int64_t mul32x32 ( const int32_t a, const in32_t b ){    return a * (int64_t) b;}
but the above compiles to 6 multiplies and the use case I want - 32 and 32 in, 64 out - only needs 4.

Code:

mul32x32_64 :.global mul32x32_64// We want// signed   ahi*bhi// signed   ahi * blo// signed   alo * bhi// unsigned alo * blo => but it just gives us a 32-bit pattern, so unsigned doesn't matter                         // here r0 = a r1 = buxth    r2,r1            // alo => r2 = a & 0xfffflsrs    r1,r1,#16        // ahi => r0 = a >> 16lsrs    r3,r0,#16        // bhi => r3 = b >> 16uxth    r0,r0            // blo => r1 = b & 0xffffpush    {r4}movs    r4,r0            // r4 = bhi whymuls    r0,r2            // lolo => r0 = blo * alo - that's why, we corrupt r0muls    r4,r1            // x1   => r4 = ahi * blomuls    r1,r3            // hihi => r1 = ahi * bhimuls    r3,r2            // x2   => r3 = bhi * alolsls    r2,r4,#16        // r2 = (x1 << 16)lsrs    r4,r4,#16        // r4 = x1 >> 16adds    r0,r4,#0adcs    r1,r2pop     {r4}lsls    r2,r3,#16        // r2 = x2 << 16lsrs    r3,r3,#16        // r3 = x2 >> 16adds    r0,r3,#0adcs    r1,r2bx      lr

Statistics: Posted by omenie — Thu Sep 11, 2025 4:50 pm — Replies 3 — Views 262



Viewing all articles
Browse latest Browse all 6990

Trending Articles