Sorry to wimp out, but I'm struggling with unfamiliarity with the instruction set and assembler syntax. I'm trying to get the fastest 32x32 multiply into 64-bit result that I can, the version here is around 25% faster than the C version but I can't get it to work. If anyone can help I'd be dead chuffed. I'm sure it's just something dumb I'm doing, but I have all day failed to find it and life is too short ![Smile :)]()
The C is just this :
but the above compiles to 6 multiplies and the use case I want - 32 and 32 in, 64 out - only needs 4.
The C is just this :
Code:
inline int64_t mul32x32 ( const int32_t a, const in32_t b ){ return a * (int64_t) b;}Code:
mul32x32_64 :.global mul32x32_64// We want// signed ahi*bhi// signed ahi * blo// signed alo * bhi// unsigned alo * blo => but it just gives us a 32-bit pattern, so unsigned doesn't matter // here r0 = a r1 = buxth r2,r1 // alo => r2 = a & 0xfffflsrs r1,r1,#16 // ahi => r0 = a >> 16lsrs r3,r0,#16 // bhi => r3 = b >> 16uxth r0,r0 // blo => r1 = b & 0xffffpush {r4}movs r4,r0 // r4 = bhi whymuls r0,r2 // lolo => r0 = blo * alo - that's why, we corrupt r0muls r4,r1 // x1 => r4 = ahi * blomuls r1,r3 // hihi => r1 = ahi * bhimuls r3,r2 // x2 => r3 = bhi * alolsls r2,r4,#16 // r2 = (x1 << 16)lsrs r4,r4,#16 // r4 = x1 >> 16adds r0,r4,#0adcs r1,r2pop {r4}lsls r2,r3,#16 // r2 = x2 << 16lsrs r3,r3,#16 // r3 = x2 >> 16adds r0,r3,#0adcs r1,r2bx lrStatistics: Posted by omenie — Thu Sep 11, 2025 4:50 pm — Replies 3 — Views 262