Efficiently compute two dissimilar numbers in arm neon

I have an array of 16 integers and I'd like to find pair of ints from this array that have max dissimilarity between each other.
dissimilarity could be computed with this (pseudo) code:

int diss(uint32_t x, uint32_t y) { // it could do square for each byte of the number instead. return abs(((x >> 24) & 0xFF) - ((y >> 24) & 0xFF)) + abs(((x >> 16) & 0xFF) - ((y >> 16) & 0xFF)) + abs(((x >> 8) & 0xFF) - ((y >> 8) & 0xFF)) + abs(((x >> 0) & 0xFF) - ((y >> 0) & 0xFF)); } void findDissimilar(uint32_t buf[16], uint32_t& x, uint32_t& y) { int maxDiss = 0; for (int i=0; i<16; ++i) { for (int j=0; j<16; ++j) { int d = diss(buf[i], bud[j]); if (d > maxDiss) { maxDiss = d; x = buf[i]; y = buf[j]; } } } }

On input buf is already in neon registers if that matters. On output I should get two ints (in neon reg perhaps it's better).
How can I do that efficiently in arm neon, what approaches should I try? Just to clarify, the point of the question is about optimizing findDissimilar.

buf

findDissimilar

So you're doing 4-byte SAD (sum of absolute differences), and looking for the pair that has the maximum SAD?
– Peter Cordes
Jul 3 at 0:11

Yes, something like that. I think I've seen that somewhere in ffmpeg or x264, I'll try to grep to see if they have something similar.
– Pavel
Jul 3 at 0:12

Yes, x264 should have some 4x4 SAD motion search, including an exhaustive one. But it will be looking at different colour planes separately, for blocks of 4x4 pixels. So the motion-search will check every byte offset, not just in steps of 4 bytes. But you might get an idea of some useful instructions for byte-wise SAD with NEON, if you can separate that out from shuffling / unaligned loads.
– Peter Cordes
Jul 3 at 0:16

x264 has multiple search patterns to look for a good match quickly, not necessarily the minimum SAD: dia (diagonal), hex (hexagon), umh (uneven multi-hexagon), and esa (exhaustive, but not a brute-force check of every alignment). tesa finds the minimum-cost DCT-transformed residual, instead of the minimum SAD. Also note that x264 will have 8x8 and 16x16 SADs for motion-searching on larger blocks.
– Peter Cordes
Jul 3 at 0:17

dia

hex

umh

esa

tesa

How are you using the result? Do you want a SIMD vector of results for multiple 16-element buffers, or is this only ever used on one 16-byte vector at once? (Not sure if that makes a difference, but it easily might. And there might be a throughput vs. latency tradeoff.) And just to confirm, you only want the numbers themselves, not the position?
– Peter Cordes
Jul 3 at 0:18

1 Answer
1

diss is trivial to compute in neon, it could possibly implemented this way (untested code):

diss

uint32x4_t diss_x4(uint32x4_t x4, uint32x4_t y4) { uint8x16_t diff = vabdq_u8(vreinterpretq_u8_u32(x4), vreinterpretq_u8_u32(x4)); uint16x8_t m0 = vmull_u8(vget_low_u8(diff), vget_low_u8(diff)); uint16x8_t m1 = vmull_u8(vget_high_u8(diff), vget_high_u8(diff)); uint16x4_t s0 = vpadd_u16(vget_low_u8(m0), vget_high_u8(m0)); uint16x4_t s1 = vpadd_u16(vget_low_u8(m1), vget_high_u8(m1)); uint16x4_t sx = vpadd_u16(s0, s1); return vmovl_u16(sx); }

but not so trivial to findDissimilar. I think the best approach is to do the following:
- load all 16 ints in 4 q-registers {q0, q1, q2, q3}.
- first q0 reg contains { buf[0], buf[1], buf[2], buf[3] }
- then I can loop 15 times and extract from the 4 input q regs into a qext values such as vextq_u32(q0, q1, 1) for the first iteration and so on.
- compute minimums between q0 and qext.

findDissimilar

{q0, q1, q2, q3}

{ buf[0], buf[1], buf[2], buf[3] }

then the same process should be done for q1, q2, q3.

Perhaps if I deinterleave {q0, q1, q2, q3} by byte it could be optimized even better.

{q0, q1, q2, q3}

So NEON has a SAD instruction? You should at least say what it is if you want to post this as an answer. You can't just do byte-element subtraction and take the signed absolute value, because that only works for absolute unsigned differences of 128 and smaller.
– Peter Cordes
Jul 3 at 22:15

@PeterCordes not sure exactly what you mean. The dissimilarity calculation is not relevant here (I updated with sample code), the findDissimilarity is complex to do in neon and I described possible solution. Didn't try to implement it though
– Pavel
Jul 4 at 0:42

findDissimilarity

You said it was trivial, so I assumed there must be some direct hardware support. Anyway, yes vabdq_u8 gives you absolute differences; that's what I was wondering about. The rest of the function (2 multiplies and 3 adds) doesn't look very cheap, and obviously it took you some work to come up with it, so I'd hardly call it trivial.
– Peter Cordes
Jul 4 at 0:48

vabdq_u8

Would vabal_u8 (absolute diff and accumulate) be useful? It apparently accumulates horizontally into a uint16x8_t vector.
– Peter Cordes
Jul 4 at 0:53

vabal_u8

uint16x8_t

I wouldn't use NEON in the first place on ARMv7, or even on ARMv6. usad8 is the one powerful instruction that isn't available on ARMv8.
– Jake 'Alquimista' LEE
Jul 4 at 14:22

usad8

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Fjhtyj