Hacker News

Bitset Decoding on Apple’s A12(lemire.me)

83 pointsingve posted a year ago24 Comments
TickleSteve said a year ago:

"Apple A12 in my iPhone is limited to 2.5 GHz, so I multiply the result by 2.5 to get the number of cycles."

I can guarantee that unless he was going out his way to force it, the processor was not running at 2.5GHz... it would have been scaled way back as mobile processors are very aggressively tuned for power-saving.

Unless he can show how he managed to disable frequency scaling on the iPhone, the results are kind of meaningless.

If he wants to get the cycle count of the instructions, you don't have to run it, you need the datasheet from the manufacturer (which may be difficult to source admittedly).

jcranmer said a year ago:

> If he wants to get the cycle count of the instructions, you don't have to run it, you need the datasheet from the manufacturer (which may be difficult to source admittedly).

A12 is an out-of-order superscalar core. Counting the exact cycle counts of a sequence of instructions for such architectures by hand is basically a fool's errand, especially if you start running into issues such as exhausting register ports or issue stations.

zeusk said a year ago:

I'm quite sure Apple has a tool for profiling iOS applications. ARM has PMU (Performance Measuring Unit) as part of the architecture, for exactly this sort of testing.

saagarjha said a year ago:

Instruments probably has this available; I know they do for Intel processors.

Edit: Just checked, and my A10 shows a small but standard set of PMCs, many of which show up here: https://opensource.apple.com/source/xnu/xnu-4903.221.2/osfmk...

TickleSteve said a year ago:

fair point, the point I was trying to make is that its an overly simplisitic way of measuring performance.

astrange said a year ago:

As long as the phone is plugged in, it doesn't particularly need to power save. You can also set the thread priority higher.

mort96 said a year ago:

Doesn't mean it will run at its max speed. Thermals are a huge issue in phone CPUs, and that's even worse when you add the heat generated by charging the battery. Maybe it's always running at 2.5GHz when plugged in, I don't know, but it seems extremely likely that it'll try to use less power just to reduce heat output.

londons_explore said a year ago:

Indeed, but it will hit 2.5Ghz occasionally, or it wouldn't be worth designing the CPU to go that fast.

My guess is it's designed for fast interactive use, so as long as you do your work in bursts of 300ms or less, you will probably get full speed.

Beyond that, and you'll be power or thermally throttled.

mort96 said a year ago:

You have absolutely nothing to base that guess on though? What if Apple made the CPU to usually peak at, say, 2GHz and only go up to 2.5 under very special circumstances? What if the author has a, say, iPhone Xs, and the Xs's CPU is limited to never exceed 2.3GHz while the same CPU in the Xs Max reaches 2.5GHz? Or any number of other factors which I couldn't even start guessing at because the CPUs in smartphones and the software and firmware which controls them is an extremely complicated matter which I'd bet neither of us has any first-hand experience with.

londons_explore said a year ago:

I have experience with production CPU governing in some Android devices.

It is more art than science - you want to get the best possible user experience while not using too much power and not exceeding thermal and power limits.

Typically, governors are written as linux kernel modules. They get to choose the clock speed of groups of cores, and which cores are enabled. Some are in trustzone applications or android userland apps though.

Some hardware has extra constraints. For example, the autofocus mechanism on one device I worked on used a lot of power, so while the autofocus was initializing, the CPU had to be scaled down to the slowest speed for a few milliseconds, and the radio transmit disabled. The code ends up a mess because you need to pipe that info from all the other device drivers with no proper interface. A large number of devices do similar shenanigans when the battery is <50% because the battery voltage drops at high loads, and if it's already quite low, then you'll trip a brownout detector and reboot the device.

Thermal constraints are the main one - typically represented as some exponential curves. Ie. you may be at 2.0Ghz for 100ms, 1.8Ghz for 1 second, 1.6Ghz for 5 seconds, 1.4 Ghz for 30 seconds, etc. Depending on the history of the currently running process, the governor would try to decide if it was best to go to 2.0Ghz because this was going to be a quick task, or to only start at 1.6Ghz because it was going to be a long running thing.

Hopefully thats some insight! Check out XDA forums where some devices have a community of people into modding governors for different use cases.

wolf550e said a year ago:

I guess the numbers, but the phone body can dissipate something like 1W, the chip can output 5W of heat when fully utilized at max clock, the difference accumulates until it shuts down to protect itself from melting. You need active cooling to run it at max clock, and I don't know if the firmware will let you even if the temp sensor says it's fine.

max76 said a year ago:

This comment makes me want a liquid cooled and overclocked ipad pro.

gok said a year ago:

So you think it uses even fewer cycles then?

nwallin said a year ago:

> On recent x64 processors, we find that it is beneficial to decode in bulk: e.g., assume that there are at least 4 set bits, decode them without checking whether there are four of them. The code might look as follow: [...]

I find this assertion surprising, and so I decided to test it. I wrote a naive algorithm, and found the generated assembly even more surprising. https://godbolt.org/z/ah3msM

    int bits(unsigned long n, char* b) {
        int retval = 0;
        while(n) {
            *b++ = __builtin_ctzll(n);
            n &= n-1;
        return retval;
First of all, the business with incrementing retval is discarded entirely. It simply performs popcnt on n and returns that. Second, it performs the calculation in two sections. First it does the naive, simple loop until the remaining bits are a multiple of 8. Then it unrolls the loop into groups of 8. It unrolled the loop even with just -O2 set. In essence, the compiler does precisely what his code does, but with groups of 8 instead of 4.

Clang seems to have changed behavior between versions 6 and 7; version 6 and below use the naive loop, as does GCC. Clang 7 and 8 unroll it. I'm curious if this had any effect on his results; his results for the A12 presumably are using Clang, while his linux box presumably uses GCC.

There's not really a whole lot of room for improvement in that unless you go whole hog and unroll the entire thing and table jump into it: https://godbolt.org/z/qWzfHl

    int bits(unsigned long n, char* b) {
        const int retval = __builtin_popcountll(n);
        switch(retval) {
            default: for(char i = 64; i--;) b[i] = i; return 64;
            case 63: b[62] = __builtin_ctzll(n); n &= n-1;
            case 62: b[61] = __builtin_ctzll(n); n &= n-1;
            case 61: b[60] = __builtin_ctzll(n); n &= n-1;
            // etc
            case 2: b[1] = __builtin_ctzll(n); n &= n-1;
            case 1: b[0] = __builtin_ctzll(n);
            case 0: return retval;
Which has its own macabre beauty to it.

Interesting distinction between Intel CPUs and AMD: the blsr instruction has a latency of 1 on Intel, and 2 on AMD, while tzcnt has a latency of 3 on Intel, and 2 on AMD. But the performance is limited by the blsr instruction, so Intel processes 1 bit per cycle, but AMD can only manage 1 bit per two cycles.

nwallin said a year ago:

I looked into it a little more, and coded up some benchmarks, and it turns out I was wrong. His solution is like this:

    int bits(unsigned long n, char* b) {
        int retval = __builtin_popcountll(n);
        while(n) {
            *b[0] = __builtin_ctzll(n); n &= n-1;
            *b[1] = __builtin_ctzll(n); n &= n-1;
            *b[2] = __builtin_ctzll(n); n &= n-1;
            *b[3] = __builtin_ctzll(n); n &= n-1;
            b += 4;
        return retval;
Unrolled four times, this is faster than the naive code unrolled by the compiler, but slower than the jump table. Unrolled eight times, that is ~10% faster than the jump table, which really surprised me. How can a loop be faster than non-branching code? The crucial difference is that his version begins filling the pipeline immediately. The compiler unrolled version needs to wait on the popcnt before it can begin working, and the jump table needs to wait on both the popcnt and the subsequent jump. Despite the fact that the jump table is 100% branchless after the initial jump, its speed advantage down the stretch can't make up for the author's early start, nor can it make up for the author's eventual branch misprediction and subsequent pipeline stall. Very strange. Cool, but strange.

I think the lesson here is to not jump to conclusions and post your faulty misconceptions about performance on the internet before you benchmark it.

kazinator said a year ago:

What sort of application shows a noticeable speedup if we manage to make bitset decoding, say, ten times faster?

pierrebai said a year ago:

In this case, Mr Lemire previously design with collaborators a fast JSON parser (GB/s) that required such bit detection. You can find previous papers in his blog.

booblik said a year ago:

Prof Lemire

astrange said a year ago:

Video playback has been moved to dedicated hardware, but it and other kinds of decompression would help. I actually didn't know ARM didn't have popcount.

londons_explore said a year ago:

A good chunk of video playback is still in software.

As well as application developers using silly formats (Hello GIF!), new video formats tend to start getting use before hardware is ready, and some DRM schemes can't or won't use hardware decoding.

subprotocol said a year ago:

It's used everywhere in analytics applications and particularly noticeable at scale. For specific example- We use roaring bitmaps (also from Lemire) in an inverted index to make count queries instant (depending on the query and encoding, top-n queries do 20B events per second per core in our application).

davidgould said a year ago:

Could you elaborate on this a little? I'm aware of roaring bitmaps, but not understanding the problem and solution you are describing. Thanks!

tveita said a year ago:

Could a lookup table be faster even for fairly sparse bits? I'm imagining something like

  uint8_t *result;
  uint8_t numpositions[];
  uint64_t positions0[];
  uint64_t positions1[];

  *(uint64_t*) result = positions0[byte0]
  result += numpositions[byte0]
  *(uint64_t*) result = positions1[byte1]
  result += numpositions[byte1]
Or probably accumulating results in a register until you get 8.
mcbain said a year ago:

Even if those arrays are entirely in cache they require more cycles per lookup than the computation from the article.

That's been true of lookup tables for a long while now.