The first thing is to have settable threshold parameters, and that's the object of Kernel-nice.1234. The settings can be changed manually through Preferences interface, but then it's tedious to tune by hand.
So the next thing is to have a strategy for auto-tuning. I have chosen this one:
- set the threshold for Karatsuba and Toom-3 multiplication Mul22 and Mul33 at very high values so as to have unaccelerated arithmetic
- gradually increase the Mul22 threshold starting at 16 bytes, and doubling until it beats reference un-accelerated performance
- once found, try smaller threshold by setting the next two bits (if threshold is 512, test for 320, 384 and 448)
- repeat the procedure for Mul33, starting at Mul22 threshold
I did the same for squaring with S1/2 and S1/4, but multiplication tuning must be performed first, because asymetrical squaring uses that.
For division, it's a bit different, since the recursive division did not procure an immediate gain at 1x threshold, I decided to engage digitDivSplit: at twice the threshold so as to benefit from at least 2 recursions, so I must start testing at twice the threshold and finally used series of byte length 2x, 3x and 4x threshold.
I have published this auto-tuning code in STEM repository in the LargeArithmeticBench package in its version -nice.11 at this time of writing.
To perform the tuning, doIt:
LargeArithmeticBench new autotuneAcceleratedArithmeticThresholds.
With this strategy, the tuning is relatively fast, about 30 seconds on my MBP. But we are not that patient nowadays, so I reported the progress in the Transcript - which is a very basic interface, but works.
The autotuned thresholds now are these ones on 64 bits Spur, but they somtimes vary a bit from 1 run to another:
best Mul22=448
best Mul33=1536
best S1/2=320
best S1/4=640
best D21=448
I also refactored benchReport to have constant ratios M(n,2n) and D(n,2n), and reduced the number of runs for large bit lengths. Performance is not much changed versus hardcoded thresholds, except for division in range 2kbits to 10kbits.
Beware, the D(10n,n) now tries the very large integers, and the primitive for digitDiv:neg: crashes the un-accelerated image. This does not happen with accelerated arithmetic, because we split in small chunks (if 224 bytes is small). That's a long time since I did not put my fingers on VM code, that's an occasion...
I believe that it's a tiny bit better to recompose a longer least significant chunk with a shorter most significant chunk, and that I marginally improved the performance by rounding up to next multiple of 4 digits, rather than down to with bitClear: 2r11, but I launched so many tests that this would need confirmation. I put those changes in Squeak inbox/Kernel-nice.1235. If someone wants to try, it's better to re-tune the thresholds for each code version.
LargeArithmeticBench new autotuneAcceleratedArithmeticThresholds; benchReport.
Here are the updated curves:
No comments:
Post a Comment