Sunday, May 5, 2019

Profiling code in Smalltalk

The goal is to benchmark the accelerated Large Integer arithmetic, and track the costs by profiling. But let's first blog about the profiling facility itself.

This will be illustrated in Squeak, but applies to many Smalltalk dialects.

Squeak by default has a MessageTally class that is really easy to use, just evaluate code like this (my ongoing work on exponentiation):

MessageTally spyOn: [10000 timesRepeat: [103 raisedToInteger: 8000]].

How does it work? By setting a timer that will periodically interrupt the profiled Process, scanning the interrupted Process stack, and gather statistics about the calling sequence tree (it basically count the number of time that each method in the call tree has been interrupted). It's a gift from original Smalltalk-80, and is completely written at image side in pure Smalltalk, coexisting in the lively environment with the profiled code, and with other essential tools like Debugger. This is amazing enough and demonstrates the remarkable reflexion facilities of Smalltalk upon itself (introspection). I invite everyone to read that well written piece of code.

It also works remarkably well for analyzing pure Smalltalk code.

But for code which is dominated by long running primitives like Large Integer arithmetic, it does not work so well, because execution of primitives are atomic un-interruptible operations from the Smalltalk image point of view. The resut is that timer-based interruption is delayed and the profile cost attributed to the wrong message, the one following the primitive, and when profiling several processes, it's worse, see this mailing list message from Eliot Miranda, the main contributor of opensmalltalk-vm, or should I say the deus ex virtual machina.

Effectively, see the leaves at end of the report, it indicates that exponentiation time is dominated by addition! (no, we do not implement multiplication by repetitive additions).

 - 4631 tallies, 4644 msec.

**Tree**
--------------------------------
Process: (40) 58662: nil
--------------------------------
99.5% {4622ms} MessageTally class>>spyOn:reportOtherProcesses:
  99.5% {4622ms} MessageTally>>spyEvery:on:
    99.5% {4622ms} BlockClosure>>ensure:
      99.5% {4622ms} [] UndefinedObject>>DoIt
        99.5% {4622ms} SmallInteger(Integer)>>timesRepeat:
          99.5% {4620ms} [[]] UndefinedObject>>DoIt
            99.4% {4618ms} SmallInteger(Integer)>>raisedToInteger:
              99.4% {4618ms} SmallInteger(Integer)>>ternaryBinaryExponentationOf:
                96.8% {4495ms} LargePositiveInteger>>squared
                  |94.4% {4385ms} LargePositiveInteger>>squaredByFourth
                  |  |37.0% {1718ms} LargePositiveInteger>>*
                  |  |  |34.7% {1614ms} LargePositiveInteger(Integer)>>*
                  |  |  |  |34.7% {1614ms} LargePositiveInteger>>multiplyByInteger:
                  |  |  |  |  26.0% {1205ms} LargePositiveInteger>>digitMul22:
                  |  |  |  |    |17.3% {803ms} LargePositiveInteger>>+
                  |  |  |  |    |6.8% {316ms} LargePositiveInteger>>multiplyByInteger:
                  |  |  |  |  8.7% {405ms} LargeNegativeInteger(LargePositiveInteger)>>multiplyByInteger:
                  |  |  |  |    8.7% {405ms} LargeNegativeInteger(LargePositiveInteger)>>digitMul22:
                  |  |  |  |      5.7% {263ms} LargeNegativeInteger(LargePositiveInteger)>>+
                  |  |  |  |      2.2% {102ms} LargeNegativeInteger(LargePositiveInteger)>>multiplyByInteger:
                  |  |  |2.2% {104ms} primitives
                  |  |31.4% {1460ms} LargePositiveInteger>>squared
                  |  |  |24.9% {1155ms} LargePositiveInteger>>squaredByFourth
                  |  |  |  |9.1% {421ms} LargeNegativeInteger(Integer)>>bitShift:
                  |  |  |  |  |9.0% {419ms} primitives
                  |  |  |  |7.8% {363ms} LargePositiveInteger>>+
                  |  |  |  |4.5% {209ms} LargeNegativeInteger(LargePositiveInteger)>>*
                  |  |  |  |  |4.3% {201ms} primitives
                  |  |  |  |2.2% {100ms} LargePositiveInteger>>-
                  |  |  |  |  2.0% {94ms} primitives
                  |  |  |6.0% {278ms} LargePositiveInteger>>squaredByHalf
                  |  |  |  2.4% {110ms} LargePositiveInteger>>inplaceAddNonOverlapping:digitShiftBy:
                  |  |  |  1.7% {77ms} LargePositiveInteger>>squared
                  |  |  |    |1.6% {76ms} primitives
                  |  |  |  1.5% {69ms} LargePositiveInteger>>multiplyByInteger:
                  |  |14.6% {679ms} LargePositiveInteger(Integer)>>bitShift:
                  |  |8.6% {397ms} LargePositiveInteger>>+
                  |  |2.2% {104ms} LargePositiveInteger>>-
                  |2.2% {103ms} LargePositiveInteger>>squaredByHalf
                2.1% {98ms} primitives

**Leaves**
39.9% {1853ms} LargePositiveInteger>>+
23.6% {1098ms} LargeNegativeInteger(Integer)>>bitShift:
11.4% {529ms} LargePositiveInteger>>multiplyByInteger:
7.0% {325ms} LargeNegativeInteger(LargePositiveInteger)>>*
5.2% {243ms} LargePositiveInteger>>-
4.0% {186ms} LargePositiveInteger>>inplaceAddNonOverlapping:digitShiftBy:
3.6% {167ms} LargePositiveInteger>>squared
2.2% {102ms} SmallInteger(Integer)>>ternaryBinaryExponentationOf:


Since primitives are opaque at image side, it is necessary to introduce additional VM support for measuring such primitive cost. The simple idea for intercepting primitives is to instrument the VM at time of method activation. The timer-based interrupt is replaced by a polling of high resolution clock inside the VM. If the method is a primitive, and the sample interval is expired, then the currently active Process is recorded into profileProcess inst. var of the Interpreter. and the activated method (newMethod inst. var.) is recorded in the profileMethod inst. var., then the Semaphore of the profiler Process is signalled. This profileMethod and profileProcess can then be queried by the profiler thru primitives. The polling is also performed at each interrupt check - that is at each and every potential Process interruption point (method activation and long jump byte code). Since the active Process is recorded in profileProcess, before eventual Process switch, this avoid attribution of profiling cost to the newly running Process.

This support already exists and is covered by a set of four primitives accessible from the image:
  • primitiveProfileSemaphore for setting the Semaphore to signal when a sample is ready (the profiler will wait on that semaphore).
  • primitiveProfileStart for starting/stopping the profiling and setting the number of high resolution clock ticks between two sample
  • primitiveProfilePrimitive for querying the profileMethod
  • primitiveProfileSample for querying the profileProcess
The generated C code for these primitives and other support routines is available at opensmalltalk-vm, but it's better to browse the corresponding VMMaker Smalltalk source code from a Squeak image, because most of the VM is written in Smalltalk. For the curious, browse implementors of checkForInterrupts and internalPrimitiveResponse, or above mentionned inst. var. references.

The rest of profiling magic happens at image side, like with MessageTally. The profiler is not included in trunk images, but easily available at squeaksource3, or via SqueakMap.

It is named AndreasSystemProfiler in memory of Andreas Raab which was one of the major developer and inspiring leader of Squeak and one of the authors of this piece of code among many others. This time it's a gift from Ron Teitelbaum which agreed to release under MIT license in memory of Andreas (such re-license request was one of his suggestions).

The use of AndreasSystemProfiler is equally simple:

AndreasSystemProfiler spyOn: [10000 timesRepeat: [103 raisedToInteger: 8000]].

The report obtained is below:

Reporting - 45,077 tallies, 5,101 msec.

**Tree**
... snip ...

[                              99.94 (5,098)  UndefinedObject DoIt
[                                99.94 (5,098)  AndreasSystemProfiler class spyOn:
[                                  99.94 (5,098)  BlockClosure ensure:
[                                    99.94 (5,098)  [] AndreasSystemProfiler class spyOn:
[                                      99.94 (5,098)  AndreasSystemProfiler spyOn:
[                                        99.94 (5,098)  BlockClosure ensure:
[                                          99.94 (5,098)  [] UndefinedObject DoIt
[                                            99.94 (5,098)  SmallInteger(Integer) timesRepeat:
[                                              99.94 (5,098)  [[]] UndefinedObject DoIt
[                                                99.94 (5,098)  SmallInteger(Integer) raisedToInteger:
[                                                  99.94 (5,098)  SmallInteger(Integer) ternaryBinaryExponentationOf:
[                                                    96.92 (4,944)  LargePositiveInteger squared
[                                                      |94.24 (4,807)  LargePositiveInteger squaredByFourth
[                                                      |  |40.45 (2,063)  LargePositiveInteger *
[                                                      |  |  |37.78 (1,927)  LargePositiveInteger(Integer) *
[                                                      |  |  |  |37.78 (1,927)  LargePositiveInteger multiplyByInteger:
[                                                      |  |  |  |  29.27 (1,493)  LargePositiveInteger digitMul22:
[                                                      |  |  |  |    |15.47 (789)  LargePositiveInteger +
[                                                      |  |  |  |    |  |13.62 (695)  Integer digitMultiply:neg:
[                                                      |  |  |  |    |10.74 (548)  LargePositiveInteger multiplyByInteger:
[                                                      |  |  |  |    |  |9.12 (465)  Integer digitMultiply:neg:
[                                                      |  |  |  |    |1.11 (57)  LargePositiveInteger inplaceAddNonOverlapping:digitShiftBy:
[                                                      |  |  |  |  8.5 (434)  LargeNegativeInteger(LargePositiveInteger) multiplyByInteger:
[                                                      |  |  |  |    8.5 (434)  LargeNegativeInteger(LargePositiveInteger) digitMul22:
[                                                      |  |  |  |      4.16 (212)  LargeNegativeInteger(LargePositiveInteger) multiplyByInteger:
[                                                      |  |  |  |        |4.11 (210)  Integer digitMultiply:neg:
[                                                      |  |  |  |      4.08 (208)  LargeNegativeInteger(LargePositiveInteger) +
[                                                      |  |  |  |        3.9 (199)  Integer digitMultiply:neg:
[                                                      |  |  |1.12 (57)  Integer digitMultiply:neg:
[                                                      |  |26.74 (1,364)  LargePositiveInteger squared
[                                                      |  |  |19.34 (987)  LargePositiveInteger squaredByFourth
[                                                      |  |  |  |6.94 (354)  LargePositiveInteger +
[                                                      |  |  |  |  |2.77 (142)  Integer digitMultiply:neg:
[                                                      |  |  |  |  |1.41 (72)  Integer digitAdd:
[                                                      |  |  |  |4.78 (244)  LargeNegativeInteger(LargePositiveInteger) *
[                                                      |  |  |  |  |4.01 (205)  Integer digitMultiply:neg:
[                                                      |  |  |  |4.76 (243)  LargePositiveInteger(Integer) bitShift:
[                                                      |  |  |  |  |2.46 (125)  Integer digitMultiply:neg:
[                                                      |  |  |  |  |1.32 (68)  Integer digitSubtract:
[                                                      |  |  |  |1.27 (65)  LargePositiveInteger -
[                                                      |  |  |  |1.26 (64)  LargePositiveInteger squared
[                                                      |  |  |6.99 (357)  LargePositiveInteger squaredByHalf
[                                                      |  |  |  3.68 (188)  LargePositiveInteger inplaceAddNonOverlapping:digitShiftBy:
[                                                      |  |  |    |3.42 (175)  Integer digitMultiply:neg:
[                                                      |  |  |  1.82 (93)  LargePositiveInteger squared
[                                                      |  |  |    |1.59 (81)  Integer digitMultiply:neg:
[                                                      |  |  |  1.2 (61)  LargePositiveInteger multiplyByInteger:
[                                                      |  |  |    1.14 (58)  Integer digitMultiply:neg:
[                                                      |  |14.42 (736)  LargePositiveInteger(Integer) bitShift:
[                                                      |  |  |10.68 (545)  Integer digitMultiply:neg:
[                                                      |  |  |2.36 (121)  Integer digitAdd:
[                                                      |  |7.4 (377)  LargePositiveInteger +
[                                                      |  |  |2.51 (128)  Integer digitMultiply:neg:
[                                                      |  |  |2.14 (109)  Integer bitShiftMagnitude:
[                                                      |  |3.9 (199)  LargePositiveInteger -
[                                                      |  |  1.67 (85)  Integer digitSubtract:
[                                                      |2.66 (136)  LargePositiveInteger squaredByHalf
[                                                      |  2.19 (112)  LargePositiveInteger multiplyByInteger:
[                                                      |    2.18 (111)  Integer digitMultiply:neg:
[                                                    2.43 (124)  Integer digitAdd:

**Leaves**
63.12 (3,220)  Integer digitMultiply:neg:
8.28 (423)  Integer digitAdd:
6.68 (341)  Integer digitSubtract:
6.61 (337)  Integer bitShiftMagnitude:



The tree is more detailed because sample interval has been reduced to about 100 µs, which is not possible with image-side timer-based polling. We don't get exactly 10 samples per millisecond (45 000 tallies for 5 000 ms), because long primitives execution time may cause some pause in polling, and drift the sampling interval I presume. Fortunately, divide and conquer split the time spent in primitives in tiny fragments, which is good for this profiling strategy.
The tree is still not perfect because the primitives sometimes get attached to the wrong caller (like + sending digitMultiply:neg: ???).
But the leaves are much better: we see that the VM is effectively spending most time in multiplication, even if time spent in addition/subtraction/shifting is not neglectable. That's more realistic!
 

No comments:

Post a Comment