Sorting functions by their sizes in bytes of assembly

Back in the post on defining your own for-loop macro, I hurried over something like a performance comparison of our newly built for-loop and two built-in for-loop macros in Common Lisp. In that, I used the ‘time macro which, though useful for rough estimates, produces wildly varying outputs based on the machine calling it and the other processes currently being run on said machine.

I was talking with a good friend of mine about miniKanren, and the conversation meandered towards the idea of how we’d want to sort procedurally generated functions by their performances. There are a few ways to procedurally generate many (even all) possible functions to satisfy a set of unit-tests. Once we have a gorillion functions that are all correct, we would then want to know which of those functions were the most performant. [1]

The ‘time macro prints out, among other things, how many processor-cycles passed during the evaluation of a body of code. The results are almost always different, even on the same machine and during the same coding session, for the same code body. A more objective, reproducible and portable interpretation of the performance of a function would be to inspect the assembly our compiler would compile that function into.

We can print out the assembly our compiler would produce when compiling a function in Common Lisp using the #’disassemble function. We pass a function or symbol bound to a function, to #’disassemble, and it prints to the *standard-output* stream an explanation and listing of the given function’s assembly code.

Here we disassemble the ‘+ function:

ELO> (disassemble '+)
; disassembly for +
; Size: 135 bytes. Origin: #x100060C598                       ; +
; 598:       840425F8FF1020   TEST AL, [#x2010FFF8]           ; safepoint
; 59F:       4D85E4           TEST R12, R12
; 5A2:       0F8473000000     JEQ L5
; 5A8:       498B36           MOV RSI, [R14]
; 5AB:       8D5EF1           LEA EBX, [RSI-15]
; 5AE:       F6C301           TEST BL, 1
; 5B1:       7510             JNE L0
; 5B3:       80FB0A           CMP BL, 10
; 5B6:       740B             JEQ L0
; 5B8:       F6C30F           TEST BL, 15
; 5BB:       755B             JNE L4
; 5BD:       807EF129         CMP BYTE PTR [RSI-15], 41
; 5C1:       7755             JNBE L4
; 5C3: L0:   BB02000000       MOV EBX, 2
; 5C8:       EB39             JMP L2
; 5CA:       660F1F440000     NOP
; 5D0: L1:   48895DF8         MOV [RBP-8], RBX
; 5D4:       4C8965F0         MOV [RBP-16], R12
; 5D8:       4C8975E8         MOV [RBP-24], R14
; 5DC:       488BFB           MOV RDI, RBX
; 5DF:       48F7DF           NEG RDI
; 5E2:       498B3CBE         MOV RDI, [R14+RDI*4]
; 5E6:       488BD6           MOV RDX, RSI
; 5E9:       FF1425B0000020   CALL QWORD PTR [#x200000B0]     ; GENERIC-+
; 5F0:       488BF2           MOV RSI, RDX
; 5F3:       4C8B75E8         MOV R14, [RBP-24]
; 5F7:       4C8B65F0         MOV R12, [RBP-16]
; 5FB:       488B5DF8         MOV RBX, [RBP-8]
; 5FF:       4883C302         ADD RBX, 2
; 603: L2:   840425F8FF1020   TEST AL, [#x2010FFF8]           ; safepoint
; 60A:       4C39E3           CMP RBX, R12
; 60D:       7CC1             JL L1
; 60F:       488BD6           MOV RDX, RSI
; 612: L3:   488BE5           MOV RSP, RBP
; 615:       F8               CLC
; 616:       5D               POP RBP
; 617:       C3               RET
; 618: L4:   CC57             INT3 87                         ; OBJECT-NOT-NUMBER-ERROR
; 61A:       18               BYTE #X18                       ; RSI
; 61B: L5:   31D2             XOR EDX, EDX
; 61D:       EBF3             JMP L3
NIL

And here we disassemble the #’disassemble function itself!

ELO> (disassemble #'disassemble)
; disassembly for DISASSEMBLE
; Size: 583 bytes. Origin: #x1000780ED6                       ; DISASSEMBLE
; 0ED6:       840425F8FF1020   TEST AL, [#x2010FFF8]          ; safepoint
; 0EDD:       48837DE000       CMP QWORD PTR [RBP-32], 0
; 0EE2:       0F8416020000     JEQ L13
; 0EE8:       488B45E8         MOV RAX, [RBP-24]
; 0EEC:       448D60FD         LEA R12D, [RAX-3]
; 0EF0:       41F6C40F         TEST R12B, 15
; 0EF4:       0F84FA010000     JEQ L12
; 0EFA:       488B45E8         MOV RAX, [RBP-24]
; 0EFE:       448D60F5         LEA R12D, [RAX-11]
; 0F02:       41F6C40F         TEST R12B, 15
; 0F06:       750A             JNE L0
; 0F08:       8078F539         CMP BYTE PTR [RAX-11], 57
; 0F0C:       0F8479010000     JEQ L8
; 0F12: L0:   48817DE817001120 CMP QWORD PTR [RBP-24], #x20110017  ; NIL
; 0F1A:       0F8563010000     JNE L7
; 0F20: L1:   4C8B75E8         MOV R14, [RBP-24]
; 0F24: L2:   4C8975F8         MOV [RBP-8], R14
; 0F28:       488B45F0         MOV RAX, [RBP-16]
; 0F2C:       448D60F1         LEA R12D, [RAX-15]
; 0F30:       41F6C40F         TEST R12B, 15
; 0F34:       7558             JNE L4
; 0F36:       8078F12D         CMP BYTE PTR [RAX-15], 45
; 0F3A:       7552             JNE L4
; 0F3C:       4C8D6424F0       LEA R12, [RSP-16]
; 0F41:       4883EC28         SUB RSP, 40
; 0F45:       488B55F0         MOV RDX, [RBP-16]
; 0F49:       488B3D50FDFFFF   MOV RDI, [RIP-688]             ; :STREAM
; 0F50:       488B75F8         MOV RSI, [RBP-8]
; 0F54:       488B054DFDFFFF   MOV RAX, [RIP-691]             ; :USE-LABELS
; 0F5B:       49894424F0       MOV [R12-16], RAX
; 0F60:       488B45D8         MOV RAX, [RBP-40]
; 0F64:       49894424E8       MOV [R12-24], RAX
; 0F69:       488B0560FDFFFF   MOV RAX, [RIP-672]             ; #<SB-KERNEL:FDEFN SB-DISASSEM:DISASSEMBLE-CODE-COMPONENT>
; 0F70:       B90A000000       MOV ECX, 10
; 0F75:       49892C24         MOV [R12], RBP
; 0F79:       498BEC           MOV RBP, R12
; 0F7C:       FF5009           CALL QWORD PTR [RAX+9]
; 0F7F:       480F42E3         CMOVB RSP, RBX
; 0F83: L3:   BA17001120       MOV EDX, #x20110017            ; NIL
; 0F88:       488BE5           MOV RSP, RBP
; 0F8B:       F8               CLC
; 0F8C:       5D               POP RBP
; 0F8D:       C3               RET
; 0F8E: L4:   4883EC10         SUB RSP, 16
; 0F92:       488B55F0         MOV RDX, [RBP-16]
; 0F96:       488B053BFDFFFF   MOV RAX, [RIP-709]             ; #<SB-KERNEL:FDEFN SB-DISASSEM::GET-COMPILED-FUNS>
; 0F9D:       B902000000       MOV ECX, 2
; 0FA2:       48892C24         MOV [RSP], RBP
; 0FA6:       488BEC           MOV RBP, RSP
; 0FA9:       FF5009           CALL QWORD PTR [RAX+9]
; 0FAC:       4C8BF2           MOV R14, RDX
; 0FAF:       E9B7000000       JMP L6
; 0FB4:       660F1F840000000000 NOP
; 0FBD:       0F1F00           NOP
; 0FC0: L5:   4981FE17001120   CMP R14, #x20110017            ; NIL
; 0FC7:       74BA             JEQ L3
; 0FC9:       4D8B7EF9         MOV R15, [R14-7]
; 0FCD:       498BD7           MOV RDX, R15
; 0FD0:       4C8975D0         MOV [RBP-48], R14
; 0FD4:       4C897DC8         MOV [RBP-56], R15
; 0FD8:       4883EC10         SUB RSP, 16
; 0FDC:       488B05FDFCFFFF   MOV RAX, [RIP-771]             ; #<SB-KERNEL:FDEFN SB-KERNEL:%FUN-NAME>
; 0FE3:       B902000000       MOV ECX, 2
; 0FE8:       48892C24         MOV [RSP], RBP
; 0FEC:       488BEC           MOV RBP, RSP
; 0FEF:       FF5009           CALL QWORD PTR [RAX+9]
; 0FF2:       488BF2           MOV RSI, RDX
; 0FF5:       4883EC10         SUB RSP, 16
; 0FF9:       488B55F8         MOV RDX, [RBP-8]
; 0FFD:       488B3DE4FCFFFF   MOV RDI, [RIP-796]             ; "~&; disassembly for ~S"
; 1004:       488B05E5FCFFFF   MOV RAX, [RIP-795]             ; #<SB-KERNEL:FDEFN FORMAT>
; 100B:       B906000000       MOV ECX, 6
; 1010:       48892C24         MOV [RSP], RBP
; 1014:       488BEC           MOV RBP, RSP
; 1017:       FF5009           CALL QWORD PTR [RAX+9]
; 101A:       4C8B7DC8         MOV R15, [RBP-56]
; 101E:       4C8D6424F0       LEA R12, [RSP-16]
; 1023:       4883EC28         SUB RSP, 40
; 1027:       498BD7           MOV RDX, R15
; 102A:       488B3D6FFCFFFF   MOV RDI, [RIP-913]             ; :STREAM
; 1031:       488B75F8         MOV RSI, [RBP-8]
; 1035:       488B056CFCFFFF   MOV RAX, [RIP-916]             ; :USE-LABELS
; 103C:       49894424F0       MOV [R12-16], RAX
; 1041:       488B45D8         MOV RAX, [RBP-40]
; 1045:       49894424E8       MOV [R12-24], RAX
; 104A:       488B05A7FCFFFF   MOV RAX, [RIP-857]             ; #<SB-KERNEL:FDEFN SB-DISASSEM:DISASSEMBLE-FUN>
; 1051:       B90A000000       MOV ECX, 10
; 1056:       49892C24         MOV [R12], RBP
; 105A:       498BEC           MOV RBP, R12
; 105D:       FF5009           CALL QWORD PTR [RAX+9]
; 1060:       4C8B75D0         MOV R14, [RBP-48]
; 1064:       4D8B6601         MOV R12, [R14+1]
; 1068:       4D8BF4           MOV R14, R12
; 106B: L6:   840425F8FF1020   TEST AL, [#x2010FFF8]          ; safepoint
; 1072:       458D66F9         LEA R12D, [R14-7]
; 1076:       41F6C40F         TEST R12B, 15
; 107A:       0F8440FFFFFF     JEQ L5
; 1080:       CC49             INT3 73                        ; OBJECT-NOT-LIST-ERROR
; 1082:       38               BYTE #X38                      ; R14
; 1083: L7:   488B45E8         MOV RAX, [RBP-24]
; 1087:       CC20             INT3 32                        ; OBJECT-NOT-TYPE-ERROR
; 1089:       00               BYTE #X00                      ; RAX
; 108A:       53               BYTE #X53                      ; '(OR STREAM
                                                              ;      NULL)
; 108B: L8:   488B45E8         MOV RAX, [RBP-24]
; 108F:       4C8B600D         MOV R12, [RAX+13]
; 1093: L9:   4D8B742425       MOV R14, [R12+37]
; 1098:       4981FE17001120   CMP R14, #x20110017            ; NIL
; 109F:       742C             JEQ L10
; 10A1:       4883EC10         SUB RSP, 16
; 10A5:       488B55E8         MOV RDX, [RBP-24]
; 10A9:       488B3D58FCFFFF   MOV RDI, [RIP-936]             ; #<SB-KERNEL:LAYOUT for STREAM {10000B5CD3}>
; 10B0:       488B0559FCFFFF   MOV RAX, [RIP-935]             ; #<SB-KERNEL:FDEFN SB-KERNEL:UPDATE-OBJECT-LAYOUT-OR-INVALID>
; 10B7:       B904000000       MOV ECX, 4
; 10BC:       48892C24         MOV [RSP], RBP
; 10C0:       488BEC           MOV RBP, RSP
; 10C3:       FF5009           CALL QWORD PTR [RAX+9]
; 10C6:       480F42E3         CMOVB RSP, RBX
; 10CA:       4C8BE2           MOV R12, RDX
; 10CD: L10:  4D8B74242D       MOV R14, [R12+45]
; 10D2:       4D8B7EF9         MOV R15, [R14-7]
; 10D6:       4983FF04         CMP R15, 4
; 10DA:       7E04             JLE L11
; 10DC:       4D8B6611         MOV R12, [R14+17]
; 10E0: L11:  4C3B2521FCFFFF   CMP R12, [RIP-991]             ; #<SB-KERNEL:LAYOUT for STREAM {10000B5CD3}>
; 10E7:       0F8433FEFFFF     JEQ L1
; 10ED:       E920FEFFFF       JMP L0
; 10F2:       EBEC             JMP L11
; 10F4: L12:  488B45E8         MOV RAX, [RBP-24]
; 10F8:       4C8B6005         MOV R12, [RAX+5]
; 10FC:       EB95             JMP L9
; 10FE: L13:  4D8BB568020000   MOV R14, [R13+616]             ; tls: *STANDARD-OUTPUT*
; 1105:       4C8B250CFCFFFF   MOV R12, [RIP-1012]            ; '*STANDARD-OUTPUT*
; 110C:       4183FE61         CMP R14D, 97
; 1110:       4D0F44742401     CMOVEQ R14, [R12+1]
; 1116:       E909FEFFFF       JMP L2
; 111B:       CC10             INT3 16                        ; Invalid argument count trap
NIL

We see that #’+ and #’disassemble are compiled into 135 and 583 bytes of assembly, respectively, using our implementation of Common Lisp.

As long as we use the same implementation, the same version, and we haven’t modified that implementation, we should get the same assembly code back as if it were a pure function. The outputs are the same regardless of the environment or time the call to #’disassemble is made, if the inputs are the same.

Size in bytes of assembly is… a more… mmmm… it’s a less mercurial measure of a function’s performance than if we were to simply compare processor-cycles measured by the ‘time macro.

Now. We have a list of functions, potentially a gorillion of them and all of them procedurally generated, and they all pass the same unit tests. We want to grab just the most performant of these functions, and we’re going to compare performance by comparing the size in bytes of the disassembled functions.

#’disassemble prints out a bunch of things to the *standard-ouput* stream, but it just returns nil. We need to capture the data being sent to *standard-output*, and parse it as a string to grab the useful part: “Size: 583 bytes“.

To do this, my friend wrote a function to hijack the *standard-output* stream’s incoming data and feed it into a string, while disassembling the given function:

(defun disassemble-to-string (func)
  (let ((s (make-string-output-stream)))
    (disassemble func :stream s)
    (get-output-stream-string s)))

#’disassemble-to-string returns what #’disassemble prints, as a string, instead of nil. Said friend wrote a VERY simple function which then would take a list of functions and sort them by the length of the string returned. That’s good and all, except the string has comments, and it contains the name of the function! The length of the name of a function can now influence its supposed performance, under that quick sorting algorithm, so I hacked together a generic method to grab just the size-in-bytes part of the disassembly-string, and to parse it as an integer:

(defmethod func-size-in-bytes (func)
  (let ((str (disassemble-to-string func)))
    (parse-integer
     (subseq str (+ 6 (search "Size: " str)) (search " bytes" str)))))

===================
ELO> (func-size-in-bytes '+)
135

ELO> (func-size-in-bytes 'disassemble)
583

ELO> (func-size-in-bytes 'func-size-in-bytes)
478

Now we just get the actual number of bytes in assembly a function is made of.[2]

Note, though, that I defined #’func-size-in-bytes not as a normal function with ‘defun, no, but rather as a generic method using ‘defmethod. Why? Well, we want to pass a list of functions and get back a list of numbers-of-bytes. I stumbled upon something I find fun, a while back, and I wanted an excuse to show it to you all here.

We have a function, it accepts an object. We want to also be able to call that function on a list of objects, and get back a list of results, as if we had mapped over that list with that function. You can do this by defining a new function and having it, obviously, map over the given list of objects with our original function. That means defining a new function with a new name, though, and I have the memory of a tropical fruit. Maybe even less. I want to just remember one function name, and I don’t want to have to modify my existing function to check if it received a list or not, and to then branch itself based off the listiness of the data given to it.

Enter generic functions.

‘defmethod allows us to define a generic function with the same name, but that dispatches itself based off the data passed to it when called.

Watch:

(defmethod func-size-in-bytes (func)
  (let ((str (disassemble-to-string func)))
    (parse-integer
     (subseq str (+ 6 (search "Size: " str)) (search " bytes" str)))))

(defmethod func-size-in-bytes ((func list))
  (mapcar #'func-size-in-bytes func))

We have our original #’func-size-in-bytes, which accepts anything as the arg ‘func. We then have a new, more specific #’func-size-in-bytes which is only called when the given func is of type “list”. Common Lisp and it’s object system, CLOS, dispatches from most-specific to least-specific when calling generic functions. What all this means is that if I pass a list to #’func-size-in-bytes, the new function will be called, but if I pass something that isn’t a list, the first, original function will be called.

And what does the new #’func-size-in-bytes do? It maps over the given list by calling itself over each element in that list! We can now pass just one function to #’func-size-in-bytes, or we can pass a list of functions, and we get back nice yummy happy good-good results either way!

ELO> (func-size-in-bytes '+)
135

ELO> (func-size-in-bytes (list '+ '- '* '/))
(135 102 138 142)

But wait, there’s more!

Because our generic function is calling itself based on whether the passed arg is a list or not, this means that we can not only pass individual objects, not only can we pass lists of objects, but we can pass entire TREES of objects and get back trees of appropriate results back! We’ve implemented tree-walking by just making a generic function call itself when it receives a list!

ELO> (func-size-in-bytes (list '+ (list '- (list '* '/) (list 'disassemble) (list 'func-size-in-bytes (list 'mapcar 'parse-integer))) 'time))

(135 (102 (138 142) (583) (478 (145 1602))) 311)

We pass a tree of functions or symbols bound to functions, and we receive back a tree with identical structure where each func/name has been replaced by the result of calling our func on an individual object, without having modified our original function at all!

This is the kind of thing I get excited about, if you’ve ever wondered.

Moving on; we can get a list of size-in-bytes, now we just need to sort it. That part’s dead easy so here:

(defun sort-by-size-in-bytes (funcs)
  (sort funcs (lambda (x y)
                (< (func-size-in-bytes x)
                   (func-size-in-bytes y)))))

And now we can call #’sort-by-size-in-bytes, passing a list of func/names, and getting back a list of the given functions sorted by our notion of performance:

ELO> (func-size-in-bytes (list '+ '- '/ '*))
(135 102 142 138)

ELO> (sort-by-size-in-bytes (list '+ '- '/ '*))
(- + * /)

And there we have it. We now know that, of the four elementary arithmetic functions in SteelBank Common Lisp, ‘- is the smallest in assembly, followed by ‘+, ‘* and finally with ‘/.

Now, finally, let’s compare a function using our “loop-for” with functions using the built-in “dotimes” and “loop” macros:

ELO> (sort-by-size-in-bytes (list
                              (lambda () (loop-for (i 0) (> i 300000000) (incf i) (sqrt i)))
                              (lambda () (dotimes (i 300000000) (sqrt i)))
                              (lambda () (loop for i below 300000000 do (sqrt i)))))

(#<FUNCTION (LAMBDA ()) {10030523AB}> 
 #<FUNCTION (LAMBDA ()) {100305246B}>
 #<FUNCTION (LAMBDA ()) {10030522EB}>)

Behold! We have sorted the three loops by size in assembly! This sorted list’s great ‘n all but I don’t actually know which of those functions corresponds to which macro, so, let’s just get an unsorted list for our own sake:

ELO> (func-size-in-bytes (list
                          (lambda () (loop-for (i 0) (> i 300000000) (incf i) (sqrt i)))
                          (lambda () (dotimes (i 300000000) (sqrt i)))
                          (lambda () (loop for i below 300000000 do (sqrt i)))))
(63 61 61)

Results:

  • ‘loop-for = 63 bytes of assembly
  • ‘dotimes = 61 bytes of assembly
  • ‘loop = 61 bytes of assembly

Conclusion: If we define performance as the size of a function in assembly, then our ‘loop-for is less performant than the standard looping forms.

‘dotimes & ‘loop both compile down to the same assembly, and ‘loop-for compiles to the same except for two extra bytes: [CMP RAX, 600000000] and [JNLE L2].

[1] Superoptimization is, essentially, the practice of finding the smallest amount of assembly required to specify a function that satisfies all given tests. The usual way of superoptimizing a function is to generate the smallest amount of assembly possible, run it against the tests, and if it doesn’t pass them all then we generate the next smallest amount of assembly possible and repeat. That relational languages such as miniKanren explore the latent space of code, the whole space of code that satisfies a given set of tests, means we can (hypothetically) generate every possible function that satisfies all given tests. If we had to sort that list of functions by their assembly-size as we do here, then the function(s) with the smallest assembly-size in that list will be the superoptimized, canonical representation(s) of that function.[1a][1b] It’s a different approach to superoptimization that, if we have a cutoff to make the returned list of possible functions to satisfy our tests finite after proving that we’ve explored the entire space of functions smaller than all remaining functions, could be fun to investigate. But miniKanren and the use of various heuristics and NN’s to search the latent-space of code is the very edge of the field as of writing, in my opinion, so don’t expect anything more insightful on this website than you’d find by listening to WIll Byrd and friends, directly.

[1a] I’m no mathematician so I have no idea if there could be two canonical representations of something, hence the (s)’s.

[1b] At least it’d be superoptimised in terms of our compiler. Just because our compiler compiles a function into a given amount of assembly, doesn’t mean that it’s the smallest amount of assembly needed to represent that function. It could, however, be the smallest amount of code in the host-language needed to represent that function.

[2] #’func-size-in-bytes reads the string output by SBCL 2.0.0, and as such isn’t portable. The output of #’disassemble isn’t standardised, so #’func-size-in-bytes wouldn’t work in, say, cLisp.

Leave a Reply

Your email address will not be published. Required fields are marked *