Back in the post on defining your own for-loop macro, I hurried over something like a performance comparison of our newly built for-loop and two built-in for-loop macros in Common Lisp. In that, I used the ‘time macro which, though useful for rough estimates, produces wildly varying outputs based on the machine calling it and the other processes currently being run on said machine.
I was talking with a good friend of mine about miniKanren, and the conversation meandered towards the idea of how we’d want to sort procedurally generated functions by their performances. There are a few ways to procedurally generate many (even all) possible functions to satisfy a set of unit-tests. Once we have a gorillion functions that are all correct, we would then want to know which of those functions were the most performant. [1]
The ‘time macro prints out, among other things, how many processor-cycles passed during the evaluation of a body of code. The results are almost always different, even on the same machine and during the same coding session, for the same code body. A more objective, reproducible and portable interpretation of the performance of a function would be to inspect the assembly our compiler would compile that function into.
We can print out the assembly our compiler would produce when compiling a function in Common Lisp using the #’disassemble function. We pass a function or symbol bound to a function, to #’disassemble, and it prints to the *standard-output* stream an explanation and listing of the given function’s assembly code.
Here we disassemble the ‘+ function:
ELO> (disassemble '+)
; disassembly for +
; Size: 135 bytes. Origin: #x100060C598 ; +
; 598: 840425F8FF1020 TEST AL, [#x2010FFF8] ; safepoint
; 59F: 4D85E4 TEST R12, R12
; 5A2: 0F8473000000 JEQ L5
; 5A8: 498B36 MOV RSI, [R14]
; 5AB: 8D5EF1 LEA EBX, [RSI-15]
; 5AE: F6C301 TEST BL, 1
; 5B1: 7510 JNE L0
; 5B3: 80FB0A CMP BL, 10
; 5B6: 740B JEQ L0
; 5B8: F6C30F TEST BL, 15
; 5BB: 755B JNE L4
; 5BD: 807EF129 CMP BYTE PTR [RSI-15], 41
; 5C1: 7755 JNBE L4
; 5C3: L0: BB02000000 MOV EBX, 2
; 5C8: EB39 JMP L2
; 5CA: 660F1F440000 NOP
; 5D0: L1: 48895DF8 MOV [RBP-8], RBX
; 5D4: 4C8965F0 MOV [RBP-16], R12
; 5D8: 4C8975E8 MOV [RBP-24], R14
; 5DC: 488BFB MOV RDI, RBX
; 5DF: 48F7DF NEG RDI
; 5E2: 498B3CBE MOV RDI, [R14+RDI*4]
; 5E6: 488BD6 MOV RDX, RSI
; 5E9: FF1425B0000020 CALL QWORD PTR [#x200000B0] ; GENERIC-+
; 5F0: 488BF2 MOV RSI, RDX
; 5F3: 4C8B75E8 MOV R14, [RBP-24]
; 5F7: 4C8B65F0 MOV R12, [RBP-16]
; 5FB: 488B5DF8 MOV RBX, [RBP-8]
; 5FF: 4883C302 ADD RBX, 2
; 603: L2: 840425F8FF1020 TEST AL, [#x2010FFF8] ; safepoint
; 60A: 4C39E3 CMP RBX, R12
; 60D: 7CC1 JL L1
; 60F: 488BD6 MOV RDX, RSI
; 612: L3: 488BE5 MOV RSP, RBP
; 615: F8 CLC
; 616: 5D POP RBP
; 617: C3 RET
; 618: L4: CC57 INT3 87 ; OBJECT-NOT-NUMBER-ERROR
; 61A: 18 BYTE #X18 ; RSI
; 61B: L5: 31D2 XOR EDX, EDX
; 61D: EBF3 JMP L3
NIL
And here we disassemble the #’disassemble function itself!
ELO> (disassemble #'disassemble)
; disassembly for DISASSEMBLE
; Size: 583 bytes. Origin: #x1000780ED6 ; DISASSEMBLE
; 0ED6: 840425F8FF1020 TEST AL, [#x2010FFF8] ; safepoint
; 0EDD: 48837DE000 CMP QWORD PTR [RBP-32], 0
; 0EE2: 0F8416020000 JEQ L13
; 0EE8: 488B45E8 MOV RAX, [RBP-24]
; 0EEC: 448D60FD LEA R12D, [RAX-3]
; 0EF0: 41F6C40F TEST R12B, 15
; 0EF4: 0F84FA010000 JEQ L12
; 0EFA: 488B45E8 MOV RAX, [RBP-24]
; 0EFE: 448D60F5 LEA R12D, [RAX-11]
; 0F02: 41F6C40F TEST R12B, 15
; 0F06: 750A JNE L0
; 0F08: 8078F539 CMP BYTE PTR [RAX-11], 57
; 0F0C: 0F8479010000 JEQ L8
; 0F12: L0: 48817DE817001120 CMP QWORD PTR [RBP-24], #x20110017 ; NIL
; 0F1A: 0F8563010000 JNE L7
; 0F20: L1: 4C8B75E8 MOV R14, [RBP-24]
; 0F24: L2: 4C8975F8 MOV [RBP-8], R14
; 0F28: 488B45F0 MOV RAX, [RBP-16]
; 0F2C: 448D60F1 LEA R12D, [RAX-15]
; 0F30: 41F6C40F TEST R12B, 15
; 0F34: 7558 JNE L4
; 0F36: 8078F12D CMP BYTE PTR [RAX-15], 45
; 0F3A: 7552 JNE L4
; 0F3C: 4C8D6424F0 LEA R12, [RSP-16]
; 0F41: 4883EC28 SUB RSP, 40
; 0F45: 488B55F0 MOV RDX, [RBP-16]
; 0F49: 488B3D50FDFFFF MOV RDI, [RIP-688] ; :STREAM
; 0F50: 488B75F8 MOV RSI, [RBP-8]
; 0F54: 488B054DFDFFFF MOV RAX, [RIP-691] ; :USE-LABELS
; 0F5B: 49894424F0 MOV [R12-16], RAX
; 0F60: 488B45D8 MOV RAX, [RBP-40]
; 0F64: 49894424E8 MOV [R12-24], RAX
; 0F69: 488B0560FDFFFF MOV RAX, [RIP-672] ; #<SB-KERNEL:FDEFN SB-DISASSEM:DISASSEMBLE-CODE-COMPONENT>
; 0F70: B90A000000 MOV ECX, 10
; 0F75: 49892C24 MOV [R12], RBP
; 0F79: 498BEC MOV RBP, R12
; 0F7C: FF5009 CALL QWORD PTR [RAX+9]
; 0F7F: 480F42E3 CMOVB RSP, RBX
; 0F83: L3: BA17001120 MOV EDX, #x20110017 ; NIL
; 0F88: 488BE5 MOV RSP, RBP
; 0F8B: F8 CLC
; 0F8C: 5D POP RBP
; 0F8D: C3 RET
; 0F8E: L4: 4883EC10 SUB RSP, 16
; 0F92: 488B55F0 MOV RDX, [RBP-16]
; 0F96: 488B053BFDFFFF MOV RAX, [RIP-709] ; #<SB-KERNEL:FDEFN SB-DISASSEM::GET-COMPILED-FUNS>
; 0F9D: B902000000 MOV ECX, 2
; 0FA2: 48892C24 MOV [RSP], RBP
; 0FA6: 488BEC MOV RBP, RSP
; 0FA9: FF5009 CALL QWORD PTR [RAX+9]
; 0FAC: 4C8BF2 MOV R14, RDX
; 0FAF: E9B7000000 JMP L6
; 0FB4: 660F1F840000000000 NOP
; 0FBD: 0F1F00 NOP
; 0FC0: L5: 4981FE17001120 CMP R14, #x20110017 ; NIL
; 0FC7: 74BA JEQ L3
; 0FC9: 4D8B7EF9 MOV R15, [R14-7]
; 0FCD: 498BD7 MOV RDX, R15
; 0FD0: 4C8975D0 MOV [RBP-48], R14
; 0FD4: 4C897DC8 MOV [RBP-56], R15
; 0FD8: 4883EC10 SUB RSP, 16
; 0FDC: 488B05FDFCFFFF MOV RAX, [RIP-771] ; #<SB-KERNEL:FDEFN SB-KERNEL:%FUN-NAME>
; 0FE3: B902000000 MOV ECX, 2
; 0FE8: 48892C24 MOV [RSP], RBP
; 0FEC: 488BEC MOV RBP, RSP
; 0FEF: FF5009 CALL QWORD PTR [RAX+9]
; 0FF2: 488BF2 MOV RSI, RDX
; 0FF5: 4883EC10 SUB RSP, 16
; 0FF9: 488B55F8 MOV RDX, [RBP-8]
; 0FFD: 488B3DE4FCFFFF MOV RDI, [RIP-796] ; "~&; disassembly for ~S"
; 1004: 488B05E5FCFFFF MOV RAX, [RIP-795] ; #<SB-KERNEL:FDEFN FORMAT>
; 100B: B906000000 MOV ECX, 6
; 1010: 48892C24 MOV [RSP], RBP
; 1014: 488BEC MOV RBP, RSP
; 1017: FF5009 CALL QWORD PTR [RAX+9]
; 101A: 4C8B7DC8 MOV R15, [RBP-56]
; 101E: 4C8D6424F0 LEA R12, [RSP-16]
; 1023: 4883EC28 SUB RSP, 40
; 1027: 498BD7 MOV RDX, R15
; 102A: 488B3D6FFCFFFF MOV RDI, [RIP-913] ; :STREAM
; 1031: 488B75F8 MOV RSI, [RBP-8]
; 1035: 488B056CFCFFFF MOV RAX, [RIP-916] ; :USE-LABELS
; 103C: 49894424F0 MOV [R12-16], RAX
; 1041: 488B45D8 MOV RAX, [RBP-40]
; 1045: 49894424E8 MOV [R12-24], RAX
; 104A: 488B05A7FCFFFF MOV RAX, [RIP-857] ; #<SB-KERNEL:FDEFN SB-DISASSEM:DISASSEMBLE-FUN>
; 1051: B90A000000 MOV ECX, 10
; 1056: 49892C24 MOV [R12], RBP
; 105A: 498BEC MOV RBP, R12
; 105D: FF5009 CALL QWORD PTR [RAX+9]
; 1060: 4C8B75D0 MOV R14, [RBP-48]
; 1064: 4D8B6601 MOV R12, [R14+1]
; 1068: 4D8BF4 MOV R14, R12
; 106B: L6: 840425F8FF1020 TEST AL, [#x2010FFF8] ; safepoint
; 1072: 458D66F9 LEA R12D, [R14-7]
; 1076: 41F6C40F TEST R12B, 15
; 107A: 0F8440FFFFFF JEQ L5
; 1080: CC49 INT3 73 ; OBJECT-NOT-LIST-ERROR
; 1082: 38 BYTE #X38 ; R14
; 1083: L7: 488B45E8 MOV RAX, [RBP-24]
; 1087: CC20 INT3 32 ; OBJECT-NOT-TYPE-ERROR
; 1089: 00 BYTE #X00 ; RAX
; 108A: 53 BYTE #X53 ; '(OR STREAM
; NULL)
; 108B: L8: 488B45E8 MOV RAX, [RBP-24]
; 108F: 4C8B600D MOV R12, [RAX+13]
; 1093: L9: 4D8B742425 MOV R14, [R12+37]
; 1098: 4981FE17001120 CMP R14, #x20110017 ; NIL
; 109F: 742C JEQ L10
; 10A1: 4883EC10 SUB RSP, 16
; 10A5: 488B55E8 MOV RDX, [RBP-24]
; 10A9: 488B3D58FCFFFF MOV RDI, [RIP-936] ; #<SB-KERNEL:LAYOUT for STREAM {10000B5CD3}>
; 10B0: 488B0559FCFFFF MOV RAX, [RIP-935] ; #<SB-KERNEL:FDEFN SB-KERNEL:UPDATE-OBJECT-LAYOUT-OR-INVALID>
; 10B7: B904000000 MOV ECX, 4
; 10BC: 48892C24 MOV [RSP], RBP
; 10C0: 488BEC MOV RBP, RSP
; 10C3: FF5009 CALL QWORD PTR [RAX+9]
; 10C6: 480F42E3 CMOVB RSP, RBX
; 10CA: 4C8BE2 MOV R12, RDX
; 10CD: L10: 4D8B74242D MOV R14, [R12+45]
; 10D2: 4D8B7EF9 MOV R15, [R14-7]
; 10D6: 4983FF04 CMP R15, 4
; 10DA: 7E04 JLE L11
; 10DC: 4D8B6611 MOV R12, [R14+17]
; 10E0: L11: 4C3B2521FCFFFF CMP R12, [RIP-991] ; #<SB-KERNEL:LAYOUT for STREAM {10000B5CD3}>
; 10E7: 0F8433FEFFFF JEQ L1
; 10ED: E920FEFFFF JMP L0
; 10F2: EBEC JMP L11
; 10F4: L12: 488B45E8 MOV RAX, [RBP-24]
; 10F8: 4C8B6005 MOV R12, [RAX+5]
; 10FC: EB95 JMP L9
; 10FE: L13: 4D8BB568020000 MOV R14, [R13+616] ; tls: *STANDARD-OUTPUT*
; 1105: 4C8B250CFCFFFF MOV R12, [RIP-1012] ; '*STANDARD-OUTPUT*
; 110C: 4183FE61 CMP R14D, 97
; 1110: 4D0F44742401 CMOVEQ R14, [R12+1]
; 1116: E909FEFFFF JMP L2
; 111B: CC10 INT3 16 ; Invalid argument count trap
NIL
We see that #’+ and #’disassemble are compiled into 135 and 583 bytes of assembly, respectively, using our implementation of Common Lisp.
As long as we use the same implementation, the same version, and we haven’t modified that implementation, we should get the same assembly code back as if it were a pure function. The outputs are the same regardless of the environment or time the call to #’disassemble is made, if the inputs are the same.
Size in bytes of assembly is… a more… mmmm… it’s a less mercurial measure of a function’s performance than if we were to simply compare processor-cycles measured by the ‘time macro.
Now. We have a list of functions, potentially a gorillion of them and all of them procedurally generated, and they all pass the same unit tests. We want to grab just the most performant of these functions, and we’re going to compare performance by comparing the size in bytes of the disassembled functions.
#’disassemble prints out a bunch of things to the *standard-ouput* stream, but it just returns nil. We need to capture the data being sent to *standard-output*, and parse it as a string to grab the useful part: “Size: 583 bytes“.
To do this, my friend wrote a function to hijack the *standard-output* stream’s incoming data and feed it into a string, while disassembling the given function:
(defun disassemble-to-string (func)
(let ((s (make-string-output-stream)))
(disassemble func :stream s)
(get-output-stream-string s)))
#’disassemble-to-string returns what #’disassemble prints, as a string, instead of nil. Said friend wrote a VERY simple function which then would take a list of functions and sort them by the length of the string returned. That’s good and all, except the string has comments, and it contains the name of the function! The length of the name of a function can now influence its supposed performance, under that quick sorting algorithm, so I hacked together a generic method to grab just the size-in-bytes part of the disassembly-string, and to parse it as an integer:
(defmethod func-size-in-bytes (func)
(let ((str (disassemble-to-string func)))
(parse-integer
(subseq str (+ 6 (search "Size: " str)) (search " bytes" str)))))
===================
ELO> (func-size-in-bytes '+)
135
ELO> (func-size-in-bytes 'disassemble)
583
ELO> (func-size-in-bytes 'func-size-in-bytes)
478
Now we just get the actual number of bytes in assembly a function is made of.[2]
Note, though, that I defined #’func-size-in-bytes not as a normal function with ‘defun, no, but rather as a generic method using ‘defmethod. Why? Well, we want to pass a list of functions and get back a list of numbers-of-bytes. I stumbled upon something I find fun, a while back, and I wanted an excuse to show it to you all here.
We have a function, it accepts an object. We want to also be able to call that function on a list of objects, and get back a list of results, as if we had mapped over that list with that function. You can do this by defining a new function and having it, obviously, map over the given list of objects with our original function. That means defining a new function with a new name, though, and I have the memory of a tropical fruit. Maybe even less. I want to just remember one function name, and I don’t want to have to modify my existing function to check if it received a list or not, and to then branch itself based off the listiness of the data given to it.
Enter generic functions.
‘defmethod allows us to define a generic function with the same name, but that dispatches itself based off the data passed to it when called.
Watch:
(defmethod func-size-in-bytes (func)
(let ((str (disassemble-to-string func)))
(parse-integer
(subseq str (+ 6 (search "Size: " str)) (search " bytes" str)))))
(defmethod func-size-in-bytes ((func list))
(mapcar #'func-size-in-bytes func))
We have our original #’func-size-in-bytes, which accepts anything as the arg ‘func. We then have a new, more specific #’func-size-in-bytes which is only called when the given func is of type “list”. Common Lisp and it’s object system, CLOS, dispatches from most-specific to least-specific when calling generic functions. What all this means is that if I pass a list to #’func-size-in-bytes, the new function will be called, but if I pass something that isn’t a list, the first, original function will be called.
And what does the new #’func-size-in-bytes do? It maps over the given list by calling itself over each element in that list! We can now pass just one function to #’func-size-in-bytes, or we can pass a list of functions, and we get back nice yummy happy good-good results either way!
ELO> (func-size-in-bytes '+)
135
ELO> (func-size-in-bytes (list '+ '- '* '/))
(135 102 138 142)
But wait, there’s more!
Because our generic function is calling itself based on whether the passed arg is a list or not, this means that we can not only pass individual objects, not only can we pass lists of objects, but we can pass entire TREES of objects and get back trees of appropriate results back! We’ve implemented tree-walking by just making a generic function call itself when it receives a list!
ELO> (func-size-in-bytes (list '+ (list '- (list '* '/) (list 'disassemble) (list 'func-size-in-bytes (list 'mapcar 'parse-integer))) 'time))
(135 (102 (138 142) (583) (478 (145 1602))) 311)
We pass a tree of functions or symbols bound to functions, and we receive back a tree with identical structure where each func/name has been replaced by the result of calling our func on an individual object, without having modified our original function at all!
This is the kind of thing I get excited about, if you’ve ever wondered.
Moving on; we can get a list of size-in-bytes, now we just need to sort it. That part’s dead easy so here:
(defun sort-by-size-in-bytes (funcs)
(sort funcs (lambda (x y)
(< (func-size-in-bytes x)
(func-size-in-bytes y)))))
And now we can call #’sort-by-size-in-bytes, passing a list of func/names, and getting back a list of the given functions sorted by our notion of performance:
ELO> (func-size-in-bytes (list '+ '- '/ '*))
(135 102 142 138)
ELO> (sort-by-size-in-bytes (list '+ '- '/ '*))
(- + * /)
And there we have it. We now know that, of the four elementary arithmetic functions in SteelBank Common Lisp, ‘- is the smallest in assembly, followed by ‘+, ‘* and finally with ‘/.
Now, finally, let’s compare a function using our “loop-for” with functions using the built-in “dotimes” and “loop” macros:
ELO> (sort-by-size-in-bytes (list
(lambda () (loop-for (i 0) (> i 300000000) (incf i) (sqrt i)))
(lambda () (dotimes (i 300000000) (sqrt i)))
(lambda () (loop for i below 300000000 do (sqrt i)))))
(#<FUNCTION (LAMBDA ()) {10030523AB}>
#<FUNCTION (LAMBDA ()) {100305246B}>
#<FUNCTION (LAMBDA ()) {10030522EB}>)
Behold! We have sorted the three loops by size in assembly! This sorted list’s great ‘n all but I don’t actually know which of those functions corresponds to which macro, so, let’s just get an unsorted list for our own sake:
ELO> (func-size-in-bytes (list
(lambda () (loop-for (i 0) (> i 300000000) (incf i) (sqrt i)))
(lambda () (dotimes (i 300000000) (sqrt i)))
(lambda () (loop for i below 300000000 do (sqrt i)))))
(63 61 61)
Results:
- ‘loop-for = 63 bytes of assembly
- ‘dotimes = 61 bytes of assembly
- ‘loop = 61 bytes of assembly
Conclusion: If we define performance as the size of a function in assembly, then our ‘loop-for is less performant than the standard looping forms.
‘dotimes & ‘loop both compile down to the same assembly, and ‘loop-for compiles to the same except for two extra bytes: [CMP RAX, 600000000] and [JNLE L2].
[1] Superoptimization is, essentially, the practice of finding the smallest amount of assembly required to specify a function that satisfies all given tests. The usual way of superoptimizing a function is to generate the smallest amount of assembly possible, run it against the tests, and if it doesn’t pass them all then we generate the next smallest amount of assembly possible and repeat. That relational languages such as miniKanren explore the latent space of code, the whole space of code that satisfies a given set of tests, means we can (hypothetically) generate every possible function that satisfies all given tests. If we had to sort that list of functions by their assembly-size as we do here, then the function(s) with the smallest assembly-size in that list will be the superoptimized, canonical representation(s) of that function.[1a][1b] It’s a different approach to superoptimization that, if we have a cutoff to make the returned list of possible functions to satisfy our tests finite after proving that we’ve explored the entire space of functions smaller than all remaining functions, could be fun to investigate. But miniKanren and the use of various heuristics and NN’s to search the latent-space of code is the very edge of the field as of writing, in my opinion, so don’t expect anything more insightful on this website than you’d find by listening to WIll Byrd and friends, directly.
[1a] I’m no mathematician so I have no idea if there could be two canonical representations of something, hence the (s)’s.
[1b] At least it’d be superoptimised in terms of our compiler. Just because our compiler compiles a function into a given amount of assembly, doesn’t mean that it’s the smallest amount of assembly needed to represent that function. It could, however, be the smallest amount of code in the host-language needed to represent that function.
[2] #’func-size-in-bytes reads the string output by SBCL 2.0.0, and as such isn’t portable. The output of #’disassemble isn’t standardised, so #’func-size-in-bytes wouldn’t work in, say, cLisp.