log☇︎
83 entries in 0.697s
asciilifeform: mp_en_viaje: i recall finding that comba died right when was writing comba's algo.
asciilifeform: ( properly speaking, comba aint even a mult algo in the usual mathematical sense, but merely a 'sideways' restatement of the traditional schoolboy O(n^2) mult algo to behave smoother on pc and other irons where cache is in play )
mp_en_viaje: would be kinda interesting if comba ends up snipped.
bvt: in case only asm is used, plain comba may be fast enough as well
asciilifeform: bvt: what i meant is, the optimal asmistic base case is not necessarily comba. could be 'micro' karatsuba, or even straight schoolbook mul. for so long as asm.
asciilifeform: bvt: it isn't actually a given that comba is even needed at all, with asmistic 64x64 mul
bvt: re ~2.5x speedup: for 8-indeces-threshold asm comba makes up only ~25% of runtime, rest is spent in karatsuba squaring
bvt: i agree. i played a bit with the code, arrived at slightly cleaner (and sub-1% faster) code for multiplication, seems that the performance is close to the hw limit (at least at 32 Indices Comba-Karatsuba threshold)
lobbesbot: phf: Sent 17 hours and 38 minutes ago: <asciilifeform> plox to snarf sig in http://bvt-trace.net/2019/05/unrolled-assembly-comba/ ( and diana_coman 's earlier ) at your earliest , into btcbase, ty
asciilifeform: !Q later tell phf plox to snarf sig in http://bvt-trace.net/2019/05/unrolled-assembly-comba/ ( and diana_coman 's earlier ) at your earliest , into btcbase, ty
asciilifeform: http://bvt-trace.net/2019/05/unrolled-assembly-comba/#selection-93.193-97.1
bvt: asciilifeform: depends on whether there will be a usecase for such item, but in any case shifts/adds should be simpler than comba
feedbot: http://bvt-trace.net/2019/05/unrolled-assembly-comba/ << bvt's backtrace -- Unrolled x86_64 Assembly Comba for Ch. 11 FFA
a111: Logged on 2019-04-15 21:30 bvt: the only smaller item in my pipeline is unrolling comba for ch10 multiplication; and i'd like to give a shot a 'kill /dev/random and /dev/urandom' experiment, replacing them with buffers than can be directly fed from userspace, but this - even later.
bvt: the only smaller item in my pipeline is unrolling comba for ch10 multiplication; and i'd like to give a shot a 'kill /dev/random and /dev/urandom' experiment, replacing them with buffers than can be directly fed from userspace, but this - even later. ☟︎
a111: Logged on 2019-03-10 02:44 mircea_popescu: http://bvt-trace.net/2019/03/ffa-chapter-9-homework-comba-in-x86_64-assembly/#selection-15.295-19.176 << why this, specifically ? is there no ada asm calling method besides this ?
lobbesbot: bvt: Sent 22 hours and 22 minutes ago: <asciilifeform> http://bvt-trace.net/2019/03/ffa-chapter-9-homework-comba-in-x86_64-assembly/comment-page-1/#comment-12
a111: Logged on 2019-03-10 02:44 mircea_popescu: http://bvt-trace.net/2019/03/ffa-chapter-9-homework-comba-in-x86_64-assembly/#selection-15.295-19.176 << why this, specifically ? is there no ada asm calling method besides this ?
mircea_popescu: http://bvt-trace.net/2019/03/ffa-chapter-9-homework-comba-in-x86_64-assembly/#selection-15.295-19.176 << why this, specifically ? is there no ada asm calling method besides this ? ☟︎☟︎
asciilifeform: !Q later tell bvt http://bvt-trace.net/2019/03/ffa-chapter-9-homework-comba-in-x86_64-assembly/comment-page-1/#comment-12
feedbot: http://bvt-trace.net/2019/03/ffa-chapter-9-homework-comba-in-x86_64-assembly/ << bvt's backtrace -- FFA Chapter 9 Homework: Comba in X86_64 Assembly
asciilifeform: i found last yr, for instance, that unrolled comba ( still in ada ) gives 20-25% speedup.
asciilifeform: a correct asmism would simply read the carry flag and put the value where it belongs (e.g. in fz_add, into the next addition, in comba -- into the accumulator; etc)
asciilifeform: in a hypothetical asmistic branch of ffa, you'd want to implement whole comba in asm, rather than merely word mul
asciilifeform: the remaining virgin land for speed revvup, is asmism for base cases. ( and ~possibly~ in combination with unrolled comba )
asciilifeform: bvt: comba itself ended up on the list of things that made it in by very small margin ( it wins perhaps 10%, on most pc iron , vs straight word*word mul as base case, try it yourself by turning the threshhold knob )
bvt: asciilifeform: a better attempt at this algorithm can involve using comba at the lower level.
asciilifeform: bvt: there's a long list of things that asciilifeform considered and (for time being) rejected from ffa, on acct of costing substantial complexity for very small saving of cpu cost. e.g. unrolled comba.
bvt: yes, actually in this karatsuba/toom algo one can embed comba's trick, but the code would become even gnarlier
asciilifeform: 'locality' effect is actually why e.g. comba wins over ordinary 'kindergarten' o(n^2) mult
asciilifeform: errything after, will be linear (e.g. asm mul, in the end, unrolled comba, etc)
asciilifeform: mircea_popescu: iirc just comba (basecase mul) alone, if asmified, is 8x speedup , even if all else left alone
asciilifeform: i had example of this back in august, of comba.
asciilifeform: the use of comba , and in fact of ANY multiplier that compiles into having a x86 MUL instruction anywhere in it, is unsafe on intel celeron. and ppc32. and possibly ppc64.
asciilifeform: mircea_popescu: apparently not. hence why unrolled comba was iirc 2x+ speedup
asciilifeform: i got a megatonne of those . ( e.g. unrolled comba )
asciilifeform: ... pretty depressing, and just about obscene, you could just about fit msdos mega-game DIGGER.COM in the space taken by gcc's notion of ffa_mul_comba !
mircea_popescu: And in this here FFa post we will be taking Comba Mult version x from y date and together with last week's X, Y and Z, and make this pile
asciilifeform: trinque: but let's say i take a routine from earlier ( e.g. unrolled comba mult ) and the rest -- from last week's
asciilifeform: ( i already have, and posted, an unrolled comba -- see august log. )
asciilifeform: i'ma also try unrolled comba, to compare, but do not want to spend eons on this type of massage, it can happen after release of p.
mod6: with comba, for instance, as with others (ex. karatsuba): i tend to go and read the papers on it, then review the code and see how ffa is doing it. etc.
mod6: and that's working pretty great. just now digging into comba and other higher-level ones. not writing tests on those yet -- just studying, reading.
a111: Logged on 2017-10-14 18:39 apeloyee: besides, "bernsteinan karatsuba" requres carry-save arithmetic, otherwise it likely wins nothing. so not separate from comba rewrite.
apeloyee: besides, "bernsteinan karatsuba" requres carry-save arithmetic, otherwise it likely wins nothing. so not separate from comba rewrite. ☟︎
a111: Logged on 2017-08-10 02:43 asciilifeform: for simplicity, tested the case that actually happens in practice: on a 64bit box, any ffa width over 512 bits gives a strictly 8-wide comba mult ocurrence
asciilifeform: unrolled comba is here: http://btcbase.org/log/2017-08-10#1696236 ( http://archive.is/iMI4W ) ☝︎
asciilifeform: i still think that it makes sense to do this only after every other bolt is as tight as physically possible -- bernsteinian karatsuba, unrolled comba, etc
asciilifeform: ( unrolled comba would have explicit unrolled cases for 1,2,...,8-word operands )
asciilifeform: for instance unrolled comba wins 20-25% speed, but i did not use it in place of the generic because it is longer and harder to read.
apeloyee: 2 half products out of 3 on the first level of recursion, 4 of 9 on second, and 8 of 27 on third, assuming 64-bit words and unrealistic 2-fold speedup of comba for half-multiply, and no overhead in karatsuba,
apeloyee: and most products for which the comba is called, are full products, not half products
apeloyee: see, it does three recursive calls, meaning the speedup is wholly dependent on the speedup of comba for half-multiply
asciilifeform: incidentally ~95% of the work ffa does in modexp, now, is multiplication. which means that there is further 20-25% speedup waiting to be had when i get bernsteinian optimization for karatsuba ( haven't yet figured it out, he buried it deep in a paper , as if he were an alchemist, quite cryptically ) and another 10-20% optimization if we move to unrolled comba ( see august thread. )
asciilifeform: will need asymmetric comba, too...
asciilifeform: it was about 100x slower than comba ( current basecase ) and ~150x slower than karatsuba+comba ( currently what we use )
asciilifeform: the way the comba multer works
asciilifeform: ( one answer is that at the very bottom of the barrel, where karatsuba hits the basecase comba mult, we permit division )
asciilifeform: also lit search came back with an equally obscure recipe for a 10% or so faster comba ( tldr : compute the square (diagonal) digits in a separate walk, and add them to the result of the nondiagonal pass , at the end )
ave1: asciiliform: BTW maybe I missed it but why do comba when L <= 8? does this come from speed testing?
asciilifeform: forcing 2^x = W, x in Z , also simplifies comba
asciilifeform: we definitely don't need any case of comba above 8 tho
asciilifeform: for simplicity, tested the case that actually happens in practice: on a 64bit box, any ffa width over 512 bits gives a strictly 8-wide comba mult ocurrence ☟︎
a111: Logged on 2017-08-08 23:51 asciilifeform: it thereby follows that i could unroll comba into explicit cases from 1 to 8 words
a111: Logged on 2017-08-08 21:28 asciilifeform: in other noose, mod6 , phf , et al : http://btcbase.org/log/2017-07-10#1681208 nao 1.5s . ( this with karasbuba-squaring used in exp, and comba-squaring used as base case in the former. )
asciilifeform: now wouldja believe i spent 3 wks trying to eliminate the if N mod 2 = 0 ... condition in Square_Comba
asciilifeform: it thereby follows that i could unroll comba into explicit cases from 1 to 8 words ☟︎
asciilifeform: in other noose, mod6 , phf , et al : http://btcbase.org/log/2017-07-10#1681208 nao 1.5s . ( this with karasbuba-squaring used in exp, and comba-squaring used as base case in the former. ) ☝︎☟︎
ascii_modem: i also discovered how to write comba such that the only branch is the loop branch
ascii_modem: also comba-squaring worx nao.
mod6: (contains completed karatsuba, but no comba)
mod6: i need to integrate W_Mul_Comba & W_Add_D and do some of the other things suggested above.
a111: Logged on 2017-07-13 05:21 mod6: this is actually using karatsuba, haven't even integrated the new comba code yet.
asciilifeform: http://btcbase.org/log/2017-07-13#1682308 << comba is an O(N^2) multer. it beats the classical one simply because it doesn't need a scratch buffer of length N. it goes as the base case in karatsuba, not instead of it ☝︎
mod6: this is actually using karatsuba, haven't even integrated the new comba code yet. ☟︎
a111: 6 results for "comba", http://btcbase.org/log-search?q=comba
asciilifeform: !#s comba
mod6: im retarded, what is 'comba' ?
asciilifeform: ( alert reader will ask, what did i change ? answer : 1) comba 2) gcc -O2 (this keeps all bounds checks and dun do anything aggressive, just peepholes) )
a111: Logged on 2017-06-29 19:57 asciilifeform: in other noose ! nao we have comba's algo multiplier as basecase in karatsuba (currently threshold 8 words) , and http://btcbase.org/log/2017-06-21#1673165 becomes now 7.5sec
a111: Logged on 2017-05-21 00:47 asciilifeform: in other sads, it turns out that paul g. comba ( of comba's multiplication algo ) died in april.
asciilifeform: in other noose ! nao we have comba's algo multiplier as basecase in karatsuba (currently threshold 8 words) , and http://btcbase.org/log/2017-06-21#1673165 becomes now 7.5sec ☝︎☟︎
asciilifeform: in other sads, it turns out that paul g. comba ( of comba's multiplication algo ) died in april. ☟︎