83 entries in 0.77s
: mp_en_viaje: i recall finding that comba
died right when was writing comba
: ( properly speaking, comba
aint even a mult algo in the usual mathematical sense, but merely a 'sideways' restatement of the traditional schoolboy O(n^2) mult algo to behave smoother on pc and other irons where cache is in play )
: would be kinda interesting if comba
ends up snipped.
: in case only asm is used, plain comba
may be fast enough as well
: bvt: what i meant is, the optimal asmistic base case is not necessarily comba
. could be 'micro' karatsuba, or even straight schoolbook mul. for so long as asm.
: bvt: it isn't actually a given that comba
is even needed at all, with asmistic 64x64 mul
: re ~2.5x speedup: for 8-indeces-threshold asm comba
makes up only ~25% of runtime, rest is spent in karatsuba squaring
: i agree. i played a bit with the code, arrived at slightly cleaner (and sub-1% faster) code for multiplication, seems that the performance is close to the hw limit (at least at 32 Indices Comba
: asciilifeform: depends on whether there will be a usecase for such item, but in any case shifts/adds should be simpler than comba
: Logged on 2019-04-15 21:30 bvt: the only smaller item in my pipeline is unrolling comba
for ch10 multiplication; and i'd like to give a shot a 'kill /dev/random and /dev/urandom' experiment, replacing them with buffers than can be directly fed from userspace, but this - even later.
: the only smaller item in my pipeline is unrolling comba
for ch10 multiplication; and i'd like to give a shot a 'kill /dev/random and /dev/urandom' experiment, replacing them with buffers than can be directly fed from userspace, but this - even later. ☟︎
: i found last yr, for instance, that unrolled comba
( still in ada ) gives 20-25% speedup.
: a correct asmism would simply read the carry flag and put the value where it belongs (e.g. in fz_add, into the next addition, in comba
-- into the accumulator; etc)
: in a hypothetical asmistic branch of ffa, you'd want to implement whole comba
in asm, rather than merely word mul
: the remaining virgin land for speed revvup, is asmism for base cases. ( and ~possibly~ in combination with unrolled comba
: bvt: comba
itself ended up on the list of things that made it in by very small margin ( it wins perhaps 10%, on most pc iron , vs straight word*word mul as base case, try it yourself by turning the threshhold knob )
: asciilifeform: a better attempt at this algorithm can involve using comba
at the lower level.
: bvt: there's a long list of things that asciilifeform considered and (for time being) rejected from ffa, on acct of costing substantial complexity for very small saving of cpu cost. e.g. unrolled comba
: yes, actually in this karatsuba/toom algo one can embed comba
's trick, but the code would become even gnarlier
: 'locality' effect is actually why e.g. comba
wins over ordinary 'kindergarten' o(n^2) mult
: errything after, will be linear (e.g. asm mul, in the end, unrolled comba
: mircea_popescu: iirc just comba
(basecase mul) alone, if asmified, is 8x speedup , even if all else left alone
: the use of comba
, and in fact of ANY multiplier that compiles into having a x86 MUL instruction anywhere in it, is unsafe on intel celeron. and ppc32. and possibly ppc64.
: mircea_popescu: apparently not. hence why unrolled comba
was iirc 2x+ speedup
: i got a megatonne of those . ( e.g. unrolled comba
: ... pretty depressing, and just about obscene, you could just about fit msdos mega-game DIGGER.COM in the space taken by gcc's notion of ffa_mul_comba !
: And in this here FFa post we will be taking Comba
Mult version x from y date and together with last week's X, Y and Z, and make this pile
: trinque: but let's say i take a routine from earlier ( e.g. unrolled comba
mult ) and the rest -- from last week's
: ( i already have, and posted, an unrolled comba
-- see august log. )
: i'ma also try unrolled comba
, to compare, but do not want to spend eons on this type of massage, it can happen after release of p.
: with comba
, for instance, as with others (ex. karatsuba): i tend to go and read the papers on it, then review the code and see how ffa is doing it. etc.
: and that's working pretty great. just now digging into comba
and other higher-level ones. not writing tests on those yet -- just studying, reading.
: Logged on 2017-10-14 18:39 apeloyee: besides, "bernsteinan karatsuba" requres carry-save arithmetic, otherwise it likely wins nothing. so not separate from comba
: besides, "bernsteinan karatsuba" requres carry-save arithmetic, otherwise it likely wins nothing. so not separate from comba
: Logged on 2017-08-10 02:43 asciilifeform: for simplicity, tested the case that actually happens in practice: on a 64bit box, any ffa width over 512 bits gives a strictly 8-wide comba
: i still think that it makes sense to do this only after every other bolt is as tight as physically possible -- bernsteinian karatsuba, unrolled comba
: ( unrolled comba
would have explicit unrolled cases for 1,2,...,8-word operands )
: for instance unrolled comba
wins 20-25% speed, but i did not use it in place of the generic because it is longer and harder to read.
: 2 half products out of 3 on the first level of recursion, 4 of 9 on second, and 8 of 27 on third, assuming 64-bit words and unrealistic 2-fold speedup of comba
for half-multiply, and no overhead in karatsuba,
: and most products for which the comba
is called, are full products, not half products
: see, it does three recursive calls, meaning the speedup is wholly dependent on the speedup of comba
: incidentally ~95% of the work ffa does in modexp, now, is multiplication. which means that there is further 20-25% speedup waiting to be had when i get bernsteinian optimization for karatsuba ( haven't yet figured it out, he buried it deep in a paper , as if he were an alchemist, quite cryptically ) and another 10-20% optimization if we move to unrolled comba
( see august thread. )
: it was about 100x slower than comba
( current basecase ) and ~150x slower than karatsuba+comba
( currently what we use )
: ( one answer is that at the very bottom of the barrel, where karatsuba hits the basecase comba
mult, we permit division )
: also lit search came back with an equally obscure recipe for a 10% or so faster comba
( tldr : compute the square (diagonal) digits in a separate walk, and add them to the result of the nondiagonal pass , at the end )
: asciiliform: BTW maybe I missed it but why do comba
when L <= 8? does this come from speed testing?
: we definitely don't need any case of comba
above 8 tho
: for simplicity, tested the case that actually happens in practice: on a 64bit box, any ffa width over 512 bits gives a strictly 8-wide comba
mult ocurrence ☟︎
: Logged on 2017-08-08 23:51 asciilifeform: it thereby follows that i could unroll comba
into explicit cases from 1 to 8 words
: Logged on 2017-08-08 21:28 asciilifeform: in other noose, mod6 , phf , et al : http://btcbase.org/log/2017-07-10#1681208
nao 1.5s . ( this with karasbuba-squaring used in exp, and comba
-squaring used as base case in the former. )
: now wouldja believe i spent 3 wks trying to eliminate the if N mod 2 = 0 ... condition in Square_Comba
: it thereby follows that i could unroll comba
into explicit cases from 1 to 8 words ☟︎
: i also discovered how to write comba
such that the only branch is the loop branch
: (contains completed karatsuba, but no comba
: i need to integrate W_Mul_Comba & W_Add_D and do some of the other things suggested above.
: Logged on 2017-07-13 05:21 mod6: this is actually using karatsuba, haven't even integrated the new comba
: this is actually using karatsuba, haven't even integrated the new comba
code yet. ☟︎
: im retarded, what is 'comba
: ( alert reader will ask, what did i change ? answer : 1) comba
2) gcc -O2 (this keeps all bounds checks and dun do anything aggressive, just peepholes) )
: Logged on 2017-05-21 00:47 asciilifeform: in other sads, it turns out that paul g. comba
( of comba
's multiplication algo ) died in april.
: in other sads, it turns out that paul g. comba
( of comba
's multiplication algo ) died in april. ☟︎