log☇︎
158 entries in 0.473s
asciilifeform: in the flagship branch, it comes into play as base case when 8 or fewer words in multiplicand
asciilifeform: bvt: what i meant is, the optimal asmistic base case is not necessarily comba. could be 'micro' karatsuba, or even straight schoolbook mul. for so long as asm.
bvt: you mean, can use just karatsuba?
bvt: re ~2.5x speedup: for 8-indeces-threshold asm comba makes up only ~25% of runtime, rest is spent in karatsuba squaring
bvt: i agree. i played a bit with the code, arrived at slightly cleaner (and sub-1% faster) code for multiplication, seems that the performance is close to the hw limit (at least at 32 Indices Comba-Karatsuba threshold)
a111: Logged on 2019-06-22 07:09 bvt: i do expect that 'saltmine season' is over now -- i've been working on asming karatsuba squaring (ffa ch.12b) over this week, post expected tomorrow.
bvt: i do expect that 'saltmine season' is over now -- i've been working on asming karatsuba squaring (ffa ch.12b) over this week, post expected tomorrow. ☟︎
a111: Logged on 2018-11-26 15:24 asciilifeform: meanwhile, in spam lulz, http://www.eghtesadban.com/events/5560171/finite-field-arithmetic-chapter-12b-karatsuba-redux-part-2-of-2
a111: 93 results for "from:asciilifeform karatsuba", http://btcbase.org/log-search?q=from%3Aasciilifeform%20karatsuba
asciilifeform: !#s from:asciilifeform karatsuba
asciilifeform: verisimilitude: as a concrete example : you will find that ffa uses an unmoving hinge for karatsuba multiplication. consequently all numbers are required to occupy a space that is a power-of-two bits wide. but from this you get a 3-4x simpler mechanism.
a111: Logged on 2019-01-20 15:31 feedbot: http://bvt-trace.net/2019/01/experiment-n-way-split-karatsuba/ << bvt's backtrace -- Experiment: N-Way Split Karatsuba
asciilifeform: just as you can de-recursivize the karatsuba etc
asciilifeform: there was another fella (since wandered off) who posted last yr an attempt at de-recursivizing karatsuba ( it's definitely in the logs, can't find it just yet tho )
bvt: no, i mean that particular paper. i initially wanted to implement a fft-like multiplication algorithm, but got interested in this karatsuba/toom when reading the paper.
bvt: yes, actually in this karatsuba/toom algo one can embed comba's trick, but the code would become even gnarlier
asciilifeform: same effect applies to karatsuba, and errywhere else.
asciilifeform: bvt, mircea_popescu : http://bvt-trace.net/2019/01/experiment-n-way-split-karatsuba/comment-page-1/#comment-6
feedbot: http://bvt-trace.net/2019/01/experiment-n-way-split-karatsuba/ << bvt's backtrace -- Experiment: N-Way Split Karatsuba ☟︎
asciilifeform: diana_coman: where this appears, it is a concession to 1) speed 2) brevity of code . where the indexing of the array is known, because it is a local buffer, it is referred to explicitly; invocations X'First, X'Last, etc cost measurable cpu, it turned out. but where indexing is not known ( e.g. if operand may be part of a karatsuba ) , there it is unavoidably X'First etc.
asciilifeform: diana_coman: where at all possible, i tried to use ~simpler~ algo than what was given by knuth. e.g. karatsuba without moving pivots.
asciilifeform: the other thing, if one is writing a proggy where only 1 ffawidth is used, it is possible to unroll all of the loops, derecursivize karatsuba, etc. and get straight line of instructions
asciilifeform: btw it is possible to dispense with FZ_LoMul , and simply discard upper chunk of ordinary karatsuba product. this costs ~10% cpuwise, and may be justified for some proggies.
mircea_popescu: karatsuba also.
zx2c4: > At this time we will walk through the mechanics of our Karatsuba multiplier, so as to cement in the reader’s head the correctness of the routine, and lay groundwork for the optimization which is introduced in Ch. 12B.
asciilifeform: meanwhile, in spam lulz, http://www.eghtesadban.com/events/5560171/finite-field-arithmetic-chapter-12b-karatsuba-redux-part-2-of-2 ☟︎
asciilifeform: e.g. the karatsuba 'sandwiches' took coupla ~days~ of this.
deedbot: http://www.loper-os.org/?p=2798 << Loper OS - Finite Field Arithmetic. Chapter 12B: Karatsuba Redux. (Part 2 of 2)
deedbot: http://www.loper-os.org/?p=2753 << Loper OS - Finite Field Arithmetic. Chapter 12A: Karatsuba Redux. (Part 1 of 2)
asciilifeform: ( http://btcbase.org/patches/ffa_ch10_karatsuba/tree/ffa/ffacalc/cmdline.ads + http://btcbase.org/patches/ffa_ch10_karatsuba/tree/ffa/ffacalc/cmdline.adb for the curious )
asciilifeform: ( there was not an existing 'profile' corresponding to the degree of 'fascism' asciilifeform wanted , hence the bulk of http://btcbase.org/patches/ffa_ch10_karatsuba/tree/ffa/libffa/restrict.adc item )
mod6: yeah, no. it's fine. i was happy to build/run. to me this is all very exciting/important stuff. back when alf was doing the karatsuba and other things in the late summer/fall, this type of thing started to be contimplated for innerloop substitute.
asciilifeform: >> http://btcbase.org/patches/ffa_ch10_karatsuba << ty phf !
asciilifeform: and, to nobody's great surprise, i suspect, the ideal karatsuba threshhold turns out to differ for 'soft' vs 'iron' word-mul variant.
asciilifeform: this is the karatsuba ch. quite possibly the most complicated in the whole series. and no it can't be split.
asciilifeform: or for another example, take the ugliness and 'pointericity' of the traditional 'pivoting' form of karatsuba. which i killed by forcing all FZ bitnesses to be powers of 2.
apeloyee: karatsuba's saves area only, not time
asciilifeform: makes karatsuba cut cleanly.
asciilifeform: plenty of what i suspect 'kindergarten-level' (in hindsight, like karatsuba) -- unsolved.
mircea_popescu: there's outright cleverness, like that guy's method to compute inside pi ; and then there's just sheer good fucking fortune, like karatsuba.
asciilifeform: e.g. karatsuba
asciilifeform: ( multiplication parallelizes 3 ways , with karatsuba, and N ways with toom-cook )
asciilifeform: the 2 things i did not forbid (though wanted to) were recursion (can't... karatsuba) and implicit-loops (can't... array assignments)
asciilifeform: to simplify karatsuba and other algos.
mod6: with comba, for instance, as with others (ex. karatsuba): i tend to go and read the papers on it, then review the code and see how ffa is doing it. etc.
a111: Logged on 2017-10-14 18:39 apeloyee: besides, "bernsteinan karatsuba" requres carry-save arithmetic, otherwise it likely wins nothing. so not separate from comba rewrite.
apeloyee: besides, "bernsteinan karatsuba" requres carry-save arithmetic, otherwise it likely wins nothing. so not separate from comba rewrite. ☟︎
apeloyee: can has link describing " bernsteinian karatsuba"?
asciilifeform: i still think that it makes sense to do this only after every other bolt is as tight as physically possible -- bernsteinian karatsuba, unrolled comba, etc
asciilifeform: ( this is quite attainable using the ffa we have ~today~, supposing one allows karatsuba to split 3way on 3 cores of whatever chip you have )
a111: Logged on 2017-10-08 20:43 apeloyee: not if it needs 8x more temp space << and karatsuba totally doesn't.
asciilifeform: http://btcbase.org/log/2017-10-08#1723064 << gcd does not need a karatsuba, the karatsubatron can be doing something else while gcd happens ☝︎
apeloyee: not if it needs 8x more temp space << and karatsuba totally doesn't. ☟︎
asciilifeform: ( karatsuba assumes length always divisible by 2 )
asciilifeform: ( karatsuba, i will note for n00bz, parallelizes , but i deliberately omitted parallelization logic because i want ffa buildable on msdos and for machines with 1 cpu )
asciilifeform: and then bernsteinian karatsuba, possibly, and whatever else i can think of.
a111: Logged on 2017-10-07 00:38 asciilifeform: mod6: you will notice that the barrett in 'crc handbook' is more complicated : it shrinks the x and then compensates later. this relies on normalization , and constanttimeized incarnation of it would have to work as apeloyee described ( i'ma try it much later, once i see what can be had re speed strictly from having asymmetric karatsuba instead of the current mega-waste )
apeloyee: 2 half products out of 3 on the first level of recursion, 4 of 9 on second, and 8 of 27 on third, assuming 64-bit words and unrealistic 2-fold speedup of comba for half-multiply, and no overhead in karatsuba,
apeloyee: I mean, W_Mul doesn't do karatsuba
a111: Logged on 2017-10-07 21:14 apeloyee: http://btcbase.org/log/2017-10-07#1722289 << and the point of doing karatsuba is? you do 2 recursive calls to Mul_Karatsuba_TopOnly and one to Mul_Karatsuba. should've simply calculated upper_part(XLo*YHi), upper_part(YLo*XHi) and XHi*YHi
apeloyee: http://btcbase.org/log/2017-10-07#1722395 << why do karatsuba when you can just shift and add them, like in your W_Mul ☝︎
asciilifeform: ( mainly, i suspect, by recognizing masses of 0 in karatsuba and returning 0 when they get mul'd )
a111: Logged on 2017-10-07 21:14 apeloyee: http://btcbase.org/log/2017-10-07#1722289 << and the point of doing karatsuba is? you do 2 recursive calls to Mul_Karatsuba_TopOnly and one to Mul_Karatsuba. should've simply calculated upper_part(XLo*YHi), upper_part(YLo*XHi) and XHi*YHi
apeloyee: http://btcbase.org/log/2017-10-07#1722289 << and the point of doing karatsuba is? you do 2 recursive calls to Mul_Karatsuba_TopOnly and one to Mul_Karatsuba. should've simply calculated upper_part(XLo*YHi), upper_part(YLo*XHi) and XHi*YHi ☝︎☟︎☟︎
asciilifeform: ( if it isn't obvious from where the error comes : observe the 3 Karatsuba_Term additions. in ordinary K., they walk over the upper half of XYLo ( lower half of result.) but in TopOnly K. we lose XYLo, so that carryolade is lost. )
asciilifeform: now! this procrusted-karatsuba is only used for the barrettron, so theoretically could compensate for that 3 with 3 additional subtractor-muxes. and still win ~4x speedup vs last night's . but this is mega-ugly.
asciilifeform: ( Karatsuba_Term is same for both )
asciilifeform: http://wotpaste.cascadianhacker.com/pastes/TgRkm/?raw=true << ordinary karatsuba, for convenient comparison.
asciilifeform: ^ this 'upper half only' karatsuba works, but the answer is always off by 0 to 3, because the carries from the bottom halves are ( recursively ) lost. somehow gotta be finessed.
asciilifeform: mod6: you will notice that the barrett in 'crc handbook' is more complicated : it shrinks the x and then compensates later. this relies on normalization , and constanttimeized incarnation of it would have to work as apeloyee described ( i'ma try it much later, once i see what can be had re speed strictly from having asymmetric karatsuba instead of the current mega-waste ) ☟︎
asciilifeform: incidentally ~95% of the work ffa does in modexp, now, is multiplication. which means that there is further 20-25% speedup waiting to be had when i get bernsteinian optimization for karatsuba ( haven't yet figured it out, he buried it deep in a paper , as if he were an alchemist, quite cryptically ) and another 10-20% optimization if we move to unrolled comba ( see august thread. )
asciilifeform: ( barrett with 8192b barretoids, i.e. 16384bit mult via ordinary symmetric karatsuba with simple brutal slice , rather than apeloyee's shift )
asciilifeform: it was about 100x slower than comba ( current basecase ) and ~150x slower than karatsuba+comba ( currently what we use )
asciilifeform: one by jebelean, 'Exact Division with Karatsuba Complexity' , possibly ffaizable.
asciilifeform: this is why it was so tricky to implement karatsuba.
asciilifeform: karatsuba terms a+b+c ( k. squaring, for simplicity. mult. has four of'em ). we want a+b+c mod m.
asciilifeform: ( this is the 'new' karatsuba, simplified, and demands powers-of-two-bitnessed operands. should be easier to read than old.)
a111: Logged on 2017-09-12 21:22 asciilifeform: http://wotpaste.cascadianhacker.com/pastes/uLQBe/?raw=true << karatsuba, for reference
asciilifeform: what we do not have -- and don't have because it doesn't actually work -- is the karatsuba cut for modular mult.
asciilifeform: and yes it'd be 133337 if it worked, we would have a direct modular equivalent of karatsuba
asciilifeform: 1) mircea_popescu describes algo for mod. 2) turns out exactly knuths's, that is in existing ffa 3) describes 'do it to each term of a+b+c in karatsuba' 4) this dun work, if it worked we would be bragging about the new 133337 recursive modular mult algo we've got
asciilifeform: ( one answer is that at the very bottom of the barrel, where karatsuba hits the basecase comba mult, we permit division )
asciilifeform: ( nao if only modulus were a distributive operation ! then could take mod for each of the 3 addition arguments inside karatsuba, and we'd be golden )
asciilifeform: http://wotpaste.cascadianhacker.com/pastes/uLQBe/?raw=true << karatsuba, for reference ☟︎
asciilifeform: ( because if you are doing anything other than karatsuba for mult of whatever variety, you're sunk )
asciilifeform: the approach i've been (futile, so far) taking is, to find a way to interleave modularization into karatsuba
a111: Logged on 2017-08-17 20:30 asciilifeform: but in very other olds, apparently in an obscure article in '09 bernstein shows how to eliminate one of the middle-term additions of karatsuba .
asciilifeform: but in very other olds, apparently in an obscure article in '09 bernstein shows how to eliminate one of the middle-term additions of karatsuba . ☟︎
asciilifeform: but instead we can simply let Karatsuba_Term return the carry, and say c1, c2, c3 and in the end we ripple out c1+c2+(-1^subtractenable) .
asciilifeform: in other noose, how come nobody noticed that we can accumulate the carry in Karatsuba_Term, and ripple it out once instead of 3 times per firing ...
ave1: asciilifeform: but it works, and for me was a good way to review karatsuba code
ave1: asciilifeform: are you still looking for a non-recursive Karatsuba? or should it wait until it is fully crystalized?
asciilifeform: ( specifically karatsuba's 3 prongs, are independent )
mod6: just was curious if your readability issue could be resolved with pulling the paramaters of Mul_Karatsuba back out in similar fashion to the original. 'tis all.
mod6: Mul_Karatsuba(X(X'First .. X'First + K - 1),
mod6: Mul_Karatsuba(X0, Y0, P);
mod6: http://btcbase.org/log/2017-08-14#1697732 << >> <+asciilifeform> despite 50% reduction in temp space used by karatsuba mult and square ☝︎
asciilifeform: ^ for readers who wondered why karatsuba is the 1 routine in ffa ~not~ inlined... think.
mod6: this might be extra-strength dumb, but... in your new power of 2 version, do you need to inline the Mul & Square of karatsuba?
asciilifeform: despite 50% reduction in temp space used by karatsuba mult and square