log☇︎
50900+ entries in 0.03s
asciilifeform: bvt: even the current (ch11 and after) ffa relies on a gnat with working forced-inlining
bvt: aha, i see. this would also involve lots function call inlining as well
asciilifeform: but this limitation would also be true of any hypothetical arithmetic iron.
asciilifeform: none of the lengths depend on the actual contents of the user input.
asciilifeform: just as you can de-recursivize the karatsuba etc
asciilifeform: bvt: all of the lengths are deterministically known from the primary fz width.
bvt: i guess i could do some experiments here. the immediate question is that ffa does plenty of FZ_Adds with different FZ'Length, so full unrolling would not really work (unless i miss something).
asciilifeform: the skipping itself is expensive enuff on iron with cache/branchpredictor, that he loses rather than wins from it.
asciilifeform: koch's turd, despite being implemented in c, with no bounds checks, actually loses to ch14 ffa , for inputs of same ~width~ -- despite fact that he doesn't constanttime and thereby gets to skip massive work
asciilifeform: ( consider the http://www.loper-os.org/?p=2906#selection-864.0-883.127 experiment, for instance. )
asciilifeform: i found last yr, for instance, that unrolled comba ( still in ada ) gives 20-25% speedup.
asciilifeform: bvt: i expect one would trivially get a 10-20x speedup over the ordinary ffa, esp. if the item still fits in l1
bvt: would this actually be efficient on current arches? this would stress instruction decoder a lot
asciilifeform: ideally one would unroll ~all~ of the loops ( e.g. instead of looping through the words of a bignum, would e.g. add-with-carry on immediately consecutive words with stream of add instrs, etc )
asciilifeform: ( this is not a contradiction in terms, it is possible to implement whole thing, with same constant-time algos, by hand asm )
asciilifeform: before considering to bake irons, it is worth to see what a 100%-asmic ffa would give.
asciilifeform: anyway pretty sure we had this thread, is in the logs somewhere.
asciilifeform: nao, it isn't as if the current ffa, with 2.7sec 4096-bit modexp, is immediately usable to eat packets at line rate. but that part at least theoretically parallelizes ( i.e. a rack fulla multicore boxen running ffa, can theoretically eat packets at line rate... )
asciilifeform: ( is why all previously published rsatrons , entirely unsuitable -- if there's any leakage at all via timing, enemy trivially derives yer key )
asciilifeform: helps to recall that the problem which originally prompted asciilifeform to write ffa, is a (currently hypothetical) application where rsa sigs are carried in ~individual packets~
asciilifeform: ftr i suspect that entirely ordinary algos, such as are seen in the current ffa, would already give ~line-rate~ (i.e. , 4096 modexp faster than 1G/s nic can give you new inputs to modexp on ) if implemented in iron properly.
asciilifeform: ( and defo not in constant time )
asciilifeform: bvt: there's a large market of various voodoo for dsp, but afaik it all operates 'inexactly'
bvt: are such machines even build these days? i.e. for dsp, or other use-cases? ☟︎
asciilifeform: bvt: i found a buncha these, when digging
asciilifeform: in simple o(n) bignum operations like addition, the cost of instruction decoding for each consecutive 'add' , is substantial % of the cost
bvt: actually, that fft-like multiplication algorithm was developed in APL
asciilifeform: bvt: possibly the bolix machines also ( they did it in vertical microcode, iirc, tho, and in nonconstant time unsurprisingly )
asciilifeform: bvt: notion was, you gotta stop the pipe if something ~were~ to read it
asciilifeform: could simply have optimized 'take these-here N words and those-there M words, and put bignum addition in memory starting at O, and overflow flag in P ' or similarly
bvt: i wonder how this is a valid argument even there. if nothing reads the flag, there are no pipeline dependecies.
asciilifeform: and no it dun have to have gigantic bus width, necessarily
asciilifeform: i find it interesting that -- afaik -- nobody's ever built iron that was specifically optimized for bignum
asciilifeform: misguided folx ~continue~ to build these, with the excuse given being 'pipeline'
asciilifeform: bvt: 1 of the reasons why ada doesn't offer e.g. addition-with-overflow , is that there is an abundance of sad iron where there isn't even physically a carry flag.
asciilifeform: ( ada standard btw trivially allows for types where this holds true automatically , i.e. throws exception for overflow. but this is not only massively unconstanttime but the overhead is gigantic )
bvt: agreed. on x86_64 is compiled to ADD; SETC sequence, but who knows what happens on other arch/gcc version.
asciilifeform: gcc's knob seems to be geared for scenarios where the overflow is an error condition, rather than expected.
asciilifeform: a correct asmism would simply read the carry flag and put the value where it belongs (e.g. in fz_add, into the next addition, in comba -- into the accumulator; etc)
asciilifeform: therefore that approach is completely verboten.
a111: Logged on 2019-01-20 16:30 bvt: http://p.bvulpes.com/pastes/q0ffh/?raw=true - some things can be done with gcc specific features, but i guess asming is cleaner
asciilifeform: btw, must also add to bvt's http://btcbase.org/log/2019-01-20#1888517 >> https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html admits that : 'The compiler will attempt to use hardware instructions to implement these built-in functions where possible, like conditional jump on overflow after addition, conditional jump on carry etc.' , i.e. there is not a guarantee that the thing dun introduce cond jumps. ☝︎
asciilifeform: ( '1st commandment' of ffa : thou shalt not branch on seekrit bits. '2nd commandment' -- thou shalt not index memory by seekrit bits ... )
asciilifeform: chances are that it wouldn't, tho, given how the table still has to be indexed via fz_mux in order to prevent variant (i.e. nonconstanttime) memory indexing
asciilifeform: ( in all fairness, a large -- e.g. 8bit -- window, ~could~ win, but massively multiplies the memory requirement for the thing )
asciilifeform: possibly i'ma do a writeup on the subj, once errything else is fielded.
asciilifeform: the overhead eats the winnings.
a111: Logged on 2019-01-20 16:23 asciilifeform: as i noted previously -- i do not expect to find any moar ~asymptotic~ speedups for ffa algos , such that are relevant to the sizes of numbers typically used in public key crypto
asciilifeform: i prolly oughta add to the http://btcbase.org/log/2019-01-20#1888508 thing : 1 of the items which seemed like a speedup, but in actual practice sucked, was the use of (constant-time) 2 (ditto 4) -bit windows for modexp ( iirc apeloyee suggested ) ☝︎
asciilifeform: mircea_popescu: lobe << that's still interesting - invites q of ~why~ had bad lobe
mircea_popescu: apparently 200 reqs ~per 100ms~ (that's how long it takes to complete one) is not enough to bring it down. ☟︎☟︎
mircea_popescu: meanwhile in trilema lulz : netstat | grep http | wc -l > 190.
mircea_popescu: -reindex rebuilds the index as expected. fin.
mircea_popescu: meanwhile in news : http://btcbase.org/log/2019-01-19#1888338 << mystery solved. trb uses bdb for block indexing ; the index had an incorrect lobe affecting some blocks ; whenever someone asked for one of those blocks, bdb died and trb never returned. ☝︎
asciilifeform: 1 annoying aspect of 'iron ffa'-gedankenexperiment, is that none of the available fpga ( either 'ice40' series, or the evil ones ) are anywhere near big enuff to prototype with. it'd have to be simulated a la http://www.loper-os.org/?p=2593 , slowly, and then straight to silicon.
asciilifeform has quite thick binder of curated material on subj, for the hypothetical day that we start baking irons
asciilifeform: ( the gate count, that is )
asciilifeform: note that the 'cube' observation only applies if you're going for a single-clock-cycle iron multer. otherwise it grows as square of bitness.
asciilifeform: mircea_popescu: possibly, it'll have to be tested when asciilifeform or somebody else can be arsed
mircea_popescu: but afaik the double-wide mul, "optimized"
asciilifeform: ( the 64x64 iron multer in amd/intel, possibly surprisingly, is in fact constant time, in all boxes i've tested to date )
asciilifeform: mircea_popescu: it is conceivable that the ones currently sold are constant time , i simply haven't tried'em.
asciilifeform: ( not to mention, a, e.g., 64 bit multiplication table, with extant logic density, would be approx the size of solar system )
asciilifeform: mircea_popescu: that wouldn't be constant-time...
mircea_popescu: asciilifeform ie, splits the ints in bus width chunks and only does all 4 quadrants depending on results from the others.
bvt: i'll be afk an hour or two
bvt: i.e. W_Borrow/W_Carry still cause the overhead?
mircea_popescu: i suspect double-wide mul might be implemented by lookup tables.
asciilifeform: ( otherwise you end up eating the overhead from the existing portable carry calculation mechanism for no reason )
asciilifeform: in a hypothetical asmistic branch of ffa, you'd want to implement whole comba in asm, rather than merely word mul
asciilifeform: recent pc irons offer 128x128bit iron mul. but i have not investigated it, and it specifically gotta be tested for constancy of time
asciilifeform: bvt: i'd prefer asmisms to c imports ( which not only ugly and compiler-dependent, but i suspect destroy performance with overhead )
asciilifeform: but i dun currently have such a thing.
asciilifeform: the ~other~, tho less simple, speed boost, is if one were to obtain a machine with wider multiplier !
bvt: http://p.bvulpes.com/pastes/q0ffh/?raw=true - some things can be done with gcc specific features, but i guess asming is cleaner ☟︎
asciilifeform: the other 'secret' speed boost is if one were to turn off the bounds checks; this gives ~2x speedup across the board. but i dun expect to do this for my personal uses for years , if ever -- it requires 100% certainty of correctness of the program under all possible input
asciilifeform: the only reason asmism even potentially invites itself, is that idjit compiler gives no primitive for add/sub-with-carry or full-word mul ☟︎
asciilifeform: ( and when ADD, SUB, we do it twice, where only needs once )
bvt: my understanding is that asmism would go only for lower-level ffa code, i.e. barret/modexp will remain as-is.
asciilifeform: as i noted previously -- i do not expect to find any moar ~asymptotic~ speedups for ffa algos , such that are relevant to the sizes of numbers typically used in public key crypto ☟︎
asciilifeform: bvt: comba itself ended up on the list of things that made it in by very small margin ( it wins perhaps 10%, on most pc iron , vs straight word*word mul as base case, try it yourself by turning the threshhold knob )
bvt: myeah, complexity does not go too well with both constant-time and fits-in-head.
bvt: asciilifeform: a better attempt at this algorithm can involve using comba at the lower level.
asciilifeform: it allows reader to see what is bought by adding the moving parts.
asciilifeform: sorta why, when i first started preparing the thing for publication, wound it back to the simplest known ('egyptian') variant, and walked from there.
asciilifeform: bvt: there's a long list of things that asciilifeform considered and (for time being) rejected from ffa, on acct of costing substantial complexity for very small saving of cpu cost. e.g. unrolled comba.
bvt: mircea_popescu: thanks
asciilifeform: there was another fella (since wandered off) who posted last yr an attempt at de-recursivizing karatsuba ( it's definitely in the logs, can't find it just yet tho )
mircea_popescu: there's that old joke re "i read good books twice but bad books i don't read at all" which very much applies : if only one knew before looking whether an algo is fast or slow!
bvt: no, i mean that particular paper. i initially wanted to implement a fft-like multiplication algorithm, but got interested in this karatsuba/toom when reading the paper.
a111: Logged on 2019-01-20 15:49 mircea_popescu: bvt if you think about it, it has to be slower, because operations are fast and allocations slow.
asciilifeform: http://btcbase.org/log/2019-01-20#1888467 << gotta nitpick here: it aint allocations (which in ffa planet are always done by stack frame, in O(1) ) that leads to slow, but cache eviction ( as well as linear overhead from doing moar ops in general ) ☝︎
mircea_popescu: lol, you're supposed to remember!
bvt: nope, but there were a few mentions of other algorithms in the logs, so i decided to have a look. don't remember how i arrived at this particular one
asciilifeform: i assumed bvt knew of a use, given that he dug out the ru materials on subj
bvt: re fft: i would wait until a use-case appears, to at least understand what are the requirements. do you have something in mind?
bvt: yes, actually in this karatsuba/toom algo one can embed comba's trick, but the code would become even gnarlier
asciilifeform: i saw. and imho would be interesting to have a constant-spacetime, no-floats fft
bvt: also, the linked pdf contains one FFT-like algorithm i considered to implement (using walsh transform for convolution instead of fft). but it'd be even more complex
asciilifeform: same effect applies to karatsuba, and errywhere else.