tmsr - Log 2019-01-20

BingoBoingo: Meanwhile there is a story floating around that Wikipedo does not load in Vzla after some stunt changing the President according to wikipedia to an opposition leader. At the same time girl on the ground in Vzla reports that wikipedo does indeed load.

BingoBoingo: The anglophone press is saying sophisticated SSL fuckery is involved in the block rather than the usual DNS shennanigans

BingoBoingo: ^ Yes I raised the alarm. Efforts to convince me to lower it are welcome. If you are outside my L1 you need to offer a stake and odds for your petition to be considered.

mircea_popescu: apologetic girl -- "today's the last of my hormone pills, might have something to do with it". deaf mp -- "whore moan pills ?!"

bvt: as far as code speed is concerned, i still don't know whether i fucked up something, or the algorithm is fundamentally slow after some bignum size

mircea_popescu: bvt if you think about it, it has to be slower, because operations are fast and allocations slow. ☟︎

asciilifeform: bvt, mircea_popescu : http://bvt-trace.net/2019/01/experiment-n-way-split-karatsuba/comment-page-1/#comment-6

bvt: yes, and it's also true for memory accesses. however the number of executed instructions also increased a lot, which makes me suspect that there is something i missed

asciilifeform: bvt: ty for reading, signing, and publishing experiment -- i will include your seals in ch16 article

mircea_popescu: i suppose you might say "it shouldn't be x2, my expectation would be it's x2^1/2", but w/e.

asciilifeform: mircea_popescu: i'm not surprised, considering that bounds check overhead (in ada with all safeties switched on) magnifies the 'losing' of a losing (for particular width) algo

asciilifeform: bvt: you have more than twice the # of cache evictions

asciilifeform: of course 2x slower.

asciilifeform: 'locality' effect is actually why e.g. comba wins over ordinary 'kindergarten' o(n^2) mult

asciilifeform: despite both being in same order

asciilifeform: same effect applies to karatsuba, and errywhere else.

bvt: also, the linked pdf contains one FFT-like algorithm i considered to implement (using walsh transform for convolution instead of fft). but it'd be even more complex

asciilifeform: i saw. and imho would be interesting to have a constant-spacetime, no-floats fft

asciilifeform: ( not particularly useful for ffa, but potentially elsewhere.. )

bvt: yes, actually in this karatsuba/toom algo one can embed comba's trick, but the code would become even gnarlier

bvt: re fft: i would wait until a use-case appears, to at least understand what are the requirements. do you have something in mind?

asciilifeform: bvt: i don't, which is why never bothered with fft

asciilifeform: i assumed bvt knew of a use, given that he dug out the ru materials on subj

bvt: nope, but there were a few mentions of other algorithms in the logs, so i decided to have a look. don't remember how i arrived at this particular one

asciilifeform: 'i built atomic dirigible but fughet why!111'

asciilifeform: http://btcbase.org/log/2019-01-20#1888467 << gotta nitpick here: it aint allocations (which in ffa planet are always done by stack frame, in O(1) ) that leads to slow, but cache eviction ( as well as linear overhead from doing moar ops in general ) ☝︎

a111: Logged on 2019-01-20 15:49 mircea_popescu: bvt if you think about it, it has to be slower, because operations are fast and allocations slow.

bvt: no, i mean that particular paper. i initially wanted to implement a fft-like multiplication algorithm, but got interested in this karatsuba/toom when reading the paper.

mircea_popescu: there's that old joke re "i read good books twice but bad books i don't read at all" which very much applies : if only one knew before looking whether an algo is fast or slow!

asciilifeform: there was another fella (since wandered off) who posted last yr an attempt at de-recursivizing karatsuba ( it's definitely in the logs, can't find it just yet tho )

asciilifeform: bvt: there's a long list of things that asciilifeform considered and (for time being) rejected from ffa, on acct of costing substantial complexity for very small saving of cpu cost. e.g. unrolled comba.

asciilifeform: sorta why, when i first started preparing the thing for publication, wound it back to the simplest known ('egyptian') variant, and walked from there.

asciilifeform: it allows reader to see what is bought by adding the moving parts.

bvt: asciilifeform: a better attempt at this algorithm can involve using comba at the lower level.

asciilifeform: bvt: comba itself ended up on the list of things that made it in by very small margin ( it wins perhaps 10%, on most pc iron , vs straight word*word mul as base case, try it yourself by turning the threshhold knob )

asciilifeform: as i noted previously -- i do not expect to find any moar ~asymptotic~ speedups for ffa algos , such that are relevant to the sizes of numbers typically used in public key crypto ☟︎

asciilifeform: the remaining virgin land for speed revvup, is asmism for base cases. ( and ~possibly~ in combination with unrolled comba )

asciilifeform: http://www.loper-os.org/pub/ffa/hypertext/ch15/w_mul__adb.htm#95_14 << specifically here, currently we do 4 MUL instrs for where really needs only 1

bvt: my understanding is that asmism would go only for lower-level ffa code, i.e. barret/modexp will remain as-is.

asciilifeform: ( and when ADD, SUB, we do it twice, where only needs once )

asciilifeform: bvt: correct

asciilifeform: the only reason asmism even potentially invites itself, is that idjit compiler gives no primitive for add/sub-with-carry or full-word mul ☟︎

asciilifeform: the other 'secret' speed boost is if one were to turn off the bounds checks; this gives ~2x speedup across the board. but i dun expect to do this for my personal uses for years , if ever -- it requires 100% certainty of correctness of the program under all possible input

asciilifeform: ( and even presuming 100% correctness, bounds checks still give added 'cosmic ray resistance'. i ~like~ having'em )

bvt: http://p.bvulpes.com/pastes/q0ffh/?raw=true - some things can be done with gcc specific features, but i guess asming is cleaner ☟︎

asciilifeform: the ~other~, tho less simple, speed boost, is if one were to obtain a machine with wider multiplier !

asciilifeform: but i dun currently have such a thing.

asciilifeform: bvt: i'd prefer asmisms to c imports ( which not only ugly and compiler-dependent, but i suspect destroy performance with overhead )

asciilifeform: recent pc irons offer 128x128bit iron mul. but i have not investigated it, and it specifically gotta be tested for constancy of time

asciilifeform: in a hypothetical asmistic branch of ffa, you'd want to implement whole comba in asm, rather than merely word mul

asciilifeform: same goes for add/sub.

asciilifeform: ( otherwise you end up eating the overhead from the existing portable carry calculation mechanism for no reason )

asciilifeform: mircea_popescu: expand ?

asciilifeform: bvt: correct

mircea_popescu: asciilifeform ie, splits the ints in bus width chunks and only does all 4 quadrants depending on results from the others.

asciilifeform: mircea_popescu: that wouldn't be constant-time...

asciilifeform: ( not to mention, a, e.g., 64 bit multiplication table, with extant logic density, would be approx the size of solar system )

asciilifeform: mircea_popescu: it is conceivable that the ones currently sold are constant time , i simply haven't tried'em.

mircea_popescu: nah, just, if you're multipling (for simplicity) 01 x 01, it won't do 0x1 0x0 0x1 parts

asciilifeform: ( the 64x64 iron multer in amd/intel, possibly surprisingly, is in fact constant time, in all boxes i've tested to date )

asciilifeform: the # of gates req'd, is a cube.

asciilifeform: mircea_popescu: possibly, it'll have to be tested when asciilifeform or somebody else can be arsed

asciilifeform: note that the 'cube' observation only applies if you're going for a single-clock-cycle iron multer. otherwise it grows as square of bitness.

asciilifeform: ( the gate count, that is )

asciilifeform has quite thick binder of curated material on subj, for the hypothetical day that we start baking irons

asciilifeform: 1 annoying aspect of 'iron ffa'-gedankenexperiment, is that none of the available fpga ( either 'ice40' series, or the evil ones ) are anywhere near big enuff to prototype with. it'd have to be simulated a la http://www.loper-os.org/?p=2593 , slowly, and then straight to silicon.

asciilifeform brb,teatime

mircea_popescu: meanwhile in news : http://btcbase.org/log/2019-01-19#1888338 << mystery solved. trb uses bdb for block indexing ; the index had an incorrect lobe affecting some blocks ; whenever someone asked for one of those blocks, bdb died and trb never returned. ☝︎

a111: Logged on 2019-01-19 13:50 mircea_popescu: aaand in wtf lulz, http://btcbase.org/log/2019-01-14#1886738 item died again, same exact details, same exact item, jan 15th (ie, it ran for ~day). ima finally get my ass into gear and debug-run it.

mircea_popescu: apparently 200 reqs ~per 100ms~ (that's how long it takes to complete one) is not enough to bring it down. ☟︎☟︎

asciilifeform: mircea_popescu: lobe << that's still interesting - invites q of ~why~ had bad lobe

asciilifeform: ... disk rot ?

asciilifeform: i prolly oughta add to the http://btcbase.org/log/2019-01-20#1888508 thing : 1 of the items which seemed like a speedup, but in actual practice sucked, was the use of (constant-time) 2 (ditto 4) -bit windows for modexp ( iirc apeloyee suggested ) ☝︎

a111: Logged on 2019-01-20 16:23 asciilifeform: as i noted previously -- i do not expect to find any moar ~asymptotic~ speedups for ffa algos , such that are relevant to the sizes of numbers typically used in public key crypto

asciilifeform: the overhead eats the winnings.

asciilifeform: possibly i'ma do a writeup on the subj, once errything else is fielded.

asciilifeform: ( in all fairness, a large -- e.g. 8bit -- window, ~could~ win, but massively multiplies the memory requirement for the thing )

asciilifeform: chances are that it wouldn't, tho, given how the table still has to be indexed via fz_mux in order to prevent variant (i.e. nonconstanttime) memory indexing

asciilifeform: ( '1st commandment' of ffa : thou shalt not branch on seekrit bits. '2nd commandment' -- thou shalt not index memory by seekrit bits ... )

asciilifeform: btw, must also add to bvt's http://btcbase.org/log/2019-01-20#1888517 >> https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html admits that : 'The compiler will attempt to use hardware instructions to implement these built-in functions where possible, like conditional jump on overflow after addition, conditional jump on carry etc.' , i.e. there is not a guarantee that the thing dun introduce cond jumps. ☝︎

a111: Logged on 2019-01-20 16:30 bvt: http://p.bvulpes.com/pastes/q0ffh/?raw=true - some things can be done with gcc specific features, but i guess asming is cleaner

asciilifeform: therefore that approach is completely verboten.

asciilifeform: a correct asmism would simply read the carry flag and put the value where it belongs (e.g. in fz_add, into the next addition, in comba -- into the accumulator; etc)

asciilifeform: without any jumps

asciilifeform: gcc's knob seems to be geared for scenarios where the overflow is an error condition, rather than expected.

bvt: agreed. on x86_64 is compiled to ADD; SETC sequence, but who knows what happens on other arch/gcc version.

asciilifeform: ( ada standard btw trivially allows for types where this holds true automatically , i.e. throws exception for overflow. but this is not only massively unconstanttime but the overhead is gigantic )

asciilifeform: bvt: 1 of the reasons why ada doesn't offer e.g. addition-with-overflow , is that there is an abundance of sad iron where there isn't even physically a carry flag.

asciilifeform: misguided folx ~continue~ to build these, with the excuse given being 'pipeline'

asciilifeform: !#s riscv

asciilifeform: i find it interesting that -- afaik -- nobody's ever built iron that was specifically optimized for bignum

asciilifeform: and no it dun have to have gigantic bus width, necessarily

bvt: i wonder how this is a valid argument even there. if nothing reads the flag, there are no pipeline dependecies.

asciilifeform: could simply have optimized 'take these-here N words and those-there M words, and put bignum addition in memory starting at O, and overflow flag in P ' or similarly

asciilifeform: bvt: notion was, you gotta stop the pipe if something ~were~ to read it

asciilifeform: bvt: possibly the bolix machines also ( they did it in vertical microcode, iirc, tho, and in nonconstant time unsurprisingly )

asciilifeform: in simple o(n) bignum operations like addition, the cost of instruction decoding for each consecutive 'add' , is substantial % of the cost

asciilifeform: so even vertical microcode would win

asciilifeform: bvt: i found a buncha these, when digging

asciilifeform: often called 'systolic' algos

asciilifeform: they potentially win, but only on custom iron really

asciilifeform: bvt: there's a large market of various voodoo for dsp, but afaik it all operates 'inexactly'

asciilifeform: ( and defo not in constant time )

asciilifeform: more or less entirely opposite approach from what's wanted for crypto.

asciilifeform: ftr i suspect that entirely ordinary algos, such as are seen in the current ffa, would already give ~line-rate~ (i.e. , 4096 modexp faster than 1G/s nic can give you new inputs to modexp on ) if implemented in iron properly.

asciilifeform: without any exotic systolicisms etc

asciilifeform: helps to recall that the problem which originally prompted asciilifeform to write ffa, is a (currently hypothetical) application where rsa sigs are carried in ~individual packets~

asciilifeform: ( is why all previously published rsatrons , entirely unsuitable -- if there's any leakage at all via timing, enemy trivially derives yer key )

asciilifeform: nao, it isn't as if the current ffa, with 2.7sec 4096-bit modexp, is immediately usable to eat packets at line rate. but that part at least theoretically parallelizes ( i.e. a rack fulla multicore boxen running ffa, can theoretically eat packets at line rate... )

asciilifeform: err, 1.7

asciilifeform: 1G/s link in principle delivers 262144 4096bit packets /sec ( in practice, many fewer, on acct of overhead )

asciilifeform: anyway pretty sure we had this thread, is in the logs somewhere.

asciilifeform: before considering to bake irons, it is worth to see what a 100%-asmic ffa would give.

asciilifeform: ( this is not a contradiction in terms, it is possible to implement whole thing, with same constant-time algos, by hand asm )

asciilifeform: ideally one would unroll ~all~ of the loops ( e.g. instead of looping through the words of a bignum, would e.g. add-with-carry on immediately consecutive words with stream of add instrs, etc )

bvt: would this actually be efficient on current arches? this would stress instruction decoder a lot

asciilifeform: bvt: i expect one would trivially get a 10-20x speedup over the ordinary ffa, esp. if the item still fits in l1

asciilifeform: i found last yr, for instance, that unrolled comba ( still in ada ) gives 20-25% speedup.

asciilifeform: the penalty from having any branches, of whatever kind, anywhere at all, on pc iron -- is substantial

asciilifeform: ( consider the http://www.loper-os.org/?p=2906#selection-864.0-883.127 experiment, for instance. )

asciilifeform: koch's turd, despite being implemented in c, with no bounds checks, actually loses to ch14 ffa , for inputs of same ~width~ -- despite fact that he doesn't constanttime and thereby gets to skip massive work

asciilifeform: the skipping itself is expensive enuff on iron with cache/branchpredictor, that he loses rather than wins from it.

bvt: i guess i could do some experiments here. the immediate question is that ffa does plenty of FZ_Adds with different FZ'Length, so full unrolling would not really work (unless i miss something).

asciilifeform: bvt: all of the lengths are deterministically known from the primary fz width.

asciilifeform: therefore you can unroll.

asciilifeform: just as you can de-recursivize the karatsuba etc

asciilifeform: none of the lengths depend on the actual contents of the user input.

asciilifeform: granted, an unrolled ffa would operate on a fixed width (e.g. 8192) of primary fz.

asciilifeform: but this limitation would also be true of any hypothetical arithmetic iron.

asciilifeform: bvt: even the current (ch11 and after) ffa relies on a gnat with working forced-inlining

asciilifeform: ( ave1 discovered how to guarantee working inlining, and this gave 'free' ~2x speedup )

bvt: writing such code really could use some program to generate the necessary amount of adds/subs

asciilifeform: bvt: a typical macroassembler would work for the purpose.

asciilifeform: one would still want to audit the output of any such thing tho, by hand.

bvt: ave1's fix also fixed the 'undefined references' in static lib with gcc-6.3 for me

asciilifeform: not having used gcc5+ , i never saw this bug

asciilifeform: bvt: why not ? they all will look same neh

asciilifeform: it's no moar complicated than to count bricks in a wall.

asciilifeform: and besides, the wall wouldn't even have that many bricks -- on a 64bit bus, a 4096-bit addition is 64 instructions long

asciilifeform: ( if you have an add-with-carry instr )

asciilifeform: there's an obvious limit to what you can unroll and still fit in any plausible cache tho.

asciilifeform: ( you won't be unrolling an entire 4096bit modexp, on any plausible irons... )

asciilifeform: the inner loop or 2, tho, definitely can.

asciilifeform: i'll add , for completeness of thread, that if yer ~sending~, rather than receiving, rsa packets, your bottleneck will be ~rng~ long before it could ever be the arithmetron per se

asciilifeform: ( a single FG , recall, yields ~7kB/s at room temp )

asciilifeform: a fast rsatron is important mainly in light of fast rejection of crapola sent by enemy, rather than for payload per se. ☟︎