log
▁▁▁▁▁▁⏐︎▁▁▁▁▁▁ 3359
BingoBoingo: Meanwhile in other fallout from the new penal code: https://www.reddit.com/r/uruguay/comments/ahax04/se_cans%C3%B3_de_esperar_en_banifox_y_se_rob%C3%B3_una_play/
BingoBoingo: Meanwhile there is a story floating around that Wikipedo does not load in Vzla after some stunt changing the President according to wikipedia to an opposition leader. At the same time girl on the ground in Vzla reports that wikipedo does indeed load.
BingoBoingo: The anglophone press is saying sophisticated SSL fuckery is involved in the block rather than the usual DNS shennanigans
BingoBoingo: I suspect USG.NSA.JAVAGURLZ is auditioning for sandbag duty
BingoBoingo: And that aircraft carrier mght be sinking sooner rather than later
feedbot: http://qntra.net/2019/01/increased-usg-psyops-against-maduro-government-intensify-risk-of-military-confrontation/ << Qntra -- Increased USG Psyops Against Maduro Government Intensify Risk Of Military Confrontation
BingoBoingo: ^ Yes I raised the alarm. Efforts to convince me to lower it are welcome. If you are outside my L1 you need to offer a stake and odds for your petition to be considered.
BingoBoingo: https://twitter.com/BBoingo/status/1086854906055741440 << mircea_popescu Seems I have yet to get twatter banned
mircea_popescu: apologetic girl -- "today's the last of my hormone pills, might have something to do with it". deaf mp -- "whore moan pills ?!"
mircea_popescu: meanwhile in antiqua, http://btcbase.org/log/2013-12-21#428326 ☝︎
a111: Logged on 2013-12-21 02:07 asciilifeform: http://www.usatoday.com/story/nation/2013/12/19/maj-gen-michael-carey-nuclear-missiles-alcohol/4131963/
feedbot: http://bvt-trace.net/2019/01/experiment-n-way-split-karatsuba/ << bvt's backtrace -- Experiment: N-Way Split Karatsuba ☟︎
mircea_popescu: ahaha check him out asciilifeform ! MATHEMATICAL NOTATION!
mircea_popescu: apparently, the split variant is ~slower~.
bvt: hello
bvt: re math, i found that htmling it would be unreadable, so decided to use images.
bvt: as far as code speed is concerned, i still don't know whether i fucked up something, or the algorithm is fundamentally slow after some bignum size
bvt: asciilifeform: http://bvt-trace.net/vpatches/ffa_ch1_genesis.kv.vpatch.bvt.sig http://bvt-trace.net/vpatches/ffa_ch2_logicals.kv.vpatch.bvt.sig http://bvt-trace.net/vpatches/ffa_ch3_shifts.kv.vpatch.bvt.sig
bvt: http://bvt-trace.net/vpatches/ffa_ch4_ffacalc.kv.vpatch.bvt.sig http://bvt-trace.net/vpatches/ffa_ch5_egypt.kv.vpatch.bvt.sig http://bvt-trace.net/vpatches/ffa_ch6_simplest_rsa.kv.vpatch.bvt.sig
bvt: will provide more seals after i work with chapters 7-9 more.
mircea_popescu: bvt if you think about it, it has to be slower, because operations are fast and allocations slow. ☟︎
asciilifeform: bvt, mircea_popescu : http://bvt-trace.net/2019/01/experiment-n-way-split-karatsuba/comment-page-1/#comment-6
bvt: yes, and it's also true for memory accesses. however the number of executed instructions also increased a lot, which makes me suspect that there is something i missed
mircea_popescu: i don't think so.
asciilifeform: bvt: ty for reading, signing, and publishing experiment -- i will include your seals in ch16 article
mircea_popescu: i suppose you might say "it shouldn't be x2, my expectation would be it's x2^1/2", but w/e.
asciilifeform: mircea_popescu: i'm not surprised, considering that bounds check overhead (in ada with all safeties switched on) magnifies the 'losing' of a losing (for particular width) algo
bvt: i would not be surprised it was 20% slower, but 2x was surprising for me
bvt: asciilifeform: yw
mircea_popescu: asciilifeform aha. kinda my thought also.
asciilifeform: bvt: you have more than twice the # of cache evictions
asciilifeform: of course 2x slower.
asciilifeform: 'locality' effect is actually why e.g. comba wins over ordinary 'kindergarten' o(n^2) mult
asciilifeform: despite both being in same order
asciilifeform: same effect applies to karatsuba, and errywhere else.
bvt: also, the linked pdf contains one FFT-like algorithm i considered to implement (using walsh transform for convolution instead of fft). but it'd be even more complex
asciilifeform: i saw. and imho would be interesting to have a constant-spacetime, no-floats fft
asciilifeform: ( not particularly useful for ffa, but potentially elsewhere.. )
bvt: yes, actually in this karatsuba/toom algo one can embed comba's trick, but the code would become even gnarlier
bvt: re fft: i would wait until a use-case appears, to at least understand what are the requirements. do you have something in mind?
asciilifeform: bvt: i don't, which is why never bothered with fft
asciilifeform: i assumed bvt knew of a use, given that he dug out the ru materials on subj
bvt: nope, but there were a few mentions of other algorithms in the logs, so i decided to have a look. don't remember how i arrived at this particular one
mircea_popescu: lol, you're supposed to remember!
asciilifeform: 'i built atomic dirigible but fughet why!111'
asciilifeform: http://btcbase.org/log/2019-01-20#1888467 << gotta nitpick here: it aint allocations (which in ffa planet are always done by stack frame, in O(1) ) that leads to slow, but cache eviction ( as well as linear overhead from doing moar ops in general ) ☝︎
a111: Logged on 2019-01-20 15:49 mircea_popescu: bvt if you think about it, it has to be slower, because operations are fast and allocations slow.
bvt: no, i mean that particular paper. i initially wanted to implement a fft-like multiplication algorithm, but got interested in this karatsuba/toom when reading the paper.
mircea_popescu: nickpit granted!
bvt: :-)
mircea_popescu: anyway, nothing wrong with it, i enjoyed reading.
mircea_popescu: there's that old joke re "i read good books twice but bad books i don't read at all" which very much applies : if only one knew before looking whether an algo is fast or slow!
asciilifeform: there was another fella (since wandered off) who posted last yr an attempt at de-recursivizing karatsuba ( it's definitely in the logs, can't find it just yet tho )
bvt: mircea_popescu: thanks
asciilifeform: bvt: there's a long list of things that asciilifeform considered and (for time being) rejected from ffa, on acct of costing substantial complexity for very small saving of cpu cost. e.g. unrolled comba.
asciilifeform: sorta why, when i first started preparing the thing for publication, wound it back to the simplest known ('egyptian') variant, and walked from there.
asciilifeform: it allows reader to see what is bought by adding the moving parts.
bvt: asciilifeform: a better attempt at this algorithm can involve using comba at the lower level.
bvt: myeah, complexity does not go too well with both constant-time and fits-in-head.
asciilifeform: bvt: comba itself ended up on the list of things that made it in by very small margin ( it wins perhaps 10%, on most pc iron , vs straight word*word mul as base case, try it yourself by turning the threshhold knob )
bvt: will do. so far i had a very brief look at it
asciilifeform: as i noted previously -- i do not expect to find any moar ~asymptotic~ speedups for ffa algos , such that are relevant to the sizes of numbers typically used in public key crypto ☟︎
asciilifeform: the remaining virgin land for speed revvup, is asmism for base cases. ( and ~possibly~ in combination with unrolled comba )
asciilifeform: http://www.loper-os.org/pub/ffa/hypertext/ch15/w_mul__adb.htm#95_14 << specifically here, currently we do 4 MUL instrs for where really needs only 1
bvt: my understanding is that asmism would go only for lower-level ffa code, i.e. barret/modexp will remain as-is.
asciilifeform: ( and when ADD, SUB, we do it twice, where only needs once )
asciilifeform: bvt: correct
asciilifeform: the only reason asmism even potentially invites itself, is that idjit compiler gives no primitive for add/sub-with-carry or full-word mul ☟︎
asciilifeform: the other 'secret' speed boost is if one were to turn off the bounds checks; this gives ~2x speedup across the board. but i dun expect to do this for my personal uses for years , if ever -- it requires 100% certainty of correctness of the program under all possible input
asciilifeform: ( and even presuming 100% correctness, bounds checks still give added 'cosmic ray resistance'. i ~like~ having'em )
bvt: http://p.bvulpes.com/pastes/q0ffh/?raw=true - some things can be done with gcc specific features, but i guess asming is cleaner ☟︎
asciilifeform: the ~other~, tho less simple, speed boost, is if one were to obtain a machine with wider multiplier !
asciilifeform: but i dun currently have such a thing.
asciilifeform: bvt: i'd prefer asmisms to c imports ( which not only ugly and compiler-dependent, but i suspect destroy performance with overhead )
asciilifeform: recent pc irons offer 128x128bit iron mul. but i have not investigated it, and it specifically gotta be tested for constancy of time
asciilifeform: in a hypothetical asmistic branch of ffa, you'd want to implement whole comba in asm, rather than merely word mul
asciilifeform: same goes for add/sub.
asciilifeform: ( otherwise you end up eating the overhead from the existing portable carry calculation mechanism for no reason )
mircea_popescu: i suspect double-wide mul might be implemented by lookup tables.
asciilifeform: mircea_popescu: expand ?
bvt: i.e. W_Borrow/W_Carry still cause the overhead?
asciilifeform: bvt: correct
bvt: i'll be afk an hour or two
mircea_popescu: asciilifeform ie, splits the ints in bus width chunks and only does all 4 quadrants depending on results from the others.
asciilifeform: mircea_popescu: that wouldn't be constant-time...
mircea_popescu: which is my point.
mircea_popescu: 128*128 bit iron mul is not ct by itself.
asciilifeform: ( not to mention, a, e.g., 64 bit multiplication table, with extant logic density, would be approx the size of solar system )
asciilifeform: mircea_popescu: it is conceivable that the ones currently sold are constant time , i simply haven't tried'em.
mircea_popescu: nah, just, if you're multipling (for simplicity) 01 x 01, it won't do 0x1 0x0 0x1 parts
asciilifeform: ( the 64x64 iron multer in amd/intel, possibly surprisingly, is in fact constant time, in all boxes i've tested to date )
mircea_popescu: cuz cheap enough.
asciilifeform: the # of gates req'd, is a cube.
mircea_popescu: but afaik the double-wide mul, "optimized"
asciilifeform: mircea_popescu: possibly, it'll have to be tested when asciilifeform or somebody else can be arsed
asciilifeform: note that the 'cube' observation only applies if you're going for a single-clock-cycle iron multer. otherwise it grows as square of bitness.
asciilifeform: ( the gate count, that is )
asciilifeform has quite thick binder of curated material on subj, for the hypothetical day that we start baking irons
asciilifeform: 1 annoying aspect of 'iron ffa'-gedankenexperiment, is that none of the available fpga ( either 'ice40' series, or the evil ones ) are anywhere near big enuff to prototype with. it'd have to be simulated a la http://www.loper-os.org/?p=2593 , slowly, and then straight to silicon.
asciilifeform brb,teatime
mircea_popescu: meanwhile in news : http://btcbase.org/log/2019-01-19#1888338 << mystery solved. trb uses bdb for block indexing ; the index had an incorrect lobe affecting some blocks ; whenever someone asked for one of those blocks, bdb died and trb never returned. ☝︎
a111: Logged on 2019-01-19 13:50 mircea_popescu: aaand in wtf lulz, http://btcbase.org/log/2019-01-14#1886738 item died again, same exact details, same exact item, jan 15th (ie, it ran for ~day). ima finally get my ass into gear and debug-run it.
mircea_popescu: -reindex rebuilds the index as expected. fin.
mircea_popescu: meanwhile in trilema lulz : netstat | grep http | wc -l > 190.
mircea_popescu: apparently 200 reqs ~per 100ms~ (that's how long it takes to complete one) is not enough to bring it down. ☟︎☟︎
asciilifeform: mircea_popescu: lobe << that's still interesting - invites q of ~why~ had bad lobe
asciilifeform: ... disk rot ?
asciilifeform: i prolly oughta add to the http://btcbase.org/log/2019-01-20#1888508 thing : 1 of the items which seemed like a speedup, but in actual practice sucked, was the use of (constant-time) 2 (ditto 4) -bit windows for modexp ( iirc apeloyee suggested ) ☝︎
a111: Logged on 2019-01-20 16:23 asciilifeform: as i noted previously -- i do not expect to find any moar ~asymptotic~ speedups for ffa algos , such that are relevant to the sizes of numbers typically used in public key crypto
asciilifeform: the overhead eats the winnings.
asciilifeform: possibly i'ma do a writeup on the subj, once errything else is fielded.
asciilifeform: ( in all fairness, a large -- e.g. 8bit -- window, ~could~ win, but massively multiplies the memory requirement for the thing )
asciilifeform: chances are that it wouldn't, tho, given how the table still has to be indexed via fz_mux in order to prevent variant (i.e. nonconstanttime) memory indexing
asciilifeform: ( '1st commandment' of ffa : thou shalt not branch on seekrit bits. '2nd commandment' -- thou shalt not index memory by seekrit bits ... )
asciilifeform: btw, must also add to bvt's http://btcbase.org/log/2019-01-20#1888517 >> https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html admits that : 'The compiler will attempt to use hardware instructions to implement these built-in functions where possible, like conditional jump on overflow after addition, conditional jump on carry etc.' , i.e. there is not a guarantee that the thing dun introduce cond jumps. ☝︎
a111: Logged on 2019-01-20 16:30 bvt: http://p.bvulpes.com/pastes/q0ffh/?raw=true - some things can be done with gcc specific features, but i guess asming is cleaner
asciilifeform: therefore that approach is completely verboten.
asciilifeform: a correct asmism would simply read the carry flag and put the value where it belongs (e.g. in fz_add, into the next addition, in comba -- into the accumulator; etc)
asciilifeform: without any jumps
asciilifeform: gcc's knob seems to be geared for scenarios where the overflow is an error condition, rather than expected.
bvt: agreed. on x86_64 is compiled to ADD; SETC sequence, but who knows what happens on other arch/gcc version.
asciilifeform: ( ada standard btw trivially allows for types where this holds true automatically , i.e. throws exception for overflow. but this is not only massively unconstanttime but the overhead is gigantic )
asciilifeform: bvt: 1 of the reasons why ada doesn't offer e.g. addition-with-overflow , is that there is an abundance of sad iron where there isn't even physically a carry flag.
asciilifeform: misguided folx ~continue~ to build these, with the excuse given being 'pipeline'
asciilifeform: e
asciilifeform: .g.,
asciilifeform: !#s riscv
a111: 11 results for "riscv", http://btcbase.org/log-search?q=riscv
asciilifeform: i find it interesting that -- afaik -- nobody's ever built iron that was specifically optimized for bignum
asciilifeform: and no it dun have to have gigantic bus width, necessarily
bvt: i wonder how this is a valid argument even there. if nothing reads the flag, there are no pipeline dependecies.
asciilifeform: could simply have optimized 'take these-here N words and those-there M words, and put bignum addition in memory starting at O, and overflow flag in P ' or similarly
asciilifeform: bvt: notion was, you gotta stop the pipe if something ~were~ to read it
bvt: perhaps some old APL machines would qualify
asciilifeform: bvt: possibly the bolix machines also ( they did it in vertical microcode, iirc, tho, and in nonconstant time unsurprisingly )
bvt: actually, that fft-like multiplication algorithm was developed in APL
asciilifeform: in simple o(n) bignum operations like addition, the cost of instruction decoding for each consecutive 'add' , is substantial % of the cost
asciilifeform: so even vertical microcode would win
bvt: https://www.dyalog.com/uploads/conference/dyalog12/presentations/U23_MultidigitAlgorithms/Optimization%20of%20parallel%20multi-digit%20algorithms.pdf
asciilifeform: bvt: i found a buncha these, when digging
asciilifeform: often called 'systolic' algos
asciilifeform: they potentially win, but only on custom iron really
bvt: are such machines even build these days? i.e. for dsp, or other use-cases? ☟︎
asciilifeform: bvt: there's a large market of various voodoo for dsp, but afaik it all operates 'inexactly'
asciilifeform: ( and defo not in constant time )
asciilifeform: more or less entirely opposite approach from what's wanted for crypto.
asciilifeform: ftr i suspect that entirely ordinary algos, such as are seen in the current ffa, would already give ~line-rate~ (i.e. , 4096 modexp faster than 1G/s nic can give you new inputs to modexp on ) if implemented in iron properly.
asciilifeform: without any exotic systolicisms etc
asciilifeform: helps to recall that the problem which originally prompted asciilifeform to write ffa, is a (currently hypothetical) application where rsa sigs are carried in ~individual packets~
asciilifeform: ( is why all previously published rsatrons , entirely unsuitable -- if there's any leakage at all via timing, enemy trivially derives yer key )
asciilifeform: nao, it isn't as if the current ffa, with 2.7sec 4096-bit modexp, is immediately usable to eat packets at line rate. but that part at least theoretically parallelizes ( i.e. a rack fulla multicore boxen running ffa, can theoretically eat packets at line rate... )
asciilifeform: err, 1.7
asciilifeform: 1G/s link in principle delivers 262144 4096bit packets /sec ( in practice, many fewer, on acct of overhead )
asciilifeform: anyway pretty sure we had this thread, is in the logs somewhere.
asciilifeform: before considering to bake irons, it is worth to see what a 100%-asmic ffa would give.
asciilifeform: ( this is not a contradiction in terms, it is possible to implement whole thing, with same constant-time algos, by hand asm )
asciilifeform: ideally one would unroll ~all~ of the loops ( e.g. instead of looping through the words of a bignum, would e.g. add-with-carry on immediately consecutive words with stream of add instrs, etc )
bvt: would this actually be efficient on current arches? this would stress instruction decoder a lot
asciilifeform: bvt: i expect one would trivially get a 10-20x speedup over the ordinary ffa, esp. if the item still fits in l1
asciilifeform: i found last yr, for instance, that unrolled comba ( still in ada ) gives 20-25% speedup.
asciilifeform: the penalty from having any branches, of whatever kind, anywhere at all, on pc iron -- is substantial
asciilifeform: ( consider the http://www.loper-os.org/?p=2906#selection-864.0-883.127 experiment, for instance. )
asciilifeform: koch's turd, despite being implemented in c, with no bounds checks, actually loses to ch14 ffa , for inputs of same ~width~ -- despite fact that he doesn't constanttime and thereby gets to skip massive work
asciilifeform: the skipping itself is expensive enuff on iron with cache/branchpredictor, that he loses rather than wins from it.
bvt: i guess i could do some experiments here. the immediate question is that ffa does plenty of FZ_Adds with different FZ'Length, so full unrolling would not really work (unless i miss something).
asciilifeform: bvt: all of the lengths are deterministically known from the primary fz width.
asciilifeform: therefore you can unroll.
asciilifeform: just as you can de-recursivize the karatsuba etc
asciilifeform: none of the lengths depend on the actual contents of the user input.
asciilifeform: granted, an unrolled ffa would operate on a fixed width (e.g. 8192) of primary fz.
asciilifeform: but this limitation would also be true of any hypothetical arithmetic iron.
bvt: aha, i see. this would also involve lots function call inlining as well
asciilifeform: bvt: even the current (ch11 and after) ffa relies on a gnat with working forced-inlining
asciilifeform: ( ave1 discovered how to guarantee working inlining, and this gave 'free' ~2x speedup )
bvt: writing such code really could use some program to generate the necessary amount of adds/subs
asciilifeform: bvt: a typical macroassembler would work for the purpose.
asciilifeform: one would still want to audit the output of any such thing tho, by hand.
bvt: ave1's fix also fixed the 'undefined references' in static lib with gcc-6.3 for me
asciilifeform: not having used gcc5+ , i never saw this bug
bvt: you can't verify output of 'write 4096 adds' by hand, though ☟︎
asciilifeform: bvt: why not ? they all will look same neh
asciilifeform: it's no moar complicated than to count bricks in a wall.
asciilifeform: and besides, the wall wouldn't even have that many bricks -- on a 64bit bus, a 4096-bit addition is 64 instructions long
asciilifeform: ( if you have an add-with-carry instr )
asciilifeform: there's an obvious limit to what you can unroll and still fit in any plausible cache tho.
asciilifeform: ( you won't be unrolling an entire 4096bit modexp, on any plausible irons... )
bvt: this is true, rigth.
asciilifeform: the inner loop or 2, tho, definitely can.
asciilifeform: i'll add , for completeness of thread, that if yer ~sending~, rather than receiving, rsa packets, your bottleneck will be ~rng~ long before it could ever be the arithmetron per se
asciilifeform: ( a single FG , recall, yields ~7kB/s at room temp )
asciilifeform: a fast rsatron is important mainly in light of fast rejection of crapola sent by enemy, rather than for payload per se. ☟︎
asciilifeform will bbl