asciilifeform: in asciilifeform's o(1) tx indexer ( will be welded to an experimental bdb once i get the mmap thing resolved ) there's a 2-level storage -- a 'write-once' o(1) index for blox of age N ( N can be 100-500 in practice ), and a much smaller rewritable one kept strictly in ram ( for 'recent' blox, where the longest chain is potentially movable )☟︎
asciilifeform: (tho, will add, the retarded design of bdb, gives early death on ssd. in sane indexer there is never any reason to touch any single block moar than once )
asciilifeform: not unless yer using rubbish ssd, at any rate
asciilifeform concurs that there's 0 point in debug of bdb, it's medicine for a corpse
asciilifeform: i'd look at the realloc counter on the disk, betcha it's started to turn.
asciilifeform: ( there -- found this twice, to date. )
asciilifeform: mircea_popescu: that's the interesting imho part. ftr i have yet to witness a corrupted-but-openable bdb on machine other than where ssd was found to have rotted.
asciilifeform: a fast rsatron is important mainly in light of fast rejection of crapola sent by enemy, rather than for payload per se.☟︎
asciilifeform: ( a single FG , recall, yields ~7kB/s at room temp )
asciilifeform: i'll add , for completeness of thread, that if yer ~sending~, rather than receiving, rsa packets, your bottleneck will be ~rng~ long before it could ever be the arithmetron per se
asciilifeform: the inner loop or 2, tho, definitely can.
asciilifeform: ( you won't be unrolling an entire 4096bit modexp, on any plausible irons... )
asciilifeform: there's an obvious limit to what you can unroll and still fit in any plausible cache tho.
asciilifeform: ( if you have an add-with-carry instr )
asciilifeform: and besides, the wall wouldn't even have that many bricks -- on a 64bit bus, a 4096-bit addition is 64 instructions long
asciilifeform: it's no moar complicated than to count bricks in a wall.
asciilifeform: bvt: why not ? they all will look same neh
asciilifeform: not having used gcc5+ , i never saw this bug
asciilifeform: one would still want to audit the output of any such thing tho, by hand.
asciilifeform: bvt: a typical macroassembler would work for the purpose.
asciilifeform: ( ave1 discovered how to guarantee working inlining, and this gave 'free' ~2x speedup )
asciilifeform: bvt: even the current (ch11 and after) ffa relies on a gnat with working forced-inlining
asciilifeform: but this limitation would also be true of any hypothetical arithmetic iron.
asciilifeform: granted, an unrolled ffa would operate on a fixed width (e.g. 8192) of primary fz.
asciilifeform: none of the lengths depend on the actual contents of the user input.
asciilifeform: just as you can de-recursivize the karatsuba etc
asciilifeform: bvt: all of the lengths are deterministically known from the primary fz width.
asciilifeform: the skipping itself is expensive enuff on iron with cache/branchpredictor, that he loses rather than wins from it.
asciilifeform: koch's turd, despite being implemented in c, with no bounds checks, actually loses to ch14 ffa , for inputs of same ~width~ -- despite fact that he doesn't constanttime and thereby gets to skip massive work
asciilifeform: the penalty from having any branches, of whatever kind, anywhere at all, on pc iron -- is substantial
asciilifeform: i found last yr, for instance, that unrolled comba ( still in ada ) gives 20-25% speedup.
asciilifeform: bvt: i expect one would trivially get a 10-20x speedup over the ordinary ffa, esp. if the item still fits in l1
asciilifeform: ideally one would unroll ~all~ of the loops ( e.g. instead of looping through the words of a bignum, would e.g. add-with-carry on immediately consecutive words with stream of add instrs, etc )
asciilifeform: ( this is not a contradiction in terms, it is possible to implement whole thing, with same constant-time algos, by hand asm )
asciilifeform: before considering to bake irons, it is worth to see what a 100%-asmic ffa would give.
asciilifeform: anyway pretty sure we had this thread, is in the logs somewhere.
asciilifeform: 1G/s link in principle delivers 262144 4096bit packets /sec ( in practice, many fewer, on acct of overhead )
asciilifeform: nao, it isn't as if the current ffa, with 2.7sec 4096-bit modexp, is immediately usable to eat packets at line rate. but that part at least theoretically parallelizes ( i.e. a rack fulla multicore boxen running ffa, can theoretically eat packets at line rate... )
asciilifeform: ( is why all previously published rsatrons , entirely unsuitable -- if there's any leakage at all via timing, enemy trivially derives yer key )
asciilifeform: helps to recall that the problem which originally prompted asciilifeform to write ffa, is a (currently hypothetical) application where rsa sigs are carried in ~individual packets~
asciilifeform: ftr i suspect that entirely ordinary algos, such as are seen in the current ffa, would already give ~line-rate~ (i.e. , 4096 modexp faster than 1G/s nic can give you new inputs to modexp on ) if implemented in iron properly.
asciilifeform: more or less entirely opposite approach from what's wanted for crypto.
asciilifeform: in simple o(n) bignum operations like addition, the cost of instruction decoding for each consecutive 'add' , is substantial % of the cost
asciilifeform: bvt: possibly the bolix machines also ( they did it in vertical microcode, iirc, tho, and in nonconstant time unsurprisingly )
asciilifeform: bvt: notion was, you gotta stop the pipe if something ~were~ to read it
asciilifeform: could simply have optimized 'take these-here N words and those-there M words, and put bignum addition in memory starting at O, and overflow flag in P ' or similarly
asciilifeform: and no it dun have to have gigantic bus width, necessarily
asciilifeform: i find it interesting that -- afaik -- nobody's ever built iron that was specifically optimized for bignum
asciilifeform: misguided folx ~continue~ to build these, with the excuse given being 'pipeline'
asciilifeform: bvt: 1 of the reasons why ada doesn't offer e.g. addition-with-overflow , is that there is an abundance of sad iron where there isn't even physically a carry flag.
asciilifeform: ( ada standard btw trivially allows for types where this holds true automatically , i.e. throws exception for overflow. but this is not only massively unconstanttime but the overhead is gigantic )
asciilifeform: gcc's knob seems to be geared for scenarios where the overflow is an error condition, rather than expected.
asciilifeform: a correct asmism would simply read the carry flag and put the value where it belongs (e.g. in fz_add, into the next addition, in comba -- into the accumulator; etc)
asciilifeform: therefore that approach is completely verboten.
asciilifeform: ( '1st commandment' of ffa : thou shalt not branch on seekrit bits. '2nd commandment' -- thou shalt not index memory by seekrit bits ... )
asciilifeform: chances are that it wouldn't, tho, given how the table still has to be indexed via fz_mux in order to prevent variant (i.e. nonconstanttime) memory indexing
asciilifeform: ( in all fairness, a large -- e.g. 8bit -- window, ~could~ win, but massively multiplies the memory requirement for the thing )
asciilifeform: possibly i'ma do a writeup on the subj, once errything else is fielded.
asciilifeform: i prolly oughta add to the http://btcbase.org/log/2019-01-20#1888508 thing : 1 of the items which seemed like a speedup, but in actual practice sucked, was the use of (constant-time) 2 (ditto 4) -bit windows for modexp ( iirc apeloyee suggested )☝︎
asciilifeform: 1 annoying aspect of 'iron ffa'-gedankenexperiment, is that none of the available fpga ( either 'ice40' series, or the evil ones ) are anywhere near big enuff to prototype with. it'd have to be simulated a la http://www.loper-os.org/?p=2593 , slowly, and then straight to silicon.
asciilifeform has quite thick binder of curated material on subj, for the hypothetical day that we start baking irons
asciilifeform: note that the 'cube' observation only applies if you're going for a single-clock-cycle iron multer. otherwise it grows as square of bitness.
asciilifeform: mircea_popescu: possibly, it'll have to be tested when asciilifeform or somebody else can be arsed