50900+ entries in 0.03s

bvt: aha, i see.
this would also involve lots function call inlining as well
bvt: i guess i could do some experiments here.
the immediate question is
that ffa does plenty of FZ_Adds with different FZ'Length, so full unrolling would not really work (unless i miss something).
bvt: would
this actually be efficient on current arches?
this would stress instruction decoder a lot
bvt: are such machines even build
these days? i.e. for dsp, or other use-cases?
☟︎ bvt: actually,
that fft-like multiplication algorithm was developed in APL
bvt: i wonder how
this is a valid argument even
there. if nothing reads
the flag,
there are no pipeline dependecies.
bvt: agreed. on x86_64 is compiled
to ADD; SETC sequence, but who knows what happens on other arch/gcc version.
a111: Logged on 2019-01-20 16:23 asciilifeform: as i noted previously -- i do not expect
to find any moar ~asymptotic~ speedups for ffa algos , such
that are relevant
to
the sizes of numbers
typically used in public key crypto
mircea_popescu: apparently 200 reqs ~per 100ms~ (that's how long it
takes
to complete one) is not enough
to bring it down.
☟︎☟︎ mircea_popescu: asciilifeform ie, splits
the ints in bus width chunks and only does all 4 quadrants depending on results from
the others.
bvt: i'll be afk an hour or
two
bvt: i.e. W_Borrow/W_Carry still cause
the overhead?
mircea_popescu: i suspect double-wide mul might be implemented by lookup
tables.
bvt: my understanding is
that asmism would go only for lower-level ffa code, i.e. barret/modexp will remain as-is.
bvt: myeah, complexity does not go
too well with both constant-time and fits-in-head.
bvt: asciilifeform: a better attempt at
this algorithm can involve using comba at
the lower level.
bvt: mircea_popescu:
thanks
mircea_popescu: there's
that old joke re "i read good books
twice but bad books i don't read at all" which very much applies : if only one knew before looking whether an algo is fast or slow!
bvt: no, i mean
that particular paper. i initially wanted
to implement a fft-like multiplication algorithm, but got interested in
this karatsuba/toom when reading
the paper.
a111: Logged on 2019-01-20 15:49 mircea_popescu: bvt if you
think about it, it has
to be slower, because operations are fast and allocations slow.
bvt: nope, but
there were a few mentions of other algorithms in
the logs, so i decided
to have a look. don't remember how i arrived at
this particular one
bvt: re fft: i would wait until a use-case appears,
to at least understand what are
the requirements. do you have something in mind?
bvt: yes, actually in
this karatsuba/toom algo one can embed comba's
trick, but
the code would become even gnarlier
bvt: also,
the linked pdf contains one FFT-like algorithm i considered
to implement (using walsh
transform for convolution instead of fft). but it'd be even more complex