log☇︎
51900+ entries in 0.029s
asciilifeform: ( i suppose technically this is 4th expedition -- BingoBoingo orig pioneer, and then ben_vulpes . but asciilifeform's 2nd. )
asciilifeform: there's a list of itches that needs scratching and it got longer today.
asciilifeform: mircea_popescu: acceptable for nao , but i'm inclined to make 2nd expedition sooner rather than later.
asciilifeform: ( why needs gear ? because can flash even nao, but you want to get orig rom contents ~out~ 1st, for a good looking at. and then there is 'unbrick' event also. )
mircea_popescu: "we dunno wtf happened, i guess to a certain mind events centered on jan 13th suspicious as fuck, we have a new kernel and we redid the bios just in case, dunno what more can be possibly done" is, or atleast i guess will have to be, acceptable.
asciilifeform: it is on the cargo manifest for 2nd.
asciilifeform: mircea_popescu: i do. i actually bit my elbows after 1st expedition, that i did not include the necessary gear for this in the 1st crate.
mircea_popescu: do you suppose bios could benefit from a reflash ? if nothing else, to have thermal / crash logging as expected ?
a111: Logged on 2019-01-14 16:48 asciilifeform: after we get to the bottom of UY1 issue, i'ma make sure that all iron owned by pizarro has asciilifeform-baked kernel in place.
mircea_popescu: i recall folks asking, and you saying ok, rather than "you know what... i don't even recall who made this kernel, maybe i remake it when i have time befgore oking this"
asciilifeform: i'd much rather folx did those on rk's..
mircea_popescu: quite well snapped to the peculiar idiocy of a certain band of peculiar idiots.
asciilifeform: insert typos ?
asciilifeform: there's the 'what used for' q also. i'm still at a loss, i'll admit, re what is to be gained even from root on a box hosting blogs.
mircea_popescu: but yes, i run systems which had 0 unexpected reboots, and i've thrown out components / replaced / redesigned systems over unexpected reboots.
asciilifeform: i recall there was a piece where mircea_popescu threw out a raid card
mircea_popescu: depends what i use them for, yes.
asciilifeform: mircea_popescu: what do you typically do with yours ? throw'em out after 1 peculiar reset ? 2 ?
asciilifeform: cuz that's a perfectly valid q.
asciilifeform: i'd like not to lose the orig thread, re whether box was interfered with.
mircea_popescu: not proposing this is foolproof or anything ; not strictly speaking ~impossible~ thermal trip went unnoticed.
asciilifeform: i thought we were discussing the outrageous howler that 'dc would warn if 1 box in a 42u has 1 chip 40c over temp'
mircea_popescu: you'd just notice box went from spewing 50s to spewing 55s suddenly. or w/e.
mircea_popescu: there's a discrete flow of air from each rack that they measure.
asciilifeform: and the halon gets pumped.
asciilifeform: mircea_popescu: that'd be for ambient air. if yer ambient atmosphere is 70c, this is called 'room on fire'
BingoBoingo: asciilifeform: I asked. If you have some targeted questions I will be happy to ask them.
asciilifeform: mircea_popescu: 'is 3 tonne auto here or not' is very diff problem from 'what temperature is the red hot nail inside this 30kg crate'
mircea_popescu: thermal trip is usually >70s or somesuch
mircea_popescu: same fucking thing is the case in ~every dc i eve rsaw, there's a line of sensors above the racks, and can tell whether box is working 30s, 40s or 70s
asciilifeform: BingoBoingo: do they log these somewhere ? how didja learn of it
mircea_popescu: you ever been in one of those parking lots where they have devices telling you how many free spots per isle/level, and red/green light above the individual spots ? without, magically, having a rod up your driver's ass.
BingoBoingo: <mircea_popescu> i ~also~ find it peculiar your dc wouldn't have alerted you in case of thermal trip. because in general they have sensors. << There was a ground fault alarm tripped in the datacenterś fire supression system over the weekend, but the time doesn's line up with the beginning of this reset crisis.
asciilifeform: ( and if this wasn't part of the deal you had with'em, oughta have a stern talk )
asciilifeform: mircea_popescu: if a dc were able to warn you re overheating cpu, they had root on yer box.
asciilifeform: aimed at what ? the outside ? you'd need a bomb calorimeter and whole thing in transformer oil, to stand any chance of distinguishing a frying cpu from a working one through closed chassis
mircea_popescu: w/e, the spot things.
mircea_popescu: i suppose this is an academic discussion. in any case, i've had warnings re hot boxes from dcs before, it's not a wholly unheard of item. laser sensor costs ~nothing, and hvac management is 60% of what they do for a living.
asciilifeform: ( and a cpu is a ~1g thermal mass. )
BingoBoingo: I took off a light layer of particulate. When I opened the chassis I found it matched the photos from when the FG were installed.
asciilifeform: what's the temp 1m from a hot iron ? ~room.
mircea_popescu: the isle cooler tends to notice if rack x is spewing out 200C
asciilifeform: dc has nfi what temp is inside our box
mircea_popescu: i ~also~ find it peculiar your dc wouldn't have alerted you in case of thermal trip. because in general they have sensors.
mircea_popescu: photo or no photo. did you take your own weight in gunk out of the fans over there ?!
asciilifeform: BingoBoingo: didja happen to photo the internals when you opened ?
mircea_popescu: the case here seems to strike out on both of these.
mircea_popescu: and then, when i went to clean it, i found it dirty ; as opposed to clean.
mircea_popescu: asciilifeform i did. let me tell you how it behaved : box went down. upon reboot it went down again, in the following manner : every time it was rebooted, within a finite time interval (bout an hour). no exceptions.
asciilifeform: box runs till it doesn't, then behaves rather like this.
mircea_popescu: asciilifeform this observation merely begs the question of "why".
mircea_popescu: indulge me. so the theory goes that an event with a probability inferior to 1e-4 / day occured three times in two days ?
asciilifeform: mircea_popescu: keep in mind this was a ben_vulpes-baked box, i never saw inside of it ( dulap-III, dulap-spare, s-mg, s-mg-spare -- i cleaned with own hands )
mircea_popescu: ie, a fan stoppedf by itself, and then started working again, by itself ?
asciilifeform: imho mle is still thermals.
mircea_popescu: (i think the third reboot was actually you guise, or not ?)
mircea_popescu: so to get this straight, your "most likely explanation" points to... ram failure resulting in kernel panic... twice ?
asciilifeform: they aint always visibly burst, but i'm inclined to think this wasn't the caps.
asciilifeform: http://www.loper-os.org/?p=1871 << typical example of when caps.
mircea_popescu: i never saw a mobo bust a cap and then boot by itself again tho. besides, he'd see a busted cap i imagine.
mircea_popescu: but i mean... so you have ten year old gb ram ?
asciilifeform: ssds are new, no one buys used ssd, it'd be like buying used toiletpaper
mircea_popescu: i guess i must've mixed things.
asciilifeform: ( can't speak for colo subscribers such as trinque , referring to pizarro irons )
mircea_popescu: weren't you shipping a bunch of new rams to make it ?!
BingoBoingo: From what I understand the ram came with the chassis
asciilifeform: mircea_popescu: 100% of the x86 iron in the cage is 2009-11 vintage.
mircea_popescu: i'm confused, you bought used ram ? i seem to recall a discussion...
asciilifeform: the only new iron in the cage is the rk's.
mircea_popescu: but the ram in that box is as new as a kitten.
asciilifeform: ( why -- i do not know. but ram appears to age, possibly ion migration )
asciilifeform: ram, typically
mircea_popescu: kudos to you, but nevertheless.
mircea_popescu: outside of hard drives, and capacitors on OVER FIFTEEN YEAR OLD motherbopards, i have not witnessed this wonder myself, of failing hardware.
mircea_popescu: i confess i have nfi what makes you think commodity hardware failed in this case.
asciilifeform: but indeed i'd much like to move to a 'near-errybody on rockchips' , ~these~ can approximate the ideal of 'treat irons as toilet paper, discard on 1st sign of rot'
mircea_popescu: so then as a factual matter, if asciilifeform threw out erry box that rebooted "by itself" for no apparent reason, pizarro would be missing uy1
asciilifeform: i haven't succeeded in crashing a rockchip yet ( outside from the rotting usb ssd's affair )
mircea_popescu: is this a fact, eg the rockchips ?
asciilifeform: the sad part tho is that if asciilifeform threw out erry retardix box that ever kernel panicked, would have none left in service
mircea_popescu: and the coincidence is there, and glaringly. we know for a fapt inept http://trilema.com/2012/law-enforcement-never-fails-to-unintentionally-entertain/ efforts ~signature move~ is rebootage, much like olde smersh signature move was the clicking on phone line ; and as he points out, buncha people moved their zncs there.
asciilifeform: mircea_popescu: i can't disagree, and am inclined to move it to cold spare when we get another crate in.
mircea_popescu: this stance is consistent with, inter alia, republican practice -- we moved variously boxes off providers who kept rebooting them "mysteriously"
trinque: BingoBoingo: signaling my willingness to pay, not my condoning of the outage.
asciilifeform: mircea_popescu: fwiw crashism is more typical result of failed diddling than working.
BingoBoingo: <mircea_popescu> the one concerning bit is whether indeed pizarro still owns that box or not. << This very much concerns me
asciilifeform: mircea_popescu: this q can be asked re any box.
BingoBoingo: trinque: My though on the month is that the money being paid for shared hosting is very real to our customers, and we lack a firm hour count on how many customer uptime hours have been lost.
mircea_popescu: the one concerning bit is whether indeed pizarro still owns that box or not.
a111: 5 results for "how many bugs tolerate", http://btcbase.org/log-search?q=how%20many%20bugs%20tolerate
asciilifeform: !#s how many bugs tolerate
trinque: ftr I don't need a month's comp for a few hours of outage, though a few hours of outage does suck.
asciilifeform: BingoBoingo: it's in the log
BingoBoingo: During palm touch tests before cleaning fans the warmest part of the chasis was near the RAID card, by a margin that though small registered on my skin. Do we have a way to instrument the RAID card.
trinque: conspicuous bit is various folks having moved their comms aboard uy1
asciilifeform: hanbot & other subscribers to uy1 : plox to inform asciilifeform asap if you notice ~any~ unusual behaviour on this box ( not only reset , reset will be obvious from here )
asciilifeform: hanbot: today i set up realtime stream of system log + voltages + temperatures + fan rpm to the torture room, was expecting to find thermal problem, so far 0
asciilifeform: hanbot: we dun know yet wtf reset the box ( and it happened no fewer than 3 times, in 2 day span )
asciilifeform: been up continuously since i set up sensors earlier today, and still alive, with 0 anomalous readings
BingoBoingo: hanbot: The investigation is ongoing. Other than the chassis interor being marginally cleaner than before asciilifeform instrumented the machine, answers remain elusive. Per http://pizarroisp.net/2019/01/14/pizarro-isp-update-january-14-2019/#selection-13.0-17.315 I am inclined to not charge any Pizarro shared hosting customers for the month of January though I am open to hearing other suggested remedies.