Jump to content

Primary: Sky Slate Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate Marble
Secondary: Sky Slate Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate Marble
Pattern: Blank Waves Squares Notes Sharp Wood Rockface Leather Honey Vertical Triangles
Photo

Emergency maintenance - 2017/06/09


  • This topic is locked This topic is locked
12 replies to this topic

#1
Grumpy

Grumpy

    RawR

  • Administrators
  • 4,078 posts
  • LocationHere of course!

*
POPULAR

So I've been posting bits and pieces on the twitter and twitter is just so darn short...
 
Batoto runs on number of servers and there's one main server that's at the helm of all other servers. It controls what processing goes where, balances the load and handles certain key features including uploads.
 
I may be over personifying, but every computer/server has a personality. And you tend to see it more when you push it to the max. This main server has been finicky for quite some time. Sometimes it doesn't boot up right... Sometimes when it's pushed to extremes, it just starts being an ass. It's always made my maintenance longer than expected and a large reason to a huge chunk of downtime in percentage. 
 
Like the last tweet on May 17th (part1 part2) was about this server too. Whether it was intentional DDoS or a swarm of haywire bots (they happen surprisingly often... *cough* baidu *cough*), I didn't bother to investigate, but that ignited the jerkwad mode and the precipice of downtime. About 70%+ of my time sink and downtime was due to that. Historically, that wasn't even the worst event, but I don't intend to start another story time. Generally, after much effort, it goes back to being normal.
 
Today though... It just went poof. With no visible cause. Rather, it was it going poof that made me think it was having abnormal traffic as the outflow was off because it wasn't juggling the load properly.
 
It's still running. And throwing no errors. No error logs relating to hardware. Temperatures fine. etc. Everything says it's fine. But it's running at like 1/10th of the speed. And no matter how much I try to console it, doesn't seem to change.
 
So, I took the last resort and signed a new server entirely. Hooked everything up. And this is how performance has been restored now. (Would appreciate some donations. Actually not behind much since I'll be cancelling the previous later and I stick with monthly contracts.)
 
The main server is doing very little right now and adding even a tiny bit more would make it cry again.
 
The new server still needs to be handed over more of less demanding but still numerous tasks including handling of uploads (which is next in priority) (upload now complete). Once all of that is done, it will be the new main server. Do expect some minor down times as I still progress with this. But we should generally be online without super lag we've been seeing all day.
 
I'll keep y'all updated with any significant progress.



#2
Nozomi

Nozomi

    Potato Spud

  • Members
  • 12 posts

Thanks for keeping up on things grumpy. :)


"It is a joyful thing indeed to hold intimate converse with a man after one’s own heart, chatting without reserve about things of interest or the fleeting topics of the world; but such, alas, are few and far between."

- Yoshida Kenko (1283-1350), [i]Tsurezure-Gusa (1340)

#3
Grumpy

Grumpy

    RawR

  • Administrators
  • 4,078 posts
  • LocationHere of course!

Okay. I think the upload is working.

 

That probably takes care of all the critical areas. Still bunch to go. But don't have to do it now. :P

 

 

 

 

 

Waiting for slew of bugs reports...



#4
Grumpy

Grumpy

    RawR

  • Administrators
  • 4,078 posts
  • LocationHere of course!

I think everything is migrated now. Everything that matters anyway...

 

If something is broken but was working before, please report here. Thank you.

 

 

 

p.s. I know the flash uploader is sometimes working and sometimes not working. Just refresh a few times or a different browser. It's weird... I know. Or use remote upload.

p.p.s. Batoto Twitter: @BatotoStatus Bookmark or follow for these kind of stuff. I can't use the forums to say stuff if the forums is down. I don't use it unless I have something to say about the status though. So... don't ask me questions there.



#5
kant

kant

    Potato Sprout

  • Members
  • 2 posts

Just curious (I work as a sysadmin), what kind of hardware does this master server have?

If it's running a reasonably new linux distro, you could try using the tracing tools like SystemTap to try and figure out where's the bottleneck (tho something easier like a default netdata install may suffice if the root cause is not too esoteric)



#6
Grumpy

Grumpy

    RawR

  • Administrators
  • 4,078 posts
  • LocationHere of course!

Just curious (I work as a sysadmin), what kind of hardware does this master server have?

If it's running a reasonably new linux distro, you could try using the tracing tools like SystemTap to try and figure out where's the bottleneck (tho something easier like a default netdata install may suffice if the root cause is not too esoteric)

Nothing too fancy. The web software is fairly easy to expand horizontally without need for any specialized hardware (except db server for ssd, and image servers), so I have bunch of smaller servers. The newest one I deployed was E3-1230v6. All servers are running Centos 7. Beefiest I ever got was E5-1650. And that was because at the time it had a good price to performance ratio at the hosting. Bigger servers just tend to be less value efficient, so I tend to stay away from it.

 

Obviously the bottleneck/issue exists somewhere since something slowed down, but I haven't been able to really pin point it. I have another server running exact same hardware, same os, same services (except being the master--but that's tiny amount of work) yet it's often able to pull 2~3x more work load than this one when this one starts throwing a fit (10x for yesterday). Then sometimes before I even get there to see the problem, problem just vanishes. If this was a super expensive hardware, I think it'd warrant a much more in-depth investigation, but it's a low end and on monthly contract. I think it's easier for me to just ditch it since time is a much more expensive resource for me.



#7
kant

kant

    Potato Sprout

  • Members
  • 2 posts

Nothing too fancy. The web software is fairly easy to expand horizontally without need for any specialized hardware (except db server for ssd, and image servers), so I have bunch of smaller servers. The newest one I deployed was E3-1230v6. All servers are running Centos 7. Beefiest I ever got was E5-1650. And that was because at the time it had a good price to performance ratio at the hosting. Bigger servers just tend to be less value efficient, so I tend to stay away from it.

 

Obviously the bottleneck/issue exists somewhere since something slowed down, but I haven't been able to really pin point it. I have another server running exact same hardware, same os, same services (except being the master--but that's tiny amount of work) yet it's often able to pull 2~3x more work load than this one when this one starts throwing a fit (10x for yesterday). Then sometimes before I even get there to see the problem, problem just vanishes. If this was a super expensive hardware, I think it'd warrant a much more in-depth investigation, but it's a low end and on monthly contract. I think it's easier for me to just ditch it since time is a much more expensive resource for me.

Ah, I misinterpreted your previous post and thought that the new server had the same issue (which would mean that something in the software was having a bottleneck)



#8
Alacia

Alacia

    Russet Potato

  • The Company
  • 230 posts
  • LocationUSA

Thanks for your work, Grumpy.



#9
phoenixalia

phoenixalia

    Russet Potato

  • Contributor
  • 229 posts

Thank you so much, Grumpy.



#10
spdeey

spdeey

    Potato Sprout

  • Members
  • 3 posts

thank you grumpyyyyyyy



#11
alfadeboc

alfadeboc

    Potato Sprout

  • Members
  • 1 posts

From your description it seems that you use as SSD for the DB. That might explain why the server is becoming unreliable and slower.

 

If the DB is often updated, the related SSD cells will wear out, and will be swapped with cells that are still fresh.

Over time, the sectors of the SSD that are the least written to, and would be expected to be reliable, will surprisingly become slower to read. Worn out cells will cause bit errors, and SSD sector read will be retried until the errors can be corrected.

Unfortunately, SSD sectors that are seldom updated are often those that contain the most critical parts of the system, like kernel, binaries and libraries. You must expect a strange behaviour at first, and later a general slow down without apparent reason.

 

I recommend that you copy the SSH content to a reliable disk and replace the SSD before it is entirely worn out.

 

A worn out SSD will kill itself (by design) at the first reset or power-off. Just before that, it should turn into a read-only disk, but don't rely on it. Check the "SSD endurance experiment".



#12
Grumpy

Grumpy

    RawR

  • Administrators
  • 4,078 posts
  • LocationHere of course!

Oh just some closure.

 

After running tests for hardware issues on the ex-main server for 3 straight days now. I have found zero faults. Guess I have to wait until random day of the month for it to act up again. It's like perfect again. It's so... weird...

 

I put it back into the gulag but as a slave, not master this time (and I have time remaining on the contract, might as well use it up). So its failure isn't going to be quite catastrophic.

 

The site as a whole running very well and overall site performance is probably the best right now ever for this year since I have extra +1 server. Site running at 1.00 (rounded to 2nd decimal) apdex score at peak traffic (right now is peak). That is of 10898 sample size, 10836 pages served in under 0.5s. 49 under 2s and 13 over 2s. Kinda tempted to just keep it since I'm getting such nice stats. Maybe I'll stick another ad somewhere to balance it out later. Got enough donation to keep that one server up for ~2 months. Thanks guys :)


From your description it seems that you use as SSD for the DB. That might explain why the server is becoming unreliable and slower. [...]

The SSD I'm running isn't that old. Wear out indicator is at 93% across all SSDs. Also, it's not the database server that's acting up.



#13
yellowFish

yellowFish

    Potato Sprout

  • Members
  • 1 posts
  • LocationUnder my blanket.

Hello Grumpy,

 

I've been using Batoto for quite some time now, and while I was reading this post I noticed that you may be the only one here doing technical stuff (maybe I'm wrong), and that you definitely have a lot of things to do. Since I didn't saw anything related to this, I was wondering if there was any git repo for Batoto where we could help you maintain the website, or are you working in the shadow with few trusted people?

 

Thanks for the hard work