HOWDOI: Manage many instances of httpd/nginx/etc

OstermanA · 06-18-2017, 09:32 PM

Say I have a farm with a few hundred web servers. How does one manage that? What practices are common for starting/stopping that many servers, validating that all of them are intact and up to date, which are up and responding, log accumulation, etc?

Is there anything off the shelf to manage and monitor a large number of instances of either/both? What about standalone JBoss instances? If you want to boot up a hundred instances on fifty hosts, what are the options?

sag47 · 06-22-2017, 02:12 AM

It depends on the infrastructure you're managing it on. For instance, if you were managing your hosts on infrastructure as a service (Like Apache mesos, OpenStack, or any of the many hosted cloud providers) then I would recommend following immutable infrastructure practices. That is, for each application stack you should have the following concepts.

Building frozen images of all required parts.
Provisioning your infrastructure using the images you've built.
At run time, link up dependent services including things like backends, logging, and monitoring.

Example technology stack:

(configuration management) Use Ansible to configure services by dropping or modifying configuration files and installing required software typically via package manager in an idempotent way. (95% of operating system configuration you need customized).
Use Hashicorp Packer to "bake" your operating system images. That is, it makes API calls to a cloud service to start an operating system instance, runs Ansible to configure it, then takes a new snapshot which will be used in the next step as your building blocks).
Use Hashicporp Terraform to describe the infrastructure layout of your service. e.g. let's say you used packer to "bake" a proxy image and an application server image. Your application requires persistent storage for data that needs to survive a system restart. You would use terraform to describe you have a proxy server in front of an application server and a slice of disk mounted on your application server for persistent storage.
Post startup initialization can be handled by cloud-init. This is where the final configuration steps are performed. Writing out configuration files with host names, formatting and mounting your persistent storage if it is raw disk, enabling and starting (restarting) services after they've been configured post-boot by cloud-init. (the remaining 5% of operating system configuration you needed customized but could only do it here because you didn't have runtime information like IP addresses during the image bake).

Other considerations:

Log aggregation and searching logs: Elastic search, logstash, and kibana are an oft referred stack for centralized logging (commonly referred as ELK).
Metrics and monitoring: I personally have enjoyed using Telegraf (shipping metrics), InfluxDB (time series metrics collecting), and Grafana (UI frontend providing dashboards and alerting based on metrics). Also, Hashicorp Consul provides defining services which can report service health.
Service discovery: Hashicorp Consul can be used for service discovery, simplifying DNS of discovered services, and a key value store where you can store things like UUIDs of packer baked images for adding process around your provisioned infrastructure (e.g. different environments like dev, qa, stage, prod).

There's a lot of tools mentioned but the cool thing is you can scale services to be hundreds of servers with little effort if you do it right. Mesos with Marathon is also pretty cool because it allows you to do neat things like elastic scaling (i.e. automatically provision more application servers and proxies if your web service is under heavy load or delete some when there's little load).

OstermanA · 06-22-2017, 11:20 PM

Hmm.

In my particular setup, we're provisioned Redhat machines that we don't get root on, and essentially have to do everything via ssh. Ansible is doable, but I don't know anything about it and my python sucks. Working on that last bit, though.

Essentially, my problem is that I just don't know what's out there or how to look it up.

Say you walked into a shop with a hundred nginx servers all defined and set up, and were asked to implement a mechanism that would allow them to start or stop an arbitrary slice of them from a central location using a limited number of commands. Assume that you don't have to manage the configuration file, but it would be a bonus. How would you approach it?

sag47 · 06-23-2017, 10:02 AM

You need to be root to start and stop most services (unless your service is listening on ports higher than 1024). There's plenty of solutions for configuration management. "Python sucks" is an opinion. Python has strengths and weaknesses (like all programming languages). I tend to define my goals for what I want to accomplish and look for the right tool for the job regardless of the technology it's built upon. Taking into account strengths of the team I'm working on as well (I am not an island but work with other people).

If you're on a Linux workstation check out clusterssh. Or you can maintain a flat file list of hostnames and maintain shell scripts to execute across them. If I had to write such a shell script I would write scripts so they can be run in parallel across all hosts and use flock to write the output to a file. Bash can handle that.