💡 Background processing is a critical part of any web backend, and Sidekiq is a very popular background job processing system for Ruby for many reasons, it can also be integrated as the backend to the
ActiveJobinterface as well. Many projects inside Raksul utilize Sidekiq, but although Sidekiq provides several different queue processing strategies out of the box (random, strictly ordered, and weighted job priorities) , sometimes you still encounter different problems with queues & job priorities management that those strategies cannot solve easily. This small article will try to tackle some examples of such problems, together with some principles that I've learned during my work, about how to prioritizing background jobs appropriately.
Basic queue processing strategies
- Strictly ordered job priorities: declare queues in the order you want them to strictly run, without weight options.
# config/sidekiq.yml ... :queues: - critical - default - low
- Weighted job priorities: the weight you put on each queue will determine how frequently it'll be scanned relatively compared to other queues.
# config/sidekiq.yml ... :queues: - [critical, 5] - [default, 2] - [low, 1]
- Random job priorities: Just queues that are all set to the same weights. Not really practical.
sidekiq-priorityis also an interesting gem that helps you to prioritize some specific jobs with specific arguments. It can prioritize certain jobs to be processed ahead of other jobs (even of the same job class) that are queuing in the same queue. It's pretty handy in some niche scenarios.
Principles & tips about prioritizing jobs
- Jobs with different criticality levels must be separated into different queues. Don't be afraid and use as many queues as you want, as long as all the jobs in the same queues have similar priority & criticality levels. Mailing jobs can be put in the same queue as Elasticsearch indexing jobs, as well as ActionCable broadcasting jobs, for example. Don't stop at
critical_as_f*ck, you may have
midnight_schedules, etc as well. Separate queues with clear intentions will help you a lot later on.
- Balancing jobs based on criticality is not enough, as obviously not all jobs are born equal. Some may flood the workers in a short period, some may consume a big chunk of execution time, and some only run at a specific hour of the day. That's why there are few other metrics that you should consider when assigning jobs to queues: blocking potential, average execution time, frequency of arrival, and last but not least, latency tolerance.
- Blocking potential is the potential of blocking entirely all other jobs in the same queue with them from being processed, causing noticeable latency.
- Jobs that have high average execution time (in the order of minutes or hours) combined with high arriving frequency will exhibit high blocking potential as well.
- Latency tolerance is a metric in which you evaluate the impact of latency on the nature of the job itself. For example, indexing jobs that affect search results usually have high latency tolerance, since most of the time, users don't know exactly what's in the search result they're looking for.
- Sometimes high latency tolerance translates to low criticality, and vice versa, but that's not always the case. Some jobs may have high criticality, but can still tolerate some latency (Elasticsearch indexing jobs, for example). They can be put together with other jobs that have high criticality but also a high blocking potential (In other words, having potential to block entirely the queue for a brief period, causing latency for other jobs in the same queue with them).
Advanced queuing with multiple processes (swarm as in Sidekiq Enterprise)
- What if, two different classes of job have similar criticality, but cannot tolerate each other since one has high blocking potential and the other has low latency tolerance? And you want them both to be processed in the same timeframe without stomping each other's feet?
- What if, you want your high priority queues to be processed with the highest speed regardless of any events in other queues? In normal weighted queues, unfortunate timing can cause high priority queues to be contaminated with, or even blocked by a mass number of low priority jobs that simply arrived at a bad time.
- What if, you utilize a fancy compute node with lots of CPU cores to hopefully speed up your job processing speed? But then only to find out later that no matter how many threads your sidekiq process spawn, they all run on the same core? (all Sidekiq threads run concurrently instead of in parallel, due to global interpreter lock in CRuby. They will eventually hit the wall of diminishing return as increasing number of threads saturates the allocated core).
→ Here comes multi-process sidekiq to the rescue (or better, pay $1900/year for Sidekiq Enterprise's swarm for minimum config of 10x10 threads):
# config/sidekiq_urgent.yml ... :queues: - critical - default
# config/sidekiq_relax.yml ... :queues: - default - low
Balancing queues between processes is needed to pursuit meaningful purposes: ensure maximum throughput for VIP queues, offer fault tolerance to some extends when one of your processes encounter problems, and better utilize computing resources (since multiple processes can span to use multiple cores and have address spaces of their own).
Similar principles as when balancing jobs between queues can be applied to queues between processes too. And in the broader picture, balancing processes between compute nodes / containers, or balancing containers across availability zones. They will allow you to provide utmost availability & stability to your sidekiq fleet.