How to monitor your Redis-based job queue system?

May 04, 2019

How to monitor your Redis-based job queue system?

Distributed job queue is a very popular pattern to process heavy batch work, which can require much ‘bigger’ (read: expensive) hardware resource to complete.

Let’s illustrate with an example. Your system needs to send a newsletter to all of your users (e.g 100k users) on weekly basis. Since it is more important that all users receive an email, rather than they need to receive it at the same time, or exactly at a moment, it’s probably useful to split the task into, for example, 100 tasks of sending email (each for 1000 users).

Why this approach is better? Sending to all users can take a very long time and if the system crashes, it will affect all of them and even if you restart you might not know which users have been sent or not (I do hope you store a flag in database when an email is sent to user). In contrary, if you split the task, the crash can only affect a small part of user and you can need to resume unprocessed tasks, instead of running the whole big task from the beginning.

Implementation

One of the most popular ways to implement a queue is by using three Redis list for waiting, processing and dead letter queue.

Specifically, your task is inserted to waiting queue, then it will be polled by worker to process and at the same time, move it to processing queue. There will be other process to check if the task has been in the processing queue for too long (i.e it has probably has not been executed successfully) and then re-queue (move back to waiting queue) or move it to dead letter queue for re-examination.

The detailed implementation is out of the scope of this note but if you are interested, feel free to checkout my Github repo.

Monitoring

Monitoring is a big part of any software system. In the next part, I want to discuss a way to monitor the distributed queue operation.

In particularly, we can use the following structure to monitor a job’s execution:

class JobMetadata {

  private UUID id;
  private Integer count;
  private LocalDateTime insertedAt;
  private LocalDateTime executedAt;
  private LocalDateTime finishedAt;
}

where:

insertedAt: the time when it is inserted to waiting queue.
executedAt: the time when it is moved to processing queue.
finishedAt: the time when its execution is complete (and thus, deleted from processing queue).
count: how many times the job has been tried to execute by the worker.

When a task is inserted to waiting queue, it will be assigned with an UUID and we can use the id to construct Redis key to store the its JobMetadata. This guarantees that we have O(1) time complexity when finding and updating the job metadata.

Making sense of the metadata

Ideally, we would want to have the duration between insertedAt, executedAt and finishedAt to be as small as possible.

If the difference between executedAt and insertedAt is too long, it means that the jobs are stuck on waiting queue and you should scale up the number of workers (horizontal scaling).
Similarly, if finishedAt is too long after executedAt, it implies that either your job is too big (i.e you should consider to split it) or your worker machine is not powerful enough (thus, it probably needs vertical scaling).

Also, it’s worth noting that ‘too long’ here depends entirely on the business need of your system. To continue the above example, let’s say that all users should receive newsletter between 3 and 4pm, so you need to make sure that all the tasks must be completed in the duration of 1 hour.

The last metric is count. And of course you want it to be 1 all the time. If it is not the case, consider whether something has gone wrong when the task is executed (like an exception is thrown).

Finally, we have the dead letter queue that you can visit and check if there are items there and investigate why the tasks could not be executed. In our system, for example, when a task is moved to dead letter queue, we send a Slack notification so the engineer team can investigate and react as soon as possible.

Storing the job metadata

During the execution of a job, we can use Redis to store its metadata to benefit from fast read-write latency. However, we should flush the data to some analytics database (e.g Cassandra or BigQuery) as we do not want to fill up the Redis memory.

For example, you can also add another field JobType to JobMetada and then perform a query to see the average time taken for each type of job.

Recap

In this post, I discuss how we can implement a monitoring architecture for Redis-based job queue. For a detailed implementation, visit my Github. Pull request or comments are highly welcomed.

This article is also posted on Medium.

Written by Hiep Doan, a software engineer living in the beautiful Switzerland. I am passionate about building things which matters and sharing my experience along the journey. I also selectively post my articles in Medium. Check it out!