Advanced subsearches and transactions in Splunk: Tracing qmail deliveries
The email in question was part of a larger mail processing job, and we're using qmail to process these mails. Yes, qmail – it works great when it comes to doing high-volume, outbound-only deliveries in short time.
The problem challenge is that qmail has an interesting way of logging in the
current log, which looks like this:
@400000005aa66b052a527324 new msg 33778541@400000005aa66b052a527edc info msg 33778541: bytes 7703 from <email@example.com> qp 21534 uid 64011@400000005aa66b052a840e5c starting delivery 7512293: msg 33778541 to remote firstname.lastname@example.org@400000005aa66b060a418aac delivery 7512293: success: 184.108.40.206_accepted_message./Remote_host_said:_250_OK_id=1evM4J-0005W8-QC/@400000005aa66b060a419a4c end msg 33778541
I am not talking about the funny-looking tai64 timestamps, but rather the message and delivery ids. There are several problems with this:
- The message id is based on the Linux filesystem inode id for the mail file sitting in the queue. While being unique at a given time, multiple different mails will use the same message id over time.
- The delivery id is just a counter that increments with every message processed. It will start from scratch if you restart qmail, and so again, this id is not unique over a longer time.
- While you get the information that a particular delivery has been started for a given message id, all further information regarding the progress of this delivery is logged only with the delivery id, but does not show the message id again. That's probably due to the way the qmail architecture uses different processes for isolated tasks.
In order to get a comprehensive Splunk report for a given email address and to make it run in acceptable time, I had to learn about Splunk subsearches and transaction grouping.
Use a subsearch to narrow down relevant events
First, lets start with a simple Splunk search for the recipient address.
index=mail sourcetype=qmail_current email@example.com
In particular, this will find the
starting delivery events for this address, like the third log line shown above. Having done our homework, Splunk extractions are set up in a way that we get the
qmail_delivery fields for this event.
Now, in order to get a complete report including delivery progress, we need to consider all log events that include either the appropriate message id or delivery id.
With a default Splunk subsearch, the outer search will get all events where every field returned from the subsearch matches.
This works because Splunk applies the
format command implicitly on subsearches. Try this:
index=mail sourcetype=qmail_current firstname.lastname@example.org | fields qmail_msg qmail_delivery | format
This will return a single event with a field named
search and a value like
( ( qmail_delivery="8227046" AND qmail_msg="33565415" ) OR ( qmail_delivery="7947353" AND qmail_msg="33719121" ) OR ...)
Splunk will first execute the subsearch. Then, the value from this
search field is taken as a replacement for the subsearch part of the query. Finally, the resulting query is executed.
You can, in fact, put the format command in your subsearch yourself and use parameters to modify the resulting string. Let's do this and directly combine it with a subsearch:
index=mail sourcetype=qmail_current [ search index=mail sourcetype=qmail_current email@example.com | fields qmail_msg qmail_delivery | format "" "" "OR" "" "OR" ""]
This search fetches all log events that either have a message id or a delivery id for any message or delivery ids that appears in context with the recipient address.
Group events as transactions
You can then use the transaction command to group events. As you can see from the log excerpt above, qmail transactions start with
new msg and end with
end msg. We need to pick the message id from the start event and also include all events that have this message id and occur before the end event.
Additionally, the delivery id makes up the transaction. The transaction command is smart enough to pick up any delivery id that appears together with the message id we're following, and it can use this delivery id to include further events, even if they lack the message id. The transaction documentation has an example for this.
Adding a maximum duration between single transaction events for added performance, this gives us
| transaction qmail_msg qmail_delivery startswith="new msg" endswith="end msg" maxpause=1h
Now the last step is to once again filter out transactions that do not contain our intended recipient. Those transactions may show up because, as I stated in the beginning, the ids used by qmail are not unique. So while the subsearch finds the right ids for our particular recipient, the outer search may produce too many intermediate results.
We remove those transactions by applying a final
| search firstname.lastname@example.org
Putting it all together, we get
index=mail sourcetype=qmail_current [ search index=mail sourcetype=qmail_current email@example.com | fields qmail_msg qmail_delivery | format "" "" "OR" "" "OR" "" ] | transaction qmail_msg qmail_delivery startswith="new msg" endswith="end msg" maxpause=1h | search firstname.lastname@example.org
Most Splunk queries like this seem to do magic. And once they work, you tend to forget how they work or why they probably produce correct results. In part, this is why I am writing this up as a note to my future self.
Also, because a query like this is hard to remember or to re-construct, I saved this search as a Splunk dashboard. That allows you to add a handy input box for the address et voilà, we have a simple-to-use interface for our next support call.
Der strategische Kopf hinter unseren Softwaresystemen behält bei der Entwicklung und Betreuung komplexer Architekturen den Überblick, kümmert sich um die technische Infrastruktur oder berichtet über seine Erfahrungen und Erkenntnisse auf der Symfony User Group.