Hacker Read

chasil | karma 5927 | avg karma 2.19 · 2023-07-31 13:14:39

> "I have a backup job that is triggered by a timer. I want to know when that job fails so I can investigate and fix it."

This is really more in the realm of a shell script.

You could do this verbosely:

  #!/bin/sh

  /path/to/my/backup_job

  if [ $? -ne 0 ]
  then /path/to/my/failure_alert
  fi

...or, you could do this tersely:

  #!/bin/sh

  /path/to/my/backup_job || /path/to/my/failure_alert

The wrapper script would go into your timer unit. I like dash.

javajosh | karma 18162 | avg karma 3.39 · 2023-07-31 14:40:06

That's great but isn't the real question about what goes in /path/to/my/failure_alert?

chasil | karma 5927 | avg karma 2.19 · 2023-07-31 15:10:47

The original poster hinted that "notifications" and email were options.

For SMS text message notifications, I use an AWK script to send SMTP to an email-SMS gateway. I try to keep these under the 160 character limit, only sent in extraordinary situations (high server room temp, decoy port triggering on the firewall hinting an intrusion, etc). I don't want this blowing up my phone.

For email, I have a MIME pack script that allows me to send a message with an arbitrary number of base64-encoded attachments.

Does that cover what might be in a failure alert script?

reply

justin_oaks | karma 3123 | avg karma 3.98 · 2023-07-31 15:16:22

That might be a good first step, but certainly isn't sufficiently robust.

What happens when when the /path/to/my/failure_alert script fails?

What happens when your backup job returns success but didn't generate any output?

What happens when you turn off the systemd timer for a while and forget to turn it back on?

What happens when the server stops running, has a full disk, or has a networking issue?

Ultimately, some of the other answers are better. You should have a separate system monitoring this. And that separate system should track every time a backup happens, either by checking the backup exists at the target location (good), or checking that the backup system sent a "Yes, I did a backup" message (ok, but not as good).

I use Telegraf for data collection, InfluxDB (v1) as a time series database, and Grafana (v7) for graphing and alerts. I'm using an older version of InfluxDB and Grafana because they just work and keep on working. Many other tools will work just as well as these do. I'm just giving them as an example.

Such a system may seem like overkill to just keep track of a few things, but you need something that'll tell you when you get no data. So at a minimum you'll want something on a separate server and you'll want it to send alerts when an expected event doesn't happen.

reply

chasil | karma 5927 | avg karma 2.19 · 2023-08-01 04:41:50

The original poster asked for simple detection of non-zero exit status.

What you speak of is far, far beyond the original question.

I am quite pleased with the reaction to my post, and I do not feel the need to compare technical merit.

Perhaps you would be happier with JCL?

In any case, enjoy your tooling.

reply