Categories: Python

Run crontab on a Python Script

Truly Automate the Boring Stuff with Python

When I was a beginner Python user, there is no book that helped me progress more in my Python standard library skills than Automate the Boring Stuff with Python. It’s also one of our favorite books to read for data scientists. The book is a short and concise approach to learning Python to help you automate common tasks. However, does it truly help you do these things day after day without having to lift a finger? No.

The book leaves out the critical task of job scheduling. Job scheduling helps your run scripts or defined “jobs” at given times to help you not lift a finger when you want your script to complete. For instance, let’s say you created a script to download some data and put it into an Excel file. While you could go to the command line or your Jupyter Notebook each day/week/month to download this data, why not set up a job to pull this data down at a specific time, without intervention on your part? This is the power of job scheduling and is exactly what crontab, the most basic type of job scheduler, was created for.

Why use crontab for automation?

crontab is the simplest form of job scheduling there is. crontab is primarily used on unix based systems to have scripts run at specified times or multiple times chosen by a developer.

Examples of what can be run in crontab:

  • At a specific time of day
  • At every minute of a day
  • Running a script once a month

We’ll walk through exactly how to set up each of these configurations (and more) in the tutorial below.

General crontab Command Structures

crontab is a generalizable system that can be used for lots of types of scripts – shell, JAVA, Python, etc. – however, in this review we’ll only focus on the points important to Python.

The general structure of a crontab command is detailed below:

* * * * * cd /WORKING DIRECTORY/ && /LOCAL PYTHON PATH/ /WORKING DIRECTORY/py_script.py

The first component here is the * * * * * portion of the script, which is used to help specify the time, date, and year of potential runtime your job. Each * represents a specific unit of time that can be altered and modified.

Each * can be left as is and this will result in a job running each interval not modified, for instance, if we don’t change the first *, the script will run once a minute. So let’s dive into each to ensure we understand what each is:

  • The first * is minute, if you set this to 5, the job will run every 5 minutes, leaving all the other * values as 5 ****. Minutes ranging from 0 to 59.
  • The second * is for hour, ranging from 0 to 23
  • The third * is for day of month, ranging from 1 to 31
  • The fourth * is for the month, ranging from 1 to 12
  • The fifth * is for day of the week, ranging from 0 to 6, with 0 being Sunday

If this was confusing, here are some examples below to help you through the logic:

A script that runs every minute:

* * * * * cd /Users/user.name/dataanalysis/test && /usr/local/bin/python /Users/user.name/dataanalysis/test/py_script.py

A script that runs every day at 8:05AM local time:

5 8 * * * cd /Users/user.name/dataanalysis/test && /usr/local/bin/python /Users/user.name/dataanalysis/test/py_script.py

A script that runs at 8:05AM local time, every non-leap year (Feb, 29th):

5 8 29 2 * cd /Users/user.name/dataanalysis/test && /usr/local/bin/python /Users/user.name/dataanalysis/test/py_script.py

Please note that these scripts are based on a local User environment on Mac OSx, so you may need slightly different commands if running the script on a Linux system.

Setup Your Python Environment

which python in your command line to get your python path locally. If you don’t have one, we recommend setting up Python 3 or Anaconda3 for scientific computing, which contains a ton of

Setup Your Python Script

Our Python script in this tutorial is short and sweet. It is used to download a small open source dataset and write it to a timestamped .csv file in our local directory. We use Pandas DataFrames and the to_csv() function to do this along with the Python Standard Library’s datetime library to achieve this.

import pandas as pd
import datetime

file_name = "https://people.sc.fsu.edu/~jburkardt/data/csv/homes.csv"
df = pd.read_csv(file_name)

df.to_csv("test_output-"+str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))+".csv")

Some best practices in crontab & Python

One thing we want to do when we have cron running in the background is setup some kind of logging system so that we can see when we have outputs from Python that may need to be reviewed. For instance, if you don’t name your file correctly, as this silly example shows, you can get an email showing that is the case:

This is achieved by setting the “MAILTO” variable in crontab to your email address. Once this is done, any and all script outputs (there may be none at all if a job successfully runs) will be sent to your email that are run from the crontab directory.

MAILTO=your.email@email.com
* * * * * cd /Users/user.name/dataanalysis/test && /usr/local/bin/python /Users/user.name/dataanalysis/test/py_script.py

There are many other features of crontab that can be used with Python for other types of logging, some of which are detailed below.

Setup a Test Run

Now that we understand crontab, have our environmental variables at hand, have our script ready to use, and know some crontab best practices, we’re off to the races, so let’s setup a test run.

For simplicities sake we’ll use the script above with our MAILTO function. As we can see from the *****, this script will run each minute until it is terminated (or if it has a bug… which means it may not run at all, but we’d find out from our email logs).

Now you need to go to crontab -e in your command line and paste in your script (while changing the underlying details).

MAILTO=your.email@email.com
* * * * * cd /Users/user.name/dataanalysis/test && /usr/local/bin/python /Users/user.name/dataanalysis/test/py_script.py

Once you save your file in crontab, you’ll need to wait a minute to see the first run. Maybe go get a coffee and relax for a second. When you get back check your email for alerts. If there are none, you’ve seen success most likely, but can verify by looking at your local directory where the script was stored to see the scripts that were run. In this case, the script ran 4 times for me while I was making tea:

Summary

crontab and Python combined are probably the most powerful stepping stones to get you truly automating the boring stuff with Python. Job scheduling itself is a huge discipline, well beyond the scope of this article, however you can know that crontab is your first foray into the topic and the most important initial stepping stone on a longer journey to come. But who knows, plenty of people are highly productive with just a crontab script running regularly.

Now, while this example was kind of dry, think of the possibilities when you’ve got a machine learning pipeline or model you’d like to run daily, or a report that you want to generate once a week without worrying about it.

References:

Andrew W. Owens

Analytics and sciences contributor and professional. Specializing in Python and GCP.

Recent Posts

Adding rows to a Pandas Dataframe

While studying Data Science, we often come across DataFrames ready to be used. Normally, those…

6 days ago

How to Install & Import Pandas in Python

Pandas is one of the most powerful libraries for data analysis and is the most…

2 weeks ago

Decision Trees in Scikit-Learn

Introduction The decision tree is a machine learning algorithm which perform both classification and regression.…

3 weeks ago

A Holistic Guide to Groupby Statements in Pandas

The Importance of Groupby Functions In Data Analysis Whether working in SQL, R, Python, or…

4 weeks ago

Logistic Regression in Sci-Kit Learn

Introduction Logistic regression is an important model used in supervised learning. You can use logistic…

1 month ago

Pandas-Profiling, explore your data faster in Python

All datasets have one obvious thing in common, information, but this information is easy and…

1 month ago