Towards AIblog

Self-Hosting Airflow at Home: Automating Stock Price Data Collection

Thursday, June 18, 2026FS StanceView original
Author(s): FS Stance Originally published on Towards AI. Self-Hosting Airflow at Home: Automating Stock Price Data Collection One of the main goals of creating my home lab is to gain a deeper understanding of Machine Learning Operations (MLOps) and how to productionalize AI workflows. Generally speaking, MLOps and productionalization deals with moving AI models from research into a real-life environment with automation and ability to handle errors gracefully. In my previous articles, I set up a PostgreSQL server and an Airflow server. These serve as the data foundation for how to get datasets that will be used to train AI models. Now we need to start filling out our PostgreSQL databases with data. We can use Airflow to orchestrate data pipelines so that up-to-date data is loaded into our PostgreSQL database. Setting up this data foundation is typically the first step in the machine learning (ML) process after planning, as you need data to train your models. A major reason behind my home lab setup is I want to show that you can self-host the whole ML process with a couple of VMs and containers. Since I have been doing a lot of personal investments lately, let’s work with finance data. With finance data, you can analyze trends, correlate prices, and even try forecasting, making it broadly useful across many scenarios. Configuring Airflow Running you Airflow Server Before we start, here are things I learnt and implemented that will make life easier with Airflow. When I set up the Airflow server previously, I used the following commands: nohup airflow scheduler > scheduler.log 2>&1 & nohup airflow dag-processor > dag-processor.log 2>&1 & nohup airflow triggerer > triggerer.log 2>&1 & nohup airflow api-server --port 8080 > api-server.log 2>&1 & The nohup command allows you to run a process continuously even after you log out or close the shell. The issue I ran into was that the Airflow components would crash and I would have to go to the Airflow server and re-run the commands. Another issue was that starting up all the components required 4 commands which could be annoying to run all the time. To solve the latter issue, you can write a script to start or restart Airflow. nano airflow_restart.sh In the airflow_restart.sh file, copy the below code and save it. #!/bin/bashpkill -f "airflow" --ignore-ancestorssleep 2echo "Starting scheduler..."nohup airflow scheduler > scheduler.log 2>&1 & echo $! | tee scheduler.pidecho "Starting dag-processor..."nohup airflow dag-processor > dag-processor.log 2>&1 & echo $! | tee dag-processor.pidecho "Starting triggerer..."nohup airflow triggerer > triggerer.log 2>&1 & echo $! | tee triggerer.pidecho "Starting api-server..."nohup airflow api-server --port 8080 > api-server.log 2>&1 & echo $! | tee api-server.pidecho "Airflow restarted" This script will kill any processes with “airflow” and restart all 4 components. So instead of running all 4 commands, you can now just run the script. To go a step further and solve the issue of having to restart the Airflow components every time it crashes, you can set up Airflow as a system daemon process. A daemon is a background process that typically starts when you boot up your system, restarts when there are failures, and is detached from the shell so you can keep it running at all times. First, create a file using: nano /etc/systemd/system/airflow-scheduler.service Next, copy the below text, put it in the file, and save it. [Unit]Description=Airflow SchedulerAfter=network.target postgresql.serviceWants=postgresql.service[Service]User=<USER>Group=<GROUP>Environment=PATH=<AIRFLOW_PATH>:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/binEnvironment=AIRFLOW_HOME=<AIRFLOW_FOLDER>ExecStart=<AIRFLOW_PATH> schedulerRestart=on-failureRestartSec=5sStandardOutput=journalStandardError=journal[Install]WantedBy=multi-user.target Replace <USER>, <GROUP>, <AIRFLOW_PATH>, and <AIRFLOW_FOLDER> depending on how you set up your Airflow server. Repeat this for all the other components of Airflow: dag-processor, triggerer, api-server. Run the below to pick up new system daemon processes. sudo systemctl daemon-reload Then, enable and start all the services. # Enable services on bootsudo systemctl enable airflow-schedulersudo systemctl enable airflow-dag-processorsudo systemctl enable airflow-triggerersudo systemctl enable airflow-api-server# Start servicessudo systemctl start airflow-schedulersudo systemctl start airflow-dag-processorsudo systemctl start airflow-triggerersudo systemctl start airflow-api-server Voila! You now have Airflow running on your server in a more robust way. You can use systemctl status to check if each component is running and journalctl to see logging information. Adding your PostgreSQL Connection Another thing to do is to set up the PostgreSQL connection in your Airflow server so that it can talk to your PostgreSQL database. Airflow has a handy PostgreSQL connection that lets you specify the connection parameters once, and this allows you to re-use the connection anytime you want to connect to PostgreSQL. Edit the airflow.cfg file. nano airflow.cfg Find where it says test_connection and set it equal to Enabled. Restart Airflow (either using the script or restart the daemon processes). This allows you to test your connections through the Airflow UI. Now, let’s go to the Airflow UI and add the PostgreSQL connection. One the left panel, select Admin > Connections. On the top right, select Add Connection. Under Connection Type, select Postgres. Note you may need to run pip install apache-airflow-providers-postgres in your Airflow server if you do not see Postgres as an option. Fill in your PostgreSQL Host (IP address), Login, Password, Port, and Database. You should see your new PostgreSQL connection. To test the connection, click on the graph-like icon. After clicking, it should become like a wi-fi icon. There should also be a green pop-up message on the bottom-right indicating your test connection was successful. Creating the Airflow Pipeline Code Airflow has been set up and configured properly, so now we can get to writing the code for the Airflow pipeline. First, a little bit about Airflow. Airflow works through Directed Acyclic Graphs (DAGs), which is just a term they use to describe structured workflows that can contain multiple data processing tasks. Within a DAG, the tasks can be run in specified orders and with dependencies, so you can set up tasks to run when tasks complete or in parallel with other tasks. To get finance data, I used yfinance, a python package that leverages Yahoo! Finance’s APIs to get market data. We can use this to extract and write stock ticker price data to our PostgreSQL database. Here is the final […]