Build it yourself: A data pipeline that trains a real model

Wait 5 sec.

We talk about AI a lot here. We talk about data less often, but data is one of the most important parts of the AI ecosystem. Without data, there would be no AI. Whenever you use AI, there’s always a data pipeline feeding whatever work you’re doing with the AI, so let’s take some time to discuss data pipelines. What they are, how they serve AI, and then we’ll walk through a tutorial on how to build a small custom data pipeline, including model training.What is a data pipeline?A data pipeline is how data moves from raw input to usable output. It’s a set of steps that do the following:Collect data from the source, like apps, sensors, logs, etc.Move data to storage like a database, warehouse, or service.Transform data with processes that clean, aggregate, or reshape it.Deliver data to dashboards, models, and APIs.It won’t matter which algorithm, library, or model you use. If your data isn’t accurate, your results won’t be accurate either. How data serves AIWe know data is important, but what does it actually do? Here are the three roles data provides for AI systems.Data trains the modelIt teaches an AI system how to behave. Machine learning models learning patterns from structured datasets. LLMs learn language, context, and relationships from text data. No data, no learning. You’d just have these fancy models with no understanding of anything. Data shapes a model’s outputModels need data even after they’re trained because they rely on data inputs to produce their outputs. Data triggers the model to act. For example:Prediction models need new data points to evaluate.Recommendation systems need user behavior to make recommendations.A language model needs a prompt.Models improve through dataAI systems aren’t static. Their evolution and continued success rely on the data they continue to receive. Data’s role after deployment is pretty similar to the role it plays in the earlier stages:Improving future outputs based on user interaction data.Identify errors and drift through performance data.Retraining or fine-tuning models using new data.All this can be summed up into a simple statement. There is no AI without data. There is no good AI without good data.Build and train a model with simulated inputsNo matter how large or small the AI system is, data pipelines still follow the same workflow listed earlier in this article (ingestion, processing, storage, serving). The majority of these details are abstracted away when working with SaaS AI because companies want to make it as easy as possible for you to use. I still think it’s helpful to understand what’s going on under the hood. Having this understanding helps you make better decisions about the quality, timeliness, and reliability of the data your AI relies on.The remainder of this article will focus on creating a data simulation, training a small model with scikit-learn’s linear regression, and making predictions that you can see in your terminal. Before getting started, make sure you have an IDE and Python installed on your machine.We’ll need to install pandas and sci-kit learn. You can do this using the code below:View the code on Gist.Once your installs are successful, let’s set up our file structure. It should look like this:View the code on Gist.Now we’re ready to get started!Simulate data and make predictionsFor this project, we’re going to build a data simulation rather than connect to an API or an existing dataset. This shifts the focus away from gathering or sending data to/from an internal source and toward building data to train a model. This would be a small piece in a larger data pipeline (steps collect, transform, deliver).We’re going to simulate temperature data over a 24-hour period using a script that mimics daily patterns and adds in a little randomness. This script builds a data set with natural variation and features you can model against (like average temperature at a given hour, how much it fluctuates, and the temperature from the previous hour).Our prediction code, at a high level, uses the tool sin to simulate daily temperature patterns, adds random noise to make the data less perfect and more predictable, and loads and runs our model (model.pkl).direct_predict.pyView the code on Gist.Training a modelNext, we’re going to train a model using simple linear regression. Linear regression is a method that predicts a numeric value by finding the best straight-line relationship between input features and the output. By using linear regression, we can estimate a number (like tomorrow’s temperature) based on other known values (like today’s temperature and the time of day) by fitting a straight line to past data.The model below will learn the relationship between time and temperature and save it to model.pkl file so we can reuse it.train_model.pyView the code on Gist.Running the codeThe first thing we’re going to do is train the model. We can do that with the following terminal command:View the code on Gist.This will create your model.pkl file.The last step includes creating data and making the predictions. You can do this by running the following terminal command:View the code on Gist.After you run this command, you’ll see a chart in your terminal that includes the actual temperature and predicted temperature. Now you have a basic understanding of how data works hand in hand with AI. Understanding the basics of how data flows and gets processed gives you a clearer picture of what’s really happening behind the scenes. The more you understand, the better you can leverage an AI system to work for your benefit.The post Build it yourself: A data pipeline that trains a real model appeared first on The New Stack.