Coding Your First Learning Algorithm - Part Two - The Code without ML Libraries

Steps to Coding Our Linear Regression Algorithm
All good code starts on paper:

  1. Gather our data & determine our starting conditions
  2. Compute our cost function (or how much error our hypothesis function has)
  3. Perform gradient descent to minimize the cost function
  4. Return the proper a values for the most accurate predictive function
Gather our data & determine our starting conditions
I'll be using the same data points from the last post, so the function to load the data will remain the same. I want to set up my hypothesis function with some guess as to what my a values are. If I simply set them to zero, the computer will do the heavy lifting and determine the proper value with gradient descent. I will need to set up a learning rate and the number of iterations. The number of iterations determines how many times we will the program or how much the model will be "trained." We have 97 data points, so we don't want to train our model too heavily, but we also want to make sure we have a good amount of training. I chose 1000 (mostly because that's what we use in class), but also because it is a good middle ground for a small dataset. Remember, our learning rate is the size of steps we take towards a minimum. Similarly, the iterations are the number of steps we take and gradient descent is the direction. I wrapped all of these attributes into a "main" function and the code looks like this:



Compute our cost function
If you recall, we defined our cost function as:


This function will return mean squared error for the supplied values of a. That code has been wrapped up into a function called the cost_function. This code is computation and we need to iterate through all the data with a for loop.



Perform Gradient Descent to minimize the cost function
The code for gradient descent will be broken down into two parts. The gradient descent runner, which updates the steps simultaneously and the step gradient function which calculates the steps that will be passed. The synchronous nature of Python ensures that each value will be updated simultaneously. The code for both functions is below: 


Return the proper a values for the most accurate predictive function
Now we have a set of functions that should return the minimal value of a and provides a the most accurate possible prediction. How do we measure if we have improved? We compare our original error value and the error value post gradient descent. I added some code that prints to the command line. 


When I run my program, it is extremely quick (less than 1 second) and we have good values for a! Here are the results:


When we started, our average error was approximately 64, but after we ran our program it was close to 11. It seems that our program has improved by a significant amount!

Soon, I hope to do this same code with the scikitlearn library, which could reduce this code down to less than 15 lines. 

Coding Your First Supervised Learning Program-Part One-Describing Our Dataset

Special Thanks:

  • Siraj Raval
  • Dr. Andrew Ng
  • Dr. Jason Brownlee
  • Harrison from Sentdex
--------------------------------------------------------------------------------------------------------------------------
For my first demo of a supervised linear regression learning program, I will be using the dataset from Andrew Ng's Machine Learning course and doing some descriptive work. This dataset explores the relationship between a city's population and the profitability of a food truck. In Dr. Ng's first programming assignment, he provides this data in a .txt format with our x and y values separated by a comma. Dr. Ng's data file is labeled 'ex1data.txt.' To get started, I created a directory labeled SimpleLinearRegession and I placed the data file inside this directory.

Let's Describe Our Data
I want to get an understanding of what's going with our data. We're going to write our program in Python (it is the easiest programming language you'll ever learn). In order to access our data in the .txt file, Python requires a library called numpy for a command called loadtxt. In order to install these packages, I'd recommend getting Python's package installer, PIP. We will be using the numpy library for a lot of other cool stuff later. I will import this functionality with the following script:



What's cool about Python is how human readable it is (much like Ruby). As described before, we have a dataset with x and y values separated by commas which looks like this: 



We can make this data usable in our Python program by creating a variable and loading the data from the .txt file, like so:




What does our data look like now? In Python, we can see what our data variable looks like by typing 'print data' into our Python folder and then executing the python program (python your_file.py). You should get something like this:




At this point, our program returns an array of arrays. Each array contains a x and a y value. From this, we can determine our m value or the amount of data we have in our dataset. If you type, 'print len(data)', then your program should return 97, so m = 97. What I would like to do now is plot my data. We can do this in Python with a library called matplotlib and use pyplot with the code below: 



In this case, plt acts as an alias for the package so that we don't have to type it out over and over again. We can create a scatter plot for our data with the command 'plt.scatter().' This command requires two arguments, an x and y variable. However, we have an array of arrays, with each array containing an x and y value. Each array would only pass as one argument, as such we need to split these values up into a group of x coordinates and y coordinates to pass into our scatter method. We can do so by initializing two empty arrays, iterating through our data array, and then splitting our first array element into the x array and the second into the y array. That code looks like this: 





We can now plot our data with the scatter method. We can add labels to the x and y axis with the '.xlabel' and '.ylabel' methods. Keep in mind, that the scatter plot does need the '.show' method for the user to see the scatter plot. The code to create and show our scatter plot looks like this: 



This code should produce a scatter plot, like the one below, whenever the program is executed.





Now we have an idea of what our data looks like. In the next post, we'll code our linear regression algorithm and determine how city population (x) affects our profits(y). 

The code for what I have so far is here.