- Siraj Raval
- Dr. Andrew Ng
- Dr. Jason Brownlee
- Harrison from Sentdex
--------------------------------------------------------------------------------------------------------------------------
For my first demo of a supervised linear regression learning program, I will be using the dataset from Andrew Ng's Machine Learning course and doing some descriptive work. This dataset explores the relationship between a city's population and the profitability of a food truck. In Dr. Ng's first programming assignment, he provides this data in a .txt format with our x and y values separated by a comma. Dr. Ng's data file is labeled 'ex1data.txt.' To get started, I created a directory labeled SimpleLinearRegession and I placed the data file inside this directory.
Let's Describe Our Data
Let's Describe Our Data
I want to get an understanding of what's going with our data. We're going to write our program in Python (it is the easiest programming language you'll ever learn). In order to access our data in the .txt file, Python requires a library called numpy for a command called loadtxt. In order to install these packages, I'd recommend getting Python's package installer, PIP. We will be using the numpy library for a lot of other cool stuff later. I will import this functionality with the following script:
What's cool about Python is how human readable it is (much like Ruby). As described before, we have a dataset with x and y values separated by commas which looks like this:
We can make this data usable in our Python program by creating a variable and loading the data from the .txt file, like so:
What does our data look like now? In Python, we can see what our data variable looks like by typing 'print data' into our Python folder and then executing the python program (python your_file.py). You should get something like this:
At this point, our program returns an array of arrays. Each array contains a x and a y value. From this, we can determine our m value or the amount of data we have in our dataset. If you type, 'print len(data)', then your program should return 97, so m = 97. What I would like to do now is plot my data. We can do this in Python with a library called matplotlib and use pyplot with the code below:
In this case, plt acts as an alias for the package so that we don't have to type it out over and over again. We can create a scatter plot for our data with the command 'plt.scatter().' This command requires two arguments, an x and y variable. However, we have an array of arrays, with each array containing an x and y value. Each array would only pass as one argument, as such we need to split these values up into a group of x coordinates and y coordinates to pass into our scatter method. We can do so by initializing two empty arrays, iterating through our data array, and then splitting our first array element into the x array and the second into the y array. That code looks like this:
We can now plot our data with the scatter method. We can add labels to the x and y axis with the '.xlabel' and '.ylabel' methods. Keep in mind, that the scatter plot does need the '.show' method for the user to see the scatter plot. The code to create and show our scatter plot looks like this:
This code should produce a scatter plot, like the one below, whenever the program is executed.
Now we have an idea of what our data looks like. In the next post, we'll code our linear regression algorithm and determine how city population (x) affects our profits(y).
The code for what I have so far is here.
No comments:
Post a Comment