Supervised Learning- Linear Regression-Part Two

Recap:
In my last post, I explained how given a set of data we can then create a line that tries to predict outputs. This prediction can be presented as a function, F(x) = y = ax + b. We also discussed how this line could be inaccurate without some analysis of errors.


A Couple Notes:
If anyone is taking Professor Ng’s Coursera course, there may be confusion with to the notation I used and the notation he uses. It is more proper, for the field of machine learning, to use Professor Ng’s notation, but I decided to start with the y = ax + b as it is a more common point and less confusing when starting out.

Also, the notation we use for our data points will come in handy as we progress. Let’s look at our data points and get an understanding of our notation:


X Notation
X Value-Living Area
Y Notation
Y Value-Price
x1
2104
y1
400
x2
1600
y2
330
x3
2400
y3
369
x4
1416
y4
232
x5
3000
y5
540

Also keep in mind, the term for “any given x” or “any given y” would be: x(i) and y(i) respectively.

What is an error?
An error is rather intuitive. Given a predictive function, F(x), measure how far off the predicted outputs are from known outputs. Simply predicted - actual. Imagine if I took F(2104) and determined that y = 500. My error in this case would be 100 (500 - 400). If I took F(1600) and determined that y = 300, then my error would be -30. You may ask, how does one get a negative amount of error? Would you be saying you’re so wrong that you’re right? This is because at this point we are measuring a range rather than error. We handle this squaring our errors. In case you’re curious why we don’t use absolute value, read about Jensen's Inequality.

This can be represented as:

Mean Squared Error and the Cost Function
We can take our function and check the error with each x and y value, but what do we do after that? The error data is not really meaningful to us yet as it is simply a collection of error values. What we’re going to do is very similar to taking the average of collected errors. The sum of all our errors can be represented as:
Where the preceding term, sigma, is just a fancy way of representing a sum. Remember that our m value is the sum of all of our data in the set. For the purpose of our goal, instead of dividing the above term by m, we will multiply it by 2 and put it over one making (f(xi) -yi)2our numerator. Our end equation looks like this and we set our error to a variable J:

In machine learning, J is called the cost function. The one I have provided is different in notation, but fundamentally the same as one you might find in a machine learning textbook. Now you might notice a problem: our original equation is y = ax + b. What do we do with the a and b in this case? We will calculate for mean squared error based off different parameters, which makes our job extremely easy. Think about it this way: we always know what our x and y value will be if a and b are constant, but we need the values of a and b to make our algorithm work. Our calculation of any given a or b is noted as follows:


A Change in Notation
It is at this point, I’d to change the notation in order to align more closely with Professor Ng’s class. We will recognize our b term is a parameter like a and notate each parameter as a. So,


This will help us with the calculus that will happen later on. Similarly, the cost function will look like this:

Finding the right parameters
Our goal now is find values of a that produce the lowest cost, J. By obtaining the minimal value of J, we will then have the most accurate function possible. How do we find the lowest value of J? To do so, we need to find the relationship of a to the cost function. More specifically, we need to see how J changes and at what rate it changes with regards to our parameters. The rate of change in calculus is found by with the derivative. To understand a little better let’s say we had the function:
The way find the derivative is by eliminating the constant, 3, and multiplying the coefficient 2 by the exponent. With this we get 6x and we know that the above function changes at a rate of 6x. So the derivative of the above function (to read as “f prime of x”) is:
However, the cost function takes two parameters and as such we must evaluate the rate of change one variable has on another variable. We do this with a partial derivative. For example, take the following function:
We can then take the derivatives of each variable, x and y, and the find out how they change in relation to one another. The partial derivative is denoted as such:

Essentially what we’re doing is seeing how y changes with regards to x by holding y constant. In the above example, the derivative of y evaluate to 1 and x squared remains constant. So y changes at a rate of x2 with regards to y.

For our cost function, we want to see how a parameter a changes with given set outputs and inputs. Our inputs and outputs are constants when evaluating our functions. The derivative term with respect to our cost function, looks like this:

We could then multiply the derivative term by some value, which we will call , and subtract the product of and the derivative term from our original proposed parameter a. So the algorithm to find the minimum would look like this:


We would then assign the result of this function to the parameters and plug them back into the equation and repeat the process until we converge to the minimum value of our function. This process is called Gradient Descent.

Gradient Descent Visualized
While it might be difficult to understand that algorithm, it is easy to visualize and understand what gradient descent is trying to accomplish. Imagine if we took our x and y terms and expanded a plot to a third dimension. We would represent our cost function value on the z axis and plot the various values of a. A plot might look like this:
You can imagine the function as someone navigating a landscape, where they are standing on a high point and seeking the lowest point in the valley. Each do above represents a step towards the lowest point in the valley, or the minimum. This step is our term and it represents our learning rate. The value we set the learning rate determines the size of the step we take towards the minimum. As such, it is important to select the right learning rate. If the learning rate is too large, then the a value will remain too large and then never reach the minimum. If is too small, then we may never reach the minimum on a practical timescale.

Conclusion
We have taken a function that we believe can predict the price of house given the size of the house. We tested the predictive power of the function by placing the size of the house and seeing how the predicted value differed from each actual value we knew. We then added those errors up and used the value of our parameters in relation to the mean error squared to see what the optimal (or minimal) should be.

If you really think about it, through this long process we were able to learn how to predict the price of a house. Were we to place this process into code and run it, we would see that our code would take in past data and perform a predicting task. It would then learn to improve the predicting task with time. This is the very definition of a learning machine!

In the next post, I will be using Python to make a machine learn.





Supervised Learning-Regression-Part One

Supervised Learning-Regression- Part One


           My goal for this post is to explain the mathematical concepts behind supervised learning. For this post, I will be borrowing heavily from Andrew Ng’s (very well written and clear) notes as well as his Coursera course.
           Supervised learning is a program that finds patterns from a well-defined data set and can be split into two types: regression and classification. Supervised learning seeks some function that receives input(s) X in order to predict, with reasonable accuracy, some output y (think y=f(x)). With classification, Y is discrete. Discrete values are countable within a finite measure of time (e.g. the change in your pocket or your sexual assignment). Regression, on the other hand, seeks to find a continuous output. One way to think of continuous values is decimals. The temperature could be 50 degrees or 50.00 degrees or 50.01 degrees. These values could not be counted within a finite time.
As stated before, our goal is to find some function that takes a set of past data and create a function that can accurately predict Y output values of new data set. This set of past data is called a “training set.” We create our view or model of the data from this and if we have done our math correctly we will be able to predict Y from any new incoming set of data called a “test set.” Let’s look at a simple data set that I have borrowed from Dr. Ng: house size and house price.

With this data the living area is the input X and the price would be the output Y. We use the variable m to denote how many training examples there are, in this case m = 5. I have plotted the above data with the matlibplot and numpy Python language libraries.
With this data we could make some hypothesis about the Y output. The best approach for this is to give something similar to y = ax + b. In this case, a stands as the weight or parametric value that affects how much x affects the output.  We need to find the best fit for this data or the a and b values that give us the most accurate predictions. We can draw a line to represent this fit. Such a line would look like the one below:





Does this line give us an accurate prediction of the data? Not really, because we assume that y will always move in a positive direction. We can move towards a more accurate estimate by moving our function to least amount of error. More explicitly, what value of a and b should aim for? We can find this with something called the cost function. Given the length of this post, I will explain this concept in the next post.

What is Machine Learning?

In the beginning of January, I decided to take some coursework via MOOC's on Machine Learning and Artificial Intelligence. One sensation we hear about in the media is how much AI is going to change the world for the better and worse. Many of the staple jobs in the US are projected to be done by intelligent machines in the coming decade and cause massive structural unemployment. Some ever worry AI will take over like we have seen in Hollywood (unlikely I think). Regardless, we must acknowledge and push forward in this area because the benefits (as well as the risk of not pursuing AI) are too great to ignore.

I've been taking Andrew Ng's Intro to Machine Learning via Coursera, Udacity's Intro to AI and Linear Algebra refresher, and watching a ton of Siraj Rival videos on YouTube. Machine Learning is one the hardest things I've tried to learn in my life. It involves a ton of math, probability, and programming, but this is precisely why I like it. I love the challenge. I am, however, a lazy student. I am trying to get into a GA Tech's Online Masters in Computer Science, so I can do machine learning full time, but I also have to make up for mediocre undergraduate performance in a non-CS field. I have learned, since teaching myself to program, that this blog and using the Feynman method (teaching what you've only just learned in order to grasp it fully) will keep me accountable and push me to a deep understanding. Regardless of what my academic career looks like, I will learn machine learning because I like it.

So, the fun part: What is machine learning? Tom Mitchell, a computer science professor at Carnegie Melon, defines machine learning as this:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Makes perfect sense right? The definition is not as intimidating as it looks at first glance. All Dr. Mitchell is saying is: if you give a computer data about the past regarding some task and it is able to improve how that task is done, then you can say that the program has learned.

Why is is this important? Is Machine Learning AI?
This is important because machine learning is our first big step in the ultimate goal in artificial intelligence: making a machine as intelligent as a human being. It is important to remember that machine learning is a subset of artificial intelligence, not the field as a whole. Learning is only a small part of what we want artificial minds to do. Some researchers, like Monica Anderson, say machine learning is the only kind of artificial intelligence we have, though the goal is to incorporate things like abstract thinking into our machines.

So how does machine learning differ from conventional programming?

Conventional programming forces a programmer to define all the parameters and data needed to perform a task. Machine learning takes a series of rules about a task, data about the past, and the desired output and reaches the desired output by repeatedly attempting the task until reaching the desired output. For example, imagine trying to make a winning chess program. You could program all the moves and countermoves, but Chess strategy has been something that has been studied for about 2,000 years. The programming task would be monumental. However, if you supply the rules on how pieces move and what a winning scenario is, then with machine learning, a program can learn to play chess and win.

How does a computer learn to play chess? Or any task for that matter?
Well, this is where the math comes in. Machine learning has also been called "the extraction of knowledge from data." Machine learning uses advanced concepts from statistics and probability (that I will be covering in future sections) to determine patterns in datasets. Just hang in there! I will teach you this in detail. I do want to explain conceptually the different ways machines learn from data.

Supervised Learning
This is probably the most intuitive learning method. A program is supplied a well labeled dataset and a correct outcome. For instance, supplying a program images and labelling the images (e.g. "this is a cat"). With this method, when the program is supplied more data, the more it is able to learn and when it encounters unclassified data, it should be able to classify that data correctly. My next couple of posts will revolve around this and walking through a "hello world" style machine learning program. This is the most common type of machine learning.

Unsupervised Learning
This kind of learning is when a computer is given an unlabelled set of data and we tell the computer: "here is an unstructured dataset, can you give it structure?" This is a powerful algorithm and one way to think of it is how we learn something completely new. If you watch one or two YouTube videos on a complex subject, then you aren't likely to learn much. However, if you watch thousands of videos, read a hundred books, and attend network with others learning the same thing, then you will learn much more.

Reinforcement Learning
Similar to unsupervised learning, this algorithm takes data about what classifies something (e.g. what makes something a truck instead of a car?) and uses the "Nearest Neighbors" technique to determine what something is. The nearest neighbors technique takes the classification of some item that neighbors the unknown item and then uses the data from the neighbors to classify the unknown item.

Neural Nets
This is a really cool buzzword and a really cool technique! This method uses a model of the human brain, which to truly explain requires its own post. This technique uses matrix multiplication and activation functions to determine to allow a machine to interpret inputs and propagate it through a network. This technique has lost popularity in the past, but regained it in the future. This is my personal favorite type of learning!

Conclusion
My hope for my future blog posts is the go into detail on how math can help a machine learn. As always, if you are kind enough to read this post, let me know what you think. I welcome constructive criticism. I want to learn how to design learning machines and be the best at it. I don't want to just seem like I know what I'm talking about. If there's concept I didn't explain well or if you want to understand something more deeply, then let me know!


Thorn: Digital Defenders of Children - What They Do and How You Can Help

I was born with a strong desire to do good for others. Over the last three years, the career choices I have made have reflected that. In 2014, I joined the Franklin County Sheriff's Office with the hopes of becoming an investigator in one of two task forces: I.C.A.C. (Internet Crimes Against Children) or the Homeland Security/Ohio Bureau of Criminal Investigation human trafficking task force. I started off working in the Franklin County jails, but after two years of working I came to the realization that I couldn't live off the salary I earned. Luckily I found something that I loved which could be used for the benefit of others: software development.

Everyday after work, I would come home and study for at least three hours instead of watching TV or playing video games. When I left the Sheriff's Office, it was bittersweet. I knew I would miss the work and my coworkers, but I've been pretty happy in my software development job at Nationwide Insurance. Despite the fact that I am earning more money than I ever have before, I find that I am missing one thing. I am intrinsically motivated or more motivated by the effect of the work I do than the money I make. Yesterday, I had no intentions of applying for a job until I watched Ashton Kutcher's powerful testimony and learned about Thorn.

What is Thorn? "The thorn protects the rose." Thorn is an organization that collaborates with government and business to use technology to fight child exploitation. The simple fact of the matter is this: the criminals are technologically superior to those wishing to do good. Criminals have created vast and elaborate enterprises for selling drugs, illegal arms, and human beings. Human beings of all ages and genders are sold for the purposes of entertaining the perverse sexual desires of a very large marketplace. I had no idea how well designed these system were until I spoke with pedophiles while working at Franklin County. What is worse, however, is how widespread and profitable this industry is.

Thorn's work leverages powerful technologies in order to help victims report their abusers, provide software that helps government identify victims, and deterrence. They maintain a central database of abuse images so that large organizations can collaborate effectively with one another. One thing that struck me most on Thorn's website is that they are taking a very intelligent and holistic approach to fighting this problem. Thorn's Deterrence project (read more here) tracks those spreading exploitative material and encourages the perpetrator to find help. While this sounds like it is soft on someone who just conducted an extremely heinous crime, I can say definitively that most people who commit this type of crime cannot help themselves. Remember, the goal is to protect children and prevent these acts from happening.


Thorn's work is important. I have applied for a job there, but you don't have to be employed to do it. Thorn is a partnership of large companies as well as individuals. Anyone can contribute. If you're a developer like me, consider contributing to one of their projects! If you're not a developer, you can still help! Thorn is not just a couple of actor's pet project to make the public feel like they care about the world. Thorn has gotten results and it has been done from the heart. I'm happy to say that even though I am new programmer, I have applied for a job with Thorn. However, I'm even happier to say that I have applied to help with open source projects. I hope to see my friends and coworkers join me, but most of all: I hope I can help.

So it's been a while: what I've been doing

I'm sorry to say that I haven't been blogging like I should be. Between traveling (now that I have the funds) and the holidays, I've honestly let this slip. However, I have been learning like I should be and I've actually learned quite a bit since I blogged last. I've been exploring multiple areas of interest and I've been learning to program in multiple languages. It has been a blast, but one this is for certain: learn one programming language and get good at it. The other languages will follow.

What I've been learning and how I've been learning it.

  • Java
    • How? I've been programming through Graham Mitchell's Learn Java the Hard Way. I really enjoy the program by doing series. I read Zed Shaw's Learn Ruby the Hard Way and I feel it is the most practical way to learn a language. I've also learned what a great language Java is. I personally like how organized it is, though it is quite verbose. Don't let that scare you. It is a invaluable skill and practically lingua franca of software development.
  • Machine Learning
    • How? I've been taking Andrew Ng's class on machine learning out of curiosity as well as gathering credits for the possibility of getting my master's in computer science. The more I learn about machine learning, the more I enjoy it. I may one day make this my niche. However, machine learning is extremely difficult and math heavy. I'll be blogging about machine learning in the future and explaining what I learn using the Feynmann method. For this class I've been making extensive use of Octave. 
  • Python & R for data science
    • How? Through Datacamp! It is a low-cost online program for training people in data science and machine learning. This is a great resource, but it definitely needs work. I've enjoyed learning all the methods of pulling in massive amounts of data and using computers to condense down what data we need. You also learn to make useful predictions with that data. I will also be blogging about this in the future. Machine learning and data science are used heavily together.
  • JavaScript
    • How? Mostly by doing projects from FreeCodeCamp. I've made a Weather App that detects your location and tells you the local weather and displays icons. I've also made a WikiPedia Clone! Both of these sites make heavy use of API's (Weather Underground and Wikimedia respectively). 
How do I have time to learn all these things? Truth is, once you get your first coding job, you'll have plenty of time to learn while doing or do self-paced learning. If you're looking to get into this field, just get good enough to get your first job and learn from there. Learning is my hobby and I'm looking forward to restarting my blog and showing you what I'm learning!