Natural Language Processing with the NLTK Library
What is Natural Language Processing?
According to Wikipedia
"enabling computers to derive meaning from human or natural language input."
Lets go through an example… what do people think of Asim?
Humans have more than one way to express things.
Asim is awesome Asim is good Asim is great
All the above statements express a positive sentiment.
Opposite world.
Asim is bad Asim is rubbish Asim is pants
All the above express a negative sentiment
A simple approach would be to define some hand written rules.
good = set("awesome","good","grew") bad = set("bad", "rubbish", "pants") statement = set("Asim", "is", "awesome")
if good.intersects(statement): then good elif bad.intersects(statement): then bad
It kinda works, but what if the statement was:
"Asim is bad ass"
Rather than do things by hand we prefer to get the machine to do it for us.
Natural Language Processing using Machine Learning covers quite a lot of areas, for the purposes of this presentation we are gonna cover just the NaiveBaisian method.
These are basically just statistical models, by that i mean equations that calculate probabilities of things based on certain inputs.
They learn by inputting data, the more data you give it the more accurate it becomes.
There are quite a few different types of statistical models you can leverage and the power of NLTK is that it has a load of them already available as part of the library.
By far the simplest (and i've had such good results with i've never touched upon the others) is the NaiveBaysian Classifier.
Before we delve into the code we are gonna do something for some strange reason a lot of geeks hate, which is MAFS!
Baysian Statistics is pretty simple so don't worry.
P(E|C) = P(C|E)P(E) / (P(C|E)P(E) + P(C|E')*P(E'))
Perhaps to make it more sensible i'll change the names a bit:
P(Good|Asim) = P(Asim|Good)P(Good) / (P(Asim|Good)P(Good) + P(Asim|Bad)*P(Bad))
P( Err | What ) ?
This is the probability of Err happening given that What has already happened.
So P(Good|Asim) translates to the probability that the statement is Good given it has the word Asim in it.
Now with just a small number of statements we can just work it out visually.
Imagine we only had two sets of data
Asim is good Asim is bad
Good Bad
Asim 1 1 is 1 1 Good 1 0 Bad 0 1
P(Good|Asim)
Good Bad
Asim 1 1
= 1/2
Following the full equation
P(Asim|Good)P(Good) / (P(Asim|Good)P(Good) + P(Asim|Bad)*P(Bad))
1/3 * 3/6 / (1/3 * 3/6 + 1/3 * 3/6)
1/6 / (1/6 + 1/6)
= 1/2
This makes sense cause we have two statements with Asim in both so the probability should be 1/2.
What about the statement "Asim is good"?
We calculate the probability of
P(Good|Asim) = 1/2 P(Good|is) = 1/2 P(Good|good) = 1
1 + 1/2
and assign the result of each one a weighting of 1/3, the Naive bit is that assumption that each word is independent of each other but we all know that they are not, but it just kinda works.