Introduction to Natural Language Processing in Python

Alt text that describes the graphic

What is Natural Language Processing?

  • Field of study focused on making sense of language
    • Using statistics and computers
  • You will learn the basics of NLP
    • Topic identification
    • Text classification
  • NLP applications include:
    • Chatbots
    • Translation
    • Sentiment analysis
    • ... and many more!

What exactly are regular expressions?

  • Strings with a special syntax
  • Allow us to match patterns in other strings
  • Applications of regular expressions:
    • Find all web links in a document
    • Parse email addresses, remove/replace unwanted characters
In [1]:
import re

# match a pattern with a string 
re.match('abc', 'abcdef')
Out[1]:
<_sre.SRE_Match object; span=(0, 3), match='abc'>
In [2]:
# match a word with a string 
word_regex = '\w+'
re.match(word_regex, 'hi there!')
Out[2]:
<_sre.SRE_Match object; span=(0, 2), match='hi'>

Common regex patterns

Pattern Matches Example
\w+ word 'Magic'
\d digit 9
\s space ' '
.* wildcard 'username74'
+ or * greedy match 'aaaaa'
\S not space 'no_spaces'
[a-z] lowercase group 'abcdefg'

Python's re Module

  • split: split a string on regex
  • findall: find all patterns in a string
  • search: search for a pattern
  • match: match an entire string or substring based on a pattern
  • Pattern first, and the string second
  • May return an iterator, string, or match object

Practicing regular expressions: re.split() and re.findall()

Here, we'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters.

Note: It's important to prefix your regex patterns with r to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, "\n" in Python is used to indicate a new line, but if you use the r prefix, it will be interpreted as the raw string "\n" - that is, the character "\" followed by the character "n" - and not as a new line.

In [3]:
my_string = 'Let\'s write RegEx!  Won\'t that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?'
print(my_string)
Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?
In [4]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))
["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
In [5]:
# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))
['Let', 'RegEx', 'Won', 'Can', 'Or']
In [6]:
# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
In [7]:
# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))
['4', '19']

What is tokenization?

  • Turning a string or document into tokens (smaller chunks). Converts sentences to individual words.

Alt text that describes the graphic

  • One step in preparing a text for NLP
  • Many different theories and rules
  • You can create your own rules using regular expressions
  • Some examples:
    • Breaking out words or sentences
    • Separating punctuation
    • Separating all hashtags in a tweet

Why tokenize?

  • Easier to map part of speech
  • Matching common words
  • Removing unwanted tokens
  • "I don't like Sam's shoes."
  • "I", "do", "n't", "like", "Sam", "'s", "shoes", "."

Other nltk tokenizers

  • sent_tokenize: tokenize a document into sentences
  • regexp_tokenize: tokenize a string or document based on a regular expression pattern
  • TweetTokenizer: special class just for tweet tokenization, allowing you to separate hashtags, mentions and lots of exclamation points!!!

Word tokenization with NLTK

Here, we'll be using the first scene of Monty Python's Holy Grail.

Our job in this exercise is to utilize word_tokenize and sent_tokenize from nltk.tokenize to tokenize both words and sentences from Python strings - in this case, the first scene of Monty Python's Holy Grail.

In [8]:
scene_one = list()
with open('grail.txt', 'r') as f:
    text = f.readlines()
    for line in text:
        if 'SCENE 2' not in str(line): 
            line = line.rstrip('\n')
            scene_one.append(str(line))
        else:    
            break
In [9]:
print(scene_one[:5])
['SCENE 1: [wind] [clop clop clop] ', 'KING ARTHUR: Whoa there!  [clop clop clop] ', 'SOLDIER #1: Halt!  Who goes there?', 'ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!', 'SOLDIER #1: Pull the other one!']
In [10]:
# convert a list to a string 
scene_one = ' '.join(scene_one)
scene_one
Out[10]:
"SCENE 1: [wind] [clop clop clop]  KING ARTHUR: Whoa there!  [clop clop clop]  SOLDIER #1: Halt!  Who goes there? ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England! SOLDIER #1: Pull the other one! ARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master. SOLDIER #1: What?  Ridden on a horse? ARTHUR: Yes! SOLDIER #1: You're using coconuts! ARTHUR: What? SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together. ARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through-- SOLDIER #1: Where'd you get the coconuts? ARTHUR: We found them. SOLDIER #1: Found them?  In Mercea?  The coconut's tropical! ARTHUR: What do you mean? SOLDIER #1: Well, this is a temperate zone. ARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land? SOLDIER #1: Are you suggesting coconuts migrate? ARTHUR: Not at all.  They could be carried. SOLDIER #1: What?  A swallow carrying a coconut? ARTHUR: It could grip it by the husk! SOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut. ARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here. SOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right? ARTHUR: Please! SOLDIER #1: Am I right? ARTHUR: I'm not interested! SOLDIER #2: It could be carried by an African swallow! SOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point. SOLDIER #2: Oh, yeah, I agree with that. ARTHUR: Will you ask your master if he wants to join my court at Camelot?! SOLDIER #1: But then of course a-- African swallows are non-migratory. SOLDIER #2: Oh, yeah... SOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop]  SOLDIER #2: Wait a minute!  Supposing two swallows carried it together? SOLDIER #1: No, they'd have to have it on a line. SOLDIER #2: Well, simple!  They'd just use a strand of creeper! SOLDIER #1: What, held under the dorsal guiding feathers? SOLDIER #2: Well, why not?"
In [11]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)
{'coconut', 'do', 'together', 'since', 'Are', 'must', 'matter', 'Pendragon', 'martin', 'them', "'d", 'Where', 'will', 'with', 'knights', 'are', 'but', 'wind', 'under', 'this', 'who', 'these', '2', 'suggesting', '#', 'of', 'Am', 'wants', 'maybe', 'pound', 'ridden', 'have', 'your', 'fly', 'sovereign', 'castle', 'minute', 'dorsal', 'ARTHUR', 'go', 'carried', 'court', 'goes', 'swallows', 'land', 'grips', 'kingdom', 'our', 'simple', 'a', 'feathers', 'Listen', 'seek', '?', 'son', 'covered', 'plover', 'does', 'in', 'velocity', 'SCENE', 'tropical', 'all', 'by', 'they', 'needs', 'here', 'trusty', 'air-speed', 'Supposing', 'We', 'not', 'lord', 'yeah', 'That', 'carrying', 'breadth', 'tell', 'interested', 'Whoa', 'on', "n't", '--', 'ounce', 'temperate', 'point', 'No', 'why', 'using', 'me', 'In', 'my', 'Wait', 'snows', 'winter', 'order', 'got', 'anyway', 'coconuts', 'migrate', 'it', 'or', 'if', 'creeper', ',', 'zone', 'may', 'climes', 'mean', 'right', 'They', 'that', 'Found', 'beat', 'European', 'horse', 'husk', 'strangers', 'England', 'two', 'question', 'Who', 'Court', 'guiding', 'you', 'Well', 'But', "'m", 'he', 'KING', "'s", 'just', 'ratios', 'and', 'The', 'Will', 'swallow', 'south', 'through', 'there', '[', 'Ridden', 'is', 'SOLDIER', 'use', 'Oh', 'You', 'one', 'from', 'other', 'carry', 'Yes', "'", 'its', ']', 'Britons', 'found', 'every', 'bangin', 'grip', 'search', 'Arthur', 'halves', 'Mercea', 'A', 'then', 'I', 'yet', 'warmer', 'be', 'Camelot', 'second', 'agree', 'master', 'join', 'non-migratory', '1', 'at', "'re", 'wings', 'course', 'get', 'What', 'So', 'sun', 'back', 'where', 'could', 'clop', 'speak', 'bring', '!', 'house', "'ve", 'five', 'Uther', 'defeator', 'Not', 'am', 'African', 'Halt', 'servant', 'empty', 'Patsy', 'ask', 'length', 'bird', 'Pull', '...', 'line', 'held', 'Please', 'It', "'em", 'Saxons', 'King', 'to', 'maintain', 'times', '.', 'weight', 'forty-three', 'an', 'strand', ':', 'the'}

More regex with re.search()

Here, we'll utilize re.search() and re.match() to find specific tokens. Both search and match expect regex patterns, similar to those we defined previously. We'll apply these regex library methods to the same Monty Python text from the nltk corpora.

In [12]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())
580 588
In [13]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))
<_sre.SRE_Match object; span=(9, 2240), match="[wind] [clop clop clop]  KING ARTHUR: Whoa there!>
In [14]:
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w]+:"
print(re.match(pattern2, sentences[3]))
<_sre.SRE_Match object; span=(0, 7), match='ARTHUR:'>

Regex with NLTK tokenization

Twitter is a frequently used source for NLP text and tasks. In this exercise, we'll build a more complex tokenizer for tweets with hashtags and mentions using nltk and regex. The nltk.tokenize.TweetTokenizer class gives you some extra methods and attributes for parsing tweets.

Here, we're given some example tweets to parse using both TweetTokenizer and regexp_tokenize from the nltk.tokenize module.

In [15]:
tweets = ['This is the best #nlp exercise ive found online! #python', 
          '#NLP is super fun! <3 #learning', 
          'Thanks @datacamp :) #nlp #python']
In [16]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)
['#nlp', '#python']
In [17]:
# Write a pattern that matches both mentions (@) and hashtags
pattern2 = r"([@|#]\w+)"

# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern2)
print(mentions_hashtags)
['@datacamp', '#nlp', '#python']
In [18]:
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)
[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@datacamp', ':)', '#nlp', '#python']]

Non-ascii tokenization

In this exercise, we'll practice advanced tokenization by tokenizing some non-ascii based text. We'll be using German with emoji!

In [19]:
german_text = 'Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕'
print(german_text)
Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕
In [20]:
# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)
['Wann', 'gehen', 'wir', 'Pizza', 'essen', '?', '🍕', 'Und', 'fährst', 'du', 'mit', 'Über', '?', '🚕']
In [21]:
# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))
['Wann', 'Pizza', 'Und', 'Über']
In [22]:
# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))
print(german_text)
['🍕', '🚕']
Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕

Charting practice

We will find and chart the number of words per line in the script using matplotlib.

In [23]:
holy_grail = list()
with open('grail.txt', 'r') as f:
    text = f.readlines()
    for line in text:
        line = line.rstrip('\n')
        holy_grail.append(str(line))
In [24]:
import matplotlib.pyplot as plt 

# Replace all script lines for speaker
pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:"
lines = [re.sub(pattern, '', l) for l in holy_grail]

# Tokenize each line: tokenized_lines
tokenized_lines = [regexp_tokenize(s, "\w+") for s in lines]

# Make a frequency list of lengths: line_num_words
line_num_words = [len(t_line) for t_line in tokenized_lines]

# Plot a histogram of the line lengths
plt.hist(line_num_words)

# Show the plot
plt.show()

Word counts with bag-of-words

Bag-of-words

Bag-of-words takes each unique word from all the available text documents, and every word in the vocabulary can be a feature. For each text document a feature vector will be an array where feature values are simply the count of each word in one text. And if some word is not in the text, its feature value is zero. Therefore, the word order in a text is not important, just the number of occurrences.

  • Basic method for finding topics in a text
  • Need to first create tokens using tokenization
  • ... and then count up all the tokens
  • The more frequent a word, the more important it might be
  • Can be a great way to determine the significant words in a text

Alt text that describes the graphic

Building a Counter with bag-of-words

In this exercise, we'll build our first bag-of-words counter using a Wikipedia article, which has been pre-loaded as article.

In [25]:
article = list()
with open('./Wikipedia articles/wiki_text_debugging.txt', 'r') as f:
    text = f.readlines()
    for line in text:
        line = line.rstrip('\n')
        article.append(str(line))
In [26]:
article = ' '.join(article)
bs = r"['\\]"
article = re.split(bs, article)
article = ''.join(article)
article
Out[26]:
'Debugging is the process of finding and resolving of defects that prevent correct operation of computer software or a system.    Numerous books have been written about debugging (see below: #Further reading|Further reading), as it involves numerous aspects, including interactive debugging, control flow, integration testing, Logfile|log files, monitoring (Application monitoring|application, System Monitoring|system), memory dumps, Profiling (computer programming)|profiling, Statistical Process Control, and special design tactics to improve detection while simplifying changes.  Origin A computer log entry from the Mark&nbsp;II, with a moth taped to the page  The terms "bug" and "debugging" are popularly attributed to Admiral Grace Hopper in the 1940s.[http://foldoc.org/Grace+Hopper Grace Hopper]  from FOLDOC While she was working on a Harvard Mark II|Mark II Computer at Harvard University, her associates discovered a moth stuck in a relay and thereby impeding operation, whereupon she remarked that they were "debugging" the system. However the term "bug" in the meaning of technical error dates back at least to 1878 and Thomas Edison (see software bug for a full discussion), and "debugging" seems to have been used as a term in aeronautics before entering the world of computers. Indeed, in an interview Grace Hopper remarked that she was not coining the term{{Citation needed|date=July 2015}}. The moth fit the already existing terminology, so it was saved.  A letter from J. Robert Oppenheimer (director of the WWII atomic bomb "Manhattan" project at Los Alamos, NM) used the term in a letter to Dr. Ernest Lawrence at UC Berkeley, dated October 27, 1944,http://bancroft.berkeley.edu/Exhibits/physics/images/bigscience25.jpg regarding the recruitment of additional technical staff.  The Oxford English Dictionary entry for "debug" quotes the term "debugging" used in reference to airplane engine testing in a 1945 article in the Journal of the Royal Aeronautical Society. An article in "Airforce" (June 1945 p.&nbsp;50) also refers to debugging, this time of aircraft cameras.  Hoppers computer bug|bug was found on September 9, 1947. The term was not adopted by computer programmers until the early 1950s. The seminal article by GillS. Gill, [http://www.jstor.org/stable/98663 The Diagnosis of Mistakes in Programmes on the EDSAC], Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, Vol. 206, No. 1087 (May 22, 1951), pp. 538-554 in 1951 is the earliest in-depth discussion of programming errors, but it does not use the term "bug" or "debugging". In the Association for Computing Machinery|ACMs digital library, the term "debugging" is first used in three papers from 1952 ACM National Meetings.Robert V. D. Campbell, [http://portal.acm.org/citation.cfm?id=609784.609786 Evolution of automatic computation], Proceedings of the 1952 ACM national meeting (Pittsburgh), p 29-32, 1952.Alex Orden, [http://portal.acm.org/citation.cfm?id=609784.609793 Solution of systems of linear inequalities on a digital computer], Proceedings of the 1952 ACM national meeting (Pittsburgh), p. 91-95, 1952.Howard B. Demuth, John B. Jackson, Edmund Klein, N. Metropolis, Walter Orvedahl, James H. Richardson, [http://portal.acm.org/citation.cfm?id=800259.808982 MANIAC], Proceedings of the 1952 ACM national meeting (Toronto), p. 13-16 Two of the three use the term in quotation marks. By 1963 "debugging" was a common enough term to be mentioned in passing without explanation on page 1 of the Compatible Time-Sharing System|CTSS manual.[http://www.bitsavers.org/pdf/mit/ctss/CTSS_ProgrammersGuide.pdf The Compatible Time-Sharing System], M.I.T. Press, 1963  Kidwells article Stalking the Elusive Computer BugPeggy Aldrich Kidwell, [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&arnumber=728224&isnumber=15706 Stalking the Elusive Computer Bug], IEEE Annals of the History of Computing, 1998. discusses the etymology of "bug" and "debug" in greater detail.  Scope As software and electronic systems have become generally more complex, the various common debugging techniques have expanded with more methods to detect anomalies, assess impact, and schedule software patches or full updates to a system. The words "anomaly" and "discrepancy" can be used, as being more neutral terms, to avoid the words "error" and "defect" or "bug" where there might be an implication that all so-called errors, defects or bugs must be fixed (at all costs). Instead, an impact assessment can be made to determine if changes to remove an anomaly (or discrepancy) would be cost-effective for the system, or perhaps a scheduled new release might render the change(s) unnecessary. Not all issues are life-critical or mission-critical in a system. Also, it is important to avoid the situation where a change might be more upsetting to users, long-term, than living with the known problem(s) (where the "cure would be worse than the disease"). Basing decisions of the acceptability of some anomalies can avoid a culture of a "zero-defects" mandate, where people might be tempted to deny the existence of problems so that the result would appear as zero defects. Considering the collateral issues, such as the cost-versus-benefit impact assessment, then broader debugging techniques will expand to determine the frequency of anomalies (how often the same "bugs" occur) to help assess their impact to the overall system.  Tools Debugging on video game consoles is usually done with special hardware such as this Xbox (console)|Xbox debug unit intended for developers.  Debugging ranges in complexity from fixing simple errors to performing lengthy and tiresome tasks of data collection, analysis, and scheduling updates.  The debugging skill of the programmer can be a major factor in the ability to debug a problem, but the difficulty of software debugging varies greatly with the complexity of the system, and also depends, to some extent, on the programming language(s) used and the available tools, such as debuggers. Debuggers are software tools which enable the programmer to monitor the execution (computers)|execution of a program, stop it, restart it, set breakpoints, and change values in memory. The term debugger can also refer to the person who is doing the debugging.  Generally, high-level programming languages, such as Java (programming language)|Java, make debugging easier, because they have features such as exception handling that make real sources of erratic behaviour easier to spot. In programming languages such as C (programming language)|C or assembly language|assembly, bugs may cause silent problems such as memory corruption, and it is often difficult to see where the initial problem happened. In those cases, memory debugging|memory debugger tools may be needed.  In certain situations, general purpose software tools that are language specific in nature can be very useful.  These take the form of List of tools for static code analysis|static code analysis tools.  These tools look for a very specific set of known problems, some common and some rare, within the source code.  All such issues detected by these tools would rarely be picked up by a compiler or interpreter, thus they are not syntax checkers, but more semantic checkers.  Some tools claim to be able to detect 300+ unique problems. Both commercial and free tools exist in various languages.  These tools can be extremely useful when checking very large source trees, where it is impractical to do code walkthroughs.  A typical example of a problem detected would be a variable dereference that occurs before the variable is assigned a value.  Another example would be to perform strong type checking when the language does not require such.  Thus, they are better at locating likely errors, versus actual errors.  As a result, these tools have a reputation of false positives.  The old Unix Lint programming tool|lint program is an early example.  For debugging electronic hardware (e.g., computer hardware) as well as low-level software (e.g., BIOSes, device drivers) and firmware, instruments such as oscilloscopes, logic analyzers or in-circuit emulator|in-circuit emulators (ICEs) are often used, alone or in combination.  An ICE may perform many of the typical software debuggers tasks on low-level software and firmware.  Debugging process  Normally the first step in debugging is to attempt to reproduce the problem. This can be a non-trivial task, for example as with Parallel computing|parallel processes or some unusual software bugs. Also, specific user environment and usage history can make it difficult to reproduce the problem.  After the bug is reproduced, the input of the program may need to be simplified to make it easier to debug. For example, a bug in a compiler can make it Crash (computing)|crash when parsing some large source file. However, after simplification of the test case, only few lines from the original source file can be sufficient to reproduce the same crash. Such simplification can be made manually, using a Divide and conquer algorithm|divide-and-conquer approach. The programmer will try to remove some parts of original test case and check if the problem still exists. When debugging the problem in a Graphical user interface|GUI, the programmer can try to skip some user interaction from the original problem description and check if remaining actions are sufficient for bugs to appear.  After the test case is sufficiently simplified, a programmer can use a debugger tool to examine program states (values of variables, plus the call stack) and track down the origin of the problem(s). Alternatively, Tracing (software)|tracing can be used. In simple cases, tracing is just a few print statements, which output the values of variables at certain points of program execution.{{citation needed|date=February 2016}}   Techniques   Interactive debugging  {{visible anchor|Print debugging}} (or tracing) is the act of watching (live or recorded) trace statements, or print statements, that indicate the flow of execution of a process. This is sometimes called {{visible anchor|printf debugging}}, due to the use of the printf function in C. This kind of debugging was turned on by the command TRON in the original versions of the novice-oriented BASIC programming language. TRON stood for, "Trace On." TRON caused the line numbers of each BASIC command line to print as the program ran.  Remote debugging is the process of debugging a program running on a system different from the debugger. To start remote debugging, a debugger connects to a remote system over a network. The debugger can then control the execution of the program on the remote system and retrieve information about its state.  Post-mortem debugging is debugging of the program after it has already Crash (computing)|crashed. Related techniques often include various tracing techniques (for example,[http://www.drdobbs.com/tools/185300443 Postmortem Debugging, Stephen Wormuller, Dr. Dobbs Journal, 2006]) and/or analysis of memory dump (or core dump) of the crashed process. The dump of the process could be obtained automatically by the system (for example, when process has terminated due to an unhandled exception), or by a programmer-inserted instruction, or manually by the interactive user.  "Wolf fence" algorithm: Edward Gauss described this simple but very useful and now famous algorithm in a 1982 article for communications of the ACM as follows: "Theres one wolf in Alaska; how do you find it? First build a fence down the middle of the state, wait for the wolf to howl, determine which side of the fence it is on. Repeat process on that side only, until you get to the point where you can see the wolf."<ref name="communications of the ACM">{{cite journal | title="Pracniques: The "Wolf Fence" Algorithm for Debugging", | author=E. J. Gauss | year=1982}} This is implemented e.g. in the Git (software)|Git version control system as the command git bisect, which uses the above algorithm to determine which Commit (data management)|commit introduced a particular bug.  Delta Debugging{{snd}} a technique of automating test case simplification.Andreas Zeller: <cite>Why Programs Fail: A Guide to Systematic Debugging</cite>, Morgan Kaufmann, 2005. ISBN 1-55860-866-4{{rp|p.123}}<!-- for redirect from Saff Squeeze -->  Saff Squeeze{{snd}} a technique of isolating failure within the test using progressive inlining of parts of the failing test.[http://www.threeriversinstitute.org/HitEmHighHitEmLow.html Kent Beck, Hit em High, Hit em Low: Regression Testing and the Saff Squeeze]  Debugging for embedded systems In contrast to the general purpose computer software design environment, a primary characteristic of embedded environments is the sheer number of different platforms available to the developers (CPU architectures, vendors, operating systems and their variants). Embedded systems are, by definition, not general-purpose designs: they are typically developed for a single task (or small range of tasks), and the platform is chosen specifically to optimize that application. Not only does this fact make life tough for embedded system developers, it also makes debugging and testing of these systems harder as well, since different debugging tools are needed in different platforms.  to identify and fix bugs in the system (e.g. logical or synchronization problems in the code, or a design error in the hardware); to collect information about the operating states of the system that may then be used to analyze the system: to find ways to boost its performance or to optimize other important characteristics (e.g. energy consumption, reliability, real-time response etc.).  Anti-debugging Anti-debugging is "the implementation of one or more techniques within computer code that hinders attempts at reverse engineering or debugging a target process".<ref name="veracode-antidebugging">{{cite web |url=http://www.veracode.com/blog/2008/12/anti-debugging-series-part-i/ |title=Anti-Debugging Series - Part I |last=Shields |first=Tyler |date=2008-12-02 |work=Veracode |accessdate=2009-03-17}} It is actively used by recognized publishers in copy protection|copy-protection schemas, but is also used by malware to complicate its detection and elimination.<ref name="soft-prot">[http://people.seas.harvard.edu/~mgagnon/software_protection_through_anti_debugging.pdf Software Protection through Anti-Debugging Michael N Gagnon, Stephen Taylor, Anup Ghosh] Techniques used in anti-debugging include: API-based: check for the existence of a debugger using system information Exception-based: check to see if exceptions are interfered with Process and thread blocks: check whether process and thread blocks have been manipulated Modified code: check for code modifications made by a debugger handling software breakpoints Hardware- and register-based: check for hardware breakpoints and CPU registers Timing and latency: check the time taken for the execution of instructions Detecting and penalizing debugger<ref name="soft-prot" /><!-- reference does not exist -->  An early example of anti-debugging existed in early versions of Microsoft Word which, if a debugger was detected, produced a message that said: "The tree of evil bears bitter fruit. Now trashing program disk.", after which it caused the floppy disk drive to emit alarming noises with the intent of scaring the user away from attempting it again.<ref name="SecurityEngineeringRA">{{cite book | url=http://www.cl.cam.ac.uk/~rja14/book.html | author=Ross J. Anderson | title=Security Engineering | isbn = 0-471-38922-6 | page=684 }}<ref name="toastytech">{{cite web | url=http://toastytech.com/guis/word1153.html | title=Microsoft Word for DOS 1.15}}'
In [27]:
# Import Counter
from collections import Counter

# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))
[(',', 151), ('the', 150), ('.', 89), ('of', 81), ('to', 63), ('a', 60), ('in', 44), ('and', 41), ("''", 41), ('debugging', 40)]

Simple text preprocessing

Alt text that describes the graphic

Why preprocess?

  • Helps make for better input data
  • When performing machine learning or other statistical methods

Alt text that describes the graphic

  • Stemming is a process of reducing inflected words to their word stem, i.e root. It doesn’t have to be morphological; you can just chop off the words ends. For example, the word “solv” is the stem of words “solve” and “solved”.
  • Lemmatization is another approach to remove inflection by determining the part of speech and utilizing detailed database of the language.

Alt text that describes the graphic

  • Stop Words: A stop word is a commonly used word in any natural language (such as “the”, “a”, “an”, “in”).

Examples:

  • Tokenization to create a bag of words
  • Lowercasing words
  • Lemmatization/Stemming
    • Shorten words to their root stems
  • Removing stop words, punctuation, or unwanted tokens
  • Good to experiment with different approaches

Text preprocessing practice

Now, we will apply the techniques we've learned to help clean up text for better NLP results. We'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on our cleaned text.

In [28]:
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))

# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only (removes punctuation)
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))
[('debugging', 40), ('system', 25), ('software', 16), ('bug', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('debugger', 13)]

Introduction to gensim

What is gensim?

  • Popular open-source NLP library
  • Uses top academic models to perform complex tasks
    • Building document or word vectors
    • Performing topic identification and document comparison

What is a word vector?

Alt text that describes the graphic

The problem with BoW and tf-idf

  • Consider the $3$ sentences:
    • 'I am happy'
    • 'I am joyous'
    • 'I am sad'
  • If we were to compute the similarities, 'I am happy' and 'I am joyous' would have the same score as I am happy and I am sad, regardless of how we vectorize it.
  • This is because 'happy', 'joyous' and 'sad' are considered completely different words. However, we know that 'happy' and 'joyous' are more similar to each other than 'sad'. This is something a BoW or tf-idf techniques simply cannot capture.

Word vectors (also called word embeddings) represent each word numerically in such a way that the vector corresponds to how that word is used or what it means in a coordinate system. Vector encodings are learned by considering the context in which the words appear. Words that appear in similar contexts will have similar vectors and, are placed closer together in the vector space.

  • For example, vectors for "leopard", "lion", and "tiger" will be close together, while they'll be far away from "planet" and "castle". This is great for capturing meaning.

Combining and preprocessing all Wikipedia articles into a list

In [29]:
from os import listdir
from os.path import isfile, join

mypath = './Wikipedia articles'
filenames = [f for f in listdir(mypath) if isfile(join(mypath, f))]
filenames
Out[29]:
['wiki_text_bug.txt',
 'wiki_text_crash.txt',
 'wiki_text_reversing.txt',
 'wiki_text_hopper.txt',
 'wiki_text_language.txt',
 'wiki_text_program.txt',
 'wiki_text_debugging.txt',
 'wiki_text_software.txt',
 'wiki_text_computer.txt',
 'wiki_text_debugger.txt',
 'wiki_text_malware.txt',
 'wiki_text_exception.txt']
In [30]:
articles = []

for fname in filenames:
    article = list()
    with open(f'{mypath}/{fname}', 'r') as f:
        text = f.readlines()
        for line in text:
            line = line.rstrip('\n')
            article.append(str(line))
        
    article = ' '.join(article)
    bs = r"['\\]"
    article = re.split(bs, article)
    article = ''.join(article)

    # Tokenize the article: tokens
    tokens = word_tokenize(article)

    # Convert the tokens into lowercase: lower_tokens
    lower_tokens = [t.lower() for t in tokens]

    # Retain alphabetic words: alpha_only (removes punctuation)
    alpha_only = [t for t in lower_tokens if t.isalpha()]

    # Remove all stop words: no_stops
    article = [t for t in alpha_only if t not in english_stops]
    articles.append(article)

Creating and querying a corpus with gensim

It's time to create our first gensim dictionary and corpus!

  • The dictionary is a mapping of words to integer ids
  • The corpus simply counts the number of occurences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.

We'll use these data structures to investigate word trends and potential interesting topics in our document set. To get started, we have will import additional messy articles from Wikipedia, where we will preprocess by lowercasing all words, tokenizing them, and removing stop words and punctuation. Then we will store them in a list of document tokens called articles. We'll need to do some light preprocessing and then generate the gensim dictionary and corpus.

In [31]:
# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary with an id for each token
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])
computer
[(2, 2), (4, 1), (16, 1), (20, 1), (25, 2), (44, 1), (48, 1), (49, 6), (51, 1), (54, 1)]

Gensim bag-of-words

Now, we'll use your new gensimcorpus and dictionary to see the most common terms per document and across all documents. We can use our dictionary to look up the terms.

We will use the Python defaultdict and itertools to help with the creation of intermediate data structures for analysis.

defaultdict allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument int, we are able to ensure that any non-existent keys are automatically assigned a default value of $0$. This makes it ideal for storing the counts of words in this exercise.

itertools.chain.from_iterable() allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through our corpus object (which is a list of lists).

In [32]:
from collections import defaultdict
import itertools

# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
    
# Create a sorted list from the defaultdict: sorted_word_count 
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

print()
# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)
language 54
programming 39
languages 30
code 22
computer 15

computer 589
software 452
cite 322
ref 259
code 235

Finding the Optimum Number of Topics

Latent Dirichlet allocation (LDA)

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model. LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. It posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics.

  • Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
  • LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is important.
  • It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

Alt text that describes the graphic

Now we can run a batch LDA (because of the small size of the dataset that we are working with) to discover the main topics in our articles.

In [33]:
import warnings
warnings.filterwarnings('ignore')

from gensim.models.ldamodel import LdaModel
import pyLDAvis.gensim as gensimvis
import pyLDAvis
In [34]:
# fit LDA model
topics = LdaModel(corpus=corpus,
                  id2word=dictionary,
                  num_topics=5,
                  passes=10)
In [35]:
# print out first 5 topics
for i, topic in enumerate(topics.print_topics(5)):
    print('\n{} --- {}'.format(i, topic))
0 --- (0, '0.022*"malware" + 0.019*"ref" + 0.016*"cite" + 0.015*"computer" + 0.012*"software" + 0.008*"system" + 0.008*"security" + 0.006*"used" + 0.005*"may" + 0.005*"programs"')

1 --- (1, '0.028*"software" + 0.011*"engineering" + 0.011*"computer" + 0.009*"reverse" + 0.008*"system" + 0.007*"debugging" + 0.007*"program" + 0.006*"used" + 0.005*"cite" + 0.005*"code"')

2 --- (2, '0.025*"computer" + 0.011*"cite" + 0.011*"computers" + 0.009*"hopper" + 0.007*"first" + 0.006*"computing" + 0.006*"grace" + 0.006*"ref" + 0.005*"machine" + 0.004*"program"')

3 --- (3, '0.020*"exception" + 0.015*"programming" + 0.014*"language" + 0.013*"code" + 0.012*"exceptions" + 0.012*"computer" + 0.010*"handling" + 0.009*"program" + 0.008*"languages" + 0.008*"ref"')

4 --- (4, '0.015*"bug" + 0.014*"bugs" + 0.013*"software" + 0.011*"may" + 0.009*"computer" + 0.008*"code" + 0.006*"program" + 0.005*"cite" + 0.004*"ref" + 0.004*"errors"')

Visualizing topics with LDAviz

The display of inferred topics shown above does is not very interpretable. The LDAviz R library developed by Kenny Shirley and Carson Sievert, is an interactive visualization that's designed help interpret the topics in a topic model fit to a corpus of text using LDA.

Here, we use the Python version of the LDAviz R library pyLDAvis. Two great features of pyLDAviz are its ability to help interpret the topics extracted from a fitted LDA model, but also the fact that it can be easily incorporated within an iPython notebook.

In [36]:
vis_data = gensimvis.prepare(topics, corpus, dictionary)
pyLDAvis.display(vis_data)
Out[36]:

Term frequency - inverse document frequency (Tf-idf) with gensim

What is tf-idf?

Tf–idf is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

  • Allows you to determine the most important words in each document
  • Each corpus may have shared words beyond just stopwords
  • These words should be down-weighted in importance
  • Ensures most common words don't show up as key words
  • Keeps document specific frequent words weighted high, and the common words across the entire corpus weighted low.

Tf-idf formula:

$$ w_{ij} = \mathrm{tf}_{i,j} \ast \log(\frac{N}{\mathrm{df}_{i}})$$
  • $w_{i,j}$ = tf-idf weight for token $i$ in document $j$
  • $\mathrm{tf}_{i,j}$ = number of occurences of token $i$ in document $j$
  • $\mathrm{df}_{i}$ = number of documents that contain token $i$
  • $N$ = total number of documents

For example if I am an astronomer, sky might be used often but is not important, so I want to downweight that word. Tf-idf weights can help determine good topics and keywords for a corpus with a shared vocabulary.

Tf-idf with Wikipedia

Now we will determine new significant terms for our corpus by applying gensim's tf-idf.

In [35]:
# Import TfidfModel
from gensim.models.tfidfmodel import TfidfModel

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[:5])

print()
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)
[(2, 0.05480063960024175), (4, 0.0036036134191403646), (16, 0.0173037733192008), (20, 0.013700159900060438), (25, 0.05480063960024175)]

compiled 0.21714239695479498
abstraction 0.21714239695479498
compilation 0.2124863975732396
eiffel 0.17707199797769965
intermediate 0.16440191880072524

Named Entity Recognition (NER)

Alt text that describes the graphic

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answer many real-world questions, such as:

  • Which companies were mentioned in the news article?
  • Were specified products mentioned in complaints or reviews?
  • Does the tweet contain the name of a person? Does the tweet contain this person’s location?

NER with NLTK

We're now going to have some fun with named-entity recognition! We will look at a scraped news article and use nltk to find the named entities in the article.

In [36]:
import pandas as pd

df = pd.read_table(f'./News articles/uber_apple.txt', header=None)
text = []
for i in range(df.shape[0]):
    article = df.values[i][0]
    text.append(str(article))
article = ' '.join(text)
In [37]:
article
Out[37]:
'The taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company. Uber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire a driver, be given false reports about the location of its cars. Uber management booked and then cancelled rides with a rival taxi-hailing company which took their vehicles out of circulation. Uber deny this was the intention. The punishment for this behaviour was negligible. Uber promised not to use this “greyball” software against law enforcement – one wonders what would happen to someone carrying a knife who promised never to stab a policeman with it. Travis Kalanick of Uber got a personal dressing down from Tim Cook, who runs Apple, but the company did not prohibit the use of the app. Too much money was at stake for that. Millions of people around the world value the cheapness and convenience of Uber’s rides too much to care about the lack of drivers’ rights or pay. Many of the users themselves are not much richer than the drivers. The “sharing economy” encourages the insecure and exploited to exploit others equally insecure to the profit of a tiny clique of billionaires. Silicon Valley’s culture seems hostile to humane and democratic values. The outgoing CEO of Yahoo, Marissa Mayer, who is widely judged to have been a failure, is likely to get a $186m payout. This may not be a cause for panic, any more than the previous hero worship should have been a cause for euphoria. Yet there’s an urgent political task to tame these companies, to ensure they are punished when they break the law, that they pay their taxes fairly and that they behave responsibly.'
In [38]:
import nltk

# Tokenize the article into sentences: sentences
sentences = nltk.sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)
(NE Uber/NNP)
(NE Beyond/NN)
(NE Apple/NNP)
(NE Uber/NNP)
(NE Uber/NNP)
(NE Travis/NNP Kalanick/NNP)
(NE Tim/NNP Cook/NNP)
(NE Apple/NNP)
(NE Silicon/NNP Valley/NNP)
(NE CEO/NNP)
(NE Yahoo/NNP)
(NE Marissa/NNP Mayer/NNP)

Charting practice

In this exercise, we'll use some extracted named entities and their groupings from a series of newspaper articles to chart the diversity of named entity types in the articles.

We'll use a defaultdict called ner_categories, with keys representing every named entity group type, and values to count the number of each different named entity type. You have a chunked sentence list called chunked_sentences similar to the last exercise, but this time with non-binary category names.

We will use hasattr() to determine if each chunk has a 'label' and then simply use the chunk's .label() method as the dictionary key.

In [39]:
df = pd.read_table(f'./News articles/articles.txt', header=None)
text = []
for i in range(df.shape[0]):
    article = df.values[i][0]
    text.append(str(article))
articles = ' '.join(text)
In [40]:
# Tokenize the article into sentences: sentences
sentences = nltk.sent_tokenize(articles)

# Tokenize each sentence into words: token_sentences
token_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=False)
In [41]:
# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

# Create the nested for loop
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(v) for v in labels]
In [42]:
# Create the pie chart
fig, ax = plt.subplots(figsize=(10, 5), subplot_kw=dict(aspect="equal"))

explode = (0.015, 0.05, 0.025, 0.1, 0.2)

wedges, texts, autotexts = ax.pie(values, labels=labels, autopct='%1.1f%%', startangle=140,
                                  textprops=dict(color="w"), explode=explode, shadow=True)

ax.legend(wedges, labels,
          title="NER Labels",
          loc="center left",
          bbox_to_anchor=(1, 0.2, 0.1, 1))

plt.setp(autotexts, size=12, weight="bold")

ax.set_title("Named Entity Recognition")

plt.show()

Introduction to SpaCy

What is SpaCy?

  • NLP library similar to gensim , with different implementations
  • Focus on creating NLP pipelines to generate models and corpora
  • Open-source, with extra libraries and tools

Why use SpaCy for NER?

  • Easy pipeline creation
  • Different entity types compared to nltk
  • Informal language corpora
    • Easily find entities in Tweets and chat messages

SpaCy entity types:

TYPE DESCRIPTION
PERSON People, including fictional.
NORP Nationalities or religious or political groups.
FAC Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws.
LANGUAGE Any named language.
DATE Absolute or relative dates or periods.
TIME Times smaller than a day.
PERCENT Percentage, including ”%“.
MONEY Monetary values, including unit.
QUANTITY Measurements, as of weight or distance.
ORDINAL “first”, “second”, etc.
CARDINAL Numerals that do not fall under another type.

Comparing NLTK with spaCy NER

We'll now see the results using spaCy's NER annotator. To minimize execution times, we'll specify the keyword arguments tagger=False, parser=False, matcher=False when loading the spaCy model, because we only care about the entity in this exercise.

In [43]:
df = pd.read_table(f'./News articles/uber_apple.txt', header=None)
text = []
for i in range(df.shape[0]):
    article = df.values[i][0]
    text.append(str(article))
article = ' '.join(text)
In [44]:
# Import spacy
import spacy

# Instantiate the English model: nlp
nlp = spacy.load('en_core_web_sm', tagger=False, parser=False, matcher=False)

# Create a new document: doc
doc = nlp(article)

# Print all of the found entities and their labels
for ent in doc.ents:
    print(f'{ent.label_}:', ent.text)
ORG: Apple
PERSON: Travis Kalanick
PERSON: Tim Cook
ORG: Apple
CARDINAL: Millions
PERSON: Uber
LOC: Silicon Valley
ORG: Yahoo
PERSON: Marissa Mayer
MONEY: 186

Vizualize NER in raw text

Let’s use displacy.render to generate the raw markup.

In [45]:
from spacy import displacy

displacy.render(doc, jupyter=True, style='ent')
The taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple ORG ’s phones even thought it is forbidden by the company. Uber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire a driver, be given false reports about the location of its cars. Uber management booked and then cancelled rides with a rival taxi-hailing company which took their vehicles out of circulation. Uber deny this was the intention. The punishment for this behaviour was negligible. Uber promised not to use this “greyball” software against law enforcement – one wonders what would happen to someone carrying a knife who promised never to stab a policeman with it. Travis Kalanick PERSON of Uber got a personal dressing down from Tim Cook PERSON , who runs Apple ORG , but the company did not prohibit the use of the app. Too much money was at stake for that. Millions CARDINAL of people around the world value the cheapness and convenience of Uber PERSON ’s rides too much to care about the lack of drivers’ rights or pay. Many of the users themselves are not much richer than the drivers. The “sharing economy” encourages the insecure and exploited to exploit others equally insecure to the profit of a tiny clique of billionaires. Silicon Valley LOC ’s culture seems hostile to humane and democratic values. The outgoing CEO of Yahoo ORG , Marissa Mayer PERSON , who is widely judged to have been a failure, is likely to get a $ 186 MONEY m payout. This may not be a cause for panic, any more than the previous hero worship should have been a cause for euphoria. Yet there’s an urgent political task to tame these companies, to ensure they are punished when they break the law, that they pay their taxes fairly and that they behave responsibly.

Multilingual NER with polyglot

What is polyglot?

  • NLP library which uses word vectors
  • Why polyglot ?
    • Vectors for many different languages
    • More than 130!

French NER with polyglot I

In this exercise and the next, we'll use the polyglot library to identify French entities. The library functions slightly differently than spacy, so we'll use a few of the new things to display the named entity text and category.

In [46]:
df = pd.read_table(f'./News articles/french.txt', header=None)
text = []
for i in range(df.shape[0]):
    article = df.values[i][0]
    text.append(str(article))
article = ' '.join(text)
In [47]:
from polyglot.text import Text

# Create a new text object using Polyglot's Text class: txt
txt = Text(article)
    
# Print the type of each entity
txt.entities
Out[47]:
[I-PER(['Charles', 'Cuvelliez']),
 I-PER(['Charles', 'Cuvelliez']),
 I-ORG(['Bruxelles']),
 I-PER(['l’IA']),
 I-PER(['Julien', 'Maldonato']),
 I-ORG(['Deloitte']),
 I-PER(['Ethiquement']),
 I-LOC(['l’IA']),
 I-PER(['.'])]

French NER with polyglot II

Here, we'll complete the work we began in the previous exercise.

Our task is to use a list comprehension to create a list of tuples, in which the first element is the entity tag, and the second element is the full string of the entity text.

In [48]:
# Create the list of tuples: entities
entities = [(ent.tag, ' '.join(ent)) for ent in txt.entities]

# Print the entities
print(entities)
[('I-PER', 'Charles Cuvelliez'), ('I-PER', 'Charles Cuvelliez'), ('I-ORG', 'Bruxelles'), ('I-PER', 'l’IA'), ('I-PER', 'Julien Maldonato'), ('I-ORG', 'Deloitte'), ('I-PER', 'Ethiquement'), ('I-LOC', 'l’IA'), ('I-PER', '.')]

French NER with polyglot lll

We'll continue our exploration of polyglot now with some French annotation.

Our specific task is to determine how many of the entities contain the words "Charles" or "Cuvelliez" - these refer to the same person in different ways!

In [49]:
# Initialize the count variable: count
count = 0

# Iterate over all the entities
for ent in txt.entities:
    # Check whether the entity contains 'Márquez' or 'Gabo'
    if "Charles" in ent or "Cuvelliez" in ent:
        # Increment count
        count += 1

# Print count
print(count)

# Calculate the percentage of entities that refer to "Gabo": percentage
percentage = count * 1.0 / len(txt.entities)
print(f'{round(percentage,2)*100}%')
2
22.0%

Classifying fake news using supervised learning with NLP

Supervised learning with NLP

  • Need to use language instead of geometric features
  • scikit-learn : Powerful open-source library
  • How to create supervised learning data from text?
    • Use bag-of-words models or tf-idf as features
In [51]:
df = pd.read_csv('fake_or_real_news.csv').drop('Unnamed: 0', axis=1)
df.head()
Out[51]:
title text label
0 You Can Smell Hillary’s Fear Daniel Greenfield, a Shillman Journalism Fello... FAKE
1 Watch The Exact Moment Paul Ryan Committed Pol... Google Pinterest Digg Linkedin Reddit Stumbleu... FAKE
2 Kerry to go to Paris in gesture of sympathy U.S. Secretary of State John F. Kerry said Mon... REAL
3 Bernie supporters on Twitter erupt in anger ag... — Kaydee King (@KaydeeKing) November 9, 2016 T... FAKE
4 The Battle of New York: Why This Primary Matters It's primary day in New York and front-runners... REAL

CountVectorizer for text classification

CountVectorizer converts a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

Alt text that describes the graphic

It's time to begin building our text classifier!

In this exercise, we'll use pandas alongside scikit-learn to create a sparse text vectorizer we can use to train and test a simple supervised model. To begin, we'll set up a CountVectorizer and investigate some of its features.

In [56]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.33, random_state=53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])
['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']

TfidfVectorizer for text classification

Similar to the sparse CountVectorizer created in the previous exercise, we'll work on creating tf-idf vectors for our documents. We'll set up a TfidfVectorizer and investigate some of its features.

In [57]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print('\n', tfidf_train.A[:10])
['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']

 [[0.        0.        0.        ... 0.        0.        0.       ]
 [0.        0.        0.        ... 0.        0.        0.       ]
 [0.        0.        0.        ... 0.        0.        0.       ]
 ...
 [0.        0.        0.        ... 0.        0.        0.       ]
 [0.        0.0121467 0.        ... 0.        0.        0.       ]
 [0.        0.0165804 0.        ... 0.        0.        0.       ]]

Training and testing a NLP classification model with scikit-learn

Naive Bayes classifier

  • Commonly used for testing NLP classification problems
  • Basis in probability
  • Given a particular piece of data, how likely is a particular outcome?

Examples:

  • If the plot has a spaceship, how likely is it to be sci-fi?
  • Given a spaceship and an alien, how likely now is it sci-fi?
  • Each word from CountVectorizer acts as a feature
  • Naive Bayes: Simple and effective

Training and testing the "fake news" model with CountVectorizer

Now it's time to train the "fake news" model using the features we identified and extracted. In this first exercise we'll train and test a Naive Bayes model using the CountVectorizer data.

In [58]:
# Import the necessary modules
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix, confusion_matrix

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(f'Accuracy = {score}')

# Calculate and plot the confusion matrix: cm
plot_confusion_matrix(nb_classifier, count_test, y_test, normalize='true', labels=['FAKE', 'REAL'])  # doctest: +SKIP
plt.show()
Accuracy = 0.893352462936394

Training and testing the "fake news" model with TfidfVectorizer

Now that we have evaluated the model using the CountVectorizer, we'll do the same using the TfidfVectorizer with a Naive Bayes model.

In [60]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(f'Accuracy = {score}')

# Calculate and plot the confusion matrix: cm
plot_confusion_matrix(nb_classifier, tfidf_test, y_test, normalize='true', labels=['FAKE', 'REAL'])  # doctest: +SKIP
plt.show()
Accuracy = 0.8565279770444764

Improving our model l

Our job in this exercise is to test a few different alpha levels using the Tfidf vectors to determine if there is a better performing combination.

In [61]:
import numpy as np 

# Create the list of alphas: alphas
alphas = np.arange(0, 1, .1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()
Alpha:  0.0
Score:  0.8813964610234337

Alpha:  0.1
Score:  0.8976566236250598

Alpha:  0.2
Score:  0.8938307030129125

Alpha:  0.30000000000000004
Score:  0.8900047824007652

Alpha:  0.4
Score:  0.8857006217120995

Alpha:  0.5
Score:  0.8842659014825442

Alpha:  0.6000000000000001
Score:  0.874701099952176

Alpha:  0.7000000000000001
Score:  0.8703969392635102

Alpha:  0.8
Score:  0.8660927785748446

Alpha:  0.9
Score:  0.8589191774270684

In [62]:
# Create a Multinomial Naive Bayes classifier with the best alpha
nb_classifier = MultinomialNB(alpha=0.1)

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(f'Accuracy = {score}')

# Calculate and plot the confusion matrix: cm
plot_confusion_matrix(nb_classifier, tfidf_test, y_test, normalize='true', labels=['FAKE', 'REAL'])  # doctest: +SKIP
plt.show()
Accuracy = 0.8976566236250598

Improving our model ll

Our job in this exercise is to test a few different setting using the TfidfVectorizer. Specifically we will be changing the analyzer (character or word) and ngram_range.

  • Characters are the basic symbols that are used to write or print a language. For example, the characters used by the English language consist of the letters of the alphabet, numerals, punctuation marks and a variety of symbols (e.g., the dollar sign and the arithmetic symbols).
  • An n-gram is a contiguous sequence of $n$ items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

Alt text that describes the graphic

In [63]:
analyzers = ['word', 'char', 'char_wb']
ngrams = [(1,1), (1,2), (1,3), (2,2), (2,3), (3,3)]

params = [analyzers, ngrams]
print('Parameters:')
print(params)

p_list = list(itertools.product(*params))
print('\nSearch space size = ', len(p_list))
print('\nParameters Space:', p_list)
Parameters:
[['word', 'char', 'char_wb'], [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)]]

Search space size =  18

Parameters Space: [('word', (1, 1)), ('word', (1, 2)), ('word', (1, 3)), ('word', (2, 2)), ('word', (2, 3)), ('word', (3, 3)), ('char', (1, 1)), ('char', (1, 2)), ('char', (1, 3)), ('char', (2, 2)), ('char', (2, 3)), ('char', (3, 3)), ('char_wb', (1, 1)), ('char_wb', (1, 2)), ('char_wb', (1, 3)), ('char_wb', (2, 2)), ('char_wb', (2, 3)), ('char_wb', (3, 3))]
In [65]:
for i in p_list:
    
    analyzer = i[0]
    ngram = i[1]
    
    # Initialize a TfidfVectorizer object: tfidf_vectorizer
    tfidf_vectorizer = TfidfVectorizer(analyzer=analyzer, stop_words='english', max_df=0.7, ngram_range=ngram)

    # Transform the training data: tfidf_train 
    tfidf_train = tfidf_vectorizer.fit_transform(X_train)

    # Transform the test data: tfidf_test 
    tfidf_test = tfidf_vectorizer.transform(X_test)

    # Create a Multinomial Naive Bayes classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=0.1)

    # Fit the classifier to the training data
    nb_classifier.fit(tfidf_train, y_train)

    # Create the predicted tags: pred
    pred = nb_classifier.predict(tfidf_test)

    # Calculate the accuracy score: score
    score = metrics.accuracy_score(y_test, pred)
    print(f'Analyzer: {analyzer} \nngram: {ngram} \nAccuracy = {round(score,2)}')

    # Calculate and plot the confusion matrix: cm
    plot_confusion_matrix(nb_classifier, tfidf_test, y_test, normalize='true', labels=['FAKE', 'REAL'])  # doctest: +SKIP
    plt.show()
Analyzer: word 
ngram: (1, 1) 
Accuracy = 0.9
Analyzer: word 
ngram: (1, 2) 
Accuracy = 0.91
Analyzer: word 
ngram: (1, 3) 
Accuracy = 0.91
Analyzer: word 
ngram: (2, 2) 
Accuracy = 0.92
Analyzer: word 
ngram: (2, 3) 
Accuracy = 0.92
Analyzer: word 
ngram: (3, 3) 
Accuracy = 0.92
Analyzer: char 
ngram: (1, 1) 
Accuracy = 0.82
Analyzer: char 
ngram: (1, 2) 
Accuracy = 0.89
Analyzer: char 
ngram: (1, 3) 
Accuracy = 0.94
Analyzer: char 
ngram: (2, 2) 
Accuracy = 0.89
Analyzer: char 
ngram: (2, 3) 
Accuracy = 0.94
Analyzer: char 
ngram: (3, 3) 
Accuracy = 0.94
Analyzer: char_wb 
ngram: (1, 1) 
Accuracy = 0.73
Analyzer: char_wb 
ngram: (1, 2) 
Accuracy = 0.83
Analyzer: char_wb 
ngram: (1, 3) 
Accuracy = 0.92
Analyzer: char_wb 
ngram: (2, 2) 
Accuracy = 0.85
Analyzer: char_wb 
ngram: (2, 3) 
Accuracy = 0.92
Analyzer: char_wb 
ngram: (3, 3) 
Accuracy = 0.92
In [ ]: