Regular expressions

In [2]:
import re
In [3]:
import requests
verne = requests.get('http://www.math.buffalo.edu/~badzioch/MTH337/_downloads/around_world.txt')
In [4]:
verne = verne.text
In [5]:
print(verne[:400])
AROUND THE WORLD IN EIGHTY DAYS



CONTENTS


CHAPTER

      I  IN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE
         ONE AS MASTER, THE OTHER AS MAN

     II  IN WHICH PASSEPARTOUT IS CONVINCED THAT HE HAS AT LAST FOUND
         HIS IDEAL

    III  IN WHICH A CONVERSATION TAKES PLACE WHICH SEEMS LIKELY TO COST
         PHILEAS FOGG DEAR

     IV  IN WHICH PHILEAS FOGG ASTOUNDS PA

Example Find all 5 letter words in the text:

\b = word boundary (matches a point between a alphanumerical and non-alphanumerical character)

\w = any word character (any letter a-z, A-Z, any digit, and the underscore)

In [6]:
re.findall(r'\b\w\w\w\w\w\b', verne)[:10]
Out[6]:
['WORLD',
 'WHICH',
 'OTHER',
 'OTHER',
 'WHICH',
 'FOUND',
 'IDEAL',
 'WHICH',
 'TAKES',
 'PLACE']

Example Find all words which start with "i" and end with "t":

The character * matches whatever preceeds it 0 or more times, as many times as possible

In [7]:
re.findall( r'\bi\w*t\b' ,verne)[:10]
Out[7]:
['it', 'it', 'it', 'it', 'it', 'itinerant', 'it', 'it', 'it', 'it']

The character + indicates that whatever is immediately preceeding it should be match at least once, and as many times as possible:

In [8]:
re.findall(r'\bi\w+t\b', verne)[:10]
Out[8]:
['itinerant',
 'inhabit',
 'instant',
 'ingot',
 'ingot',
 'incident',
 'intelligent',
 'impatient',
 'important',
 'interrupt']
  • {x} = match exactly x times
  • {x, y} = match at least x and at most y times
  • {x, } = match x or more times
  • { ,y} = match at most y times

Example Find all words with either 14 or 15 letters:

In [11]:
re.findall(r'\b\w{14,15}\b', verne)[:10]
Out[11]:
['Ecclesiastical',
 'mathematically',
 'physiognomists',
 'mathematically',
 'mathematically',
 'conscientiously',
 'conscientiously',
 'fortifications',
 'transportation',
 'reconnaissance']

Square brackets [] indicate a list of characters to be matched:

Example Find all words that start with either a, b, or c and end with x:

In [22]:
re.findall(r'\b[abcABC][a-zA-Z]*[xX]\b', verne)
Out[22]:
['box',
 'apex',
 'box',
 'box',
 'Colfax',
 'Bordeaux',
 'Bordeaux',
 'Bordeaux',
 'Bordeaux',
 'Bordeaux',
 'Bordeaux',
 'box']

Example Find all words that consists only of letters b, c, d, e, f, g:

In [26]:
re.findall(r'\b[b-g]{3,}\b', verne)
Out[26]:
['bed',
 'beef',
 'fee',
 'fed',
 'fed',
 'begged',
 'feed',
 'edge',
 'begged',
 'bed',
 'bed',
 'bed',
 'bed',
 'edge',
 'bed',
 'beg',
 'beef',
 'begged',
 'bed',
 'bed',
 'bed',
 'begged',
 'edge',
 'bed',
 'beg']

[^] indicates a list of characters that should not be matched:

Example Find all words that start with p, end with y, and do not contain a, b, and c: (note: \W matches anything except letters, digits, underscore)

In [28]:
re.findall(r'\bp[^abc\W]*y\b', verne)
Out[28]:
['porphyry',
 'portly',
 'positively',
 'politely',
 'politely',
 'plenty',
 'politely',
 'properly',
 'ply',
 'promptly',
 'prodigy',
 'purity',
 'positively',
 'pretty',
 'prosperously',
 'purposely',
 'positively',
 'profoundly',
 'persistently',
 'penny',
 'promontory',
 'prosperity',
 'pretty',
 'pretty',
 'purely',
 'ply',
 'plenty',
 'politely',
 'plentifully',
 'prodigiously',
 'politely',
 'promptly',
 'politely',
 'persistently',
 'plentifully',
 'proudly',
 'profoundly',
 'pity']

\d matches any digit:

Example find all numbers:

In [12]:
re.findall(r'\d+', verne)[:10]
Out[12]:
['1872', '7', '1814', '2', '2', '7', '13', '3', '13', '6']

The period character . matches any character except for the new line:

In [13]:
re.findall(r'.*\d+.*', verne)[:10]
Out[13]:
['Mr. Phileas Fogg lived, in 1872, at No. 7, Saville Row, Burlington',
 'Gardens, the house in which Sheridan died in 1814.  He was one of the',
 'almost superhumanly prompt and regular.  On this very 2nd of October he',
 'this Wednesday, 2nd October, you are in my service."',
 '    Brindisi, by rail and steamboats .................  7 days',
 '  From Suez to Bombay, by steamer .................... 13  "',
 '  From Bombay to Calcutta, by rail ...................  3  "',
 '  From Calcutta to Hong Kong, by steamer ............. 13  "',
 '  From Hong Kong to Yokohama (Japan), by steamer .....  6  "',
 '  From Yokohama to San Francisco, by steamer ......... 22  "']

Note: + and * are greedy: they will match as many characters as possible:

In [31]:
s = 'breakfast will be served at 7:00-8:00 AM and coffee at 10:00-11:00 AM'
In [33]:
re.findall(r'\d.+AM', s)
Out[33]:
['7:00-8:00 AM and coffee at 10:00-11:00 AM']

*? and +? are the lazy versions of * and +: they will match as few characters as possible:

In [34]:
re.findall(r'\d.+?AM', s)
Out[34]:
['7:00-8:00 AM', '10:00-11:00 AM']

Regular expressions and the web

In [15]:
import requests
In [16]:
grads = requests.get('http://www.buffalo.edu/cas/math/people/grad-directory.html').text
In [20]:
print(grads[:500])
<!DOCTYPE HTML><html lang="en"><!-- cmspub01 1113-115118 -->
<head>
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    <meta id="meta-viewport" name="viewport" content="width=device-width,initial-scale=1">
        <script>if (screen.width > 720 && screen.width < 960) document.getElementById('meta-viewport').setAttribute('content','width=960');</script>
    <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'g
In [21]:
re.findall(r'<p><b>[\w ,]+<.*\n.*\d{3}-\d{4}', grads)[:10]
Out[21]:
['<p><b>Alegria, Linda</b><br />\nOffice: 138 Phone: 645-8823',
 '<p><b>Aswani, Amy</b><br />\nOffice: 139 Phone: 645-8824',
 '<p><b>Bittner, Alyson</b><br />\nOffice: 130&nbsp;&nbsp;Phone: 645-8818',
 '<p><b>Cain Charles</b><br />\nOffice: 126 Phone: 645-8816',
 '<p><b>Casper, Michael<br />\n</b> Office: 222&nbsp; Phone: 645-8779',
 '<p><b>Chang, Hong<br />\n</b> Office: 136 Phone: 645-8821',
 '<p><b>Cheuk, Ka Yue<br />\n</b> Office: 140 Phone: 645-8825',
 '<p><b>Deutsch, Dustin<br />\n</b> Office: 140&nbsp; Phone: 645-8825',
 '<p><b>Dey, Subhankar</b><br />\nOffice: 140&nbsp; Phone: 645-8825',
 '<p><b>Doga, Hakan</b><br />\nOffice: 131&nbsp; Phone: 645-8819']

Paretheses can be used to select a subpattern that should be returned by re.findall():

In [22]:
re.findall(r'<p><b>([\w ,]+)<.*\n.*(\d{3}-\d{4})', grads)[:10]
Out[22]:
[('Alegria, Linda', '645-8823'),
 ('Aswani, Amy', '645-8824'),
 ('Bittner, Alyson', '645-8818'),
 ('Cain Charles', '645-8816'),
 ('Casper, Michael', '645-8779'),
 ('Chang, Hong', '645-8821'),
 ('Cheuk, Ka Yue', '645-8825'),
 ('Deutsch, Dustin', '645-8825'),
 ('Dey, Subhankar', '645-8825'),
 ('Doga, Hakan', '645-8819')]

Retrieving pdf files using requests:

In [43]:
f = requests.get('http://curca.buffalo.edu/students/pdfs/2017_posters/072Slominski_Emily.pdf')
In [45]:
poster = f.content   #get the content of the pdf file
myfile = open('poster.pdf', 'wb')  #open file for writing binary data
myfile.write(poster) #write to the binary file
myfile.close()

Project 8

In [24]:
import requests
t = requests.get('http://pharmacy.buffalo.edu/faculty-staff.html?CFC__target=6d2YOUp9DuKIjb191swYECKo4Ig-http%3A%2F%2Fwww.pharm.buffalo.edu%2FFaculty_Directory%2Fpages%2Fubcms_profile.php%3FID%3D26').text
In [25]:
print(t[:500])
<!DOCTYPE HTML><html lang="en" class="oldbrand"><!-- cmspub04 1113-115341 -->
<head>
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    <meta id="meta-viewport" name="viewport" content="width=device-width,initial-scale=1">
        <script>if (screen.width > 720 && screen.width < 960) document.getElementById('meta-viewport').setAttribute('content','width=960');</script>
    <script>(function(w,d,s,l,i){w[l]=w[l]|

Saving data with json

In [5]:
import json
In [6]:
jfile = open('json_test.json', 'w')
In [7]:
mylist = [1, 2, 3, -7.5, [1,2], 'hello']
In [8]:
json.dump(mylist, jfile)
jfile.close()
In [9]:
jfile = open('json_test.json', 'r')
t = jfile.read()
In [10]:
t
Out[10]:
'[1, 2, 3, -7.5, [1, 2], "hello"]'
In [11]:
t[3]
Out[11]:
' '
In [12]:
jfile.seek(0)
Out[12]:
0
In [13]:
n = json.load(jfile)
In [14]:
n
Out[14]:
[1, 2, 3, -7.5, [1, 2], 'hello']
In [15]:
n[3]
Out[15]:
-7.5