import re
import requests
verne = requests.get('http://www.math.buffalo.edu/~badzioch/MTH337/_downloads/around_world.txt')
verne = verne.text
print(verne[:400])
Example Find all 5 letter words in the text:
\b
= word boundary (matches a point between a alphanumerical and non-alphanumerical character)
\w
= any word character (any letter a-z, A-Z, any digit, and the underscore)
re.findall(r'\b\w\w\w\w\w\b', verne)[:10]
Example Find all words which start with "i" and end with "t":
The character *
matches whatever preceeds it 0 or more times, as many times as possible
re.findall( r'\bi\w*t\b' ,verne)[:10]
The character +
indicates that whatever is immediately preceeding it should be match at least once, and as many times as possible:
re.findall(r'\bi\w+t\b', verne)[:10]
{x}
= match exactly x times{x, y}
= match at least x and at most y times{x, }
= match x or more times{ ,y}
= match at most y timesExample Find all words with either 14 or 15 letters:
re.findall(r'\b\w{14,15}\b', verne)[:10]
Square brackets []
indicate a list of characters to be matched:
Example Find all words that start with either a, b, or c and end with x:
re.findall(r'\b[abcABC][a-zA-Z]*[xX]\b', verne)
Example Find all words that consists only of letters b, c, d, e, f, g:
re.findall(r'\b[b-g]{3,}\b', verne)
[^]
indicates a list of characters that should not be matched:
Example Find all words that start with p, end with y, and do not contain a, b, and c:
(note: \W
matches anything except letters, digits, underscore)
re.findall(r'\bp[^abc\W]*y\b', verne)
\d
matches any digit:
Example find all numbers:
re.findall(r'\d+', verne)[:10]
The period character .
matches any character except for the new line:
re.findall(r'.*\d+.*', verne)[:10]
Note: +
and *
are greedy: they will match as many characters as possible:
s = 'breakfast will be served at 7:00-8:00 AM and coffee at 10:00-11:00 AM'
re.findall(r'\d.+AM', s)
*?
and +?
are the lazy versions of *
and +
: they will match as few characters as possible:
re.findall(r'\d.+?AM', s)
import requests
grads = requests.get('http://www.buffalo.edu/cas/math/people/grad-directory.html').text
print(grads[:500])
re.findall(r'<p><b>[\w ,]+<.*\n.*\d{3}-\d{4}', grads)[:10]
Paretheses can be used to select a subpattern that should be returned by re.findall():
re.findall(r'<p><b>([\w ,]+)<.*\n.*(\d{3}-\d{4})', grads)[:10]
Retrieving pdf files using requests:
f = requests.get('http://curca.buffalo.edu/students/pdfs/2017_posters/072Slominski_Emily.pdf')
poster = f.content #get the content of the pdf file
myfile = open('poster.pdf', 'wb') #open file for writing binary data
myfile.write(poster) #write to the binary file
myfile.close()
import requests
t = requests.get('http://pharmacy.buffalo.edu/faculty-staff.html?CFC__target=6d2YOUp9DuKIjb191swYECKo4Ig-http%3A%2F%2Fwww.pharm.buffalo.edu%2FFaculty_Directory%2Fpages%2Fubcms_profile.php%3FID%3D26').text
print(t[:500])
import json
jfile = open('json_test.json', 'w')
mylist = [1, 2, 3, -7.5, [1,2], 'hello']
json.dump(mylist, jfile)
jfile.close()
jfile = open('json_test.json', 'r')
t = jfile.read()
t
t[3]
jfile.seek(0)
n = json.load(jfile)
n
n[3]