This project takes data about meth lab busts from the DEA Then creates a .kml file so that the data can be visualized using google earth.
I first had to make a list of urls to tell my script where to get the data from I did that using this script which lists all the URLs on a page. This script is from Dive Into Python
from sgmllib import SGMLParser class URLLister(SGMLParser): def reset(self): SGMLParser.reset(self) self.urls = [] def start_a(self, attrs): href = [v for k, v in attrs if k=='href'] self.urls.extend(href) if __name__ == "__main__": import urllib usock = urllib.urlopen("http://www.usdoj.gov/dea/seizures/") parser = URLLister() parser.feed(usock.read()) parser.close() usock.close() for url in parser.urls: print url
I then only selected the urls which were state names and saved them to a file called link plus a little awk script to generate the full url.
awk '{print "http://www.usdoj.gov/dea/seizures/"$1}' link.txt > links.txt
The script 'parseall.sh' is used to generate the kml file.
#!/bin/bash rm statedata.xml echo "<states></states>" > statedata.xml for link in `cat links.txt`; do ./htmlout.py $link; done rm final.kml echo "Generating KML" ./geocode.py
it runs htmlout.py which grabs the data from the urls given as the first argument. It generates a XML file named statedata.xml from these sites.
#! /usr/bin/env python #Grabs the data from the web and create XML doc of that data #must start with a blank xml doc "<states></states>" import urllib2, re from BeautifulSoup import BeautifulSoup import amara import sys #get the url from an arg url = sys.argv[1] #open a xml file that #note the fist time this file needs to be # "<states></states>" doc = amara.parse('statedata.xml') f=open('statedata.xml','w') f.write(doc.xml()) print "parsing: %s" % url page = urllib2.urlopen(url) soup = BeautifulSoup(page) #define a dict that will be used to find meth bust count d = {} #get the state name state = soup.find('p', align="center") state = state.find(text=True).strip(" .") state = re.split(" -- ", state) state = state[1] #create a new state element under states doc.states.xml_append(doc.xml_create_element(u'state', attributes={u'sname' : state})) # set the location for soup to start reading in data # there is only one node with the text "COUNTY" th_row = soup.find(text="COUNTY").findParent("tr") # each subsequent row is an entry in the list. for td_row in th_row.findNextSiblings("tr"): county, city, address, s_date = td_row.findAll("td") # we only want the text from the other cells, not the containing markup. county = county.find(text=True).strip(" .") city = city.find(text=True).strip(" .") address = address.find(text=True).strip(" .") s_date = s_date.find(text=True).strip(" .") #this finds the number of insidents for each county #using a dict if county in d: #already exists d[county] += 1 else: #define new pair d[county] = 1 #print out the dict for key, value in d.items(): #print "%s : %s" % (key.capitalize(),value) #the int has to be converted to a unicode string newvalue = str(value) newvalue = newvalue.decode('utf-8') #add the county info to the last state element that exists e = doc.xml_create_element(u'county', attributes={u'cname': key.capitalize(), u'points' : newvalue}) doc.states.state[-1].xml_append(e) f=open('statedata.xml','w') f.write(doc.xml()) #print doc.xml()
The statedate file is like this
<states> <state sname="State Name"> <county cname="County Name" points"datapoints" /> <state> </states>
Then geocode.py is ran. What this does is load up usdata.xml which is an xml file I found on the Keyhole BBS that has all counties for each state.
The script checks each county for each state and looks at statedata.xml to see if there is any data for that county. It then color codes that county based on the number of meth busts found for that county.
It then generates a file called final.kml which can be loaded in google earth.
#! /usr/bin/env python #US Counties v06.kml manip file #goes through the entire kml #adds in the data from the meth busts import amara #location of the kml with county data doc = amara.parse('usdata.xml') #location of xml with data points of meth busts sd = amara.parse('statedata.xml') #strip unessesary string data for better string matching def struni(name): name = str(name) name = name.lower() name = name.replace(' ','') name = name.replace('-','') return name #find the points for a county for a given state def pointfind(c,s): c = struni(c) s = struni(s) for state in sd.states.state: if struni(state.sname) == s: for county in state.county: if c == struni(county.cname): return county.points return 0 #for each state go through each county and find the data for it for states in doc.kml.Document.Folder.Folder.Folder: #print states.name for counties in states.Placemark: #find the data point for this county in this state dp = int(pointfind(counties.name,states.name)) #set the color style based on the number returned if dp == 1 or dp == 0: counties.styleUrl = u'1' if dp == 2: counties.styleUrl = u'2' if dp == 3 or dp == 4: counties.styleUrl = u'3' if 4 < dp < 20: counties.styleUrl = u'4' if dp > 20: counties.styleUrl = u'5' #add description with the meth bust info desc = "<description>%s meth busts</description>" % dp counties.xml_append_fragment(desc) #write the file out f=open('final.kml','w') f.write(doc.xml()) #print doc.xml()
cparse.py – This is a script I used for debugging the output of htmlout.py
htmlout.py – Grabs data from websites and makes xml of data
statedata.xml – data about states generated by htmlout.py
geocode.py – Generates kml and color codes counties based on the xml made my htmlout.py
usdata.xml – xml data that defines every state and county in the us
parseall.sh – this script automates the process of creating the kml
links.txt – a list of links to get data from
methmap.jpg – a screenshot if you don't have google earth.