====== About ======
{{methmap.jpg |methmap}}
This project takes data about meth lab busts from the [[http://www.usdoj.gov/dea/seizures/|DEA]] Then creates a .kml file so that the data can be visualized using google earth.
[[http://josh.gourneau.com/media/final.kmz|KMZ of methmap]]
[[http://www.ece.utk.edu/~jgournea/methmap.zip|The source code is available]]
====== Getting the links ======
I first had to make a list of urls to tell my script where to get the data from I did that using this script which lists all the URLs on a page. This script is from [[http://www.diveintopython.org/html_processing/index.html|Dive Into Python]]
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
self.urls.extend(href)
if __name__ == "__main__":
import urllib
usock = urllib.urlopen("http://www.usdoj.gov/dea/seizures/")
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls: print url
I then only selected the urls which were state names and saved them to a file called link plus a little awk script to generate the full url.
awk '{print "http://www.usdoj.gov/dea/seizures/"$1}' link.txt > links.txt
====== Generating the kml ======
The script 'parseall.sh' is used to generate the kml file.
#!/bin/bash
rm statedata.xml
echo "" > statedata.xml
for link in `cat links.txt`; do ./htmlout.py $link; done
rm final.kml
echo "Generating KML"
./geocode.py
===== It does two things =====
==== htmlout.py ====
it runs htmlout.py which grabs the data from the urls given as the first argument. It generates a XML file named statedata.xml from these sites.
#! /usr/bin/env python
#Grabs the data from the web and create XML doc of that data
#must start with a blank xml doc ""
import urllib2, re
from BeautifulSoup import BeautifulSoup
import amara
import sys
#get the url from an arg
url = sys.argv[1]
#open a xml file that
#note the fist time this file needs to be
# ""
doc = amara.parse('statedata.xml')
f=open('statedata.xml','w')
f.write(doc.xml())
print "parsing: %s" % url
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
#define a dict that will be used to find meth bust count
d = {}
#get the state name
state = soup.find('p', align="center")
state = state.find(text=True).strip(" .")
state = re.split(" -- ", state)
state = state[1]
#create a new state element under states
doc.states.xml_append(doc.xml_create_element(u'state',
attributes={u'sname' : state}))
# set the location for soup to start reading in data
# there is only one node with the text "COUNTY"
th_row = soup.find(text="COUNTY").findParent("tr")
# each subsequent row is an entry in the list.
for td_row in th_row.findNextSiblings("tr"):
county, city, address, s_date = td_row.findAll("td")
# we only want the text from the other cells, not the containing markup.
county = county.find(text=True).strip(" .")
city = city.find(text=True).strip(" .")
address = address.find(text=True).strip(" .")
s_date = s_date.find(text=True).strip(" .")
#this finds the number of insidents for each county
#using a dict
if county in d:
#already exists
d[county] += 1
else:
#define new pair
d[county] = 1
#print out the dict
for key, value in d.items():
#print "%s : %s" % (key.capitalize(),value)
#the int has to be converted to a unicode string
newvalue = str(value)
newvalue = newvalue.decode('utf-8')
#add the county info to the last state element that exists
e = doc.xml_create_element(u'county',
attributes={u'cname': key.capitalize(),
u'points' : newvalue})
doc.states.state[-1].xml_append(e)
f=open('statedata.xml','w')
f.write(doc.xml())
#print doc.xml()
The statedate file is like this
==== geocode.py ====
Then geocode.py is ran. What this does is load up usdata.xml which is an xml file I found on the
[[http://bbs.keyhole.com/ubb/showthreaded.php/Cat/0/Number/661126/page/0/vc/1|Keyhole BBS]]
that has all counties for each state.
The script checks each county for each state and looks at statedata.xml to see if there is any data for that county. It then color codes that county based on the number of meth busts found for that county.
It then generates a file called final.kml which can be loaded in google earth.
#! /usr/bin/env python
#US Counties v06.kml manip file
#goes through the entire kml
#adds in the data from the meth busts
import amara
#location of the kml with county data
doc = amara.parse('usdata.xml')
#location of xml with data points of meth busts
sd = amara.parse('statedata.xml')
#strip unessesary string data for better string matching
def struni(name):
name = str(name)
name = name.lower()
name = name.replace(' ','')
name = name.replace('-','')
return name
#find the points for a county for a given state
def pointfind(c,s):
c = struni(c)
s = struni(s)
for state in sd.states.state:
if struni(state.sname) == s:
for county in state.county:
if c == struni(county.cname):
return county.points
return 0
#for each state go through each county and find the data for it
for states in doc.kml.Document.Folder.Folder.Folder:
#print states.name
for counties in states.Placemark:
#find the data point for this county in this state
dp = int(pointfind(counties.name,states.name))
#set the color style based on the number returned
if dp == 1 or dp == 0:
counties.styleUrl = u'1'
if dp == 2:
counties.styleUrl = u'2'
if dp == 3 or dp == 4:
counties.styleUrl = u'3'
if 4 < dp < 20:
counties.styleUrl = u'4'
if dp > 20:
counties.styleUrl = u'5'
#add description with the meth bust info
desc = "%s meth busts" % dp
counties.xml_append_fragment(desc)
#write the file out
f=open('final.kml','w')
f.write(doc.xml())
#print doc.xml()
====== file summary ======
cparse.py -- This is a script I used for debugging the output of htmlout.py
htmlout.py -- Grabs data from websites and makes xml of data
statedata.xml -- data about states generated by htmlout.py
geocode.py -- Generates kml and color codes counties based on the xml made my htmlout.py
usdata.xml -- xml data that defines every state and county in the us
parseall.sh -- this script automates the process of creating the kml
links.txt -- a list of links to get data from
methmap.jpg -- a screenshot if you don't have google earth.
[[http://www.ece.utk.edu/~jgournea/methmap.zip|zip of source]]
~~NOTOC~~