Chris Agocs learns something new

Wednesday, January 23, 2013

How to parse a KML file and find the centroid of all the Placemarks in Python.

A local forum has a tag game going on. In a tag game, you go and take a picture of an interesting place, then post clues leading to that place. The next person finds the place, takes a picture of him or herself there, then chooses another place and posts pictures and clues of it. It's altogether pretty fun.

A member of the forum is keeping up with the game on a Google map, and I became curious about the centroid of all the tags.

KML file

I downloaded the KML file and opened it up in Notepad++. It turns out the Placemarks are really easy to read:

XML Parsing

I fired up Python. Here's a quick and easy import for an XML parsing package.

from xml.dom.minidom import parseString

Basic steps

Since this kml file is pretty simple, the steps are easy:

Read KML file as a string
Parse that string into a DOM
Iterate through a collection of coordinates elements from the DOM
Read the data out of the coordinates elements
Break them up into latitude and longitude
Find the centroid

So, let's get to it.

Read KML file as a string

#Read KML file as a string
file = open(location)
data = file.read()
file.close()

Parse that string into a DOM

#Parse that string into a DOM
dom = parseString(data)

Iterate through a collection of coordinates elements from the DOM

for d in dom.getElementsByTagName('coordinates'):

Read data out of the coordinates elements, break them up into latitude and longitude

coords = d.firstChild.data.split(',')

longitudes.append(float(coords[0]))

latitudes.append(float(coords[1]))

Find the centroid

centerLatitude = sum(latitudes)/len(latitudes)
centerLongitude = sum(longitudes)/len(longitudes)
return ([centerLongitude,centerLatitude])

I will note that this is technically wrong. The Earth is a sphere, and this really only works for a Cartesian plane. However, over a relatively small area, the error isn't great enough to make a difference.

Finished product

Here's the whole thing.

It turns out that the centroid is [-84.3412607740385, 33.79998225], which is unnervingly close to my house.

Tuesday, June 12, 2012

I was wrong: Text qualifiers do not count toward column width in SSIS 2008

I know, what?

The Error

I'm importing from a text qualified, pipe delimited flat file with a the following columns (Name (Datatype, maxlength)):

CustomerIdentifier (WSTR, 120)
GroupName (WSTR, 120)
UserName (WSTR, 100)
UserStatus (WSTR, 15)
UserType (WSTR, 20)

My package was chugging along, until it failed.

Error: 2012-06-12 09:38:17.73

Code: 0xC02020A1

Source: InsertUserRecords _ Pipe Flat File Source [870]

Description: Data conversion failed. The data conversion for column "UserName" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".

End Error

Error: 2012-06-12 09:38:17.73

Code: 0xC020902A

Source: InsertUserRecords _ Pipe Flat File Source [870]

Description: The "output column "UserName" (888)" failed because truncation occurred, and the truncation row disposition on "output column "UserName" (888)" specifies failure on truncation. A truncation error occurred on the specified object of the specified component.

End Error

Error: 2012-06-12 09:38:17.74

Code: 0xC0202092

Source: InsertUserRecords _ Pipe Flat File Source [870]

Description: An error occurred while processing file "C:\Scratch\DataLoad\User_USERDATA.txt" on data row 25519.

End Error

I highlighted the important bits.

I opened up the offending file and read the offending line (which I sanitized before posting here).

I highlighted the offending column. The text (from the first 'S' to the last '0') is 99 characters. The column width is 100, so it should have fit. I loaded it up in a hex editor and went through every character to see if there were any spurious nonprinting characters. Nope.

The Problem

~~Out of curiosity, I deleted a character and tried to import it into a test database. That worked.~~

~~Let me reiterate: It worked with 98 characters, with a column width of 100. If you count the text qualifiers (the " character on either side of the string), that's 100 total characters.~~

I put the missing character back and deleted the " characters. I then changed the Flat File Source such that the UserName column was NOT text qualified, and tried again. 99 characters, no quotes, and it worked. I put the " characters back, turned Text Qualified back on, and it failed.

The Conclusion

The only conclusion I can draw from this is that SSIS initializes a column 100 characters wide, attempts to insert the whole string into it (text qualifiers included), and then strips the text qualifiers out.

The Solution

Here's how I wound up handling the issue gracefully. I set the UserName column width to 102 characters, to allow for 100 characters and the text qualifiers. In the Data Flow task where I handle that Flat File Source, I have a Derived Column transform adding a new column, ShortenedUserName (WSTR, 100) = Subtstring(Username, 1, 100). In theory there should be no instances where UserName.Length > 100 characters once the text qualifiers have been stripped, but I'm not going to take that chance in production. I'm also going to have it write to an error file whenever it encounters this scenario, so we can double-check the data integrity in the database.

Update

It appears this is only the case if the text qualifier is not set on the file level, but Text Qualified is set to True on the column level.

Second Update

I was wrong. Text qualifier was not set in the General tab. Observe:

The value I circled and put arrows toward needs to be set to the actual text qualifier. If it isn't, well, then all hell breaks loose as I described above.

Monday, June 11, 2012

Unzip many .zip files with Cygwin

Someone gave me a directory of compressed text files that had been compressed using the built in Windows zip program. That's a terrible idea for two reasons:

A bunch of separate .zip files will not compress as efficiently as one .zip file with all the text files inside of it.
Windows has no built in faculty for unzipping a bunch of files.

Bandwidth is cheap, so I'm ignoring the first point for now. How can I quickly and easily unzip a bunch of .zip files? Using Cygwin. Cygwin is a *nix shell that sits on top of Windows, and it works spectacularly. You also have to install the unzip program for Cygwin, gunzip won't work.

Here's a tricky gotcha:

$ unzip *.zip

Archive: blah1.zip

caution: filename not matched: blah2.zip

caution: filename not matched: blah3.zip

caution: filename not matched: blah4.zip

caution: filename not matched: blah5.zip

Doesn't work. I Googled around, and here's the trick:

$ unzip \*.zip

You have to use the \ in front of the * in order for unzip to figure out the wildcard. Now I have a bunch of text files to process.

Dads and Grads eliminator

I hate this time of year (late May, early June). So many references to gifts for "Dads and Grads". As a college graduate who took way too long to actually graduate, that kinda sticks in my craw. To combat this, I wrote the Dads & Grads Eliminator Greasemonkey script. It's not too special, but install it if you like.

Link to it: http://userscripts.org/scripts/show/135763

How it works:

var els = document.getElementsByTagName("*");
for(var i = 0, l = els.length; i < l; i++) {
var el = els[i];
el.innerHTML = el.innerHTML.replace(/Dads and Grads/gi, 'bar');
el.innerHTML = el.innerHTML.replace(/Dads & Grads/gi, 'bar');
el.innerHTML = el.innerHTML.replace(/dads and grads/gi, 'bar');
el.innerHTML = el.innerHTML.replace(/dads & grads/gi, 'bar');
el.innerHTML = el.innerHTML.replace(/dads And grads/gi, 'bar');
el.innerHTML = el.innerHTML.replace(/Dads & Grads/gi, 'bar');
el.innerHTML = el.innerHTML.replace(/dads & grads/gi, 'bar');
}

Very simply. It gets all the elements in the page, iterates through them, and replaces all the permutations of "Dads and Grads" I could think of with "bar". It works surprisingly well. I actually had to disable it to write this post, and to upload it to UserScripts.org.

Update: Not even I'm using this script any more. (un)Surprsingly, it breaks some things (like Blogger), and takes a really long time to go through every element of complex pages. Still, a fun introduction to Greasemonkey.

Thursday, June 7, 2012

Hadoop: JAVA_HOME is not set... Yes it is!

This post is mostly expounding on my findings here: http://stackoverflow.com/questions/10824462/hadoop-java-home-is-not-set

Background

After installing Java, I wanted to install Hadoop on this Ubuntu 12.04 Server. There are a million tutorials, but I went with this one: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/. I wanted to be able to run Hadoop as all users, so I set JAVA_HOME inside of /etc/profile. That way, all profiles get that environmental variable.

I got down to the part where you format the HDFS filesystem through the following command:

/usr/local/hadoop/bin/hadoop namenode -format

And everything went just fine... NOT.

user@linux01:~$ sudo $HADOOP_INSTALL/bin/hadoop namenode -format
Error: JAVA_HOME is not set.

I was indignant! JAVA_HOME is totally set!

user@linux01:~$ tail -n 4 /etc/profile
export JAVA_HOME=/usr/local/jdk1.6.0_32/bin

export JDK_HOME=$JAVA_HOME
export PATH=$PATH:/usr/local/jdk1.6.0_32/bin

export HADOOP_INSTALL=/usr/local/hadoop/hadoop-1.0.3

user@linux01:~$ echo $JAVA_HOME
/usr/local/jdk1.6.0_32/bin

user@linux01:~$ ls $JAVA_HOME
appletviewer extcheck jar javac and so forth...

The Symptoms

Some geeks on StackOverflow pointed me in the right direction. Environmental variables sometimes don't persist when you sudo. I su'd as my hadoop user and ran the command again. Success!

The Cure

Of course, that's not the end of it. I went to start Hadoop using:

/usr/local/hadoop/bin/start-all.sh

I can't remember the exact error I got. Namenode and Jobtracker started right up, but Datanode, SecondaryNamenode and Tasktracker didn't. I did some digging, and the ones that worked are part of the Namenode, started by hadoop-daemon.sh. The ones that didn't are part of the Hadoop Datanode, and are started by hadoop-daemons.sh. The processes that were not starting all had error logs complaining about, guess what, JAVA_HOME not being set. Finally, I bit the bullet and hard-coded JAVA_HOME in conf/hadoop-env.sh.

Too long, didn't read

The moral of the story is, hard code JAVA_HOME in conf/hadoop-env.sh.

Tuesday, June 5, 2012

Installing Sun JDK 1.6.0 on Ubuntu 12.04 Server

Background
As I understand it, Oracle retired the Operating System Distributor's License for Java, meaning that Canonical could no longer include the JDK or JRE in their APT repositories. This means no more

sudo apt-get install sun-java-whatever

Recently, I ran into this trying to install Hadoop 1.03 on top of Ubuntu 12.04 Server. Canonical recommends installing OpenJDK instead, but according to Apache, bugs in how OpenJDK handles Generics might cause bugs in my MapReduce task. Unacceptable. What's more, Apache recommends Java 1.6 over 1.7.

I ran into a bunch of Google search results detailing how to install Java using PPA repositories, but I couldn't make any work. Finally, got fed up and decided to install Java the hard way.

Download Java
The first obvious step is to download the Java installer. I went here, clicked "Accept", and copied the hyperlink for Linux x64 (64-bit) (jdk-6u32-linux-x64.bin). On my linux machine, I ran

wget http://download.oracle.com/otn-pub/java/jdk/6u32-b05/jdk-6u32-linux-x64.bin

A file, jdk-6u32-linux-x64.bin appeared in my home directory. Per these instructions, I set the file executable and attempted to execute it.

chris@linux01:~/bar$ chmod +x jdk-6u32-linux-x64.bin
chris@linux01:~/bar$ ./jdk-6u32-linux-x64.bin
./jdk-6u32-linux-x64.bin: line 1: html: No such file or directory
./jdk-6u32-linux-x64.bin: line 2: head: No such file or directory
./jdk-6u32-linux-x64.bin: line 3: title: No such file or directory
./jdk-6u32-linux-x64.bin: line 4: META: No such file or directory
./jdk-6u32-linux-x64.bin: line 5: link: No such file or directory

That's odd.

chris@linux01:~/bar$ head jdk-6u32-linux-x64.bin
<html>
<head>
<title>Unauthorized Request</title>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<link rel="stylesheet" type="text/css" href="/errors/us/assets/hp-styles.css" />
<link rel="stylesheet" type="text/css" href="/errors/us/assets/master-master.css" />

<body style="margin: 0px" bgcolor=#ffffff>
<div id="banner">
<table width="100%" border="0" cellspacing="0" cellpadding="0">

Oh, that makes sense. Oracle.com either hates wget or is storing a cookie saying "yes, this guy accepted the T's and C's."

Actually download Java

Easiest thing to do was just to download the .bin file onto my desktop and use SCP (WinSCP if you need) to move it to my server.

Install Java

I decided to install Java in /usr/local/java.

sudo mkdir /usr/local/java

sudo cp jdk-6u32-linux-x64.bin /usr/local/java

cd /usr/local/java

sudo ./jdk-6u32-linux-x64.bin

Unpacking...

Checksumming...

Extracting...

UnZipSFX 5.50 of 17 February 2002, by Info-ZIP (Zip-Bugs@lists.wku.edu).

   creating: jdk1.6.0_32/

and so forth.

Unpacking Java like this is effectively "installing" it, if you hate yourself and want to write out the path to Java every time.

Set Environmental Variables

I don't hate myself, so I set a couple of environmental variables, notably PATH, JAVA_HOME, and JDK_HOME. The first one allows you to just type

javac foo.java

java foo

as opposed to writing it all out, and the latter two are requested by Hadoop. I set these in my /etc/profile file. Add the following to the end of your /etc/profile:

export JAVA_HOME=/usr/local/java/jdk1.6.0_32

export JDK_HOME=$JAVA_HOME

export PATH=$PATH:/usr/local/java/jdk1.6.0_32/bin

Finally, run

source /etc/profile

There. Now Java is installed for all intents and purposes. Way to go. As an aside, I ran into some problems with JAVA_HOME as I was trying to install Hadoop, but I'll cover those later.