Monthly Archives: June 2016

I love PETL

When I started at my current job I noticed we we had lots of room for improvement about how we imported and exported data.  Folks had been using the MicroSoft SSIS platform as a way to Extract, Transform and Load data in and out of our database to various files.

SSIS is great for lots of things and has a lot of upsides. It is very drag and drop, folks don’t have to know a lot of programming to get it to do things, and it has lots of functions built in. If you need more programming power, you can execute C# or VB scripts to do the fiddly bits.

But I hate it. ( Don’t worry, we’ll get to the love soon.)

My biggest problems with SSIS:

  • It is unversionable. Try reading a git diff of an SSIS change. The xml is designed for a machine to read, not a human. If you want to know what has changed over time in your world, it’s a problem.
  • You can only use Visual Studio to edit it. Many of our SSIS packages include VB or C# scripts. That sounds fine – but apparently these compile to an undiffable, uneditable blob in the xml that is only recompiled if you save using visual studio. So if you want to change something across many SSIS package scripts, you have to open and resave each one.
  • It hides options under rocks. Finding out how something works requires lots of delving into lotsa windows and dialogues.
  • It changes things unexpectedly. Click in the wrong dialogue and it helpfully re-infers datatypes from a file for you. You don’t know until you go to execute.
  • It slapped my momma. Etc.

I wanted to move my team to something that was better for people.

We need something:

  • That we can diff
  • That we can do code reviews and pull requests on
  • That is simple, expressive and clear.
  • That is powerful.

To me that sounds like a programming language.  I encouraged folks on the team to try accomplishing a couple of tasks that might use an SSIS package instead to use Python. Immediately, things got better. Our code reviews made sense. Code quality improved with every single pull request.

We used pymssql to connect to SqlServer and inserted records as needed after processing them. Navigating and transforming XML docs was easy, CSV files were eaten up by the native DictReader.

And then Derrick found PETL. It’s beautiful. You point it at data and make simple moves to completely transform it. I’m smitten.

I had dozens of files to read from, each a quarterly file for a year – only noted in the file name. Each had a crappy heading line that preceded column headers. I needed to put them into 1 file for loading into SalesForce Wave. Whacking together a solution with PETL was effortless. Line 36 is where the PETL starts, and it’s so small and good that it is nice to see how much it encapsulates.

My legs hurt

Group 3
I did a 5K last night in Central Park – which sounds like nothing, but I am not a runner.

The last time I ran was 15 years ago in the JP Morgan Corporate Challenge, with Chris Acton. It was terrible. I ran the 5K race in Rochester in a terrible 45 minutes. I was so slow and awful that my elf-lord boss ran backwards in bare feet encouraging me to keep going. It seems kind, but it also seems kind of like krumping to show someone that it isn’t hard to dance.

Anyway, this time was better! I don’t dig the 30 minutes of being corralled while we wait to start or the crowding, but once the run started I was able to get going and stay going. I’m still slow – finished in 32 minutes – but it’s better by a lot than last time!