Going from data to action is a recurring challenge in a start up. And the process has never been easier due to the wealth of amazing open source tools including Python (pandas, numpy, matplotlib), iPython Notebook, and D3,js.
I’ve recently worked on a project in the container shipping industry where we had a large database of information about repairs to shipping containers. The challenge was to find actionable opportunities based on insights gleaned from the data. Here’s how I went about the data analysis.
Mungeing and Probing
I started the project by flexing the data this way and that using pandas and the ipython notebook (both amazing tools you should get to know). This took a few passes. First I got it loaded into a DataFrame. Then I altered the structure to make it easier to understand, such as replacing coded names with full text. With that out of the way it was time to explore. The most helpful chart I made was this pareto chart which reveals the relative significance of various drivers in the data. Below is the code to generate the chart for any data series.
Using these pareto charts, plus a variety of histograms and scatter plots, I was able to provide the team with an initial window into the data which we used to identify an avenue that was worthy of further investigation.
With a more clear destination in mind, my goal became creating a visualization that would reveal the opportunity within the data. The tool for this is D3.js. D3 is a little bit confusing to get ones head around at first, but it is well worth figuring out because the things that you can do with it are amazing.
In our case, I wanted to let our team explore the impact of various interventions to curtail types of damage or to protect various parts of the containers. While the pareto chart (above) provides a insight about the cost of various damage types or container parts, it falls short when the two dimensions need to be considered together.
My solution is at this interactive visualization (view full size) . With it our team has been able to explore the data set without having to write more code. They are no longer dependent on me to “run the numbers”. And, it didn’t even take too long to make.
I highly recommend adding data analysis and visualization tools to your toolkit. They aren’t hard to learn and they are amazingly powerful.