Predicting Functional Water Pumps in Tanzania using Random Forests and Logistic Regression in Python
Tanzania’s water pumps dataset presented unique set of interesting problems related to data cleaning and predictions. Working on this data required some thinking about the end goal, reading data carefully, paying attention to the details, and deciding what hidden information in the data was important. It was a classification analysis to accurately predict the different classes.The target variable consisted of three classes of water pumps: Functional, Non-function, and the ones that required repair work. The challenge was to acurately predict which pumps were functional.
The home improvement spending in 2015 and projections for 2017 almost perfectly overlaps for all 25 metropolitan areas. It’s both good or bad news for the businesses. Good news because the market looks stable without any shock and speculation, but on the other hand it also means less chances of extra growth or spurt in the market.
Excel is more flexible with empty cells/columns but Python pandas will have difficulty readinf such files. These files need some serious cleaning. Data Wrangling is part of the process. This post discusses how I handled the data that looked extremely need in Excel but gave me a hard time preparing it using pandas for further analysis in Python.