Introduction: This project has been initiated as the final project requirement for Coursera course IBM Data Science Professional Certificate. The aim of this project is to demonstrate how data analytics, machine learning and data visualization techniques can be used to solve real life problem.
Business Problem: Recently one of my old alumni has been offered a job in Kuala Lumpur. With the decision to move in, he called me and asked for my help to suggest him some suitable options for him to stay with his family. I tried to understand what my friend means by “some suitable options”, basically I did a requirement analysis with few more questions. The outcome was a condominium location -
i) Within 5 KM distance of his new office address, he loves to drive.
ii) Nearby international school. Within 3 KM distance of the condominium, for his 12 years old kid.
iii) Nearby supermarket or reputed grocery shop. Within 2 KM distance of the condominium.
I asked for 1-week time and decided to apply my data science and geo coding techniques to conduct a thorough analysis on condominiums available in Kuala Lumpur to find the best options according to my friend’s requirements.
Audience: Beside my friend, many people can get benefited from the result of this project who basically need to move to a new location and have similar requirements.
Data:
Source of Data: I have decided to use Foursquare data on the locations in Kuala Lumpur as Foursquare is the most trusted, independent location data platform for understanding how people move through the real world.
Preparation : Scrapping venue related data usually comes with lots of attributes which all are not necessary for this project. We need to carefully check the keys and apply filter based on our requirements to save resources and time on processing.
Uses of data: With the location data downloaded from Foursquare, I am targeting to perform below steps of analysis to reach the targeted result.
a) Generate the list of condominiums in Kuala Lumpur.
b) Narrow down the number of condominiums within 5000 meters radius of the address provided.
c) Cluster the condominiums with a school within 3000 meters radius.
d) Generate a list of schools from ‘c’ and sort according to distance from condominiums.
e) Repeat ‘c’ and ‘d’ for supermarkets.
f) Combine the result of ‘d’ and ‘e’ to get the list of condominiums that meet the requirements.
Methodology: We will use Python 3 in our entire journey of data analysis as Python has all the required libraries for this work.
Define Algorithm: At this stage, we plan the step by step procedure to reach our expected result of this project with available data.
i) Determine the necessary libraries: We need common python libraries such as pasdas, numpy and random as we will be working with data frame, arrays and numeric values. We will use Nominatim to convert addresses into latitude and longitude values. Json_normalize library will be required to convert json files into data frames. We will also use Folium library to plot maps. Lastly, we will use haversine library to find actual distance between two geo address.
ii) With Nominatim, we will find the geo coordinates (latitude and longitude) of the given address.
iii) Communicate with Foursquare API to get list of schools within 8 (5+3) KM range of the geo coordinates for given address.
iv) Communicate with Foursquare API to get list of condominiums within 5 KM range of the geo coordinates for given address.
v) Visualize the condominiums on the map with marker and with a circle showing 5 KM range from the center as given address.
vi) Create a data table of condominiums and international schools from data sets available from steps iii) and iv) that match first 2 requirements.
vii) Finally, we will check if the condominiums listed in table of step vi) meet the 3rd requirement.
Execution: With the strategy ready, we will use Jupiter notebook for coding and testing the result.
We have found 30 international schools within 8 KM radius of the given office address (348 Jalan Tun Razak, Kuala Lumpur).
In the next phase We have found 50 international schools within 5 KM radius of the given office address.
The figure above shows the office location with 5 KM radius and the position of the condominium with blue center with orange border circle.
Data analysis has given us 668 results of condominium and school combinations that meet the first 2 requirements.
Next, we have checked for first 5 condominiums in our list if they have at least 1 supermarket within 2 KM radius. All 5 was successful. Above figure is the result of first query.
Result: As the result of the above analysis we have got 50 condominium names and addresses that meets the requirements of my friend. Also, he has 25 options as international schools within the required 3 KM distance of the 50 condominiums.
Discussion: First thing comes out of the analysis is my friend does not need to worry about the school for his kid and grocery shopping as there are plenty of options. Next he can look for other amenities such as gym, swimming pool, playground, park, medical facilities and the living cost to narrow down the options to find the best one for him.
One observation is that rating information for international schools are not available in Foursquare dataset. I would like to suggest them to include this one as many parents may look for ratings in the selecting suitable one for their children.
Conclusion: The goal has been achieved from this analysis. However, this has also opened opportunities for further analysis specially in real-state sector. This is just one small piece of example of data science uses cases that one can realize applying technology which is available for free today .
Acknowledgement and Reference: Thanks to IBM and Coursera for bringing in this amazing learning opportunity in the field of data science and machine learning as well as the capstone project which has provided the chance to work on real life problem and apply learned techniques to solve them. I am grateful to a number of online resources that helped me in many ways to complete this project.
Here is the github link, where the python notebook is available for public if anybody need the coding reference.