DataTypes and It's Features
Hi Everyone, welcome to yet another ML post. Recently I was unlearing about ML strategies and I thought this could be worth sharing.
As we all know and agree, ML problem is driven by data in various forms. Generic types of data that can come in Tabular form are "Numeric", "Categorical", "Date and Time", "Geo/Location".
One of the interesting and challenging component in ML Pipeline is that based on the problem type, we have to extract features of the the data provided to us.
In this post we’ll explore the ways to extract useful data for datatypes Date&Time and Geo/Location.
I hope the context of the post is clear now and let’s gets started.
Date and Time
A date and time feature would look like this 2016-01-01 00:00:00
. From this we can extract the following information/date related attributes.
- Date, Month, Year, Hour
- Day Number
- Week Number
- Is Weekend
- Is Weekday
- Is leap year
- Is daytime
- Is night time
For all the above elements we can rely on these two libraries offered by python, datetime library and calender library
I have included code sinppets to extract attributes/elements using the above libraries.
Now that we know how to collect the static elements of date features, we’ll look briefly into the ways to collect the dynamic elements of it since date and time comes under cyclic categorical data type.
For this, we will entirely rely on a open source library called tsfresh
This code has been taken from the book Approaching Almost Any ML Problem by Abhishek Thakur
Geo/Location
This is one of the less commonly used datatype. It appears only in problems involving transporation.
Location can be given in terms of geo code, name, geo co-ordinates
It is better to convert all other types to geo co-ordinates because we can extract interesting features out of it. The list of elements that can be extracted using geographical co-ordinates are as follows
- Distance between two geographical co-ordinates
- Distance between given geo coordinates and popular landmark of that area
- Design a hotspot circle and check if given co-ordinates falls into it or not.
If you can collect external data associated with these co-ordinates like population, area type, no.of buildings and so on then way more related features can be extracted but you need to be careful in this case as we number of features in not directly proportional to model performance and very minimal we might end up in overfitting.
As far as finding distances between tqo geo points, we can rely on one of the three distances; Manhattan distance, Euclidean distance and Haversine Distance. More information regarding pros and cons of each distance can be found here
I’ll go through the simplest distance metric of geo points i.e., Haversine distance