python: generalized pipeline

pipes - 10 sep 2018 — pipes – 10 sep 2018

For my latest General Assembly Data Science Immersive project (http://gobbledygoon.com/2018/09/project-3-reddit-predictions/), I wanted to test a whole bunch of different classification models on my data very quickly.

I admit that I have a few organizational “rabbit holes” (I like to think these are GOOD things when it comes to coding):

Never type anything twice…. Boring.
If I run it, never run it again….. Waste of time.
And a few others that show my age(?) — memory concerns, etc. (Not MY memory — computer memory!) I hate to not free up memory as soon as possible.

As the data scraping was easy and the feature engineering quite simple (I do know that I could have gone deeper on this). So for me the big challenge was …. how best to manage different models, tune hyperparameters, etc. My initial idea?

Build the world’s most amazing pipeline building and grid searching function!

First, I created a dictionary that lists all the pipeline elements. For this project, here’s what I started with; I know that it will grow as I learn more about various features and models:

Each entry has a key (a short string) and a value that is a class (a member of a pipeline).

For each class, I have started to include some of the hyperparameters that I might consider tuning:

Of course, as I run the pipeline on a given dataset, I will consider which hyperparameters to adjust.

To pull it all together, we create a pipeline and a parameter dictionary based on the items in items, then call a grid search on these:

There is much left to do to improve this, and I will keep adding to and cleaning it up as I move forward.

The full WIP can be found on my github: Pipeline generalization. I have more work to do on it to truly generalize it, but I have found it quite useful for my own purposes so far ….

Thanks to: How To Embed Code Snippet Using Github Gist Within Blog Posts