From 4 weeks to 3 days - Data science pipeline transformation


Goals of this article

Long time since we last meet, in this article I want to share with you a transformation we did at my last job, in which we transformed the way we create prediction models in the company that included both technical, and cultural change.

Life before the change

In a nut shell, adjusting our model that predict cvr to paying was a cumbersome process which involved a data science and a software engineer for a duration of 4-6 weeks! (not net).

The process included:



  1. Our data scientist (DS) asking developer (DEV) to extract pieces of information
  2. DS is building the features
  3. DS is checking each feature and creates and model and its weights (using R studio) - everything is saves as files without any source control
  4. DEV coding the features & model in rails with configuration of weights in files
  5. DEV running it and send to DS the data predicted
  6. DS validating and ALWAYS finding errors in features / model created by DEV
  7. DEV fixing and deploying to production
  8. DEV back filling prediction back

The main problems of the process are as always

  1. Communication
  2. Ownership
  3. Accountability

The change

The goal

Make the process run FAST!

The vision

Allow a single person to own the entire process and reduce friction, while keeping all the analysis process in a shared location and publicly available in a more state of the art solution.

The transformation included

In high level (more info after) 
  1. Data scientist join the team
  2. Move from R Studio into jupyter note book (based on pre defined template of work)
  3. Deploy to production in a click of a button
  4. Introduce data robot as autoML infra

Move data scientist into our team

Our data scientist, joined the team and sat with us - this had amazing impact since we reduce bottleneck and miscommunications.
No more latency, simply ask immediately!
Share progress and insights on dailies and simply be part of the same goal

Move from R into jupyter notebook

Using a jupyter notebook that leverages a python engine, allowed our DS to use wider range of modules and more up to date eco system.
Our developers were able to quickly help in the coding where needed (simply python)

The node book template included pre defined methods, that should have been override:
  1. Extract data from DB - hope to see true features created in that process
  2. Post process of data - allows to manipulate the data if direct SQL was not suitable (99% it did)
  3. Call external tool to do the actual predictions (given above transformed data)
Using a template forced to have a process, which later also allowed us to convert the python note book into a python code which then was converted into a Lambda that ran on AWS.


The wrapping code was doing all the plumbing of getting arguments, running the code, calling the prediction and transforming response, leaving the DS to deal with only what is needed.

Usage of DataRobot as our autoML

Specifically in our case, given that we had created the features there was not real need to create a sophisticated new mode.
Data robot allowed us to use its wide range of model and assemblies of more than 1 to actually choose the best model which was almost as good as what we created alone.
The process of model creation includes also wide range of graphs to describe better each impotence of all features.
and the best part was, that with a click of a button you get a deployed end point in production!


From 4 weeks to 3 days - Data science pipeline transformation From 4 weeks to 3 days - Data science pipeline transformation Reviewed by Ran Davidovitz on 1:45 PM Rating: 5

No comments:

Powered by Blogger.