Once you have been through EDA and built a satisfactory model with tuned hyperparameters, you are ready to predict using new values. However, you do not want to train your model every time you want to predict on new values. You just want to use the previously trained model and run new values through it. This is where pickle, a Python's module for serializing and de-serializing comes in.
What is pickling
As said before, pickle can be used to serialize and de-serialize the object. So what do I mean by serialization and de-serialization of the objects? Serialization is a process of converting an object to a byte stream, storing the byte stream in memory for later use. De-serialization then takes the byte stream from the memory and retrieves the original object. Therefore, pickling an object converts it to a byte stream(serialization), making sure that the character stream contains all the information to retrieve the original object(de-serialization). You can pickle data stored in list, tuples, dictionaries. You can even pickle your classes, functions and tuned models.
In this post, I will pickle a dataset that can be used in other jupyter notebook. However using the same steps you can pickle classes, functions and even models with their hyper tuned parameters.
Serialization
De - serialization
Pickling you Data?
But why? Once you are done with cleaning, you might want to use the cleaned dataset in a different model or a different jupyter notebook. Therefore having a pickled dataset, would allow you to dump you data in a file and then be open in a different dataset. Let's spin up some data and then store pickle it.
import numpy as np
import pandas as pd
random_dict = {
'volts': np.random.random(10),
'current': np.random.random(10)
}
random_data = pd.DataFrame(random_dict)
random_data.sample(3)
volts | current | |
---|---|---|
4 | 0.355205 | 0.055690 |
1 | 0.697468 | 0.668440 |
6 | 0.102049 | 0.284304 |
import pickle
file_name = 'random_data_pick.pkl'
output = open(file_name, 'wb')
pickle.dump(random_data, output)
output.close()
*Few things to remember*: Open the file and don't forget wb(write byte). Then use pickle.dump(), that uses two arguments: object you want to pickle and the file to which the object has to be saved. Also, don't forget to close the file. Lastly, this would create a file on the same file level as your notebook.
Now that you have pickled the data, you can use the data in a without going over the EDA steps in a different notebook. Let us load in the data.
input_ = open('random_data_pick.pkl', 'rb')
pickled_dataframe = pickle.load(input_)
input_.close
<function BufferedReader.close>
Now let us just make sure that we unpickled the file correct.
pickled = pd.DataFrame(pickled_dataframe)
pickled.sample(3)
volts | current | |
---|---|---|
6 | 0.115438 | 0.134317 |
7 | 0.413739 | 0.321276 |
9 | 0.145731 | 0.728365 |
Personally, I pickle data and model that can be used in a web application. See my GitHub page for my hackathon project - Pokémon and world happiness repo. to learn more about linking your model to a web application.