Hello! I have a question regarding model experimen...
# general
t
Hello! I have a question regarding model experiment. I am trying to log statsforecast models in databricks ML Flow environment. Is there any reference how I can do it? I had the code as below but always got the error of "KeyError: 'artifact_uri'"
k
Is this the full code? I can’t see what would cause this immediately
Are you using
log_artifact
?
t
not really, what I am trying to do is, since I have a large of models to run and apply, I want to break model training and forecasting into 2 processes and use 2 Pandas UDF to make it happen. But I am not sure how can I log the statsforecast model in ML Flow and how can I call them.
k
I’ll make an example for you later today
t
thank you so much!!
here is the code i have right now just in case. the structure was built based on an reference I had from sklearn, but it seems that log_model function is not applicable here.
k
Ok so I have an end to end script that is working: https://github.com/kvnkho/demos/blob/main/fugue/fugue-nixtla-mlflow.py I don’t know MLFlow well enough to get the nested runs working yet. I can probably revisit it more Wednesday, but I think this should give a clear picture of where this is going. Happy to hop on a call and discuss. I’m just a bit busy today and tom. Also, I am not with the Nixtla team if ever they correct me lol, am just a friend. I am part of the Fugue team where we work on bringing Python and Pandas code to Spark easily (as you’ll see in the end of the script). If you don’t want to use Fugue, it’s fine, you can still use something like pandas_udf. It will just be some extra code to re-write (especially around Schema). I am presenting in PyData NYC next week and this is a good use case so I’ll be use to understand it more and get it working by then. If you do get the nested client working, I’d love to see it!
❤️ 1
t
Thank you so much for sharing Kevin, I will look into the code and fit the same on my dataset to test. Appreciate you efforts with the script!!
@Kevin Kho Hi Kevin, I spent some time to look into the code and have some follow-up questions if you don't mind. I am trying to break training models and predicting as 2 process, since when they are combined as one, the computation time increase drastically. I learned this from a different project. With this situation, (1) how can I use the saved model.pkl file and apply it for forecasting? I need exogeous variable in this model, thus I am planning to have model.fit() in the 1st process, and model.predict(X_future) in the 2nd process. (2) If pickle file cannot be used in this way, is there any other method that can help me achive my goal here? I know that sklearn has load_model(uri) but not sure if there is anything similar in statsforecast?
k
Hey! Responding to general chat so Nixtla team can chime in. So my understanding with
statsforecast
is you can either use
.forecast()
which is a straight forecasing or
.fit()
and
.predict()
. If you want to decouple training and prediction time, just call the
.fit()
method and then you can save that pickle object to be used later. The pickle should be able to be used with exogenous regressions because it saves the model weights and info. That is how it was done with sklearn before also. I don’t believe
statsforecast
has it built in yet but you can just make your own saving and loading functions pretty easily.
👍 2
The exogenous docs say that the class can handle the future exogenous regressors.
On local, the
mlflow.sklearn.load_model
just uses pickle under the hood as well 🙂
❤️ 1
t
this might be another dumb question...but how can I get the location of the pickel file logged in the experiment? like, when I do pickle.load() what would be the location of model_filename?
k
I don’t know MLFlow enough but I think you get the MLFlow client, supply experiment name, get the artifact location, make the URL and then load the model
Ehh it doesn’t look too good. Seems like you need to know a bit about where the path exists: https://stackoverflow.com/a/72625756/11163214
❤️ 1
Also see the method signature here
t
yep I saw the same post and got more confused 😄 will try this load_dict method! thanks a ton!
k
No not the
load_dict
. The one above
download_artifacts
.
👍 1
I am so confused whether to use the Client or not. Last time I used MLFlow heavily was 3 years ago but we didn’t reply on it to manage the paths. We just using it to track experiments but for keeping track of paths of saved models we always had a “latest_model” path that was constant and we just always pulled from there.
👍 1
Will test things out tom