Hey <@U0316L4HWQ7>, how do you manage input `DataF...
# general
a
Hey @fede (nixtla) (they/them), how do you manage input
DataFrames
to fit the model? From what I see it is transformed into ndarray that is later processed, however, couldn't find the function that transforms from DataFrame into np array. Wanted to add polars support
πŸ‘ 1
πŸ™Œ 1
f
Hey @Akmal Soliev! That would be an amazing enhancement. The function that transforms the dataframe into numpy is this: https://github.com/Nixtla/statsforecast/blob/main/statsforecast/core.py#L407.
a
Thx, will take a look!
Hey, I had a follow up question, I'm currently trying to use your synthetic data generator function (https://github.com/Nixtla/statsforecast/blob/main/statsforecast/utils.py). In order to use it as polars DataFrame I would need to recreate the steps from scratch due to the categorical dtypes (I have no clue why that's such a pain in the @$$ to deal with). Prior to doing that I wanted to ask you what's the point of having those values as categorical?
f
hey @Akmal Soliev! It is just for efficiency purposes when dealing with experiments that have thousands of time series, it’s more memory efficient to deal with categorical instead of object with such a large amount of data
maybe we could add a new argument to
generate_series
controlling if the conversion is required, something like
uid_to_categorical=True
a
I'll see if I can just rebuild it
okay rebuilt it, also got some time shaved of the generation:
Copy code
before:
________________________________________________________
Executed in    3.93 secs    fish           external
   usr time    4.68 secs    0.06 millis    4.68 secs
   sys time    1.44 secs    1.03 millis    1.43 secs

after:
________________________________________________________
Executed in  833.23 millis    fish           external
   usr time    2.23 secs      0.07 millis    2.23 secs
   sys time    0.50 secs      1.20 millis    0.50 secs
This is the most useless improvement 🀣
m
@Akmal Soliev we think its great. Thanks.
πŸ™ 1
a
[PR Update] After two months of persistent effort, I have successfully developed a versatile DataFrame conversion solution for both Polars and Pandas. This solution is compatible with any structured two-dimensional data, provided that the output can be consolidated into named Numpy arrays. βœ… All local tests have been passed, Github Actions required
util.py
to be modified, as I have added Polars test. NOTE: β€’ Have not yet implemented engine into
_StatsForecast
and/or
StatsForecast
class for I/O to match. β€’
core.ipynb
has `util.py`'s
generate_series
function so that code can actually run. Will be removed in future. More info: https://github.com/Nixtla/statsforecast/pull/448#issuecomment-1537431035
πŸ™Œ 3
f
Awesome @Akmal Soliev!
m
You are the best @Akmal Soliev, thanks.
We are going to write a brief post to communicate the new feature. Which accounts should we mention to thank you.
a
@Max (Nixtla) Thank you, I'm glad to help. I just implemented a change where if input dataframe is polars or pandas
StatsForecast
should run without any issues. TODO: β€’ Modify
plot
staticmethod to work both with polars and pandas β€’ Implement I/O matching, at current moment there is variable for that
self.engine
β—¦ At current moment it is: β—¦ polars in and pandas out β—¦ pandas in and pandas out
m
Great :)
a
@Max (Nixtla) All the local tests have been passed on Polars and Pandas using modified
generate_series
. More information: https://github.com/Nixtla/statsforecast/pull/448#issuecomment-1546411593 @fede (nixtla) (they/them) could I ask you to please check the PR and let me know if I've missed anything. From my end everything worked smoothly.
@fede (nixtla) (they/them) here's the file with all the tests done with Polars. NOTE:
groupby
doesn't have sort param in polars, hence, have to chain
.sort('unique_id')
P.S. Latest PR is up with bug fixes
Hey, there is a
_parse_ds_type
bug on
main
where in
int
datestamps are converted into datetime in certain cases, due to the check failure; specifically
issubclass(df["ds"].dtype.type, np.integer)
, which can be checked much better with kind,
np.array().dtype.kind in ["i", "f"]
, where
i
stands for int and
f
stands for float. Implemented this change in my PR: https://github.com/Nixtla/statsforecast/pull/448
πŸ™Œ 1