We have a question from Steffen Albrecht, a resear...
# support
t
We have a question from Steffen Albrecht, a researcher at University of Auckland on a specific dataset, and if it was used for training TimeGPT. Probably a question for @azul (she/her) (nixtla). Thank you!
1
Thanks again for providing free credits for our academic project! Our results look interesting, showing that, in some cases, the GPT outperforms the competing state-of-the-art algorithms, especially when the time series is described by many cases, providing some level of robustness. When the numbers underlying the time series are small, the forecasts are not very reliable, neither from TimeGPT nor from the other algorithms.
We use a very small dataset for this, providing only 52 time points per year, so I think it fits very well to do transfer learning here.
We would like to double-check one question with you. As the ILI dataset we use for this study is publicly available, we wanted to make sure that this data has not been used to pre-train the TimeGPT model.
We couldn’t find this information on the webpage, so I am writing to you. It is about the dataset from FLUVIEW from the CDC from the US: National, Regional, and State Level Outpatient Illness and Viral Surveillance (cdc.gov)
There is also a publication using this dataset: https://www.science.org/doi/10.1126/sciadv.abb1237
Can you please tell me if this particular dataset has been used to pre-training TimeGPT?
Is there an overview about this and we just didn’t find it?
This would help a lot. If it were included, our results would have to be interpreted differently.
Here's information about his research and what' he's doing with TimeGPT.
at the School of Computer Science at the University of Auckland.
My main project is around forecasting hospitalization rates related to respiratory diseases and influenza-like illness cases in general.
Using a foundation model for this purpose is quite interesting.
Furthermore, I supervise a student project in which two students compare the TimeGPT forecasting with the forecasting of other comprehensive frameworks such as AutoGluon-TS.
a
love the question and the research approach, but i'm not sure if we want to disclose that level of detail, wdyt @Cristian Challu @Max? imo, for this kind of research, i think we should let them know if we are using the dataset since the conclusions can lead to making sensitive decisions
m
Hi Steffen, Thanks for being a user of TimeGPT. Your research sounds amazing. Please keep us posted. Regarding your specific question, for strategic reasons we haven't release the data that we used yet. That being said, lets say that we would not recommend you to test accuracy of TimeGPT on the ILI. You can safely assume that the model saw at least some part of that data set. Best, X
👍 2
t
Steffen appreciated our response, and replied with a question about data being uploaded to servers. I think that data is uploaded to our servers, even if they opt-out of having it used to improve our Services? @azul (she/her) (nixtla) @Max @Cristian Challu
thank you so much, this is a super important information!
We could use another dataset very similar to the US-ILI surveillance. This data is also about respiratory diseases derived from a hospital surveillance in Auckland. It has not been published, so we can be sure that it is not part of the TimeGPT training data.
However, there are concerns about using this data with large models that also require internet access. Collaborators are afraid that when we use TimeGPT for this data that it gets uploaded to a server overseas which would be not allowed according to the data ethic approval.
When we use the github for TimeGPT: https://github.com/Nixtla/nixtla
Yes, we need internet access to verify the Token, right?
But are there procedures uploading the data to whatever server? If the data just stays on the local machine, then we should be able to use it to test TimeGPT on this data, I think.
My draft response: Hi Steffen, Thanks, yes, that other dataset seems like it would be a good solution. How the data is handled is outlined in our Terms and Conditions. You can opt out of having your data used to improve services. However, the data is still uploaded to servers. If you need the data to remain local, you may be interested in our Docker self-hosted version. If that is of interest, I can let you know more. Best, -Tracy
👍 1
m
Hi Steffen, Thanks, yes, that other dataset seems like it would be a good solution. How the data is handled is outlined in our Terms and Conditions. You can opt out of having your data used to improve services. However, the data is still uploaded to servers. There are different options, if you use the Azure version of TimeGPT (called TimeGEN) then the data will remain in your own VPC and Nixtla will never have access to it. Alternativly, we can also sign some documents and make sure your data is erased and never used for any purposes. Finally, there is an alternative, where you could use a Docker self-hosted version. However, that is normally for enterprise customers. If that is of interest, I can let you know more. Best, -Tracy
🙏 1