Sorry hello again lol - I'm trying to run NHITS on...
# neural-forecast
d
Sorry hello again lol - I'm trying to run NHITS on a custom container on Runpod (an RTX 4500 serverless endpoint) and it really feels like it's not leveraging the GPU? Whenever its running the CPU is pegged and GPU silent. Log is threaded below which seems to support that. Perhaps I need to install packages differently?
Copy code
2024-05-03T17:20:05.994952347Z GPU available: True (cuda), used: True
2024-05-03T17:20:06.013560446Z TPU available: False, using: 0 TPU cores
2024-05-03T17:20:06.013613468Z IPU available: False, using: 0 IPUs
2024-05-03T17:20:06.013695960Z HPU available: False, using: 0 HPUs
2024-05-03T17:20:06.790184151Z You are using a CUDA device ('NVIDIA RTX A4500') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read <https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision>
2024-05-03T17:20:06.790884471Z Missing logger folder: /lightning_logs
2024-05-03T17:20:06.811951650Z LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
2024-05-03T17:20:06.900000345Z 
2024-05-03T17:20:06.900010435Z   | Name         | Type          | Params
2024-05-03T17:20:06.900012605Z -----------------------------------------------
2024-05-03T17:20:06.900014206Z 0 | loss         | MQLoss        | 7     
2024-05-03T17:20:06.900015776Z 1 | padder_train | ConstantPad1d | 0     
2024-05-03T17:20:06.900017056Z 2 | scaler       | TemporalNorm  | 0     
2024-05-03T17:20:06.900018576Z 3 | blocks       | ModuleList    | 5.6 M 
2024-05-03T17:20:06.900020016Z -----------------------------------------------
2024-05-03T17:20:06.900021346Z 5.6 M     Trainable params
2024-05-03T17:20:06.900023426Z 7         Non-trainable params
2024-05-03T17:20:06.900024736Z 5.6 M     Total params
2024-05-03T17:20:06.900026506Z 22.370    Total estimated model params size (MB)
During runtime it shows a really slow iterations per second which also seems off?
Copy code
Epoch 48: 100%|██████████| 1/1 [00:00<00:00,  2.66it/s, v_num=2, train_loss_step=116.0, train_loss_epoch=116.0]
if I run this I clearly get gpus available, correctly?
Copy code
import torch

def check_gpu_status():
    # Check if CUDA is available
    cuda_available = torch.cuda.is_available()
    print("CUDA Available:", cuda_available)
    if cuda_available:
        # Print the number of GPUs available
        num_gpus = torch.cuda.device_count()
        print("Number of GPUs Available:", num_gpus)
        # Print the current device PyTorch is using
        current_device = torch.cuda.current_device()
        print("Current GPU Device:", torch.cuda.get_device_name(current_device))
    else:
        print("No GPU available.")

# Run the function
check_gpu_status()
m
Can you try running:
Copy code
CUDA_VISIBLE_DEVICES=0 python yourfile.py
d
yessir
actually a bit of a pain but I assume I can just export CUDA_VISIBLE_DEVICES=0 then re run in my py repl?
m
Sure, should do the same!
d
same results
Copy code
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=47` in the `DataLoader` to improve performance.
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 134.66it/s]
/usr/local/lib/python3.10/site-packages/neuralforecast/core.py:184: FutureWarning: In a future version the predictions will have the id as a column. You can set the `NIXTLA_ID_AS_COL` environment variable to adopt the new behavior and to suppress this warning.
  warnings.warn(
>>> import os
>>> os.env
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'os' has no attribute 'env'
>>> os.getenv("CUDA_VISIBLE_DEVICES")
'0'
increase num workers?
lol increasing num workers slows down it/s
m
hmm. First time seeing this type of error, so I'm searching/debugging at the same time with you
d
so @Marco earlier I saw that warning and was like "gosh I should set that value to what they say"
so I bumped from default of 1 to 8
then now, I bumped again from 8 to 47
at 8 it was 2.5it/s, 47 it was ≈1it/s
I said fuck it and dropped to default value of 1 again and its 6.35it/s
🤷
current model def:
Copy code
NeuralForecast(
            models=[NHITS(loss=MQLoss(level=[75, 95, 99]), batch_size=100, input_size=365, h=365, max_steps=50, num_workers_loader=1)],
            freq="D"
        )
m
I think the MQLoss is what is slowing it down, since quantiles need ordered numbers to be calculated. At the same time, you must need it to compute some confidence interval. Still, are you able to at least use the GPU? I never used Runpod, but can you run
nvidia-smi
to confirm the GPU is being used?
d
yeah lemme check it out
yeah we're getting some play from GPU now:
Copy code
root@691a50a8e634:/# nvidia-smi
Fri May  3 17:47:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4500               On  | 00000000:81:00.0 Off |                  Off |
| 30%   32C    P2              58W / 200W |    468MiB / 20470MiB |     12%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
root@691a50a8e634:/#neural-forecast
wattage is up, memory is off the 0 peg
Is there an alternative to MQ loss for generating confidence intervals?
is MQ Loss doing the work on CPU while the model is building on GPU? Apologies for what may be stupid questions, I'm still new to the codebase
m
MQLoss is the only option for now for generating prediction intervals. Still, I don't know why the GPU is not being used more.
Honestly, I am out of ideas. Maybe my colleague @José Morales can help you more. Really sorry I can't help you more on that 😕
d
no worries at all!
I'd be super interested to hear more
I'm planning on doing ≈25k-50k forecasts a day so every performance squeeze helps, lol
j
How many series do you have? I get ~80% GPU usage running 1,000 steps for 1,000 series
d
I am doing a single series in this case
if you want the data I can DM you a sample of it
≈1,500 daily values from 2020 to today
j
Also please remove the num_workers_loader or set it to its default (0), otherwise the warnings slow down the training
d
single column, no covariates
will do!
j
If you have a single serie then
max_steps=50
means 50 epochs which can be done in ~2s, so you may not even see the GPU usage go up in that time
Also you'll get a nice speedup in the training if you add
precision='16-mixed'
to your init arguments, that'll use mixed precision training
d
@José Morales where does that get set? Param of NeuralForecast or NHITS or MQLoss?
Copy code
NeuralForecast(
            models=[NHITS(loss=MQLoss(level=REPORTED_CONFIDENCE_INTERVALS), batch_size=100, input_size=input_size, h=365, max_steps=25)],
            freq="D"
        )
j
in the NHITS
d
roger
and yeah to see the spikes I was just repeatedly training in a
while True
not great but good enough to confirm it was using GPU
lol
@José Morales and @Marco removing the num_workers message bumped it back up to ≈33it/s
🙌 1
entirely explained by removing
num_workers_loader
j
Yeah those warnings are misleading, we've already removed them in the main branch and will be gone in the next release
d
phew
lol I got pulled deeply astray by that
"ah what a helpful dialog I will dutifully follow directions"
thank you to both! Very very impressed by the work.
❤️ 1
Update here - everything running smoothly at 50it/s or so, running 10 concurrent per GPU, smooth sailing!
🙌 1