Sorry hello again lol I m trying to run NHITS on a custom co Nixtla Community #neural-forecast

Sorry hello again lol - I'm trying to run NHITS on...

Devin Gaffney

05/03/2024, 5:28 PM

Sorry hello again lol - I'm trying to run NHITS on a custom container on Runpod (an RTX 4500 serverless endpoint) and it really feels like it's not leveraging the GPU? Whenever its running the CPU is pegged and GPU silent. Log is threaded below which seems to support that. Perhaps I need to install packages differently?

Devin Gaffney

05/03/2024, 5:28 PM

Copy code

2024-05-03T17:20:05.994952347Z GPU available: True (cuda), used: True
2024-05-03T17:20:06.013560446Z TPU available: False, using: 0 TPU cores
2024-05-03T17:20:06.013613468Z IPU available: False, using: 0 IPUs
2024-05-03T17:20:06.013695960Z HPU available: False, using: 0 HPUs
2024-05-03T17:20:06.790184151Z You are using a CUDA device ('NVIDIA RTX A4500') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read <https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision>
2024-05-03T17:20:06.790884471Z Missing logger folder: /lightning_logs
2024-05-03T17:20:06.811951650Z LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
2024-05-03T17:20:06.900000345Z 
2024-05-03T17:20:06.900010435Z   | Name         | Type          | Params
2024-05-03T17:20:06.900012605Z -----------------------------------------------
2024-05-03T17:20:06.900014206Z 0 | loss         | MQLoss        | 7     
2024-05-03T17:20:06.900015776Z 1 | padder_train | ConstantPad1d | 0     
2024-05-03T17:20:06.900017056Z 2 | scaler       | TemporalNorm  | 0     
2024-05-03T17:20:06.900018576Z 3 | blocks       | ModuleList    | 5.6 M 
2024-05-03T17:20:06.900020016Z -----------------------------------------------
2024-05-03T17:20:06.900021346Z 5.6 M     Trainable params
2024-05-03T17:20:06.900023426Z 7         Non-trainable params
2024-05-03T17:20:06.900024736Z 5.6 M     Total params
2024-05-03T17:20:06.900026506Z 22.370    Total estimated model params size (MB)

Devin Gaffney

05/03/2024, 5:29 PM

During runtime it shows a really slow iterations per second which also seems off?

Copy code

Epoch 48: 100%|██████████| 1/1 [00:00<00:00,  2.66it/s, v_num=2, train_loss_step=116.0, train_loss_epoch=116.0]

Devin Gaffney

05/03/2024, 5:33 PM

if I run this I clearly get gpus available, correctly?

Copy code

import torch

def check_gpu_status():
    # Check if CUDA is available
    cuda_available = torch.cuda.is_available()
    print("CUDA Available:", cuda_available)
    if cuda_available:
        # Print the number of GPUs available
        num_gpus = torch.cuda.device_count()
        print("Number of GPUs Available:", num_gpus)
        # Print the current device PyTorch is using
        current_device = torch.cuda.current_device()
        print("Current GPU Device:", torch.cuda.get_device_name(current_device))
    else:
        print("No GPU available.")

# Run the function
check_gpu_status()

Marco

05/03/2024, 5:33 PM

Can you try running:

Copy code

CUDA_VISIBLE_DEVICES=0 python yourfile.py

Devin Gaffney

05/03/2024, 5:34 PM

yessir

Devin Gaffney

05/03/2024, 5:35 PM

actually a bit of a pain but I assume I can just export CUDA_VISIBLE_DEVICES=0 then re run in my py repl?

Marco

05/03/2024, 5:36 PM

Sure, should do the same!

Devin Gaffney

05/03/2024, 5:36 PM

same results

Devin Gaffney

05/03/2024, 5:37 PM

Copy code

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=47` in the `DataLoader` to improve performance.
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 134.66it/s]
/usr/local/lib/python3.10/site-packages/neuralforecast/core.py:184: FutureWarning: In a future version the predictions will have the id as a column. You can set the `NIXTLA_ID_AS_COL` environment variable to adopt the new behavior and to suppress this warning.
  warnings.warn(
>>> import os
>>> os.env
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'os' has no attribute 'env'
>>> os.getenv("CUDA_VISIBLE_DEVICES")
'0'

Devin Gaffney

05/03/2024, 5:37 PM

increase num workers?

Devin Gaffney

05/03/2024, 5:39 PM

lol increasing num workers slows down it/s

Marco

05/03/2024, 5:42 PM

hmm. First time seeing this type of error, so I'm searching/debugging at the same time with you

Devin Gaffney

05/03/2024, 5:42 PM

so @Marco earlier I saw that warning and was like "gosh I should set that value to what they say"

Devin Gaffney

05/03/2024, 5:42 PM

so I bumped from default of 1 to 8

Devin Gaffney

05/03/2024, 5:42 PM

then now, I bumped again from 8 to 47

Devin Gaffney

05/03/2024, 5:42 PM

at 8 it was 2.5it/s, 47 it was ≈1it/s

Devin Gaffney

05/03/2024, 5:42 PM

I said fuck it and dropped to default value of 1 again and its 6.35it/s

Devin Gaffney

05/03/2024, 5:43 PM

🤷

Devin Gaffney

05/03/2024, 5:43 PM

current model def:

Copy code

NeuralForecast(
            models=[NHITS(loss=MQLoss(level=[75, 95, 99]), batch_size=100, input_size=365, h=365, max_steps=50, num_workers_loader=1)],
            freq="D"
        )

Marco

05/03/2024, 5:45 PM

I think the MQLoss is what is slowing it down, since quantiles need ordered numbers to be calculated. At the same time, you must need it to compute some confidence interval. Still, are you able to at least use the GPU? I never used Runpod, but can you run

nvidia-smi

to confirm the GPU is being used?

Devin Gaffney

05/03/2024, 5:46 PM

yeah lemme check it out

Devin Gaffney

05/03/2024, 5:47 PM

yeah we're getting some play from GPU now:

Copy code

root@691a50a8e634:/# nvidia-smi
Fri May  3 17:47:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4500               On  | 00000000:81:00.0 Off |                  Off |
| 30%   32C    P2              58W / 200W |    468MiB / 20470MiB |     12%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
root@691a50a8e634:/#neural-forecast

Devin Gaffney

05/03/2024, 5:47 PM

wattage is up, memory is off the 0 peg

Devin Gaffney

05/03/2024, 5:48 PM

Is there an alternative to MQ loss for generating confidence intervals?

Devin Gaffney

05/03/2024, 5:49 PM

is MQ Loss doing the work on CPU while the model is building on GPU? Apologies for what may be stupid questions, I'm still new to the codebase

Marco

05/03/2024, 5:53 PM

MQLoss is the only option for now for generating prediction intervals. Still, I don't know why the GPU is not being used more.

Marco

05/03/2024, 5:54 PM

Honestly, I am out of ideas. Maybe my colleague @José Morales can help you more. Really sorry I can't help you more on that 😕

Devin Gaffney

05/03/2024, 5:55 PM

no worries at all!

Devin Gaffney

05/03/2024, 5:55 PM

I'd be super interested to hear more

Devin Gaffney

05/03/2024, 5:55 PM

I'm planning on doing ≈25k-50k forecasts a day so every performance squeeze helps, lol

José Morales

05/03/2024, 6:13 PM

How many series do you have? I get ~80% GPU usage running 1,000 steps for 1,000 series

Devin Gaffney

05/03/2024, 6:14 PM

I am doing a single series in this case

Devin Gaffney

05/03/2024, 6:15 PM

if you want the data I can DM you a sample of it

Devin Gaffney

05/03/2024, 6:15 PM

≈1,500 daily values from 2020 to today

José Morales

05/03/2024, 6:15 PM

Also please remove the num_workers_loader or set it to its default (0), otherwise the warnings slow down the training

Devin Gaffney

05/03/2024, 6:15 PM

single column, no covariates

Devin Gaffney

05/03/2024, 6:15 PM

will do!

José Morales

05/03/2024, 6:17 PM

If you have a single serie then

max_steps=50

means 50 epochs which can be done in ~2s, so you may not even see the GPU usage go up in that time

José Morales

05/03/2024, 6:21 PM

Also you'll get a nice speedup in the training if you add

precision='16-mixed'

to your init arguments, that'll use mixed precision training

Devin Gaffney

05/03/2024, 6:26 PM

@José Morales where does that get set? Param of NeuralForecast or NHITS or MQLoss?

Copy code

NeuralForecast(
            models=[NHITS(loss=MQLoss(level=REPORTED_CONFIDENCE_INTERVALS), batch_size=100, input_size=input_size, h=365, max_steps=25)],
            freq="D"
        )

José Morales

05/03/2024, 6:27 PM

in the NHITS

Devin Gaffney

05/03/2024, 6:27 PM

roger

Devin Gaffney

05/03/2024, 6:28 PM

and yeah to see the spikes I was just repeatedly training in a

while True

Devin Gaffney

05/03/2024, 6:28 PM

not great but good enough to confirm it was using GPU

Devin Gaffney

05/03/2024, 6:43 PM

lol

Devin Gaffney

05/03/2024, 6:44 PM

@José Morales and @Marco removing the num_workers message bumped it back up to ≈33it/s

🙌 1

Devin Gaffney

05/03/2024, 6:45 PM

entirely explained by removing

num_workers_loader

José Morales

05/03/2024, 6:45 PM

Yeah those warnings are misleading, we've already removed them in the main branch and will be gone in the next release

Devin Gaffney

05/03/2024, 6:47 PM

phew

Devin Gaffney

05/03/2024, 6:47 PM

lol I got pulled deeply astray by that

Devin Gaffney

05/03/2024, 6:47 PM

"ah what a helpful dialog I will dutifully follow directions"

Devin Gaffney

05/03/2024, 6:47 PM

thank you to both! Very very impressed by the work.

❤️ 1

Devin Gaffney

05/04/2024, 8:59 PM

Update here - everything running smoothly at 50it/s or so, running 10 concurrent per GPU, smooth sailing!

🙌 1

Open in Slack

Previous Next