``` train tune pid=6410 usr local lib python3 10 dist packag Nixtla Community #neural-forecast

```(_train_tune pid=6410) /usr/local/lib/python3.1...

nickeleres

04/30/2024, 11:41 AM

Copy code

(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: `ray.tune.integration.pytorch_lightning.TuneReportCallback` is deprecated. Use `ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback` instead.
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/parsing.py:199: Attribute 'loss' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['loss'])`.
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/parsing.py:199: Attribute 'valid_loss' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['valid_loss'])`.
(_train_tune pid=6410) Seed set to 6
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/torch/nn/init.py:452: UserWarning: Initializing zero-element tensors is a no-op
(_train_tune pid=6410)   warnings.warn("Initializing zero-element tensors is a no-op")
(_train_tune pid=6410) GPU available: True (cuda), used: True
(_train_tune pid=6410) TPU available: False, using: 0 TPU cores
(_train_tune pid=6410) IPU available: False, using: 0 IPUs
(_train_tune pid=6410) HPU available: False, using: 0 HPUs
(_train_tune pid=6410) You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read <https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision>
(_train_tune pid=6410) Missing logger folder: /tmp/ray/session_2024-04-30_11-37-44_977220_4931/artifacts/2024-04-30_11-37-47/_train_tune_2024-04-30_11-37-44/working_dirs/_train_tune_0c7c6ab7_2_attn_dropout=0.0000,batch_size=16,dropout=0.1000,early_stop_patience_steps=2,enable_progress_bar=False,futr_2024-04-30_11-37-54/lightning_logs
(_train_tune pid=6410) 2024-04-30 11:39:20.675778: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=6410) 2024-04-30 11:39:20.675836: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=6410) 2024-04-30 11:39:20.677586: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=6410) 2024-04-30 11:39:21.828232: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=6410) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=6410) 
(_train_tune pid=6410)   | Name                    | Type                     | Params
(_train_tune pid=6410) ---------------------------------------------------------------------
(_train_tune pid=6410) 0 | loss                    | HuberLoss                | 0     
(_train_tune pid=6410) 1 | valid_loss              | HuberLoss                | 0     
(_train_tune pid=6410) 2 | padder_train            | ConstantPad1d            | 0     
(_train_tune pid=6410) 3 | scaler                  | TemporalNorm             | 0     
(_train_tune pid=6410) 4 | embedding               | TFTEmbedding             | 44.0 K
(_train_tune pid=6410) 5 | static_encoder          | StaticCovariateEncoder   | 20.0 M
(_train_tune pid=6410) 6 | temporal_encoder        | TemporalCovariateEncoder | 137 M 
(_train_tune pid=6410) 7 | temporal_fusion_decoder | TemporalFusionDecoder    | 15.5 M
(_train_tune pid=6410) 8 | output_adapter          | Linear                   | 1.0 K 
(_train_tune pid=6410) ---------------------------------------------------------------------
(_train_tune pid=6410) 173 M     Trainable params
(_train_tune pid=6410) 0         Non-trainable params
(_train_tune pid=6410) 173 M     Total params
(_train_tune pid=6410) 693.152   Total estimated model params size (MB)
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py:298: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.

i keep on getting this log when i run

.fit(...)

...has anyone seen this before?

José Morales

04/30/2024, 3:19 PM

We recently removed the following lines:

Copy code

import logging
import warnings
logging.getLogger("pytorch_lightning").setLevel(logging.ERROR)
warnings.filterwarnings("ignore")

You can add those to your script/notebook to get the previous behavior

nickeleres

04/30/2024, 4:06 PM

the issue here is that the training never starts, this logging block just keeps repeating... @José Morales

José Morales

04/30/2024, 4:07 PM

It'll probably show for every iteration

nickeleres

04/30/2024, 4:38 PM

what is the problem here even? sometimes I can get a training to run but most of the time these logs just keep repeating for every iteration. the weights arent being updated.

José Morales

04/30/2024, 4:45 PM

Have you tried adding the lines above?

nickeleres

04/30/2024, 4:58 PM

yah it's the same thing

nickeleres

04/30/2024, 4:58 PM

just keeps repeating the logs i shared originally

nickeleres

04/30/2024, 4:59 PM

the GPU RAM keeps filling all the way up (40G) and then dumping each log cycle

José Morales

04/30/2024, 4:59 PM

Did you set them at the top of your script?

Open in Slack

Previous Next