```(_train_tune pid=6410) /usr/local/lib/python3.1...
# neural-forecast
n
Copy code
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: `ray.tune.integration.pytorch_lightning.TuneReportCallback` is deprecated. Use `ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback` instead.
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/parsing.py:199: Attribute 'loss' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['loss'])`.
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/parsing.py:199: Attribute 'valid_loss' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['valid_loss'])`.
(_train_tune pid=6410) Seed set to 6
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/torch/nn/init.py:452: UserWarning: Initializing zero-element tensors is a no-op
(_train_tune pid=6410)   warnings.warn("Initializing zero-element tensors is a no-op")
(_train_tune pid=6410) GPU available: True (cuda), used: True
(_train_tune pid=6410) TPU available: False, using: 0 TPU cores
(_train_tune pid=6410) IPU available: False, using: 0 IPUs
(_train_tune pid=6410) HPU available: False, using: 0 HPUs
(_train_tune pid=6410) You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read <https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision>
(_train_tune pid=6410) Missing logger folder: /tmp/ray/session_2024-04-30_11-37-44_977220_4931/artifacts/2024-04-30_11-37-47/_train_tune_2024-04-30_11-37-44/working_dirs/_train_tune_0c7c6ab7_2_attn_dropout=0.0000,batch_size=16,dropout=0.1000,early_stop_patience_steps=2,enable_progress_bar=False,futr_2024-04-30_11-37-54/lightning_logs
(_train_tune pid=6410) 2024-04-30 11:39:20.675778: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=6410) 2024-04-30 11:39:20.675836: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=6410) 2024-04-30 11:39:20.677586: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=6410) 2024-04-30 11:39:21.828232: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=6410) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=6410) 
(_train_tune pid=6410)   | Name                    | Type                     | Params
(_train_tune pid=6410) ---------------------------------------------------------------------
(_train_tune pid=6410) 0 | loss                    | HuberLoss                | 0     
(_train_tune pid=6410) 1 | valid_loss              | HuberLoss                | 0     
(_train_tune pid=6410) 2 | padder_train            | ConstantPad1d            | 0     
(_train_tune pid=6410) 3 | scaler                  | TemporalNorm             | 0     
(_train_tune pid=6410) 4 | embedding               | TFTEmbedding             | 44.0 K
(_train_tune pid=6410) 5 | static_encoder          | StaticCovariateEncoder   | 20.0 M
(_train_tune pid=6410) 6 | temporal_encoder        | TemporalCovariateEncoder | 137 M 
(_train_tune pid=6410) 7 | temporal_fusion_decoder | TemporalFusionDecoder    | 15.5 M
(_train_tune pid=6410) 8 | output_adapter          | Linear                   | 1.0 K 
(_train_tune pid=6410) ---------------------------------------------------------------------
(_train_tune pid=6410) 173 M     Trainable params
(_train_tune pid=6410) 0         Non-trainable params
(_train_tune pid=6410) 173 M     Total params
(_train_tune pid=6410) 693.152   Total estimated model params size (MB)
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
(_train_tune pid=6410) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py:298: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
i keep on getting this log when i run
.fit(...)
...has anyone seen this before?
j
We recently removed the following lines:
Copy code
import logging
import warnings
logging.getLogger("pytorch_lightning").setLevel(logging.ERROR)
warnings.filterwarnings("ignore")
You can add those to your script/notebook to get the previous behavior
n
the issue here is that the training never starts, this logging block just keeps repeating... @José Morales
j
It'll probably show for every iteration
n
what is the problem here even? sometimes I can get a training to run but most of the time these logs just keep repeating for every iteration. the weights arent being updated.
j
Have you tried adding the lines above?
n
yah it's the same thing
just keeps repeating the logs i shared originally
the GPU RAM keeps filling all the way up (40G) and then dumping each log cycle
j
Did you set them at the top of your script?