Training¶
This guide covers the training process for SimCLR models in Phenoscape, including configuration, monitoring, and optimization strategies.
Basic Training Process¶
1. Configuration Setup¶
Create a configuration file (config.yaml) with your training parameters:
# Data configuration
data_dir: "path/to/your/data"
out_dir: "outputs/experiment_name"
# Model configuration
backbone: "resnet50" # resnet18, resnet50, vit_base_patch16_224
weights: "DEFAULT" # Use pretrained weights or null for random init
# Training parameters
lr: 0.001
batch_size: 32
max_epochs: 100
temperature: 0.1 # Contrastive loss temperature
seed: 42
# Hardware configuration
accelerator: "gpu" # cpu, gpu, auto
devices: 1
num_workers: 4
deterministic: true
# Logging
log_every_n_steps: 50
2. Start Training¶
3. Monitor Progress¶
Training progress is logged to TensorBoard:
Configuration Parameters¶
Model Architecture¶
| Parameter | Options | Description |
|---|---|---|
backbone |
resnet18, resnet50, resnet101, vit_base_patch16_224 |
Encoder architecture |
weights |
DEFAULT, null |
Use pretrained weights or random initialization |
input_size |
224, 256, 384 |
Input image resolution |
Training Hyperparameters¶
| Parameter | Default | Range | Description |
|---|---|---|---|
lr |
0.001 |
1e-5 to 1e-1 |
Learning rate |
batch_size |
32 |
8 to 512 |
Batch size (adjust for GPU memory) |
max_epochs |
100 |
50 to 500 |
Maximum training epochs |
temperature |
0.1 |
0.05 to 0.5 |
Contrastive loss temperature |
T_max |
max_epochs |
- | Cosine annealing scheduler period |
Data Augmentation¶
# Augmentation parameters (passed to SimCLRTransform)
transforms:
input_size: 224
cj_prob: 0.8 # Color jitter probability
cj_strength: 0.5 # Color jitter strength
min_scale: 0.08 # Minimum crop scale
random_gray_scale: 0.2 # Grayscale probability
gaussian_blur: 0.5 # Gaussian blur probability
kernel_size: 0.1 # Blur kernel size
sigmas: [0.1, 2.0] # Blur sigma range
Training Strategies¶
Learning Rate Scheduling¶
The framework uses cosine annealing by default:
# Cosine annealing scheduler
lr_scheduler: "cosine"
T_max: 100 # Period of cosine annealing
eta_min: 0.0001 # Minimum learning rate
Batch Size Optimization¶
Choose batch size based on GPU memory:
| GPU Memory | Recommended Batch Size |
|---|---|
| 8GB | 16-32 |
| 16GB | 32-64 |
| 24GB+ | 64-128 |
Early Stopping¶
Monitor validation loss to prevent overfitting:
Advanced Training Options¶
Mixed Precision Training¶
Enable automatic mixed precision for faster training:
Gradient Accumulation¶
Simulate larger batch sizes:
Multi-GPU Training¶
Train on multiple GPUs:
Monitoring Training¶
Key Metrics to Watch¶
- Contrastive Loss: Should decrease steadily
- Learning Rate: Follow the scheduler curve
- GPU Utilization: Should be >80% for efficiency
- Memory Usage: Monitor for out-of-memory errors
TensorBoard Visualization¶
Key plots to monitor: - Training loss over time - Learning rate schedule - Gradient norms - Sample augmented images
Training Logs¶
Check training logs for issues:
Troubleshooting¶
Common Issues¶
Out of Memory Errors
Solutions: - Reduce batch size - Use gradient accumulation - Enable mixed precision training - Reduce input image sizeSlow Convergence - Increase learning rate - Adjust temperature parameter - Check data augmentation strength - Verify data loading efficiency
Loss Not Decreasing - Check data preprocessing - Verify positive/negative pair generation - Adjust temperature parameter - Increase batch size for better negative sampling
Performance Optimization¶
Data Loading
num_workers: 8 # Increase for faster data loading
pin_memory: true # Speed up GPU transfer
persistent_workers: true # Keep workers alive
Model Optimization
Training Best Practices¶
Data Preparation¶
- Ensure balanced dataset across categories
- Remove corrupted or low-quality images
- Standardize image preprocessing
- Verify data augmentation quality
Hyperparameter Tuning¶
- Start with default parameters
- Tune learning rate first
- Adjust batch size for hardware
- Fine-tune temperature parameter
- Optimize augmentation strength
Experiment Management¶
- Use descriptive experiment names
- Version control configuration files
- Save model checkpoints regularly
- Document experiment results
Validation Strategy¶
- Hold out validation set
- Monitor overfitting
- Use early stopping
- Evaluate on downstream tasks
Example Training Commands¶
Basic RGB Training¶
High-Resolution Training¶
Multi-GPU Training¶
Resume Training¶
python train/simclr_birdcolour.py \
--config configs/resume.yaml \
--resume outputs/experiment/checkpoints/last.ckpt
Output Files¶
After training, you'll find:
outputs/experiment_name/
├── checkpoints/
│ ├── best.ckpt # Best model checkpoint
│ └── last.ckpt # Latest checkpoint
├── logs/
│ ├── training.log # Training logs
│ └── tensorboard/ # TensorBoard logs
└── config.yaml # Saved configuration
Next Steps¶
- Configuration: Detailed parameter explanations
- Multispectral Training: Training with UV data
- Hyperspectral Training: 408-band training
- Data Augmentation: Customizing augmentations