feat(app): 1st converging cloud microphysics model

This commit exhibits nearly monotic convergence as measured by the cost function decreasing 3 orders of magnitude in the first 120 iterations. To reproduce this behavior, execute ./build/run-fpm.sh run -- --base training --epochs 120 --stride 720 with the present working directory containing the 29.7 GB training_input.nc and training_output.nc produced for the "Colorado benchmark simulation" using commit d7aa958 on the neural-net branch of https://github.com/berkeleylab/icar, which uses the simplest of ICAR's cloud microphysics models. The Inference-Engine run uses * A single time instant (as determined by the above stride), * A 30% retention rate of grid points where time derivatives vanish, * Zero initial weights and biases, * A batch size equal to the entire time instant, * Gradient descent with no optimizer, and * A single mini-batch. The program shuffles the data set in order to facilitate stochastic gradient descent. However, because a single mini-batch is used, the cost function is computed across the entire data set, which negates the value of shuffling and thus presumably makes this gradient descent. Because a single time instant is used, this case reflects the behavior that might be expected if Inference-Engine is integrated into ICAR and training happens during an ICAR run. In such a scenario, it might be desirable to iterate on each time instant as soon as the time step completes. Doing so might either help to pretrain the network to promote faster convergence if the data is saved for additional subsequent after the ICAR run. Alternatively, training at ICAR runtime might obviate the need for saving large training data sets.
BerkeleyLab · Sep 11, 2023 · b09e77c · b09e77c
1 parent cb5119f
commit b09e77c
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/app/train-cloud-microphysics.f90 b/app/train-cloud-microphysics.f90
@@ -194,7 +194,7 @@ subroutine read_train_write
       else
         close(network_unit)
         print *,"Initializing a new network"
-        trainable_engine = new_engine(num_hidden_layers=12, nodes_per_hidden_layer=16, num_inputs=8, num_outputs=6, random=.true.)
+        trainable_engine = new_engine(num_hidden_layers=12, nodes_per_hidden_layer=16, num_inputs=8, num_outputs=6, random=.false.)
       end if
 
       print *,"Defining tensors from time steps 1 through", t_end, "with strides of", stride
@@ -229,7 +229,7 @@ subroutine read_train_write
       end associate
 
 
-      associate(num_pairs => size(input_output_pairs), n_bins => size(input_output_pairs)/10000)
+      associate(num_pairs => size(input_output_pairs), n_bins => 1) ! also tried n_bins => size(input_output_pairs)/10000
         bins = [(bin_t(num_items=num_pairs, num_bins=n_bins, bin_number=b), b = 1, n_bins)]
 
         print *,"Training network"