-
Notifications
You must be signed in to change notification settings - Fork 1
/
manual.html
2083 lines (2036 loc) · 138 KB
/
manual.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<!--
TODO:
https://codepen.io/ismaelexperiments/pen/gxxjZQ
-->
<head>
<title>asanAI Handbook</title>
<script src='libs/jquery.js'></script>
<script src='libs/jquery-ui.js'></script>
<script src='libs/plotly-latest.min.js'></script>
<script src="libs/mathjax/es5/tex-chtml-full.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$']]
},
jax: ["input/TeX","output/CommonHTML"],
processEscapes: true,
"showMathMenu": true
});
</script>
<script src="tf/tf.js"></script>
<script src='base_wrappers.js'></script>
<script src='variables.js'></script>
<script src='safety.js'></script>
<script src='explain.js'></script>
<script src='model.js'></script>
<script src='gui.js'></script>
<script src='debug.js'></script>
<script src="libs/md5.umd.min.js"></script>
<style>
body {
font-family: sans-serif;
background-color: #e5e5e5;
}
.container{
margin: 20px auto;
width:100px;
height:100px;
display:grid;
grid-template-columns: 50px 50px;
grid-row: auto auto;
.box{
color:#fff;
display:flex;
align-items:center;
justify-content:center;
font-size:40px;
font-family:sans-serif;
}
}
#contents {
/*white-space: pre-line;*/
}
.center_vertically {
display: flex;
align-items: center;
}
.grid3x2 {
display: grid;
grid-template-columns: repeat(2, 40px);
grid-template-rows: repeat(3, 40px);
grid-gap: 1px;
}
.grid2x2 {
display: grid;
grid-template-columns: repeat(2, 40px);
grid-template-rows: repeat(2, 40px);
grid-gap: 1px;
}
.grid {
display: grid;
grid-template-columns: repeat(3, 40px);
grid-template-rows: repeat(3, 40px);
grid-gap: 1px;
}
.cell {
justify-content: center;
align-items: center;
display: flex;
font-family: Arial;
font-size: 0.5em;
font-weight: bold;
background: white;
outline: 1px solid black;
}
.flip3dtensor {
max-width: 100px;
transform: skew(-180deg, 21deg);
}
.kernel_images {
margin: 10px;
}
.out_images {
margin: 10px;
}
img {
max-width: 90%;
}
.error_msg {
background-color: red;
color: white;
font-size: 2em;
}
</style>
</head>
<body>
<script>
var default_config_separableConv2d = {
filters: 3,
kernelSize: [3, 3],
depthMultiplier: 1,
depthwiseInitializer: "glorotNormal",
depthwiseConstraint: undefined,
depthwiseRegularizer: undefined,
strides: [1, 1],
padding: "same",
dilationRate: [1, 1],
activation: "relu",
useBias: true,
biasInitializer: "ones",
kernelInitializer: "glorotUniform",
kernelConstraint: undefined,
biasConstraint: undefined,
activityRegularizer: tf.regularizers.l1l2( { "l1": 0.01, "l2": 0.01 } ),
kernelRegularizer: tf.regularizers.l1l2( { "l1": 0.01, "l2": 0.01 } ),
biasRegularizer: tf.regularizers.l2( { "l2": 0.01 } )
};
var default_config_depthwiseConv2d = {
kernelSize: [3, 3],
depthMultiplier: 1,
depthwiseInitializer: "glorotNormal",
depthwiseConstraint: undefined,
depthwiseRegularizer: undefined,
strides: [1, 1],
padding: "same",
dilationRate: [1, 1],
activation: "relu",
useBias: true,
biasInitializer: "ones",
kernelInitializer: "glorotUniform",
kernelConstraint: undefined,
biasConstraint: undefined,
activityRegularizer: tf.regularizers.l1l2( { "l1": 0.01, "l2": 0.01 } ),
kernelRegularizer: tf.regularizers.l1l2( { "l1": 0.01, "l2": 0.01 } ),
biasRegularizer: tf.regularizers.l2( { "l2": 0.01 } )
};
var default_config_conv2dTranspose = {
filters: 3,
kernelSize: [3, 3],
strides: [1, 1],
padding: "same",
dilationRate: [1, 1],
activation: "relu",
useBias: true,
biasInitializer: "ones",
kernelInitializer: "glorotUniform",
kernelRegularizer: tf.regularizers.l1l2( { "l1": 0.01, "l2": 0.01 } ),
biasRegularizer: tf.regularizers.l2( { "l2": 0.01 } )
};
var default_config_gaussianNoise = {
stddev: 0.1
};
var default_config_gaussianDropout = {
rate: 0.1
};
var default_config_dropout = {
rate: 0.1
};
var default_config_alphaDropout = {
rate: 0.1
};
var default_config_upSampling2d = {
size: [2, 2],
interpolation: "nearest"
};
var default_config_averagePooling2d = {
poolSize: [2, 2],
strides: [2, 2],
padding: "valid"
};
var default_config_maxPooling2d = {
poolSize: [2, 2],
strides: [2, 2],
padding: "valid"
};
var default_config_conv2d = {
filters: 3,
kernelSize: [4, 4],
activation: "sigmoid",
strides: [1, 1],
kernelSize: [3, 3],
dilationRate: [1, 1],
useBias: true,
padding: "valid",
biasInitializer: "ones",
kernelInitializer: "glorotUniform",
kernelRegularizer: tf.regularizers.l1l2( { "l1": 0.01, "l2": 0.01 } ),
biasRegularizer: tf.regularizers.l2( { "l2": 0.01 } )
};
</script>
<div id="training_data"></div>
<img src="_gui/logo_small.png" />
<div id="toc"></div>
<div id="contents">
<h2>General outline</h2>
asanAI offers a simple way of creating sequential neural networks and train them on your own data, from within the browser. It allows you to visualize a lot of different intermediate steps. You can, when done training, also <a href="#Export">export</a> the trained model to Python and NodeJS.
<h2>Quickstart</h2>
<h3>GUI Basics</h3>
<img src="manual/ribbon.png" />
<br>
The bar at the top is called ribbon. It contains general options, applicable to all layers, the data itself or the ability to start the training.
<br>
<img src="manual/layers.png" />
<br>
The left side is the layers panel. It shows the layers of the current neural network in the state they are in currently. Also, it shows the description of what groups of layers do on the right side.
<br>
<h3>Train on images from webcam</h3>
The quickest and easiest way to create a neural network is to simply use images from the webcam.
<br>
Click on the camera icon 📸 in the top-left of the ribbon.
<br>
<img src="manual/camera_icon.png" />
<br>
You then get a screen where you can set how many images you want to take via webcam (default: 100) and how much time should between them in seconds (default: 0.9 seconds). If you want to take multiple pictures, set these settings accordingly and click "Take 100 images from webcam (0.9 seconds apart)" on the first category.
<br>
If you just want to use a single image, press "Take image from webcam" instead.
<br>
While doing these images, please move the object around, so that the neural network can see it from different angles and sides.
<br>
Each of these buttons is assigned to a category, which you can name. By default, there are 2 categories, but you can add as many as you like. If you want to remove a category, press "Delete this category". If you want to add a category, press "Add new category".
<br>
<img src="manual/camera_data.png" />
<br>
When you have as many categories and images as you wish, go to the ribbon and click <img src="manual/start_training_button.png">.
<br>
Remember that the image will, by default, be converted to 10x10 pixels, so make sure that the objects you hold in front of the camera are well-visible and clearly distinguishable.
<br>
You will then see graphs like these:
<br>
<img src="manual/train_graph_ok.png" />
<br>
For now only the topmost graph is important. The two lines are the <a href="#Loss">Loss</a> and <a href="#Validation_Loss">Validation-Loss</a>.
<br>
Both show how well the network is performing. A simple (but technically inaccurate) way of thinking of them is the number of errors the network makes while predicting. The lower, the better.
<br>
The Validation Loss is based on the <a href="#Validation_Split">Validation Split</a>. This takes a certain percentage out of the training loop and tests the network after each <a href="#Epochs">Epoch</a> on data the network has not yet seen.
<br>
Both graphs should look similiar. Check <a href="#Interpretating_the_training_graph">how to interpret these graphs here</a>.
<br>
When the training is done (or you prematurely cancelled it, which can be done by clicking <img src="manual/stop_training_button.png"> in the ribbon), you get automatically redirected to the Predict tab, where your webcam is already enabled and the prediction of what the network thinks is shown life.
<br>
<img src="manual/predict_webcam.png">
<br>
The category which has the highest probability is automatically highlighted in green.
<br>
Congratulations! 🎉 You have now trained a neural network. You can now <a href="#Export_to_Python">export it to python</a> and do any kinds of logic with it.
<br>
<h3>Train on images from files</h3>
Training on images is very similiar to training from webcam images, but instead of the camera icon, click the photo icon (or chose "Own data?" → "Yes, own images/webcam").
<br>
Right of the category name, you see an Upload button. Drag files onto there or open the file picker.
<br>
The rest is the same as with training on webcam images, when done adding images, you may chose to <a href="#Augmentation">augment</a> them, or straight go to <img src="manual/start_training_button.png" /> to start training.
<br>
<h3>Train on CSV</h3>
<br>
As Neural Networks as just function approximators (see <a href="#Basic_idea_of_neural_networks">Basic idea of neural networks</a>), you can also approximate custom functions based on numbers. The easiest way to import custom data from functions of the form \( f(x_1, x_2, x_3, \dots, x_n) = [y_1, y_2, y_3, \dots, y_n] \) is to use the CSV importer.
<br>
For this, choose "Own data?" → "Yes, own CSV".
<br>
A new tab will appear.
<br>
<img src="manual/own_csv_1.png" />
<br>
In the large text field, you can enter data in the CSV format:
<pre>
x_1, x_2, x_3, y_1, y_2, y_3
1, 0, -1, 5, 1, 1
3, 3, 1, 0, 1, 3
...
</pre>
<img src="manual/own_csv_2.png" />
<br>
The the header-to-training-data-section you can specify which columns (defined by their title in the very first line) should belong to the Input and which ones should belong to the output.
<br>
<img src="manual/own_csv_3.png" />
<br>
After specifying this, on the right, a preview of how the tensors will look like like.
<br>
The input and output shapes are changed automatically according to the data type and the constructed network.
<br>
There are also some options at the top:
<ul>
<li><b>Auto-adjust last layer's number of neurons?</b> This sets the number of output neurons depending on the number of values set to "Y".</li>
<li><b>Auto-set last layer's activation to linear when any Y-values are smaller than 0 or greater than 1?</b> This sets the activation function of the last layer to Linear automatically when at least one of the output data values are smaller than 0 or larger than 0.</li>
<li><b>Shuffle data before doing validation split (recommended)?</b> TensorFlow usually takes the last n% of the input data as Validation Split (if enabled). This way, if the data is ordered, you may miss some categories of data in the training data, because it got shifted to the Validation dataset automatically. This can be counteracted by shuffling the data randomly before the validation split is taken out of the dataset. The correlation between an input and an output stays the same after shuffling.</li>
<li><b>Auto One-Hot-encode Y (disables "divide by")?</b> Automatically One-Hot-Encodes if the Y-data has only one column and is a string. Then, it autogenerates labels too.</li>
<li><b>Auto loss/metric?</b> Set loss and metric automatically depending on how the data is structured (for classification problems it takes <a href="#categoricalCrossentropy">Categorical Crossentropy</a> and <a href="#meanSquaredError">Mean Squared Error</a> for anything else.</li>
<li><b>Separator</b> Which symbol should be used as CSV seperator (usually ",").</li>
<li><b>divide by</b> Which number every value in the tensor should be divided by (by default, 1, and not enabled if auto-one-hot-encoding is enabled.</li>
</ul>
<h2>Basic idea of neural networks</h2>
<h3>Data</h3>
For neural networks, everything is a tensor. Even if you don't know, you have certainly used tensors. Every number, every vector and matrix is a tensor.
<br>
Tensors are a generalization of matrices. Where matrices have 2 dimensions,
<br>
$$
\textrm{Second dimension}
\stackrel{\mbox{First dimension}}{%
\begin{pmatrix}
a_{11} & a_{12} & \cdots & a_{1M} \\
a_{21} & a_{22} & \cdots & a_{2M} \\
\vdots & \vdots & \ddots & \vdots \\
a_{N1} & a_{N2} & \cdots & a_{NM}
\end{pmatrix}%
}.
$$
<br>
Tensors have an arbitrary number of dimensions.
<br>
An image, for example, consists of 3 channels, one for red, green and blue, each one being a matrix (or submatrix of the image tensor). If the image is 3x3 pixels, the image as a tensor would look like this:
<br>
$$
\text{Image} = \begin{pmatrix}
\text{Red:} \begin{pmatrix}
255 & 0 & 0 \\
0 & 128 & 0 \\
0 & 0 & 64
\end{pmatrix},
\text{Green:} \begin{pmatrix}
255 & 0 & 0 \\
0 & 128 & 0 \\
0 & 0 & 64
\end{pmatrix},
\text{Blue:} \begin{pmatrix}
255 & 0 & 0 \\
0 & 128 & 0 \\
0 & 0 & 64
\end{pmatrix}
\end{pmatrix}
$$
<br>
The three channels together give us this total image:
<br>
<div class="center_vertically" style="height: 300px; position: relative">
<div style="position: absolute; left: 130px; width: 300px">
<span id="training_data_matrix" style="display: inline-flex; width: 150px">
<span class="flip3dtensor">
<span class="container">
<span class="grid">
<div class="cell" style='background-color: #0000ff'>255</div>
<div class="cell">0</div>
<div class="cell">0</div>
<div class="cell">0</div>
<div class="cell" style='background-color: #000080'>128</div>
<div class="cell">0</div>
<div class="cell">0</div>
<div class="cell">0</div>
<div class="cell" style='background-color: #000040'>64</div>
</span>
</span>
</span>
<span style="position: relative; top: 30px; left: -150px;" class="flip3dtensor">
<span class="container">
<span class="grid">
<div class="cell" style='background-color: #00ff00'>255</div>
<div class="cell">0</div>
<div class="cell">0</div>
<div class="cell">0</div>
<div class="cell" style='background-color: #008000'>128</div>
<div class="cell">0</div>
<div class="cell">0</div>
<div class="cell">0</div>
<div class="cell" style='background-color: #004000'>64</div>
</span>
</span>
</span>
<span style="position: relative; top: 60px; left: -300px;" class="flip3dtensor">
<span class="container">
<span class="grid">
<span class="cell" style='background-color: #ff0000'>255</span>
<span class="cell">0</span>
<span class="cell">0</span>
<span class="cell">0</span>
<span class="cell" style='background-color: #800000'>128</span>
<span class="cell">0</span>
<span class="cell">0</span>
<span class="cell">0</span>
<span class="cell" style='background-color: #400000'>64</span>
</span>
</span>
</span>
</span>
</div>
<div style="position: absolute; left: 260px;">=</div>
<div style="position: absolute; left: 280px;">
<div class="grid">
<div class="cell" style='background-color: #ffffff'></div>
<div class="cell" style='background-color: black'></div>
<div class="cell" style='background-color: black'></div>
<div class="cell" style='background-color: black'></div>
<div class="cell" style='background-color: #808080'></div>
<div class="cell" style='background-color: black'></div>
<div class="cell" style='background-color: black'></div>
<div class="cell" style='background-color: black'></div>
<div class="cell" style='background-color: #404040'></div>
</div>
</div>
</div>
<br>
Any data a computer can handle can be expressed as <i>some</i> tensor. They may have more or larger dimensions, but they are nonetheless tensors.
<br>
The description of the size of a tensor is called a shape. A two-by-two-Matrix would have the shape \( [3, 3] \). The image above would have the shape \( [3, 3, 3] \), because its 3x3 pixels and has three channels. An image with 64x64 pixel and 3 channels would be \( [64, 64, 3] \).
<br>
<h3>One-Hot-Encoding</h3>
One-Hot-Encoding is used to symbolize percentages of values of categories. For example, if you want to differentiate between cat and dog, the output vector
could be \( [\text{Percentage Cat}, \text{Percentage Dog}] \), which, in total, sums up to 1 (100%).
<br>
For more categories, you'd add another entry to that column vector, like \( [\text{Percentage Cat}, \text{Percentage Dog}, \text{Percentage Human} ] \), all of which, again, sum up to 1. This can be achieved with the <a href="#SoftMax">SoftMax</a>-activation-function.
<br>
<h3>Layers</h3>
Layers act as nested functions. Each layer is a function by itself, and with layers, you put them together into one larger function.
<br>
You can imagine them as such:
<br>
$$ \text{Result} = \text{Layer 3}\left(\text{Layer 2}\left(\text{Layer 1}\left(\text{Layer 0}\left(\text{input data}\right)\right)\right)\right) $$
<br>
This is called a sequential model, since the data flows through it sequentially. There are other types of models, but they cannot be designed with asanAI.
<br>
<h3>What do functions have to do with neural networks?</h3>
A mathematical function assigns values from one set to values from another.
<br>
Since parameters for functions can also be matrices, or even tensors, you can define a computer program as a function that gets some input and produces a specific output, depending solely on the inputs.
<br>
Imagine a set of images of cats and dogs. These are, as already discussed, tensors. If you want to classify these images, you are actually searching a function such that
<br>
$$
f\left(\text{Input Image Tensor}\right) = \begin{pmatrix}
\text{Probability cat in percent}\\
\text{Probability dog in percent}
\end{pmatrix}
$$
<br>
Writing this function manually is practically impossible. Every picture of every cat or dog is different. Even if it's the same cat, it is different if the picture is taken half a second later. So you cannot simply say "if this pixel has this color and this pixel has this color, and ..., then it is a cat".
<br>
This is where Neural Networks jump in. Via the layers, we can approximate a function that does that, by connecting different very generalized functions (called layer types) that do specific kinds of tasks.
<br>
We will cover these layers <a href="#Layer_Types">here</a>.
<br>
In neural networks, instead of writing the interna of functions by yourself, you give the network a lot of data and what should come out. Mathematically, you tell the network that the function \( f \) should transform the input set \( X \) to the output set \( Y \). It will try to find values for the parameters of the interna of the functions, whose general outline you need to give by specifying the layer types and their options, so that the difference between the values you specify as ground truth and the values the network gives out is minimized as much as possible.
<br>
<h3>Dimensionality Reduction</h3>
A common goal of neural networks is dimensionality reduction.
<br>
Imagine a 64x64 image of either a cat or a dog. The image has 3 channels, so in total it consists of \( 64*64*3 = 12288 \) values. If we only have the categories "Dog" or "Cat", we need to reduce the information from 12288 values to only 2 values.
<br>
This is a dimensionality reduction, from a tensor of the shape \( [64, 64, 3] \) to a tensor of the shape \( [2] \).
<br>
This can be done by several ways. For example, <a href="#Convolutional_Layers">convolutions</a> or <a href="#Pooling_layers">pooling layers</a> "extract" information from images and reduce the number of values and therefore reduce the dimensionality of the inputted images.
<br>
<h3>Training</h3>
The <a href="#Loss">loss function</a> creates a single value from the training data \(X\) and \(Y\) such that the lower the number is, the better the results are.
This creates a so-called "loss-landscape", that is a function that represents, for each data point, how well the network currently recognizes it.
<br>
For each point, this is just a single float. The overall loss is the average loss of all points.
<br>
For each point, while training, a loss is determined. How exactly this is done is dependent on the optimizer chosen (see <a href="#Optimizers">Optimizers</a> for more details).
<br>
After each <a href="#Batch-Size">Batch</a>, the (trainable) weights and biases are adjusted to to better fit the training data and to minimize the loss. The network structure is not altered while training.
<br>
<h4>Batch-Size</h4>
While training, you (most probably) cannot hold all the data at once in memory. So the data is splitted into so-called batches. A batch is a subset of the \( X \)-input-tensor and the \( Y \)-output-tensor, such that the inputs are still correctly assigned. Imagine you have 1000 input values that correspond to 1000 output values, and having batch-size 3, then the first batch may be:
<br>
$$
f\left(
\begin{pmatrix}
x_0 \\
x_1 \\
x_2
\end{pmatrix}
\right) = \begin{pmatrix}
y_0 \\
y_1 \\
y_2
\end{pmatrix}
$$
<br>
The next batch may then be:
<br>
$$
f\left(
\begin{pmatrix}
x_3 \\
x_4 \\
x_4
\end{pmatrix}
\right) = \begin{pmatrix}
y_3 \\
y_4 \\
y_5
\end{pmatrix}
$$
<br>
and so on, until all values have been seen by the network once. This is then called an <a href="#Epochs">epoch</a>.
<br>
<h4>Epochs</h4>
When the network, while training, has seen all training data once, this is called an epoch. You usually need many epochs after each other to train a neural network.
<br>
<h4>Shuffling</h4>
Because usually not the whole data fits into memory and has to be sharded into smaller chunks (<a href="#Batch-Size">Batches</a>), it is usually recommended to shuffle the data. Imagine you didn't do this in the example network that should learn to classify cats and dogs, and in the first batch the network there are only cats and in the second one only dogs.
<br>
Then, the network would learn "cat" in the first batch and be punished for what it has learnt previously in the next batch, where there are only dogs.
<br>
It's recommended that in each batch, if possible, there are data from many different categories, so the network doesn't <a href="#Overfitting">overfit</a> in each batch. Therefore, the data is shuffled by default, so the likelyhood of one batch containing only one type of image is drastically reduced.
<br>
<h4>How the computer calculates derivatives of very complex functions</h4>
<br>
One possible definition of derivates is this equation:
<br>
$$ f'(x) = \lim\limits_{h \to 0} \frac{f(x + h) - f(x)}{h} $$
<br>
The way a computer can approximate derivates of any arbitrary function, no matter how complex, is to set h to some very small value.
<br>
Let's say,
<br>
$$ f(x) = 2x^2 $$
<br>
Of course, the derivative is \(4x\) then. But what does the computer say, when you, for example, set h to 0.0001, at the specific point \(x = 10\)?
<br>
$$ f'(x) = \lim\limits_{h \to 0.0001} \frac{f(x + h) - f(x)}{h}$$
<br>
$$ \frac{f(x + 0.0001) - f(x)}{0.0001} = $$
<br>
$$ \frac{f(10 + 0.0001) - f(10)}{0.0001} = $$
<br>
$$ \frac{2(10+0.0001)^2 - 2\cdot 10^2}{0.0001} = $$
<br>
$$ \frac{2(10+0.0001)^2 - 200}{0.0001} = $$
<br>
$$ \frac{0.00400002}{0.0001} = 40.0002 $$
<br>
The real answer is 40, the approximated answer is 40.0002; this is, for our example case, good enough, but it could be improved, of course, by chosing a smaller \(h\).
<br>
<h3>Predicting</h3>
Predicting with a simple sequential neural network works by passing input data through the layers of the network in order, starting from the input layer and ending at the output layer.
<br>
The input data is first passed through the first layer, the output of the first layer is then passed to the first hidden layer, where it is transformed using a set of weights and biases that are learned during the training process.
<br>
This output is then passed through the next hidden layer, where it is again transformed using a set of weights and biases, and so on, until it reaches the output layer.
<br>
The output layer produces the final predictions of the network, which can be compared to the expected output during training to adjust the weights and biases.
<br>
During the prediction phase, the input data is passed through the trained network, and the output of the output layer is the prediction of the model. The prediction can be in the form of a probability distribution over the possible classes, or a single class.
<br>
It's important to note that when the model is trained, it's trained with a specific set of weights and biases and these are the ones that are used during the prediction phase. Also, it's important to note that the prediction phase should be done on unseen data, as the model's aim is to generalize to new examples, not memorize the training data.
<h3>Shapes</h3>
<br>
Tensors have shapes. The shape describe how many data points are in the tensor, and in what way they are arranged.
<br>
For example, an Image may have 3 channels (one for each, red, green and blue), and each of those channels may be 10px wide and 20px height. So the total tensor shape may be \([10, 20, 3]\). If you have multiple of these images, lets say five, you may expand the tensor's first implicit index to 5, like this: \([5, 10, 20, 3]\). Then, you'd have 5 images, each 10x20px, with 3 channels each. This, in total, makes \(5 \cdot 10 \cdot 20 \cdot 3 = 3000\) data points.
<br>
<h4>Input Shape</h4>
The input shape of a network is the shape of tensors that the network can process. If the incoming tensor has another shape, it may fail, because the network doesn't know what to do with it.
<br>
<h4>Output Shape</h4>
The output shape of a network is the shape of the tensor that comes out of the network. Usually, in asanAI, the output shape consists of Dense Layers that have a One-Hot-Encoding (though this may be different for special kinds of networks not covered in this documentation).
<br>
For example, if you have 5 categories as a one-hot-encoded softmax'ed vector like this: \([0, 0.8, 0.1, 0.05, 0.05]\), the output shape is \([5]\)....
<br>
<br>
<h3>Overfitting</h3>
Overfitting is a common problem in machine learning, it occurs when a model is trained too well on the training data, and as a result, it performs poorly on new, unseen data. It means that the model has learned the noise in the training data, rather than the underlying pattern.
<br>
In simple terms, overfitting occurs when a model is too complex and it captures the noise in the training data. This happens when the model has too many parameters compared to the amount of training data available.
<br>
Here are some possible ways to avoid overfitting:
<br>
Use more data: The more data you have, the less likely the model is to overfit.
<br>
Use regularization techniques: Regularization techniques such as L1, L2 or dropout can help to reduce overfitting by adding a penalty term to the loss function.
<br>
Use cross-validation: Cross-validation is a technique used to evaluate the performance of a model by dividing the data into multiple subsets.
<br>
Early stopping: Monitor the performance of the model on a validation set during training and stop training when the performance on the validation set starts to decrease.
<br>
Use simpler models: A simpler model is less likely to overfit than a complex one, thus it is recommended to use simpler models when dealing with small data sets.
<br>
Bagging and Boosting: Bagging and Boosting are ensemble methods that can help to reduce overfitting by combining multiple models.
<br>
Transfer learning: Use pre-trained models to initialize the weights of the network, this can help the network to converge faster and achieve better performance.
<h2>Layer Types</h2>
<h3>Basic Layer Types</h3>
<h4>Dense</h4>
<br>
Dense Layers are used as a general-purpose-function-approximator. The basic mathematical structure of a Dense Layer is as follows:
<br>
$$
\text{Dense:} \qquad \underbrace{\begin{pmatrix}
y_{0}
\end{pmatrix}}_{\mathrm{Output}}
= \underbrace{\begin{pmatrix}
x_{0}
\end{pmatrix}}_{\mathrm{Input}}
\times \underbrace{\begin{pmatrix}
-1.4404407739639282
\end{pmatrix}}_{\mathrm{Kernel^{1 \times 1}}}
+ \underbrace{\begin{pmatrix}
0
\end{pmatrix}}_{\mathrm{Bias}}
$$
<br>
Depending on the <a href="#Input_Shape">Input Shape</a>, the number of elements in both the Kernel and the Bias may change.
<br>
This, for example, is a Dense Layer with the input shape \( [2] \):
<br>
$$
\text{Dense:} \qquad \underbrace{\begin{pmatrix}
y_{0}
\end{pmatrix}}_{\mathrm{Output}}
= \underbrace{\begin{pmatrix}
x_{0}\\
x_{1}
\end{pmatrix}}_{\mathrm{Input}}
\times \underbrace{\begin{pmatrix}
0.785955011844635\\
-0.015428715385496616
\end{pmatrix}}_{\mathrm{Kernel^{2 \times 1}}}
+ \underbrace{\begin{pmatrix}
0.123153419419419
\end{pmatrix}}_{\mathrm{Bias}}
$$
<br>
<h4>Flatten</h4>
Flatten has no options. It creates a simple vector of any matrix.
<br>
Example:
<br>
$$
\textrm{Flatten}\left( \begin{pmatrix}
0 & 1 & 2 \\
3 & 4 & 5 \\
6 & 7 & 8
\end{pmatrix}\right) = \left[0 \quad 1 \quad 2 \quad 3 \quad 4 \quad 5 \quad 6 \quad 7 \quad 8 \right]
$$
<br>
This is used for <a href="#Dimensionality_Reduction">Dimensionality Reduction</a>, in asanAI especially for the transfer of image tensors to vectors for Dense Layers (see <a href="#Network_Structures">Network Structures</a>).
<br>
<h4>Dropout</h4>
<div id="dropout_example"></div>
<br>
<br>
The dropout layer sets random values to 0 which a probability given in the Dropout-Rate-option.
<br>
$$
\underbrace{\textrm{Dropout}}_{\text{Dropout-Rate: 50\%}}\left(
\begin{pmatrix}
1 & 2 & 3 & 4 \\
5 & 6 & 7 & 8 \\
9 & 10 & 11 & 12 \\
13 & 14 & 15 & 16 \\
17 & 18 & 19 & 20 \\
21 & 22 & 23 & 24 \\
\end{pmatrix}
\right)
\xrightarrow{\text{Set values randomly to 0 with a 50\% chance}}
\begin{pmatrix}
0 & 0 & 3 & 0 \\
5 & 6 & 7 & 8 \\
9 & 10 & 0 & 0 \\
0 & 0 & 15 & 0 \\
0 & 18 & 19 & 20 \\
21 & 0 & 0 & 0 \\
\end{pmatrix}
$$
<br>
This is only active while training.
<br>
This is used for avoiding <a href="#Overfitting">overfitting</a>.
<br>
<h4>Reshape</h4>
<br>
This allows incoming data tensors to be reshaped into another tensor. The number of elements does not change, only their arragement.
<br>
$$
\begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6
\end{pmatrix}
\xrightarrow{\text{Reshape to [3, 2]}}
\begin{pmatrix}
1 & 2 \\
3 & 4 \\
5 & 6
\end{pmatrix}
<br>
$$
<br>
The product of all input shape elements must be the same as the product of the desired output shape's tensor.
<br>
<h3>Activation Layer Types</h3>
See <a href="#Activation_Functions">Activation Functions</a>. The Activation Layer Types just do the same as the activation functions, but in a seperate layer.
<h3>Convolutional Layers</h3>
<h4>convNd (conv1d, conv2d)</h4>
<div id="conv2d_example"></div>
Convolutions slide a matrix, called kernel or filter, with width \(x\) and height \(y\) over the data (by <a href="#Strides">strides steps</a>), and, for each submatrix of the size \(x\) by \(y\), multiplying each submatrix with a so-called kernel or filter of a certain size. This <a href="#Dimensionality_Reduction">reduces dimensionality</a> and preserves the general activation strength at certain submatrices.
<br>
Example:
<br>
Kernel: \( K = \begin{pmatrix}
1 & -1 \\
0 & 2
\end{pmatrix}\).
<br>
Data: \( D = \begin{pmatrix}
10 & 8 & 1 & 4 \\
4 & 2 & 14 & 5 \\
12 & 20 & 5 & 19 \\
32 & 128 & 3 & 30
\end{pmatrix}
\).
<br>
The first submatrix (without <a href="#Padding">Padding</a>, because it is not needed here for the Kernel fits perfectly when <a href="#Strides">strides</a> = 1) is \( S_1 = \begin{pmatrix}
10 & 8 \\
4 & 2
\end{pmatrix} \). \( S_1 \cdot K = \begin{pmatrix}
10 & 8 \\
4 & 2
\end{pmatrix} \cdot \begin{pmatrix}
1 & -1 \\
0 & 2
\end{pmatrix} = \begin{pmatrix}
10 & 6 \\
4 & 0
\end{pmatrix}
\).
<br>
The second submatrix is then \( S_2 = \begin{pmatrix}
1 & 4 \\
14 & 5
\end{pmatrix} \), which, multiplied by \(K\), is \( \begin{pmatrix}
1 & 7 \\
14 & -4
\end{pmatrix}
\).
<br>
When slided over the whole image, the result is \(
\begin{pmatrix}
10 & 6 & 1 & 7 \\
4 & 0 & 14 & -4 \\
12 & 28 & -5 & 43 \\
32 & 224 & 3 & 57
\end{pmatrix}
\). The kernel is being trained to recognize whatever it needs to recognize.
<br>
What the kernel has learnt can be seen by <a href="#Visualize_Layer">Visualize Layer</a> for images and image-like tensors.
<br>
The same principle of a sliding window with matrix multiplications is used in all Convolutional Layers, no matter if 1d or 2d. For 2d, the input tensor must have the shape \( [\text{int}, \text{int}, \text{int}] \) (disregarding the batch size, which would be at first position).
<br>
A bias is (if enabled) then added to each output value of this graph.
<br>
For 1d convolutions, the kernel can be written as 2d-matrix. For 2d convolutions, the kernel is actually a 3d-cube (the extra dimension being the channels).
<br>
<h4>conv2dTranspose</h4>
<div id="conv2dTranspose_example"></div>
A 2D transposed convolutional layer, also known as a deconvolutional layer, is a type of layer used in deep neural networks for upsampling the feature maps. It is the inverse operation of a 2D convolutional layer, and it is used to increase the spatial resolution of the feature maps.
<br>
A 2D transposed convolutional layer applies a set of filters to the input feature maps, with the goal of increasing their spatial resolution. The filters are applied in a way that is similar to a 2D convolutional layer, but the operation is done in reverse. The input feature maps are upsampled by the transposed convolutional layer by inserting zeros between the elements of the input feature maps, and then applying the filters.
<br>
The use of 2D transposed convolutional layers allows to increase the spatial resolution of the feature maps, which is useful for tasks such as image segmentation or image generation. They are used in the decoder part of architectures such as U-Net or encoder-decoder architectures, where the goal is to increase the spatial resolution of the feature maps in order to make the predictions more detailed.
<br>
It is a good choice when the goal is to increase the spatial resolution of the feature maps, for example, in image segmentation or image generation tasks. It is also a good choice when working with encoder-decoder architectures, where the goal is to make predictions more detailed.
<br>
<h4>depthwiseConv2d</h4>
<div id="depthwiseConv2d_example"></div>
A depthwise convolutional layer is a type of 2D convolutional layer used in deep neural networks. It applies a single filter to each input channel independently, rather than applying a set of filters to the entire input feature map.
<br>
The depthwise convolutional layer applies a single filter to each input channel independently, which means that the number of filters used is equal to the number of input channels. The filters are applied to the input feature map in a way that is similar to a 2D convolutional layer, but each filter is applied only to a single input channel.
<br>
The use of depthwise convolutional layers allows for a reduction in the number of parameters and computation in the network, while maintaining or even improving the performance. They also allow for a better representation of the spatial correlations within each channel.
<br>
It is a good choice when there is a need to reduce the number of parameters and computation in the network while maintaining or improving the performance. It's also a good choice when working with images and there's a need to preserve the spatial correlations within each channel. They are commonly used in conjunction with pointwise convolutional layers to form a separable convolutional layer, which is a more efficient way of applying convolutional filters to the input feature maps.
<br>
<h4>separableConv2d</h4>
<div id="separableConv2d_example"></div>
Separable convolutional layers are a type of 2D convolutional layers used in deep neural networks. They are designed to reduce the number of parameters and computation in the network, while maintaining or even improving the performance.
<br>
A separable convolutional layer consists of two parts: a depthwise convolutional layer and a pointwise convolutional layer. The depthwise convolutional layer applies a single filter to each input channel independently, while the pointwise convolutional layer applies a 1x1 convolution to combine the output of the depthwise convolutional layer.
<br>
The use of the depthwise convolutional layer and the pointwise convolutional layer in a separable convolutional layer makes it possible to reduce the number of parameters and computation in the network, while maintaining or even improving the performance. The depthwise convolutional layer reduces the number of parameters by applying a single filter to each input channel independently, while the pointwise convolutional layer combines the output of the depthwise convolutional layer with a 1x1 convolution.
<br>
Separable convolutional layers are often used in the early layers of neural network architectures, especially in the mobile versions of the architectures, where the number of parameters and computation is limited. They can also be used as a replacement for standard 2D convolutional layers in cases where the number of parameters and computation is a concern.
<br>
It is a good choice when there's a need to reduce the number of parameters and computation in the network while maintaining or improving the performance. It's also a good choice when the model is going to run on mobile devices or other resource-constrained environments.
<br>
<h4>upsampling2d</h4>
<div id="upSampling2d_example"></div>
<br>
Makes images and image-like tensors larger by duplicating lines specified by the size factors \( [w, h] \).
<br>
For example, \(
\underbrace{\text{upsampling2d}}_{h = 2,\ w = 4}\left(
\begin{pmatrix}
1 & 2 \\
3 & 4
\end{pmatrix}
\right) = \begin{pmatrix}
1 & 1 & 1 & 1 & 2 & 2 & 2 & 2 \\
1 & 1 & 1 & 1 & 2 & 2 & 2 & 2 \\
3 & 3 & 3 & 3 & 4 & 4 & 4 & 4 \\
3 & 3 & 3 & 3 & 4 & 4 & 4 & 4
\end{pmatrix} \).
<br>
This can be used to upscale images after they have been compressed. E.g. for image segmentation.
<br>
<h3>Pooling layers</h3>
<h4>averagePooling (averagePooling1d, averagePooling2d)</h4>
<div id="averagePooling2d_example"></div>
averagePooling slides a window with pool size \(x\) and \(y\) as width/height over the data (by <a href="#Strides">strides steps</a>), and, for each submatrix of the size \(x\) by \(y\), calculating the average of all the elements in that submatrix. This <a href="#Dimensionality_Reduction">reduces dimensionality</a> and preserves the general activation strength at certain submatrices.
<br>
Example:
<br>
$$
\underbrace{\text{averagePooling}}_{\text{Strides: 1x1, Pool-Size: 2x2}} \left(\begin{pmatrix}
\color{red}{10} & \color{red}{8} & \color{blue}{1} & \color{blue}{4} \\
\color{red}{4} & \color{red}{2} & \color{blue}{14} & \color{blue}{5} \\
\color{orange}{12} & \color{orange}{20} & \color{green}{-5} & \color{green}{19} \\
\color{orange}{32} & \color{orange}{128} & \color{green}{3} & \color{green}{30}
\end{pmatrix}\right) = \begin{pmatrix}
\color{red}{\frac{10 + 8 + 4 + 2}{4}} & \color{blue}{\frac{1 + 4 + 14 + 5}{4}} \\
\color{orange}{\frac{12+20+32+128}{4}} & \color{green}{\frac{-5+19+3+30}{4}} \\
\end{pmatrix} = \begin{pmatrix}
\color{red}{6} & \color{blue}{6} \\
\color{orange}{48} & \color{green}{11.75} \\
\end{pmatrix}
$$
<br>
<br>
<h4>maxPooling (maxPooling1d, maxPooling2d)</h4>
<div id="maxPooling2d_example"></div>
<br>
maxPooling slides a window with pool size \(x\) and \(y\) as width/height over the data (by <a href="#Strides">strides steps</a>), and, for each submatrix of the size \(x\) by \(y\), extracts the largest number. This <a href="#Dimensionality_Reduction">reduces dimensionality</a> and preserves the most activated values in certain regions.
<br>
Example:
<br>
$$
\underbrace{\text{maxPooling}}_{\text{Strides: 1x1, Pool-Size: 2x2}} \left(\begin{pmatrix}
\color{red}{10} & \color{red}{8} & \color{blue}{1} & \color{blue}{4} \\
\color{red}{4} & \color{red}{2} & \color{blue}{14} & \color{blue}{5} \\
\color{orange}{12} & \color{orange}{20} & \color{green}{-5} & \color{green}{19} \\
\color{orange}{32} & \color{orange}{128} & \color{green}{3} & \color{green}{30}
\end{pmatrix}\right) = \begin{pmatrix}
\color{red}{10} & \color{blue}{14} \\
\color{orange}{128} & \color{green}{30} \\
\end{pmatrix}
$$
<br>
<h3>Dropout and noise layers</h3>
<h4>alphaDropout</h4>
<div id="alphaDropout_example"></div>
<br>
AlphaDropout is a variation of dropout, a regularization technique for reducing overfitting in neural networks.
<br>
In standard dropout, a random subset of neurons are "dropped out" during each training step by setting their activations to zero. AlphaDropout, on the other hand, sets the activations to a random noise sampled from a zero-mean normal distribution with standard deviation of \(\alpha\) (\(\alpha\) is a hyperparameter).
<br>
This way, AlphaDropout allows the model to still learn the mean of the activations while also preventing overfitting and forcing the model to be less sensitive to the specific weights of the neurons. In other words, AlphaDropout regularizes the network by adding noise to the activations, which promotes the network to learn more robust feature representations.
<br>
AlphaDropout can be useful for certain types of datasets, such as time-series data, where the activations have temporal dependencies.
<br>
This layer is only active during training.
<br>
<h4>gaussianDropout</h4>
<div id="gaussianDropout_example"></div>
Drops out with a gaussian distribution of a specified dropout rate (in the example, 0.2).
<br>
This is used for simulating real-world-data, which is usually noisy (for example, when coming in over a webcam).
<br>
This layer is only active during training.
<br>
<h4>gaussianNoise</h4>
<div id="gaussianNoise_example"></div>
<br>
Adds gaussian noise to images. You can specify the standard deviation (in the case shown above = 1) of how noisy the image should be.
<br>
This is used for simulating real-world-data, which is usually noisy (for example, when coming in over a webcam).
<br>
This layer is only active during training.
<br>
<h3>Debug Layers</h3>
<h4>Debug Layer</h4>
This layer does not do anything to the data. It just prints them out to <tt>console.log</tt>.
<br>
<h2>Layer Options</h2>
<br>
<h3>Trainable</h3>
If enabled, the network's weights and biases (if enabled, see <a href="#Use_Bias">Use Bias</a>) are changed while training. If not, they stay the same.
<br>
<h3>Use Bias</h3>
If enabled, the network has a bias. In Dense Networks, a layer with Use Bias enabled, has this mathematical representation:
<br>
$$
\underbrace{\begin{pmatrix}
y_{0}
\end{pmatrix}}_{\mathrm{Output}}
= \mathrm{\underbrace{LeakyReLU}_{\mathrm{Activation}}}\left(\underbrace{\begin{pmatrix}
x_{0}\\
x_{1}
\end{pmatrix}}_{\mathrm{Input}}
\times \underbrace{\begin{pmatrix}
-1.124836802482605\\
0.01841479167342186
\end{pmatrix}}_{\mathrm{Kernel^{2 \times 1}}}
+ \underbrace{\begin{pmatrix}
0.123153419419419
\end{pmatrix}}_{\mathrm{Bias}}
\right)
$$
<br>
A Layer without Use Bias enabled would look like this:
<br>
$$
\underbrace{\begin{pmatrix}
y_{0}
\end{pmatrix}}_{\mathrm{Output}}
= \mathrm{\underbrace{LeakyReLU}_{\mathrm{Activation}}}\left(\underbrace{\begin{pmatrix}
x_{0}\\
x_{1}
\end{pmatrix}}_{\mathrm{Input}}
\times \underbrace{\begin{pmatrix}
0.24012170732021332\\
1.188180685043335
\end{pmatrix}}_{\mathrm{Kernel^{2 \times 1}}}
\right)
$$
<br>
The bias allows the function's output to be shifted in any axis.
<br>
<h3>Units</h3>
In a dense layer, the "units" option refers to the number of neurons or nodes in that layer. A dense layer is a type of layer in a neural network that is fully connected, meaning that each neuron in the layer is connected to every neuron in the previous and next layers.
<br>
The units option determines the number of neurons in the dense layer, and therefore also determines the number of outputs the layer will produce. For example, if you specify units=32, the dense layer will have 32 neurons, and it will produce 32 output values.
<br>
A higher number of units in a dense layer can increase the capacity of the model to learn complex patterns in the data, but it also increases the risk of overfitting. On the other hand, a lower number of units can reduce the risk of overfitting but decrease the model's ability to learn complex patterns.
<br>
When choosing the number of units for a dense layer, it's important to consider the complexity of the problem, the size of the dataset, and the capacity of the model. In general, you should start with a small number of units and increase it gradually until you find a good balance between the model's performance and the risk of overfitting.
<br>
It's also worth noting that the number of units in the output layer should match the number of classes or output variables in the problem.
<h3>Strides</h3>
The "strides" option in a neural network layer refers to the step size that the layer takes when moving across the input tensor. In other words, it defines the number of pixels or units that the layer's filter moves when scanning the input tensor.
<br>
This option is typically used in convolutional layers, which are layers that are used to extract features from images or other grid-like input data. The strides option in a convolutional layer determines how the convolution filter is moved across the input tensor, and it affects the size of the output tensor.
<br>
A stride value of 1 means that the filter is moved one pixel at a time, while a stride value of 2 means that the filter is moved two pixels at a time. A larger stride value results in a smaller output tensor and fewer computations, but it also reduces the amount of information the layer can learn from the input.
<br>
A stride of 1 is the default stride value for most convolutional layers, and it is often used when the goal is to maintain the spatial resolution of the input tensor. A stride of 2 or more is often used to reduce the size of the input tensor and reduce the number of computations.
<br>
It's worth noting that for pooling layers, strides are also used to define the step size of the pooling operation. Pooling layers are used to down-sample the input tensor, and the strides option determines how much the pooling filter moves when scanning the input tensor.
<br>
In general, choosing the right stride value depends on the problem, the size of the dataset, and the capacity of the model. It's a good practice to experiment with different stride values to find the best one for a specific task.
<h3>Regularizer</h3>
<br>
Regularization is a technique used to prevent <a href="#Overfitting">overfitting</a> in machine learning models by adding a penalty term to the loss function. The goal is to prevent the model from fitting the noise in the data, and to encourage the model to have small weights.
<br>
The choice of regularization method depends on the specific problem and the dataset. L1 regularization can be preferred in case of sparse data, where only a few features are informative. L2 regularization can be preferred when the problem is not sparse and when you want to keep all the features. L1-L2 regularization can be used when you have both sparse and non-sparse data.
<br>
In the context of machine learning, "sparse" generally refers to a dataset or input where a large proportion of the values are zero or near zero. For example, a sparse matrix is a matrix in which most of the elements are zero. A sparse dataset is a dataset where most of the features have little or no variation and thus, little or no informative value.
<br>
In the context of regularization, L1 regularization is often preferred when working with sparse data because it tends to push the weights of less important features towards zero, effectively setting them to zero. This results in feature selection, where only the most important features are used for the final model, making it simpler and more interpretable.
<br>
In contrast, L2 regularization tends to shrink the weights of all features towards zero, but it doesn't push the weights to exactly zero. This means that it doesn't perform feature selection, and all features are used in the final model.
<br>
In general, L1 regularization is more suitable for sparse data, where only a few features are informative, and L2 regularization is more suitable when the problem is not sparse.
<br>
<h4>l1</h4>
Also known as Lasso regularization, it adds a penalty term to the loss function proportional to the absolute value of the weights. The L1 regularization term is defined as \(\lambda\sum_{i=1}^{n}|w_i|\), where \(\lambda\) is the regularization strength and \(w_i\) are the weights of the model. The L1 regularization tends to push the weights towards zero and thus it can also be used to perform feature selection.
<br>
<h4>l2</h4>
L2 Regularization: Also known as Ridge regularization, it adds a penalty term to the loss function proportional to the square of the weights. The L2 regularization term is defined as \(\lambda\sum_{i=1}^{n}w_i^2\), where \(\lambda\) is the regularization strength and \(w_i\) are the weights of the model. The L2 regularization tends to shrink the weights towards zero, but unlike L1 regularization, it doesn't push the weights to exactly zero.
<br>
<h4>l1l2</h4>
The combination of L1 and L2 regularization is also known as Elastic-Net regularization. The regularization term is defined as \(\lambda(\alpha\sum_{i=1}^{n}|w_i| + (1-\alpha)\sum_{i=1}^{n}w_i^2)\), where \(\lambda\) is the regularization strength, \(w_i\) are the weights of the model, and \(\alpha\) is a parameter that controls the balance between L1 and L2 regularization. This regularization method combines the feature selection property of L1 regularization with the shrinkage property of L2 regularization.
<br>
<h3>Initializers</h3>
Initializer set in which way values, mostly the <a href="#Bias">Bias</a> and <a href="#Kernel">Kernel</a>, should be initialized as.
<br>
An initializer is a function used to initialize the weights of a neural network. It is called when the model is first created and its purpose is to set the initial values of the weights. The choice of initializer can have a significant impact on the performance and convergence of the model.
<br>
In general, He et al. and LeCun normal/uniform initializers tend to work well for deep networks, while Glorot/Xavier normal/uniform initializers tend to work well for shallow networks. It also depends on the activation function used in the layer.
<br>
Choosing the appropriate initialization method for a neural network depends on various characteristics of the data, including the scale of the input and output features, the number of input and output neurons, and the activation function used in the network. Here are some general guidelines for selecting an initialization method based on the characteristics of the data:
If the input and output features have similar scales, a good option is to use He uniform or Glorot uniform initialization methods, as they are designed for datasets with similar scale of inputs and outputs.
If the input and output features have very different scales, a good option is to use LeCun uniform initialization method, which is designed for datasets with different scale of inputs and outputs.
If the activation function used in the network is ReLU, He uniform initialization method is a good option as it's designed for this activation function.
If the activation function used in the network is sigmoid or tanh, Glorot uniform initialization method is a good option as it's designed for this activation function.
If the number of input neurons is much larger than the number of output neurons, it can be a good idea to use LeCun uniform initialization method.