FluxML · ToucheSir · Sep 30, 2024 · Sep 25, 2024 · Sep 25, 2024 · Sep 25, 2024
diff --git a/docs/src/reference.md b/docs/src/reference.md
@@ -78,10 +78,11 @@ pad_zeros
 
 `NNlib.conv` supports complex datatypes on CPU and CUDA devices.
 
-!!! AMDGPU MIOpen supports only cross-correlation (flipkernel=true).
-    Therefore for every regular convolution (flipkernel=false)
+!!! note "AMDGPU MIOpen supports only cross-correlation (`flipkernel=true`)."
+
+    Therefore for every regular convolution (`flipkernel=false`)
     kernel is flipped before calculation.
-    For better performance, use cross-correlation (flipkernel=true)
+    For better performance, use cross-correlation (`flipkernel=true`)
     and manually flip the kernel before `NNlib.conv` call.
     `Flux` handles this automatically, this is only required for direct calls.
 

diff --git a/src/activations.jl b/src/activations.jl
@@ -31,7 +31,7 @@ The ascii name `sigmoid` is also exported.
 
 See also [`sigmoid_fast`](@ref).
 
-```
+```julia-repl
 julia> using UnicodePlots
 
 julia> lineplot(sigmoid, -5, 5, height=7)
@@ -63,7 +63,7 @@ const sigmoid = σ
 
 Piecewise linear approximation of [`sigmoid`](@ref).
 
-```
+```julia-repl
 julia> lineplot(hardsigmoid, -5, 5, height=7)
           ┌────────────────────────────────────────┐         
         1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋⠉⠉⠉⠉⠉⠉⠉⠉│ hardσ(x)
@@ -102,7 +102,7 @@ const hardsigmoid = hardσ
 
 Return `log(σ(x))` which is computed in a numerically stable way.
 
-```
+```julia-repl
 julia> lineplot(logsigmoid, -5, 5, height=7)
            ┌────────────────────────────────────────┐        
          0 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡧⠤⠔⠒⠒⠒⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ logσ(x)
@@ -128,7 +128,7 @@ Segment-wise linear approximation of `tanh`, much cheaper to compute.
 See ["Large Scale Machine Learning"](https://ronan.collobert.com/pub/matos/2004_phdthesis_lip6.pdf).
 
 See also [`tanh_fast`](@ref).
-```
+```julia-repl
 julia> lineplot(hardtanh, -2, 2, height=7)
            ┌────────────────────────────────────────┐            
          1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⣀⠔⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉│ hardtanh(x)
@@ -164,7 +164,7 @@ hardtanh(x) = clamp(x, oftype(x, -1), oftype(x, 1))  # clamp(x, -1, 1) is type-s
 [Rectified Linear Unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks))
 activation function.
 
-```
+```julia-repl
 julia> lineplot(relu, -2, 2, height=7)
           ┌────────────────────────────────────────┐        
         2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔⠋│ relu(x)
@@ -188,7 +188,7 @@ Leaky [Rectified Linear Unit](https://en.wikipedia.org/wiki/Rectifier_(neural_ne
 activation function.
 You can also specify the coefficient explicitly, e.g. `leakyrelu(x, 0.01)`.
 
-```julia
+```julia-repl
 julia> lineplot(x -> leakyrelu(x, 0.5), -2, 2, height=7)
            ┌────────────────────────────────────────┐       
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ #42(x)
@@ -220,7 +220,7 @@ const leakyrelu_a = 0.01  # also used in gradient below
 activation function capped at 6.
 See ["Convolutional Deep Belief Networks"](https://www.cs.toronto.edu/~kriz/conv-cifar10-aug2010.pdf) from CIFAR-10.
 
-```
+```julia-repl
 julia> lineplot(relu6, -10, 10, height=7)
           ┌────────────────────────────────────────┐         
         6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠎⠉⠉⠉⠉⠉⠉⠉⠉│ relu6(x)
@@ -245,7 +245,7 @@ Randomized Leaky Rectified Linear Unit activation function.
 See ["Empirical Evaluation of Rectified Activations"](https://arxiv.org/abs/1505.00853)
 You can also specify the bound explicitly, e.g. `rrelu(x, 0.0, 1.0)`.
 
-```julia
+```julia-repl
 julia> lineplot(rrelu, -20, 10, height=7)
             ┌────────────────────────────────────────┐         
          10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ rrelu(x)
@@ -275,7 +275,7 @@ Exponential Linear Unit activation function.
 See ["Fast and Accurate Deep Network Learning by Exponential Linear Units"](https://arxiv.org/abs/1511.07289).
 You can also specify the coefficient explicitly, e.g. `elu(x, 1)`.
 
-```
+```julia-repl
 julia> lineplot(elu, -2, 2, height=7)
            ┌────────────────────────────────────────┐       
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ elu(x)
@@ -305,7 +305,7 @@ deriv_elu(Ω, α=1) = ifelse(Ω ≥ 0, one(Ω), Ω + oftype(Ω, α))
 
 Activation function from ["Gaussian Error Linear Units"](https://arxiv.org/abs/1606.08415).
 
-```
+```julia-repl
 julia> lineplot(gelu, -2, 2, height=7)
            ┌────────────────────────────────────────┐        
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠊│ gelu(x)
@@ -363,7 +363,7 @@ end
 Self-gated activation function.
 See ["Swish: a Self-Gated Activation Function"](https://arxiv.org/abs/1710.05941).
 
-```
+```julia-repl
 julia> lineplot(swish, -2, 2, height=7)
            ┌────────────────────────────────────────┐         
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤│ swish(x)
@@ -386,7 +386,7 @@ julia> lineplot(swish, -2, 2, height=7)
 Hard-Swish activation function.
 See ["Searching for MobileNetV3"](https://arxiv.org/abs/1905.02244).
 
-```
+```julia-repl
 julia> lineplot(hardswish, -2, 5, height = 7)
            ┌────────────────────────────────────────┐             
          5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠔⠒⠉│ hardswish(x)
@@ -430,7 +430,7 @@ deriv_hardswish(x) = ifelse(x < -3, oftf(x,0), ifelse(x > 3, oftf(x,1), x/3 + of
 Activation function from 
 ["LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent ..."](https://arxiv.org/abs/1901.05894)
 
-```
+```julia-repl
 julia> lineplot(lisht, -2, 2, height=7)
           ┌────────────────────────────────────────┐         
         2 │⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠔│ lisht(x)
@@ -469,7 +469,7 @@ lisht(x) = x * tanh_fast(x)
 Scaled exponential linear units.
 See ["Self-Normalizing Neural Networks"](https://arxiv.org/abs/1706.02515).
 
-```
+```julia-repl
 julia> lineplot(selu, -3, 2, height=7)
            ┌────────────────────────────────────────┐        
          3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ selu(x)
@@ -507,7 +507,7 @@ end
 
 Activation function from ["Continuously Differentiable Exponential Linear Units"](https://arxiv.org/abs/1704.07483).
 
-```
+```julia-repl
 julia> lineplot(celu, -2, 2, height=7)
            ┌────────────────────────────────────────┐        
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉│ celu(x)
@@ -535,7 +535,7 @@ deriv_celu(Ω, α=1) = ifelse(Ω > 0, oftf(Ω, 1), Ω / oftf(Ω, α) + 1)
 Threshold gated rectified linear activation function.
 See ["Zero-bias autoencoders and the benefits of co-adapting features"](https://arxiv.org/abs/1402.3337)
 
-```
+```julia-repl
 julia> lineplot(trelu, -2, 4, height=7)
           ┌────────────────────────────────────────┐         
         4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠋│ trelu(x)
@@ -559,7 +559,7 @@ const thresholdrelu = trelu
 
 See ["Quadratic Polynomials Learn Better Image Features"](http://www.iro.umontreal.ca/~lisa/publications2/index.php/attachments/single/205) (2009).
 
-```
+```julia-repl
 julia> lineplot(softsign, -5, 5, height=7)
            ┌────────────────────────────────────────┐            
          1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⣀⣀⠤⠤⠤⠤⠤│ softsign(x)
@@ -602,7 +602,7 @@ deriv_softsign(x) = 1 / (1 + abs(x))^2
 
 See ["Deep Sparse Rectifier Neural Networks"](http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf), JMLR 2011.
 
-```
+```julia-repl
 julia> lineplot(softplus, -3, 3, height=7)
           ┌────────────────────────────────────────┐            
         4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ softplus(x)
@@ -640,7 +640,7 @@ softplus(x) = log1p(exp(-abs(x))) + relu(x)
 
 Return `log(cosh(x))` which is computed in a numerically stable way.
 
-```
+```julia-repl
 julia> lineplot(logcosh, -5, 5, height=7)
           ┌────────────────────────────────────────┐           
         5 │⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ logcosh(x)
@@ -664,7 +664,7 @@ const log2 = log(2)
 
 Activation function from ["Mish: A Self Regularized Non-Monotonic Neural Activation Function"](https://arxiv.org/abs/1908.08681).
 
-```
+```julia-repl
 julia> lineplot(mish, -5, 5, height=7)
            ┌────────────────────────────────────────┐        
          5 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠖⠋│ mish(x)
@@ -686,7 +686,7 @@ mish(x) = x * tanh(softplus(x))
 
 See ["Tanhshrink Activation Function"](https://www.gabormelli.com/RKB/Tanhshrink_Activation_Function).
 
-```
+```julia-repl
 julia> lineplot(tanhshrink, -3, 3, height=7)
            ┌────────────────────────────────────────┐              
          3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ tanhshrink(x)
@@ -712,7 +712,7 @@ tanhshrink(x) = x - tanh_fast(x)
 
 See ["Softshrink Activation Function"](https://www.gabormelli.com/RKB/Softshrink_Activation_Function).
 
-```
+```julia-repl
 julia> lineplot(softshrink, -2, 2, height=7)
            ┌────────────────────────────────────────┐              
          2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀│ softshrink(x)
@@ -770,7 +770,7 @@ For any other number types, it just calls `tanh`.
 
 See also [`sigmoid_fast`](@ref).
 
-```
+```julia-repl
 julia> tanh(0.5f0)
 0.46211717f0
 
@@ -808,11 +808,11 @@ tanh_fast(x::Number) = Base.tanh(x)
     sigmoid_fast(x)
 
 This is a faster, and very slightly less accurate, version of `sigmoid`.
-For `x::Float32, perhaps 3 times faster, and maximum errors 2 eps instead of 1.
+For `x::Float32`, perhaps 3 times faster, and maximum errors 2 eps instead of 1.
 
 See also [`tanh_fast`](@ref).
 
-```
+```julia-repl
 julia> sigmoid(0.2f0)
 0.54983395f0
 

diff --git a/src/audio/mel.jl b/src/audio/mel.jl
@@ -4,7 +4,7 @@
         fmin::Float32 = 0f0, fmax::Float32 = Float32(sample_rate ÷ 2))
 
 Create triangular Mel scale filter banks
-(ref: https://en.wikipedia.org/wiki/Mel_scale).
+(ref: [Mel scale - Wikipedia](https://en.wikipedia.org/wiki/Mel_scale)).
 Each column is a filterbank that highlights its own frequency.
 
 # Arguments:

diff --git a/src/audio/stft.jl b/src/audio/stft.jl
@@ -5,14 +5,14 @@
     ) where T <: Real
 
 Hamming window function
-(ref: https://en.wikipedia.org/wiki/Window_function#Hann_and_Hamming_windows).
+(ref: [Window function § Hann and Hamming windows - Wikipedia](https://en.wikipedia.org/wiki/Window_function#Hann_and_Hamming_windows)).
 Generalized version of `hann_window`.
 
-``w[n] = \\alpha - \\beta cos(\\frac{2 \\pi n}{N - 1})``
+``w[n] = \\alpha - \\beta \\cos(\\frac{2 \\pi n}{N - 1})``
 
 Where ``N`` is the window length.
 
-```julia
+```julia-repl
 julia> lineplot(hamming_window(100); width=30, height=10)
      ┌──────────────────────────────┐
    1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠚⠉⠉⠉⠢⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
@@ -72,13 +72,13 @@ end
     ) where T <: Real
 
 Hann window function
-(ref: https://en.wikipedia.org/wiki/Window_function#Hann_and_Hamming_windows).
+(ref: [Window function § Hann and Hamming windows - Wikipedia](https://en.wikipedia.org/wiki/Window_function#Hann_and_Hamming_windows)).
 
-``w[n] = \\frac{1}{2}[1 - cos(\\frac{2 \\pi n}{N - 1})]``
+``w[n] = \\frac{1}{2}[1 - \\cos(\\frac{2 \\pi n}{N - 1})]``
 
 Where ``N`` is the window length.
 
-```julia
+```julia-repl
 julia> lineplot(hann_window(100); width=30, height=10)
      ┌──────────────────────────────┐
    1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⠚⠉⠉⠉⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
@@ -138,7 +138,7 @@ Short-time Fourier transform (STFT).
 The STFT computes the Fourier transform of short overlapping windows of the input,
 giving frequency components of the signal as they change over time.
 
-``Y[\\omega, m] = \\sum_{k = 0}^{N - 1} \\text{window}[k] \\text{input}[m \\times \\text{hop length} + k] exp(-j \\frac{2 \\pi \\omega k}{\\text{n fft}})``
+``Y[\\omega, m] = \\sum_{k = 0}^{N - 1} \\text{window}[k] \\text{input}[m \\times \\text{hop length} + k] \\exp(-j \\frac{2 \\pi \\omega k}{\\text{n fft}})``
 
 where ``N`` is the window length,
 ``\\omega`` is the frequency ``0 \\le \\omega < \\text{n fft}``

diff --git a/src/ctc.jl b/src/ctc.jl
@@ -23,7 +23,8 @@ function logaddexp(a, b)
 end
 
 """
-  add_blanks(z)
+    add_blanks(z)
+
 Adds blanks to the start and end of `z`, and between items in `z`
 """
 function add_blanks(z, blank)

diff --git a/src/dim_helpers/DepthwiseConvDims.jl b/src/dim_helpers/DepthwiseConvDims.jl
@@ -2,7 +2,7 @@
     DepthwiseConvDims
 
 Concrete subclass of `ConvDims` for a depthwise convolution.  Differs primarily due to
-characterization by C_in, C_mult, rather than C_in, C_out.  Useful to be separate from
+characterization by `C_in`, `C_mult`, rather than `C_in`, `C_out`.  Useful to be separate from
 DenseConvDims primarily for channel calculation differences.
 """
 struct DepthwiseConvDims{N, K, S, P, D} <: ConvDims{N}

diff --git a/src/dim_helpers/PoolDims.jl b/src/dim_helpers/PoolDims.jl
@@ -1,6 +1,6 @@
 """
     PoolDims(x_size::NTuple{M}, k::Union{NTuple{L, Int}, Int};
-            stride=k, padding=0, dilation=1)  where {M, L}
+             stride=k, padding=0, dilation=1)  where {M, L}
 
 Dimensions for a "pooling" operation that can have an arbitrary input size, kernel size,
 stride, dilation, and channel count.  Used to dispatch onto efficient implementations at

diff --git a/src/dropout.jl b/src/dropout.jl
@@ -12,7 +12,7 @@ i.e. each row of a matrix is either zero or not.
 Optional first argument is the random number generator used.
 
 # Examples
-```
+```julia-repl
 julia> dropout(ones(2, 10), 0.2)
 2×10 Matrix{Float64}:
  1.25  1.25  0.0   1.25  1.25  1.25  1.25  1.25  1.25  1.25

diff --git a/src/pooling.jl b/src/pooling.jl
@@ -162,7 +162,7 @@ Perform mean pool operation with window size `k` on input tensor `x`.
 
 Arguments:
 
-* `x` and `k`: Expects `ndim(x) ∈ 3:5``, and always `length(k) == ndim(x) - 2`
+* `x` and `k`: Expects `ndim(x) ∈ 3:5`, and always `length(k) == ndim(x) - 2`
 * `pad`: See [`pad_zeros`](@ref) for details.
 * `stride`: Either a tuple with the same length as `k`, or one integer for all directions. Default is `k`.
 """
@@ -182,7 +182,7 @@ This pooling operator from [Learned-Norm Pooling for Deep Feedforward and Recurr
 
 Arguments:
 
-* `x` and `k`: Expects `ndim(x) ∈ 3:5``, and always `length(k) == ndim(x) - 2`
+* `x` and `k`: Expects `ndim(x) ∈ 3:5`, and always `length(k) == ndim(x) - 2`
 * `p` is restricted to `0 < p < Inf`.
 * `pad`: See [`pad_zeros`](@ref) for details.
 * `stride`: Either a tuple with the same length as `k`, or one integer for all directions. Default is `k`.

diff --git a/src/softmax.jl b/src/softmax.jl
@@ -39,7 +39,7 @@ Note that, when used with Flux.jl, `softmax` must not be passed to layers like `
 which accept an activation function. The activation is broadcasted over the result,
 thus applies to individual numbers. But `softmax` always needs to see the whole column.
 
-```julia
+```julia-repl
 julia> using Flux
 
 julia> x = randn(Float32, 4, 4, 3, 13);

diff --git a/src/utils.jl b/src/utils.jl
@@ -10,7 +10,7 @@ pass it an array whose gradient is of interest.
 There is also an overload for ForwardDiff.jl's `Dual` types (and arrays of them).
 
 # Examples
-```
+```julia-repl
 julia> using ForwardDiff, Zygote, NNlib
 
 julia> f_good(x) = if NNlib.within_gradient(x)