Tag: Deep Learning
Published on: 21 Sep 2021
Recently I came across Darknet, a lightweight neural network framework elegantly written in C and CUDA. As a guy interested in hardware, I often feel being spoiled by PyTorch and wonder how things work under the hood. So I dig further into the implementation of Darknet and summarize my findings in this blog post.
Image Source: pjreddie/darknet
As an example use case, we set the Darknet into detector
mode and use the YOLOv3-tiny
network definition.
Usage: ./darknet detector train DATASET_DEF.data NETWORK_DEF.cfg CHECKPOINT_FILE
The first argument 'detector'
will trigger the run_detector(argc, argv)
function call
.....
} else if (0 == strcmp(argv[1], "lsd")){
run_lsd(argc, argv);
} else if (0 == strcmp(argv[1], "detector")){
run_detector(argc, argv); // <----- We are here
} else if (0 == strcmp(argv[1], "detect")){
float thresh = find_float_arg(argc, argv, "-thresh", .5);
char *filename = (argc > 4) ? argv[4]: 0;
char *outfile = find_char_arg(argc, argv, "-out", 0);
int fullscreen = find_arg(argc, argv, "-fullscreen");
test_detector("cfg/coco.data", argv[2], argv[3], filename, thresh, .5, outfile, fullscreen);
} else if (0 == strcmp(argv[1], "cifar")){
run_cifar(argc, argv);
} else if (0 == strcmp(argv[1], "go")){
.....
The run_detector(argc, argv)
function call requires in addition the following arguments:
argv[2]: [train/test/valid]
argv[3]: [cfg]
argv[4]: [weights (optional)]
followed by optional CLI arguments in key-value pairs.
If optional arguments are given, they will be parsed with the following lines:
char *prefix = find_char_arg(argc, argv, "-prefix", 0);
float thresh = find_float_arg(argc, argv, "-thresh", .5);
float hier_thresh = find_float_arg(argc, argv, "-hier", .5);
...
// In each parser call a default value is given as the last argument
Next, we can find a call to train_detector
with the arguments parsed:
...
else if(0==strcmp(argv[2], "train")) train_detector(datacfg, cfg, weights, gpus, ngpus, clear);
...
Things get interesting now. Inside train_detector
we can find two central part of a DL framework, namely:
Before diving directly into the code, let’s first take a look at the model definition files, which locates themselves in the cfg directory.
We use yolov3-tiny.cfg
as an example.
The first part of the cfg files always begin with a [net]
section for defining parameters shared by the whole network:
[net]
# Testing
batch=1
subdivisions=1
# Training
# batch=64
# subdivisions=2
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1
Next, we can find definitions of each NN layers:
[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky
[maxpool]
size=2
stride=2
...etc
Programmatically, the cfg file is first parsed into a list of lists. The first level list corresponds to each section in the network definition file and the second level is essentially a key-value pair.
Since we don’t have a List
type at hand, and we don’t know the length of this list prior to runtime, we have to implement them manually as linked lists:
typedef struct node{
void *val;
struct node *next;
struct node *prev;
} node;
typedef struct list{
int size;
node *front;
node *back;
} list;
Next, the nested lists is then converted to a network
stored as a C struct where each layer is then parsed and stored as an element in the layer *layers
member variable. Here you find how each layer
is filled:
while(n){
params.index = count;
fprintf(stderr, "%5d ", count);
s = (section *)n->val;
options = s->options;
layer l = {0};
LAYER_TYPE lt = string_to_layer_type(s->type);
if(lt == CONVOLUTIONAL){
l = parse_convolutional(options, params); // <---- Here
}else if(lt == DECONVOLUTIONAL){
l = parse_deconvolutional(options, params);
}else if(lt == LOCAL){
l = parse_local(options, params);
}else if(lt == ACTIVE){
...
net->layers[count] = l;
...
here params
define the input parameters for each network layer e.g. height, width, channel, number of inputs (hwc). After each layer params
will be updated according to the output of this layer.
options
define the parameters for the layer itself, e.g. in the case of colvolutional_layer
it will contain keys such as filters
, size
, stride
etc.
When we look even deeper we can find how each layer is produced, e.g. inside parse_convolutional(list *options, size_params params)
and make_convolutional_layer
.
Here we can observe the following:
layer
share the same type, as a result, the “base” type layer
contains member variables for every possible layer, which is a ugly language limitationl.forward
and l.backward
At runtime, the forward and backward functions take two arguments: layer
and net
. Again consider the convolutional_layer
function:
int m = l.n/l.groups;
// number of filters per group
// l.n: number of output channels
// l.groups: for group conv
int k = l.size*l.size*l.c/l.groups;
// number of corresponding input elements for each output convolution (per group)
// l.size: kernel size
// l.c: number of input channels
int n = l.out_w*l.out_h;
// number of output pixels
// l.out_w is calculated by the following formula (same applies to h):
// (l.w + 2*l.pad - l.size) / l.stride + 1
for(i = 0; i < l.batch; ++i){
for(j = 0; j < l.groups; ++j){
float *a = l.weights + j*l.nweights/l.groups;
// l.nweights = c/groups*n*size*size;
// find the address to current group's weights
float *b = net.workspace;
// for storing results from converting im into column vector
float *c = l.output + (i*l.groups + j)*n*m;
// here: n*m = filters * out_w * out_h / groups
// is the number of output elements per group
float *im = net.input + (i*l.groups + j)*l.c/l.groups*l.h*l.w;
// address of input corresponding to the batch and group
if (l.size == 1) {
b = im;
} else {
im2col_cpu(im, l.c/l.groups, l.h, l.w, l.size, l.stride, l.pad, b);
}
gemm(0,0,m,n,k,1,a,k,b,n,1,c,n);
// matrix multiplication, c = ab
}
}
There are two function calls worth noting here. First, im2col_cpu
manipulate and replicate input image im
according to the convolutional kernel parameters (size, stride, padding) to produce a column vector. Then, this vector will be fed into gemm
for matrix multiplication and produce the convolution output.
im2col_cpu
:
void im2col_cpu(float* data_im,
int channels, int height, int width,
int ksize, int stride, int pad, float* data_col)
{
int c,h,w;
// input size after padding & stride
int height_col = (height + 2*pad - ksize) / stride + 1;
int width_col = (width + 2*pad - ksize) / stride + 1;
int channels_col = channels * ksize * ksize;
for (c = 0; c < channels_col; ++c) {
int w_offset = c % ksize;
int h_offset = (c / ksize) % ksize;
int c_im = c / ksize / ksize;
for (h = 0; h < height_col; ++h) {
for (w = 0; w < width_col; ++w) {
int im_row = h_offset + h * stride;
int im_col = w_offset + w * stride;
int col_index = (c * height_col + h) * width_col + w;
data_col[col_index] = im2col_get_pixel(data_im, height, width, channels, im_row, im_col, c_im, pad);
}
}
}
}
Demo
Assume we have a 2x2 RGB input image, after padding it will have the following memory footprint (stored in row-major order):
And a CxHxW = 3x3x3 kernel with padding=1 and stride=1 for convolution:
Now it can be calculated:
height_col = (2 + 2*1 - 3) / 1 + 1 = 2
width_col = (2 + 2*1 - 3) / 1 + 1 = 2
channels_col = 3 * 3 * 3 = 27
As a result, we will have a data_col with following memory footprint (right side):
(data_col is then a row-major array with 27x4 elements, note how data is copied from the original image)
Finally, the matrix multiplication function is called gemm(0,0,m,n,k,1,a,k,b,n,1,c,n);
which gives us the result \(C = A \cdot B\).
Depending on the first two arguments, different gemm
variations will be called. Here we analyze the general form, gemm_nn
:
void gemm_nn(int M, int N, int K, float ALPHA,
float *A, int lda,
float *B, int ldb,
float *C, int ldc)
// M: no. of output channels
// K: no. of channels (size * size * C)
// N: no. of output pixels (out_w * out_h)
// ALPHA: constant multiplier for matrix A; set to 1.0 (ignore)
// A: weights (M x K), row-major
// lda: no. of elements in each row for A, lda = K
// B: input image processed with im2col function (K x N), row-major
// ldb = N
// C: output matrix (M x N)
// ldc = N
{
int i,j,k;
#pragma omp parallel for
for(i = 0; i < M; ++i){
for(k = 0; k < K; ++k){
register float A_PART = ALPHA*A[i*lda+k];
for(j = 0; j < N; ++j){
C[i*ldc+j] += A_PART*B[k*ldb+j];
}
}
}
}
After this step, we are mostly done with the forward propagation call. The only trivial operations left are an optional batch normalization and applying the non-linear activation.
This post gives a quick explanation of the Darknet framework by analyzing its data structure and forward propagation call. Written completely in C, Darknet can be easily adapted to embedded devices or platforms tight on computing resources.