– All right welcome to lecture nine. So today we will be talking

about CNN Architectures. And just a few administrative points before we get started,

assignment two is due Thursday. The mid term will be in

class on Tuesday May ninth, so next week and it will

cover material through Tuesday through this coming Thursday May fourth. So everything up to

recurrent neural networks are going to be fair game. The poster session

we’ve decided on a time, it’s going to be Tuesday June sixth from twelve to three p.m. So this is the last week of classes. So we have our our poster

session a little bit early during the last week so that after that, once you guys get feedback

you still have some time to work for your final report

which will be due finals week. Okay, so just a quick review of last time. Last time we talked

about different kinds of deep learning frameworks. We talked about you know

PyTorch, TensorFlow, Caffe2 and we saw that using

these kinds of frameworks we were able to easily build

big computational graphs, for example very large neural

networks and comm nets, and be able to really

easily compute gradients in these graphs. So to compute all of the

gradients for all the intermediate variables weights inputs and

use that to train our models and to run all this efficiently on GPUs And we saw that for a

lot of these frameworks the way this works is by

working with these modularized layers that you guys have

been working writing with, in your home works as well

where we have a forward pass, we have a backward pass, and then in our final model architecture, all we need to do then is to just define all of these sequence of layers together. So using that we’re able

to very easily be able to build up very complex

network architectures. So today we’re going to talk

about some specific kinds of CNN Architectures that are

used today in cutting edge applications and research. And so we’ll go into depth

in some of the most commonly used architectures for

these that are winners of ImageNet classification benchmarks. So in chronological

order AlexNet, VGG net, GoogLeNet, and ResNet. And so these will go into a lot of depth. And then I’ll also after

that, briefly go through some other architectures that are not as prominently used these

days, but are interesting either from a historical perspective, or as recent areas of research. Okay, so just a quick review. We talked a long time ago about LeNet, which was one of the first

instantiations of a comNet that was successfully used in practice. And so this was the comNet

that took an input image, used com filters five by five filters applied at stride one and

had a couple of conv layers, a few pooling layers and then

some fully connected layers at the end. And this fairly simple comNet

was very successfully applied to digit recognition. So AlexNet from 2012 which

you guys have also heard already before in previous classes, was the first large scale

convolutional neural network that was able to do well on

the ImageNet classification task so in 2012 AlexNet was

entered in the competition, and was able to outperform

all previous non deep learning based models

by a significant margin, and so this was the comNet

that started the spree of comNet research and usage afterwards. And so the basic comNet

AlexNet architecture is a conv layer followed by pooling layer, normalization, com pool norm, and then a few more conv

layers, a pooling layer, and then several fully

connected layers afterwards. So this actually looks very

similar to the LeNet network that we just saw. There’s just more layers in total. There is five of these conv layers, and two fully connected layers before the final fully connected

layer going to the output classes. So let’s first get a sense

of the sizes involved in the AlexNet. So if we look at the input to the AlexNet this was trained on ImageNet, with inputs at a size 227 by 227 by 3 images. And if we look at this first

layer which is a conv layer for the AlexNet, it’s 11 by 11 filters, 96 of these applied at stride 4. So let’s just think

about this for a moment. What’s the output volume

size of this first layer? And there’s a hint. So remember we have our input size, we have our convolutional filters, ray. And we have this formula,

which is the hint over here that gives you the size

of the output dimensions after applying com right? So remember it was the full

image, minus the filter size, divided by the stride, plus one. So given that that’s

written up here for you 55, does anyone have a guess at

what’s the final output size after this conv layer? [student speaks off mic] – So I had 55 by 55 by 96, yep. That’s correct. Right so our spatial

dimensions at the output are going to be 55 in each

dimension and then we have 96 total filters so the

depth after our conv layer is going to be 96. So that’s the output volume. And what’s the total number

of parameters in this layer? So remember we have 96 11 by 11 filters. [student speaks off mic] – [Lecturer] 96 by 11 by 11, almost. So yes, so I had another by three, yes that’s correct. So each of the filters is going to see through a local region

of 11 by 11 by three, right because the input depth was three. And so, that’s each filter

size, times we have 96 of these total. And so there’s 35K parameters

in this first layer. Okay, so now if we look

at the second layer this is a pooling layer

right and in this case we have three three by three

filters applied at stride two. So what’s the output volume

of this layer after pooling? And again we have a hint, very

similar to the last question. Okay, 27 by 27 by 96. Yes that’s correct. Right so the pooling layer

is basically going to use this formula that we had here. Again because these are pooling

applied at a stride of two so we’re going to use the

same formula to determine the spatial dimensions and

so the spatial dimensions are going to be 27 by

27, and pooling preserves the depth. So we had 96 as depth as input, and it’s still going to be 96 depth at output. And next question. What’s the number of

parameters in this layer? I hear some muttering. [student answers off mic] – Nothing. Okay. Yes, so pooling layer

has no parameters, so, kind of a trick question. Okay, so we can basically, yes, question? [student speaks off mic] – The question is, why are

there no parameters in the pooling layer? The parameters are the weights right, that we’re trying to learn. And so convolutional layers

have weights that we learn but pooling all we do is have a rule, we look at the pooling region, and we take the max. So there’s no parameters that are learned. So we can keep on doing

this and you can just repeat the process and it’s kind of

a good exercise to go through this and figure out the

sizes, the parameters, at every layer. And so if you do this all the way, you can look at this is

the final architecture that you can work with. There’s 11 by 11 filters at the beginning, then five by five and some

three by three filters. And so these are generally

pretty familiar looking sizes that you’ve seen before

and then at the end we have a couple of fully connected layers of size 4096 and finally the last layer, is FC8 going to the soft max, which is going to the

1000 ImageNet classes. And just a couple of details about this, it was the first use of

the ReLu non-linearity that we’ve talked about

that’s the most commonly used non-linearity. They used local response

normalization layers basically trying to

normalize the response across neighboring channels but this

is something that’s not really used anymore. It turned out not to, other people showed that it didn’t have so much of an effect. There’s a lot of heavy data augmentation, and so you can look in the

paper for more details, but things like flipping,

jittering, cropping, color normalization all of these things which you’ll probably

find useful for you when you’re working on your

projects for example, so a lot of data augmentation here. They also use dropout batch size of 128, and learned with SGD with

momentum which we talked about in an earlier lecture,

and basically just started with a base learning

rate of 1e negative 2. Every time it plateaus,

reduce by a factor of 10 and then just keep going. Until they finish training and a little bit of weight

decay and in the end, in order to get the best

numbers they also did an ensembling of models and

so training multiple of these, averaging them together and

this also gives an improvement in performance. And so one other thing I want to point out is that if you look at this

AlexNet diagram up here, it looks kind of like the

normal comNet diagrams that we’ve been seeing,

except for one difference, which is that it’s, you

can see it’s kind of split in these two different rows

or columns going across. And so the reason for this

is mostly historical note, so AlexNet was trained

on GTX580 GPUs older GPUs that only had three gigs of memory. So it couldn’t actually fit

this entire network on here, and so what they ended up doing, was they spread the

network across two GPUs. So on each GPU you would

have half of the neurons, or half of the feature maps. And so for example if you

look at this first conv layer, we have 55 by 55 by 96 output, but if you look at this diagram carefully, you can zoom in later in the actual paper, you can see that, it’s actually only 48 depth-wise, on each GPU, and so they just spread

it, the feature maps, directly in half. And so what happens is that

for most of these layers, for example com one, two, four and five, the connections are only with feature maps on the same GPU, so you

would take as input, half of the feature maps

that were on the the same GPU as before and you don’t

look at the full 96 feature maps for example. You just take as input the

48 in that first layer. And then there’s a few

layers so com three, as well as FC six, seven and eight, where here are the GPUs

do talk to each other and so there’s connections

with all feature maps in the preceding layer. so there’s communication across the GPUs, and each of these neurons

are then connected to the full depth of the

previous input layer. Question. – [Student] It says the

full simplified AlexNetwork architecture. [mumbles] – Oh okay, so the question

is why does it say full simplified AlexNet architecture here? It just says that because I

didn’t put all the details on here, so for example this

is the full set of layers in the architecture, and

the strides and so on, but for example the normalization

layer, there’s other, these details are not written on here. And then just one little note, if you look at the paper

and try and write out the math and architectures and so on, there’s a little bit of

an issue on the very first layer they’ll say if

you’ll look in the figure they’ll say 224 by 224 , but there’s actually some

kind of funny pattern going on and so the

numbers actually work out if you look at it as 227. AlexNet was the winner of

the ImageNet classification benchmark in 2012, you can see that it cut the error rate

by quite a large margin. It was the first CNN base

winner, and it was widely used as a base to our architecture almost ubiquitously from then

until a couple years ago. It’s still used quite a bit. It’s used in transfer learning

for lots of different tasks and so it was used for

basically a long time, and it was very famous and

now though there’s been some more recent architectures

that have generally just had better performance

and so we’ll talk about these next and these are going to be

the more common architectures that you’ll be wanting to use in practice. So just quickly first in

2013 the ImageNet challenge was won by something called a ZFNet. Yes, question. [student speaks off mic] – So the question is intuition why AlexNet was so much better than

the ones that came before, DefLearning comNets [mumbles] this is just a very different kind of

approach in architecture. So this was the first deep

learning based approach first comNet that was used. So in 2013 the challenge

was won by something called a ZFNet [Zeller Fergus Net]

named after the creators. And so this mostly was

improving hyper parameters over the AlexNet. It had the same number of layers, the same general structure

and they made a few changes things like

changing the stride size, different numbers of filters

and after playing around with these hyper parameters more, they were able to improve the error rate. But it’s still basically the same idea. So in 2014 there are a

couple of architectures that were now more significantly different and made another jump in performance, and the main difference with

these networks first of all was much deeper networks. So from the eight layer

network that was in 2012 and 2013, now in 2014 we

had two very close winners that were around 19 layers and 22 layers. So significantly deeper. And the winner of this

was GoogleNet, from Google but very close behind was

something called VGGNet from Oxford, and on actually

the localization challenge VGG got first place in

some of the other tracks. So these were both very,

very strong networks. So let’s first look at VGG

in a little bit more detail. And so the VGG network is the

idea of much deeper networks and with much smaller filters. So they increased the number of layers from eight layers in AlexNet

right to now they had models with 16 to 19 layers in VGGNet. And one key thing that they

did was they kept very small filter so only three by

three conv all the way, which is basically the

smallest com filter size that is looking at a little

bit of the neighboring pixels. And they just kept this

very simple structure of three by three convs

with the periodic pooling all the way through the network. And it’s very simple elegant

network architecture, was able to get 7.3% top five error on the ImageNet challenge. So first the question of

why use smaller filters. So when we take these

small filters now we have fewer parameters and we

try and stack more of them instead of having larger filters, have smaller filters

with more depth instead, have more of these filters instead, what happens is that you end

up having the same effective receptive field as if you

only have one seven by seven convolutional layer. So here’s a question, what is

the effective receptive field of three of these three

by three conv layers with stride one? So if you were to stack three

three by three conv layers with Stride one what’s the

effective receptive field, the total area of the input,

spatial area of the input that enure at the top

layer of the three layers is looking at. So I heard fifteen pixels,

why fifteen pixels? – [Student] Okay, so the

reason given was because they overlap– – Okay, so the reason given

was because they overlap. So it’s on the right track. What actually is happening

though is you have to see, at the first layer, the

receptive field is going to be three by three right? And then at the second layer, each of these neurons in the second layer is going to look at three

by three other first layer filters, but the corners

of these three by three have an additional pixel on each side, that is looking at in

the original input layer. So the second layer is actually

looking at five by five receptive field and then

if you do this again, the third layer is

looking at three by three in the second layer but this is going to, if you just draw out this

pyramid is looking at seven by seven in the input layer. So the effective receptive field here is going to be seven by seven. Which is the same as one

seven by seven conv layer. So what happens is that

this has the same effective receptive field as a

seven by seven conv layer but it’s deeper. It’s able to have more

non-linearities in there, and it’s also fewer parameters. So if you look at the

total number of parameters, each of these conv filters

for the three by threes is going to have nine parameters

in each conv [mumbles] three times three, and

then times the input depth, so three times three times

C, times this total number of output feature maps, which is again C is we’re going to preserve the total number of channels. So you get three times three, times C times C for each of these layers, and we have three layers

so it’s going to be three times this number, compared to if you had a

single seven by seven layer then you get, by the same reasoning, seven squared times C squared. So you’re going to have

fewer parameters total, which is nice. So now if we look at

this full network here there’s a lot of numbers up

here that you can go back and look at more carefully

but if we look at all of the sizes and number

of parameters the same way that we calculated the

example for AlexNet, this is a good exercise to go through, we can see that you

know going the same way we have a couple of these conv

layers and a pooling layer a couple more conv layers,

pooling layer, several more conv layers and so on. And so this just keeps going up. And if you counted the total

number of convolutional and fully connected layers,

we’re going to have 16 in this case for VGG 16, and then VGG 19, it’s just a very similar architecture, but with a few

more conv layers in there. And so the total memory

usage of this network, so just making a forward

pass through counting up all of these numbers so

in the memory numbers here written in terms of the total numbers, like we calculated earlier, and if you look at four bytes per number, this is going to be

about 100 megs per image, and so this is the scale

of the memory usage that’s happening and this is

only for a forward pass right, when you do a backward pass

you’re going to have to store more and so this is

pretty heavy memory wise. 100 megs per image, if

you have on five gigs of total memory, then

you’re only going to be able to store about 50 of these. And so also the total number

of parameters here we have is 138 million parameters in this network, and this compares with

60 million for AlexNet. Question? [student speaks off mic] – So the question is what

do we mean by deeper, is it the number of

filters, number of layers? So deeper in this case is

always referring to layers. So there are two usages of the word depth which is confusing one is

the depth rate per channel, width by height by depth, you can use the word depth here, but in general we talk about

the depth of a network, this is going to be the

total number of layers in the network, and usually in particular we’re counting the total

number of weight layers. So the total number of

layers with trainable weight, so convolutional layers

and fully connected layers. [student mumbles off mic] – Okay, so the question

is, within each layer what do different filters need? And so we talked about this

back in the comNet lecture, so you can also go back and refer to that, but each filter is a set of

let’s say three by three convs, so each filter is looking at a, is a set of weight looking at

a three by three value input input depth, and this

produces one feature map, one activation map of

all the responses of the different spatial locations. And then we have we can have

as many filters as we want right so for example 96 and each of these is going to produce a feature map. And so it’s just like

each filter corresponds to a different pattern

that we’re looking for in the input that we

convolve around and we see the responses everywhere in the input, we create a map of these

and then another filter will we convolve over the

image and create another map. Question. [student speaks off mic] – So question is, is

there intuition behind, as you go deeper into the network

we have more channel depth so more number of filters

right and so you can have any design that you want so

you don’t have to do this. In practice you will see this

happen a lot of the times and one of the reasons is

people try and maintain kind of a relatively

constant level of compute, so as you go higher up or

deeper into your network, you’re usually also using

basically down sampling and having smaller total

spatial area and then so then they also increase now you

increase by depth a little bit, it’s not as expensive

now to increase by depth because it’s spatially smaller and so, yeah that’s just a reason. Question. [student speaks off mic] – So performance-wise is

there any reason to use SBN [mumbles] instead

of SouthMax [mumbles], so no, for a classifier

you can use either one, and you did that earlier

in the class as well, but in general SouthMax losses, have generally worked

well and been standard use for classification here. Okay yeah one more question. [student mumbles off mic] – Yes, so the question

is, we don’t have to store all of the memory like we

can throw away the parts that we don’t need and so on? And yes this is true. Some of this you don’t need to keep, but you’re also going to

be doing a backwards pass through ware for the most part, when you were doing the chain rule and so on you needed

a lot of these activations as part of it and so in

large part a lot of this does need to be kept. So if we look at the distribution

of where memory is used and where parameters are,

you can see that a lot of memories in these early

layers right where you still have spatial dimensions you’re

going to have more memory usage and then a lot of the

parameters are actually in the last layers, the

fully connected layers have a huge number of parameters right, because we have all of

these dense connections. And so that’s something

just to know and then keep in mind so later on we’ll

see some networks actually get rid of these fully

connected layers and be able to save a lot on the number of parameters. And then just one last thing to point out, you’ll also see different ways of calling all of these layers right. So here I’ve written out

exactly what the layers are. conv3-64 means three by three convs with 64 total filters. But for VGGNet on this

diagram on the right here there’s also common ways

that people will look at each group of filters, so each orange block here, as in conv1 part one, so conv1-1, conv1-2, and so on. So just something to keep in mind. So VGGNet ended up getting

second place in the ImageNet 2014 classification challenge, first in localization. They followed a very

similar training procedure as Alex Krizhevsky for the AlexNet. They didn’t use local

response normalization, so as I mentioned earlier, they found out this

didn’t really help them, and so they took it out. You’ll see VGG 16 and VGG

19 are common variants of the cycle here, and this is just the number of layers, 19

is slightly deeper than 16. In practice VGG 19 works

very little bit better, and there’s a little

bit more memory usage, so you can use either but

16 is very commonly used. For best results, like

AlexNet, they did ensembling in order to average several models, and you get better results. And they also showed in their work that the FC7 features of the last

fully connected layer before going to the 1000 ImageNet classes. The 4096 size layer just before that, is a good feature representation, that can even just be used as is, to extract these features from other data, and generalized these other tasks as well. And so FC7 is a good

feature representation. Yeah question. [student speaks off mic] – Sorry what was the question? Okay, so the question is

what is localization here? And so this is a task,

and we’ll talk about it a little bit more in a later lecture on detection and localization

so I don’t want to go into detail here but

it’s basically an image, not just classifying What’s

the class of the image, but also drawing a bounding

box around where that object is in the image. And the difference with detection, which is a very related

task is that detection there can be multiple instances

of this object in the image localization we’re

assuming there’s just one, this classification but we just how this additional bounding box. So we looked at VGG which

was one of the deep networks from 2014 and then now

we’ll talk about GoogleNet which was the other one that won the classification challenge. So GoogleNet again was

a much deeper network with 22 layers but one

of the main insights and special things about

GoogleNet is that it really looked at this problem of

computational efficiency and it tried to design a

network architecture that was very efficient in the amount of compute. And so they did this using

this inception module which we’ll go into more

detail and basically stacking a lot of these inception

modules on top of each other. There’s also no fully connected

layers in this network, so they got rid of that

were able to save a lot of parameters and so in total

there’s only five million parameters which is twelve

times less than AlexNet, which had 60 million even

though it’s much deeper now. It got 6.7% top five error. So what’s the inception module? So the idea behind the inception module is that they wanted to design

a good local network typology and it has this idea

of this local topology that’s you know you can

think of it as a network within a network and

then stack a lot of these local typologies one on top of each other. And so in this local

network that they’re calling an inception module what they’re

doing is they’re basically applying several different

kinds of filter operations in parallel on top of the

same input coming into this same layer. So we have our input coming

in from the previous layer and then we’re going to do

different kinds of convolutions. So a one by one conv, right

a three by three conv, five by five conv, and then they also have a pooling operation

in this case three by three pooling, and so you get

all of these different outputs from these different layers, and then what they do is

they concatenate all these filter outputs together depth wise, and so then this creates one

tenser output at the end that is going tom pass

on to the next layer. So if we look at just a

naive way of doing this we just do exactly that we

have all of these different operations we get the outputs

we concatenate them together. So what’s the problem with this? And it turns out that

computational complexity is going to be a problem here. So if we look more

carefully at an example, so here just for as an example

I’ve put one by one conv, 128 filter so three by

three conv 192 filters, five by five convs and 96 filters. Assume everything has basically the stride that’s going to maintain

the spatial dimensions, and that we have this input coming in. So what is the output size

of the one by one filter with 128 , one by one

conv with 128 filters? Who has a guess? OK so I heard 28 by 28,

by 128 which is correct. So right by one by one conv

we’re going to maintain spatial dimensions and

then on top of that, each conv filter is going to look through the entire 256 depth of the input, but then the output is going to be, we have a 28 by 28 feature map for each of the 128 filters that we have in this conv layer. So we get 28 by 28 by 128. OK and then now if we do the same thing and we look at the filter

sizes of the output sizes sorry of all of the different

filters here, after the three by three conv we’re

going to have this volume of 28 by 28 by 192 right

after five by five conv we have 96 filters here. So 28 by 28 by 96, and then out pooling layer is just going to keep the same spatial

dimension here, so pooling layer will preserve it in depth, and here because of our stride, we’re also going to preserve

our spatial dimensions. And so now if we look at

the output size after filter concatenation what we’re

going to get is 28 by 28, these are all 28 by 28, and

we concatenating depth wise. So we get 28 by 28 times

all of these added together, and the total output size is going to be 28 by 28 by 672. So the input to our

inception module was 28 by 28 by 256, then the output

from this module is 28 by 28 by 672. So we kept the same spatial dimensions, and we blew up the depth. Question. [student speaks off mic] OK So in this case, yeah, the question is, how are we getting 28

by 28 for everything? So here we’re doing all the zero padding in order to maintain

the spatial dimensions, and that way we can do this filter concatenation depth-wise. Question in the back. [student speaks off mic] – OK The question is what’s

the 256 deep at the input, and so this is not the

input to the network, this is the input just

to this local module that I’m looking at. So in this case 256 is

the depth of the previous inception module that

came just before this. And so now coming out

we have 28 by 28 by 672, and that’s going to be

the input to the next inception module. Question. [student speaks off mic] – Okay the question is, how

did we get 28 by 28 by 128 for the first one, the first conv, and this is basically it’s a

one by one convolution right, so we’re going to take

this one by one convolution slide it across our 28 by

28 by 256 input spatially where it’s at each location,

it’s going to multiply, it’s going to do a [mumbles] through the entire 256

depth, and so we do this one by one conv slide it over spatially and we get a feature map

out that’s 28 by 28 by one. There’s one number at each

spatial location coming out, and each filter produces

one of these 28 by 28 by one maps, and we have

here a total 128 filters, and that’s going to

produce 28 by 28, by 128. OK so if you look at

the number of operations that are happening in

the convolutional layer, let’s look at the first one for

example this one by one conv as I was just saying at each

each location we’re doing a one by one by 256 dot product. So there’s 256 multiply

operations happening here and then for each filter

map we have 28 by 28 spatial locations, so

that’s the first 28 times 28 first two numbers that

are multiplied here. These are the spatial

locations for each filter map, and so we have to do this

to 25 60 multiplication each one of these then

we have 128 total filters at this layer, or we’re

producing 128 total feature maps. And so the total number

of these operations here is going to be 28 times 28 times 128 times 256. And so this is going to be the same for, you can think about this

for the three by three conv, and the five by five conv,

that’s exactly the same principle. And in total we’re going to

get 854 million operations that are happening here. – [Student] And the 128,

192, and 96 are just values [mumbles] – Question the 128, 192 and

256 are values that I picked. Yes, these are not values

that I just came up with. They are similar to the

ones that you will see in like a particular

layer of inception net, so in GoogleNet basically,

each module has a different set of these kinds of

parameters, and I picked one that was similar to one of these. And so this is very expensive

computationally right, these these operations. And then the other thing

that I also want to note is that the pooling layer also

adds to this problem because it preserves the whole feature depth. So at every layer your total

depth can only grow right, you’re going to take

the full featured depth from your pooling layer, as

well as all the additional feature maps from the conv

layers and add these up together. So here our input was 256

depth and our output is 672 depth and you’re just

going to keep increasing this as you go up. So how do we deal with this

and how do we keep this more manageable? And so one of the key

insights that GoogleNet used was that well we can we

can address this by using bottleneck layers and try and

project these feature maps to lower dimension before our

our convolutional operations, so before our expensive layers. And so what exactly does that mean? So reminder one by one

convolution, I guess we were just going through

this but it’s taking your input volume, it’s performing a

dot product at each spatial location and what it does is

it preserves spatial dimension but it reduces the depth and

it reduces that by projecting your input depth to a lower dimension. It just takes it’s basically

like a linear combination of your input feature maps. And so this main idea is

that it’s projecting your depth down and so the inception module takes these one by one convs

and adds these at a bunch of places in these modules

where there’s going to be, in order to alleviate

this expensive compute. So before the three by three

and five by five conv layers, it puts in one of these

one by one convolutions. And then after the

pooling layer it also puts an additional one by one convolution. Right so these are the one

by one bottleneck layers that are added in. And so how does this change the math that we were looking at earlier? So now basically what’s

happening is that we still have the same input here 28 by 28 by 256, but these one by one convs

are going to reduce the depth dimension and so you can see

before the three by three convs, if I put a one by

one conv with 64 filters, my output from that is going to be, 28 by 28 by 64. So instead of now going into

the three by three convs afterwards instead of 28

by 28 by 256 coming in, we only have a 28 by 28,

by 64 block coming in. And so this is now

reducing the smaller input going into these conv

layers, the same thing for the five by five conv, and

then for the pooling layer, after the pooling comes

out, we’re going to reduce the depth after this. And so, if you work out

the math the same way for all of the convolutional ops here, adding in now all these one by one convs on top of the three by

threes and five by fives, the total number of operations

is 358 million operations, so it’s much less than the

854 million that we had in the naive version, and

so you can see how you can use this one by one

conv, and the filter size for that to control your computation. Yes, question in the back. [student speaks off mic] – Yes, so the question

is, have you looked into what information might be

lost by doing this one by one conv at the beginning. And so there might be

some information loss, but at the same time if

you’re doing these projections you’re taking a linear

combination of these input feature maps which has redundancy in them, you’re taking combinations of them, and you’re also introducing

an additional non-linearity after the one by one

conv, so it also actually helps in that way with

adding a little bit more depth and so, I don’t think

there’s a rigorous analysis of this, but basically in

general this works better and there’s reasons why it helps as well. OK so here we have, we’re

basically using these one by one convs to help manage our

computational complexity, and then what GooleNet

does is it takes these inception modules and it’s going to stack all these together. So this is a full inception architecture. And if we look at this a

little bit more detail, so here I’ve flipped it, because it’s so big, it’s not going to fit vertically any more on the slide. So what we start with is

we first have this stem network, so this is more

the kind of vanilla plain conv net that we’ve seen earlier [mumbles] six sequence of layers. So conv pool a couple

of convs in another pool just to get started and then after that we have all of our different

our multiple inception modules all stacked on top of each other, and then on top we have

our classifier output. And notice here that

they’ve really removed the expensive fully connected layers it turns out that the model

works great without them, even and you reduce a lot of parameters. And then what they also have here is, you can see these couple

of extra stems coming out and these are auxiliary

classification outputs and so these are also you know

just a little mini networks with an average pooling,

a one by one conv, a couple of fully connected

layers here going to the soft Max and also a 1000 way SoftMax with the ImageNet classes. And so you’re actually

using your ImageNet training classification loss in

three separate places here. The standard end of the

network, as well as in these two places earlier on in

the network, and the reason they do that is just

this is a deep network and they found that having

these additional auxiliary classification outputs,

you get more gradient training injected at the earlier layers, and so more just helpful signal flowing in because these intermediate

layers should also be helpful. You should be able to do classification based off some of these as well. And so this is the full architecture, there’s 22 total layers

with weights and so within each of these modules

each of those one by one, three by three, five by

five is a weight layer, just including all of

these parallel layers, and in general it’s a relatively

more carefully designed architecture and part of this

is based on some of these intuitions that we’re talking

about and part of them also is just you know

Google the authors they had huge clusters and they’re

cross validating across all kinds of design

choices and this is what ended up working well. Question? [student speaks off mic] – Yeah so the question is,

are the auxiliary outputs actually useful for the

final classification, to use these as well? I think when they’re training them they do average all these

for the losses coming out. I think they are helpful. I can’t remember if in

the final architecture, whether they average all

of these or just take one, it seems very possible that

they would use all of them, but you’ll need to check on that. [student speaks off mic] – So the question is for

the bottleneck layers, is it possible to use some

other types of dimensionality reduction and yes you can use

other kinds of dimensionality reduction. The benefits here of

this one by one conv is, you’re getting this effect,

but it’s all, you know it’s a conv layer just like any other. You have the soul network of these, you just train it this full network back [mumbles] through everything, and it’s learning how to combine the previous feature maps. Okay yeah, question in the back. [student speaks off mic] – Yes so, question is

are any weights shared or all they all separate and yeah, all of these layers have separate weights. Question. [student speaks off mic] – Yes so the question is why do we have to inject gradients at earlier layers? So our classification

output at the very end, where we get a gradient on this, it’s passed all the way back

through the chain roll but the problem is when

you have very deep networks and you’re going all the

way back through these, some of this gradient

signal can become minimized and lost closer to the beginning,

and so that’s why having these additional ones in earlier parts can help provide some additional signal. [student mumbles off mic] – So the question is are you

doing back prop all the times for each output. No it’s just one back

prop all the way through, and you can think of these three, you can think of there being kind of like an addition at the end

of these if you were to draw up your computational

graph, and so you get your final signal and you can

just take all of these gradients and just back plot

them all the way through. So it’s as if they were

added together at the end in a computational graph. OK so in the interest of

time because we still have a lot to get through, can

take other questions offline. Okay so GoogleNet basically 22 layers. It has an efficient inception module, there’s no fully connected layers. 12 times fewer parameters than AlexNet, and it’s the ILSVRC 2014

classification winner. And so now let’s look at the 2015 winner, which is the ResNet network and so here this idea is really, this

revolution of depth net right. We were starting to increase

depth in 2014, and here we’ve just had this hugely

deeper model at 152 layers was the ResNet architecture. And so now let’s look at that

in a little bit more detail. So the ResNet architecture,

is getting extremely deep networks, much deeper

than any other networks before and it’s doing this using this idea of residual connections

which we’ll talk about. And so, they had 152

layer model for ImageNet. They were able to get 3.5

of 7% top 5 error with this and the really special

thing is that they swept all classification and

detection contests in the ImageNet mart benchmark

and this other benchmark called COCO. It just basically won everything. So it was just clearly

better than everything else. And so now let’s go into a

little bit of the motivation behind ResNet and residual connections that we’ll talk about. And the question that they

started off by trying to answer is what happens when we try

and stack deeper and deeper layers on a plain

convolutional neural network? So if we take something like VGG or some normal network that’s

just stacks of conv and pool layers on top of each

other can we just continuously extend these, get deeper

layers and just do better? And and the answer is no. So if you so if you look at what happens when you get deeper, so here

I’m comparing a 20 layer network and a 56 layer network

and so this is just a plain kind of network you’ll see

that in the test error here on the right the 56 layer

network is doing worse than the 28 layer network. So the deeper network was

not able to do better. But then the really weird thing is now if you look at the training error right we here have again the 20 layer network and a 56 layer network. The 56 layer network, one of

the obvious problems you think, I have a really deep network,

I have tons of parameters maybe it’s probably starting

to over fit at some point. But what actually happens is

that when you’re over fitting you would expect to have very good, very low training error rate,

and just bad test error, but what’s happening here is

that in the training error the 56 layer network is

also doing worse than the 20 layer network. And so even though the

deeper model performs worse, this is not caused by over-fitting. And so the hypothesis

of the ResNet creators is that the problem is actually

an optimization problem. Deeper models are just harder to optimize, than more shallow networks. And the reasoning was that well, a deeper model should be

able to perform at least as well as a shallower model. You can have actually a

solution by construction where you just take the learned layers from your shallower model, you just copy these over and then

for the remaining additional deeper layers you just

add identity mappings. So by construction this

should be working just as well as the shallower layer. And your model that weren’t

able to learn properly, it should be able to learn at least this. And so motivated by

this their solution was well how can we make it

easier for our architecture, our model to learn these

kinds of solutions, or at least something like this? And so their idea is well

instead of just stacking all these layers on top

of each other and having every layer try and learn

some underlying mapping of a desired function, lets

instead have these blocks, where we try and fit a residual mapping, instead of a direct mapping. And so what this looks

like is here on this right where the input to these block

is just the input coming in and here we are going to

use our, here on the side, we’re going to use our

layers to try and fit some residual of our desire to H of X, minus X instead of the desired

function H of X directly. And so basically at the

end of this block we take the step connection on

this right here, this loop, where we just take our input,

we just use pass it through as an identity, and so if

we had no weight layers in between it was just

going to be the identity it would be the same thing

as the output, but now we use our additional weight

layers to learn some delta, for some residual from our X. And so now the output

of this is going to be just our original R X plus some residual that we’re going to call it. It’s basically a delta

and so the idea is that now the output it should

be easy for example, in the case where identity is ideal, to just squash all of

these weights of F of X from our weight layers

just set it to all zero for example, then we’re

just going to get identity as the output, and we can get something, for example, close to this

solution by construction that we had earlier. Right, so this is just

a network architecture that says okay, let’s try and fit this, learn how our weight layers

residual, and be something close, that way it’ll more

likely be something close to X, it’s just modifying X,

than to learn exactly this full mapping of what it should be. Okay, any questions about this? [student speaks off mic] – Question is is there the same dimension? So yes these two paths

are the same dimension. In general either it’s the same dimension, or what they actually

do is they have these projections and shortcuts

and they have different ways of padding to make things work

out to be the same dimension. Depth wise. Yes – [Student] When you use the word residual you were talking about [mumbles off mic] – So the question is what

exactly do we mean by residual this output

of this transformation is a residual? So we can think of our output

here right as this F of X plus X, where F of X is the

output of our transformation and then X is our input,

just passed through by the identity. So we’d like to using a plain layer, what we’re trying to do is learn something like H of X, but what we saw

earlier is that it’s hard to learn H of X. It’s a good H of X as we

get very deep networks. And so here the idea is

let’s try and break it down instead of as H of X is

equal to F of X plus, and let’s just try and learn F of X. And so instead of learning

directly this H of X we just want to learn what

is it that we need to add or subtract to our input as

we move on to the next layer. So you can think of it as

kind of modifying this input, in place in a sense. We have– [interrupted by student mumbling off mic] – The question is, when we’re

saying the word residual are we talking about F of X? Yeah. So F of X is what we’re

calling the residual. And it just has that meaning. Yes another question. [student mumbles off mic] – So the question is in

practice do we just sum F of X and X together, or

do we learn some weighted combination and you just do a direct sum. Because when you do a direct sum, this is the idea of let

me just learn what is it I have to add or subtract onto X. Is this clear to everybody,

the main intuition? Question. [student speaks off mic] – Yeah, so the question

is not clear why is it that learning the

residual should be easier than learning the direct mapping? And so this is just their hypotheses, and a hypotheses is that if

we’re learning the residual you just have to learn

what’s the delta to X right? And if our hypotheses is that generally even something like our

solution by construction, where we had some number

of these shallow layers that were learned and we had

all these identity mappings at the top this was a

solution that should have been good, and so that implies that

maybe a lot of these layers, actually something just close to identity, would be a good layer And so because of that,

now we formulate this as being able to learn the identity plus just a little delta. And if really the identity

is best we just make F of X squashes transformation

to just be zero, which is something that’s relatively, might seem easier to learn, also we’re able to get

things that are close to identity mappings. And so again this is not

something that’s necessarily proven or anything it’s just

the intuition and hypothesis, and then we’ll also see

later some works where people are actually trying to

challenge this and say oh maybe it’s not actually the residuals

that are so necessary, but at least this is the

hypothesis for this paper, and in practice using this model, it was able to do very well. Question. [student speaks off mic] – Yes so the question is

have people tried other ways of combining the inputs

from previous layers and yes so this is basically a very

active area of research on and how we formulate

all these connections, and what’s connected to what

in all of these structures. So we’ll see a few more

examples of different network architectures briefly later

but this is an active area of research. OK so we basically have all

of these residual blocks that are stacked on top of each other. We can see the full resident architecture. Each of these residual blocks

has two three by three conv layers as part of this block

and there’s also been work just saying that this happens

to be a good configuration that works well. We stack all these blocks

together very deeply. Another thing like with

this very deep architecture it’s basically also

enabling up to 150 layers deep of this, and then

what we do is we stack all these and periodically we also double the number of filters

and down sample spatially using stride two when we do that. And then we have this additional [mumbles] at the very beginning of our network and at the end we also hear, don’t have any fully connected layers and we just have a global

average pooling layer that’s going to average

over everything spatially, and then be input into the

last 1000 way classification. So this is the full ResNet architecture and it’s very simple and

elegant just stacking up all of these ResNet blocks

on top of each other, and they have total depths

of up to 34, 50, 100, and they tried up to 152 for ImageNet. OK so one additional

thing just to know is that for a very deep network,

so the ones that are more than 50 layers deep, they

also use bottleneck layers similar to what GoogleNet did

in order to improve efficiency and so within each block

now you’re going to, what they did is, have this

one by one conv filter, that first projects it

down to a smaller depth. So again if we are looking

at let’s say 28 by 28 by 256 implant, we do

this one by one conv, it’s taking it’s

projecting the depth down. We get 28 by 28 by 64. Now your convolution

your three by three conv, in here they only have

one, is operating over this reduced step so it’s going

to be less expensive, and then afterwards they have another one by one conv that

projects the depth back up to 256, and so, this is

the actual block that you’ll see in deeper networks. So in practice the ResNet

also uses batch normalization after every conv layer, they

use Xavier initialization with an extra scaling factor

that they helped introduce to improve the initialization

trained with SGD + momentum. Their learning rate they

use a similar learning rate type of schedule where you

decay your learning rate when your validation error plateaus. Mini batch size 256, a

little bit of weight decay and no drop out. And so experimentally they

were able to show that they were able to train these

very deep networks, without degrading. They were able to have

basically good gradient flow coming all the way back

down through the network. They tried up to 152 layers on ImageNet, 1200 on Cifar, which is a,

you have played with it, but a smaller data set

and they also saw that now you’re deeper networks are

able to achieve lower training errors as expected. So you don’t have the same strange plots that we saw earlier where the behavior was in the wrong direction. And so from here they were

able to sweep first place at all of the ILSVRC competitions, and all of the COCO competitions in 2015 by a significant margins. Their total top five error

was 3.6 % for a classification and this is actually better

than human performance in the ImageNet paper. There was also a human

metric that came from actually [mumbles] our

lab Andre Kapathy spent like a week training

himself and then basically did all of, did this task himself and was I think somewhere around 5-ish %, and so I was basically able to do better than the then that human at least. Okay, so these are kind

of the main networks that have been used recently. We had AlexNet starting off with first, VGG and GoogleNet are still very popular, but ResNet is the most

recent best performing model that if you’re looking for

something training a new network ResNet is available, you should try working with it. So just quickly looking at

some of this getting a better sense of the complexity involved. So here we have some

plots that are sorted by performance so this is

top one accuracy here, and higher is better. And so you’ll see a lot

of these models that we talked about, as well as

some different versions of them so, this

GoogleNet inception thing, I think there’s like V2,

V3 and the best one here is V4, which is actually

a ResNet plus inception combination, so these are just kind of more incremental, smaller

changes that they’ve built on top of them,

and so that’s the best performing model here. And if we look on the

right, these plots of their computational complexity here it’s sorted. The Y axis is your top one accuracy so higher is better. The X axis is your operations

and so the more to the right, the more ops you’re doing,

the more computationally expensive and then the bigger the circle, your circle is your memory usage, so the gray circles are referenced here, but the bigger the circle

the more memory usage and so here we can see

that VGG these green ones are kind of the least efficient. They have the biggest memory, the most operations, but they they do pretty well. GoogleNet is the most efficient here. It’s way down on the operation side, as well as a small little

circle for memory usage. AlexNet, our earlier

model, has lowest accuracy. It’s relatively smaller compute, because it’s a smaller network, but

it’s also not particularly memory efficient. And then ResNet here, we

have moderate efficiency. It’s kind of in the middle,

both in terms of memory and operations, and it

has the highest accuracy. And so here also are

some additional plots. You can look at these

more on your own time, but this plot on the left is

showing the forward pass time and so this is in milliseconds

and you can up at the top VGG forward passes about 200

milliseconds you can get about five frames per second with this, and this is sorted in order. There’s also this plot on

the right looking at power consumption and if you look

more at this paper here, there’s further analysis of

these kinds of computational comparisons. So these were the main

architectures that you should really know in-depth and be familiar with, and be thinking about actively using. But now I’m going just

to go briefly through some other architectures

that are just good to know either historical inspirations or more recent areas of research. So the first one Network in Network, this is from 2014, and

the idea behind this is that we have these

vanilla convolutional layers but we also have these,

this introduces the idea of MLP conv layers they call

it, which are micro networks or basically network within networth, the name of the paper. Where within each conv

layer trying to stack an MLP with a couple of fully

connected layers on top of just the standard conv

and be able to compute more abstract features for these local patches right. So instead of sliding

just a conv filter around, it’s sliding a slightly

more complex hierarchical set of filters around

and using that to get the activation maps. And so, it uses these fully connected, or basically one by one

conv kind of layers. It’s going to stack them all up like the bottom diagram here where

we just have these networks within networks stacked

in each of the layers. And the main reason to know this is just it was kind of a precursor

to GoogleNet and ResNet in 2014 with this idea

of bottleneck layers that you saw used very heavily in there. And it also had a little bit

of philosophical inspiration for GoogleNet for this idea

of a local network typology network in network that they also used, with a different kind of structure. Now I’m going to talk

about a series of works, on, or works since ResNet

that are mostly geared towards improving resNet

and so this is more recent research has been done since then. I’m going to go over these pretty fast, and so just at a very high level. If you’re interested in

any of these you should look at the papers, to have more details. So the authors of ResNet

a little bit later on in 2016 also had this paper

where they improved the ResNet block design. And so they basically

adjusted what were the layers that were in the ResNet block path, and showed this new

structure was able to have a more direct path in order

for propagating information throughout the network,

and you want to have a good path to propagate

information all the way up, and then back up all the way down again. And so they showed that this

new block was better for that and was able to give better performance. There’s also a Wide Residual

networks which this paper argued that while ResNets

made networks much deeper as well as added these

residual connections and their argument was

that residuals are really the important factor. Having this residual construction, and not necessarily having

extremely deep networks. And so what they did was they

used wider residual blocks, and so what this means is

just more filters in every conv layer. So before we might have

F filters per layer and they use these factors

of K and said well, every layer it’s going to be

F times K filters instead. And so, using these

wider layers they showed that their 50 layer wide

ResNet was able to out-perform the 152 layer original ResNet, and it also had the

additional advantages of increasing with this,

even with the same amount of parameters, tit’s more

computationally efficient because you can parallelize

these with operations more easily. Right just convolutions with more neurons just spread across more kernels as opposed to depth

that’s more sequential, so it’s more computationally

efficient to increase your width. So here you can see

this work is starting to trying to understand the

contributions of width and depth and residual connections, and making some arguments

for one way versus the other. And this other paper around the same time, I think maybe a little

bit later, is ResNeXt, and so this is again,

the creators of ResNet continuing to work on

pushing the architecture. And here they also had

this idea of okay, let’s indeed tackle this width

thing more but instead of just increasing the width

of this residual block through more filters they have structure. And so within each residual

block, multiple parallel pathways and they’re going to call the total number of these

pathways the cardinality. And so it’s basically

taking the one ResNet block with the bottlenecks and having

it be relatively thinner, but having multiple of

these done in parallel. And so here you can also

see that this both have some relation to this idea of wide networks, as well as to has some connection

to the inception module as well right where we

have these parallel, these layers operating in parallel. And so now this ResNeXt has

some flavor of that as well. So another approach

towards improving ResNets was this idea called Stochastic

Depth and in this work the motivation is well let’s look more at this depth problem. Once you get deeper and

deeper the typical problems that you’re going to have

vanishing gradients right. You’re not able to, your

gradients will get smaller and eventually vanish as

you’re trying to back propagate them over very long layers,

or a large number of layers. And so what their motivation

is well let’s try to have short networks during training

and they use this idea of dropping out a subset of

the layers during training. And so for a subset of the

layers they just drop out the weights and they just set

it to identity connection, and now what you get is you

have these shorter networks during training, you can pass back your gradients better. It’s also a little more

efficient, and then it’s kind of like the drop out right. It has this sort of flavor

that you’ve seen before. And then at test time you want

to use the full deep network that you’ve trained. So these are some of the

works that looking at the resident architecture, trying

to understand different aspects of it and trying

to improve ResNet training. And so there’s also some

works now that are going beyond ResNet that are

saying well what are some non ResNet architectures that

maybe can also work better, or comparable or better to ResNets. And so one idea is

FractalNet, which came out pretty recently, and the

argument in FractalNet is that while residual

representations maybe are not actually necessary,

so this goes back to what we were talking about earlier. What’s the motivation of

residual networks and it seems to make sense and there’s, you know, good reasons for why this

should help but in this paper they’re saying that well here

is a different architecture that we’re introducing, there’s

no residual representations. We think that the key is

more about transitioning effectively from shallow to deep networks, and so they have this fractal architecture which has if you look on the right here, these layers where they compose

it in this fractal fashion. And so there’s both

shallow and deep pathways to your output. And so they have these

different length pathways, they train them with

dropping out sub paths, and so again it has this

dropout kind of flavor, and then at test time they’ll

use the entire fractal network and they show that this was able to get very good performance. There’s another idea

called Densely Connected convolutional Networks,

DenseNet, and this idea is now we have these

blocks that are called dense blocks. And within each block

each layer is going to be connected to every other layer after it, in this feed forward fashion. So within this block,

your input to the block is also the input to

every other conv layer, and as you compute each conv output, those outputs are now connected to every layer after and then,

these are all concatenated as input to the conv

layer, and they do some they have some other

processes for reducing the dimensions and keeping efficient. And so their main takeaway from this, is that they argue that

this is alleviating a vanishing gradient problem

because you have all of these very dense connections. It strengthens feature propagation

and then also encourages future use right because

there are so many of these connections each feature

map that you’re learning is input in multiple

later layers and being used multiple times. So these are just a

couple of ideas that are you know alternatives or

what can we do that’s not ResNets and yet is still performing either comparably or better to

ResNets and so this is another very active area

of current research. You can see that a lot of this is looking at the way how different layers

are connected to each other and how depth is managed

in these networks. And so one last thing

that I wanted to mention quickly, is just efficient networks. So this idea of efficiency

and you saw that GoogleNet was a work that was

looking into this direction of how can we have efficient

networks which are important for you know a lot of

practical usage both training as well as especially

deployment and so this is another recent network

that’s called SqueezeNet which is looking at

very efficient networks. They have these things

called fire modules, which consists of a

squeeze layer with a lot of one by one filters and

then this feeds then into an expand layer with one by

one and three by three filters, and they’re showing that with

this kind of architecture they’re able to get AlexNet

level accuracy on ImageNet, but with 50 times fewer parameters, and then you can further do

network compression on this to get up to 500 times

smaller than AlexNet and just have the whole

network just be 0.5 megs. And so this is a direction

of how do we have efficient networks model compression that we’ll cover more in a lecture later, but just giving you a hint of that. OK so today in summary we’ve

talked about different kinds of CNN Architectures. We looked in-depth at four

of the main architectures that you’ll see in wide usage. AlexNet, one of the early,

very popular networks. VGG and GoogleNet which

are still widely used. But ResNet is kind of

taking over as the thing that you should be

looking most when you can. We also looked at these other networks in a little bit more depth at a brief level overview. And so the takeaway that these

models that are available they’re in a lot of

[mumbles] so you can use them when you need them. There’s a trend toward

extremely deep networks, but there’s also significant

research now around the design of how do we connect layers, skip connections, what

is connected to what, and also using these to

design your architecture to improve gradient flow. There’s an even more recent

trend towards examining what’s the necessity

of depth versus width, residual connections. Trade offs, what’s

actually helping matters, and so there’s a lot of these recent works in this direction that you can look into some of the ones I pointed

out if you are interested. And next time we’ll talk about

Recurrent neural networks. Thanks.