DavidHolmes
Thanks Charlie.
Just to be double-sure, the second process was choosing the weight in a ball (so total L2 norm of weights was ⇐ 1), rather than on a sphere (total norm == 1), right?
Yes, exactly (though for some constant , which may not be , but turn out not to matter).
Is initializing weights that way actually a thing people do?
Not sure (I would like to know). But what I had in mind was initialising a network with small weights, then doing a random walk (‘undirected SGD’), and then looking at the resulting distribution. Of course this will be more complicated than the distributions I use above, but I think the shape may depend quite a bit on the details of the SGD. For example, I suspect that the result of something like adaptive gradient descent may tend towards more spherical distributions, but I haven’t thought about this carefully.
If training large neural networks only moves the parameters a small distance (citation needed), do you still think there’s something interesting to say about the effect of training in this lens of looking at the density of nonlinearities?
I hope so! I would want to understand what norm the movements are ‘small’ in (L2, L, …).
LayerNorm looks interesting, I’ll take a look.
Neural networks biased towards geometrically simple functions?
Maths at my Dutch university also has homework for quite a few of the courses, which often counts for something like 10-20% of final grade. It can usually be submitted online, so you only need to be physically present for exams. However, there are a small number of courses that are exceptions to this, and actually require attendance to some extent (e.g. a course on how to give a scientific presentation, where a large part of the course consists of students giving and commenting on each other’s presentations—not so easy to replace the learning experience with a single exam at the end).
But this differs between Dutch universities.
I suspect the arXiv might not be keen on an account that posts papers by a range of people (not including the account-owner as coauthor). This might lead to heavier moderation/whatever. But I could be very wrong!
Some advice for getting papers accepted on arxiv
As some other comments have pointed out, there is a certain amount of moderation on arXiv. This is a little opaque, so below is an attempt to summarise some things that are likely to make it easier to get your paper accepted. I’m sure the list is very incomplete!
In writing this I don’t want to give the impression that posting things to arXiv is hard; I have currently 28 papers there, have never had a single problem or delay with moderation, and the submission process generally takes me <15 mins these days.
-
Endorsement. When you first attempt to submit a paper you may need to be endorsed. JanBrauner kindly offered below to help people with endorsements; I might also be able to do the same, but I’ve never posted in the CS part of arXiv, so not sure how effective this will be. However, even better to avoid need for moderation. To this end, use an academic email address if you have one; this is quite likely to already be enough. Also, see below on subject classes (endorsement requirements depend on which subject class(es) you want to post in).
-
Choosing subject classes. Each paper gets one or more subject classes, like CS.AI; see [https://arxiv.org/category_taxonomy] for a list. Some subject classes attract more junk than others, and the ones that attract more junk are more heavily moderated. In mathematics, it is math.GM (General Mathematics) that attracts most junk, hence is most heavily moderated. I guess most people here are looking at CS.AI, I don’t know what this is like. But one easy thing is to minimise cross-listing (adding additional subject classes for your paper), as then you are moderated by all of them.
-
Write in (la)tex, submit the tex file. You don’t have to do this, but it is standard and preferred by the arXiv, and I suspect makes it less likely your paper gets flagged for moderation. It is also an easy way to make sure your paper looks like a serious academic paper.
-
It is possible to submit papers on behalf of third parties. I’ve never done this, and I suspect such papers will be more heavily moderated.
-
If you have multiple authors, it doesn’t really matter who submits. After the submission is posted you are sent a ‘paper password’ allowing coauthors to ‘claim’ the paper; it is then associated to their arXiv account, orcid etc (orcid is optional, but a really good idea, and free).
Finally, a request: please be nice to the moderators! They are generally unpaid volunteers doing a valuable service to the community (e.g. making sure I don’t have to read nonsense proofs of the Riemann hypothesis every morning). Of course it doesn’t feel good if your paper gets held up, but please try not to take it personally.
-
The arXiv really prefers that you upload in tex. For the author this makes it less likely that your paper will be flagged for moderation etc (I guess). So if it were possible to export to Rex I think that for the purposes of uploading to arXiv this would be substantially better. Of course, I don’t know how much more/less work it is…
Hi Charlie, If you can give a short (precise) description for an agent that does the task, then you have written a short programme that solves the task. I think then if you need more space to ‘explain what the agent would do’ then you are saying there also exists a less efficient/compact way to specify the solution. From this perspective I think the latter is then not so relevant. David
I think that
provable guarantees on the safety of an FHE scheme that do not rely on open questions in complexity theory such as the difficulty of lattice problems.
is far out of reach at present (in particular to the extent that there does not exist a bounty which would affect people’s likeliness to work on it). It is hard to do much in crypto without assuming some kind of problem to be computationally difficult. And there are very few results proving that a given problem is computationally difficult in an absolute sense (rather than just ‘at least as hard as some other problem we believe to be hard’). C.f. P vs NP. Or perhaps I misunderstand your meaning; are you ok with assuming e.g. integer factorisation to be computationally hard?
Personally I also don’t think this is so important; if we could solve alignment modulo assuming e.g. integer factorisation (or some suitable lattice problem) is hard, then I think we should be very happy…
-
More generally, I’m a bit sceptical of the effectiveness a bounty here because the commercial application of FHE are already so great.
-
About 10 years ago when I last talked to people in the area about this I got a bit the impression that FHE schemes were generally expected to be somewhat less secure than non-homomorphic schemes, just because the extra structure gives an attacker so much more to work with. But I have no idea if people still believe this.
P.s. the main thing I have taken so far from the link you posted is that the important part is not exactly about the biases of SGD. Rather, it is about the structure of the DNN itself; the algorithm used to find a (local) optimum plays less of a role than the overall structure. But probably I’m reading too much into your precise phrasing.
Hi Thomas, I agree the proof of the bound is not so interesting. What I found more interesting were the examples and discussion suggesting that, in practise, the upper bound seems often to be somewhat tight.
Concerning differential advancement, I agree this can advance capabilities, but I suspect that advancing alignment is somewhat hopeless unless we can understand better what is going on inside DNNs. On that basis I think it does differentials advance alignment, but of course other people may disagree.
Thanks very much for the link!
Bias towards simple functions; application to alignment?
If you get the daily arXiv email feeds for multiple areas it automatically removes duplicates (i.e. each paper appears exactly once, regardless of cross-listing). The email is not to everyone’s taste of course, but this is a nice aspect of it.
I was about to write approximately this, so thank you! To add one point in this direction, I am sceptical about the value of reducing the expectation for researchers to explain what they are doing. My research is in two fields (arithmetic geometry and enumerative geometry). In the first we put a lot of burden on the writer to explain themselves, and in the latter poor and incomplete explanations are standard. This sometimes allows people in the latter field to move faster, but
it leaves critical foundational gaps, which we can ignore for a while but which eventually causes lot of pain;
sometimes really critical points are hidden in the details, and we just miss these if we don’t write the details down properly. Disclaimers:
while I think a lot of people working in these fields would agree with me that this distinction exists, not so many will agree that it is generally a bad thing.
I’m generally criticising lack of rigour rather than lack of explanation. I am or claiming these necessarily have to go together, but in my experience they very often do.
p.s.
For the more substantive results in section 4, I do believe the direction is always flat --> sharp.
I agree with this (with ‘sharp’ replaced by ‘generalise’, as I think you intend). It seems to me potentially interesting to ask whether this is necessarily the case.
Vacuous sure, but still true, and seems relevant to me. You initially wrote:
Regarding the ‘sharp minima can generalize’ paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don’t rule out flatness as an explanation for the success of SGD.
But, allowing reparametrisation, this seems false? I don’t understand the step in your argument where you ‘rule out reparametrisation’, nor do I really understand what this would mean.
Your comment relating description length to flatness seems nice. To talk about flatness at all (in the definitions I have seen) requires a choice of parametrisation. And I guess your coordinate description is also using a fixed parametrisation, so this seems reasonable. A change of parametrisation will then change both flatness and description length (i.e. ‘required coordinate precision’).
Thank you for the quick reply! I’m thinking about section 5.1 on reparametrising the model, where they write:
every minimum is observationally equivalent to an infinitely sharp minimum and to an infinitely flat min- imum when considering nonzero eigenvalues of the Hessian;
If we stick to section 4 (and so don’t allow reparametrisation) I agree there seems to be something more tricky going on. I initially assumed that I could e.g. modify the proof of Theorem 4 to make a sharp minimum flat by taking alpha to be big, but it doesn’t work like that (basically we’re looking at alpha + 1/alpha, which can easily be made big, but not very small). So maybe you are right that we can only make flat minimal sharp and not conversely. I’d like to understand this better!
I’m not sure I agree with interstice’s reading of the ‘sharp minima’ paper. As I understand it, they show that a given function can be made into a sharp or flat minimum by finding a suitable point in the parameter space mapping to the function. So if one has a sharp minmum that does not generalise (which I think we will agree exists) then one can make the same function into a flat minimum, which will still not generalise as it is the same function! Sorry I’m 2 years late to the party...
if we gave research grants to smart and personable university graduates and gave them carte blanche to do with the money what they wished that would work just as well as the current system
This thought is not unique to you; see e.g. the French CNRS system. My impression is that it works kind of as you would expect; a lot of them go on to do solid work, some do great work, and a few stop working after a couple of years. Of course we can not really know how things would have turned out if the same people had been given more conventional positions,
I think this is a key point. Even the best possible curriculum, if it has to work for all students at the same rate, is not going to work well. What I really want (both for my past-self as a student, and my present self as a teacher of university mathematics) is to be able to tailor the learning rate to individual students and individual topics (for student me, this would have meant ‘go very fast for geometry and rather slowly for combinatorics’). And while we’re at it, can we also customise the learning styles (some students like to read, some like to sit in class, some to work in groups, etc)?
This is technologically more feasible than it was a decade ago, but seems far from common.