Veranstaltung |
Wochentag |
Termin |
Ort |

042531 (Übung: 042532) | Montag (Übung: Mittwoch) | 10.15 - 12.00 | Otto-Hahn-Str. 12 - R 1.056 Campus Nord |

- Module description: pdf
- Language: English
- Lecturer: Dr. Sangkyun Lee
- Attendance to practices (Übung) is STRONGLY RECOMMENDED..
- Office Hour (lecturer): by appointments only, LS8, OH 12, Raum 4.023

This course focuses on optimization techniques to find solutions of large-scale problems that typically appear in statistical learning / data analysis tasks with big data. Topics would include widely adopted methods in modern research literature such as projected gradient methods, accelerated first order algorithms, conjugate gradient methods, quasi-Newton methods, block coordinate descent, proximal point methods, stochastic sub-gradient algorithms, alternating direction method of multipliers, and semi-definite programming. Efficiency of these methods in terms of convergence rate and overall time complexity will be discussed, so that one can see how to choose a suitable method for his/her own research problems. Separable (approximate) reformulations of optimization problems will also be discussed that are suitable for parallel computation compared to their original formulation. Discussions in the course will be problem- oriented, so that methods will be presented in the context of specific problem instances whenever possible. Homework assignments will be given to provide background knowledge for the course or to check understanding of techniques.

The aim of this course is to provide students with understanding of modern optimization techniques suitable for large-scale/big-data problems, so that students see how they can choose, apply, and/or modify appropriate methods for their own research problems.

- Lectures: understanding of optimization algorithms and their convergence properties.
- Practices: application of optimization on machine learning / data analysis problems.

This lecture will be based on recent publications that will be assigned as reading homeworks. No textbook is required. Another lecture of mine will be helpful if you need background knowledge:

Some of my lecture materials are from:- EE236C (UCLA) by L. Vendenberghe, link
- 10-725 (CMU) by G. Gordon R. Tibshirani, link
- EE364b (Stanford) by S. Boyd, link

- Introduction (11.04)
- Background for convex optimization (18.04)
- Gradient descent (25.04)
- Stochastic gradient descent 1
- Stochastic gradient descent 2
- Proximal gradient descent
- Accelerated gradient descent
- ADMM 1/2
- ADMM 2/2
- Block coordinate descent 1
- Block coordinate descent 2
- Duality 1
- Duality 2
- Duality 3

Due to the nature of practices, only some parts of Ubung materials will be available here. We'll use the Julia language.

- Intro to Julia
- Intro to JuMP, L1-SVM
- SVM
- Parallel processing in Julia (by Raoul Lefmann)
- Proximal GD and compressed sensing
- PGD, AGD and ADMM (Fused Lasso)
- Performance Tips for Julia
- LADMM on logistic regression, and BCGD
- BCGD, Acceleration, BCD
- Gaussian MRF
- Plotting in Julia (by Mirko Bunse)
- Workflow considerations (by Mirko Bunse)

*** !!! To register for the final exam, bring this form filled-in by the last lecture date (21.07) to get my signature !!! ***

Final exam registration form

The basic idea is that you pick a paper from a suggested list or of your own choice, and implement the algorithm in the paper. You give two presentations.

- > Presentation 1 (30-45min including Q/A, qualification for the final): showing your understanding about the ideas and algorithms in the paper (tentative dates:
**July 6, 13, 20**)- Include a list of tasks you plan to do in one month (end of lecture ~ final presentation)

- > Presentation 2 (45min including Q/A, final exam): briefly review the idea, and show your implementation is correct and works well (tentative date:
**August 17 and 18, 10:00 - 13:00**)- ** Grading Criteria **
- 1. Understanding of the method from your chosen paper.
- 2. Correct implementation of the method. Correctness can be shown directly (e.g. compared to JuMP solution), or indirectly (e.g. comparing your prediciton performance to that of another code on benchmark datasets)
- 3. (bonus) making your code readable, modular, and open source with some documentation
- (note) if you choose a paper already discussed in lectures (e.g. fista), extra work will be expected: e.g. interesting application, or in-depth discussion of the method which hasn't been in lectures, etc.
- (note) if you go application-centric (using already-discussed methods), your discussion can focus on the application itself (motive, challenges, etc.) while fulfiling the criterion 1 above.

The list will be provided soon and keep growing. Discussion with the lecturer is highly recommended for your choice (you must send me an email about it at least). More than one person can work on the same paper, and you can focus on a part of a paper if you have a good reason. Finding more recent papers of similar topics are welcomed. Choose one you can implement around **2 -- 3 weeks**. Implementation in the Julia language is strongly recommended. Some papers may accessible only from the university.

>> Methodology-focused (this is the preferred way for a mini-project):

- [ SGD and variants ]
- Basis paper: Nemirovski et al., Robust Stochastic Approximation Approach to Stochastic Programming, SIOPT 2009
- ADAGRAD, JMLR 2011
- SAG, NIPS 2012
- SDCA, JMLR 2013
- SVRG, NIPS 2013 [Chosen by Mirko]
- ADAM, ICLR 2015

- [Stochastic second-order]
- Hessian-free second order, ICML 2011 [Chosen by Michael]
- K-FAC, ICML 2011
- Stochastic quasi-Netwon, AISTATS 2007
- SFO, ICML 2014

- [Proximal gradient and acceleration]
- Basis paper: FISTA: A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems
- A geometric alternative to Nesterov's accelerated gradient descent
- A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights
- Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent [Chosen by Kevin]

- [ADMM]
- Basis paper: Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, FnT ML, 2010
- Online ADM, ICML 2012
- Fast Stochastic ADMM, ICML 2014
- Accelerated ADMM, KDD 2015
- Sentence Regularization by ADMM, ICML 2014
- Bregman ADMM
- Accelerated Linearized ADMM, SIAM J. Imaging Sci, 2015

- [BCD]
- Basis paper: On the Convergence of Block Coordinate Descent Type Methods, SIOPT, 2013
- SPDC, ICML 2015
- more to be added

>> Application-centric (interesting application + basic method is possible. Finding newer/better papers is recommended)

- [SVM]
- Shalev-Shwartz, Singer, and Srebro, Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, ICML 2007
- SVM

- [Collaborative filtering]
- ALS for collaborative filtering This is not a paper, but you can get some idea for your project
- NMF: non-negative matrix factorization Methods and papers are referred to in the package description
- Learning the parts of objects by non-negative matrix factorization, Nature, 1999

- [Sparse coding]
- An overview of problems: L1-L2 Optimization in Signal and Image Processing
- Lasso, J. Royal Stat. Soc. 2003
- Generalized Lasso, Ann. Stat. 2011
- Fused Lasso, AAAI 2014
- Blind source separation
- Blind source separation-based dictionary learning
- Inpainting, Computer Journal, 2007
- Inpainting, IASC 2008
- Sparse component analysis
- more to be added

- [Sparse coding in probabilistic graphical models]
- GMRF, JMLR 2008
- Graphical Lasso, Biostatistics 2008

Data sets that might be useful for your project.

- Collections of "standard" data sets:
- Computer vision and image datasets:
- Movie rating data sets:
- Bioinformatics:
- ECML/PKDD 2015 Discovery Challenge: link