April 05, 20185 min read

From medicine to data science

Massive amounts of healthcare data are generated everyday in public hospitals, most of which are sitting untouched in storage. It is especially true here in France, where data access policies are pretty strict, electronic health records (EHR) pretty new, and hospital-employed data scientists pretty non-existant (well, let’s see what will change in the future with the recent announcements of our president and Rapport Villani). Among other reasons, this realization led me to where I am today; I recently — yesterday in fact — finished an MSc in Data Science at Ecole Polytechnique and Université Paris-Saclay, and am planning to put what I’ve learned to unravel the mysteries hidden in all this dormant data.

As this is my first post on this blog, it will also serve as a brief introduction to my background. After medical school, tired of numbingly learning books by heart, I chose to pursue a residency in Public Health and Social Medicine to do a little more maths than other specialties allowed. My first year was divided into two research internships, one in biostatistics and one in epidemiology. In both, on top of being inspired by great mentors, I discovered how properly handled data could give powerful insights and have a huge impact on people’s health and well-being. I also realized how scarce “new” statistical approaches (ie. machine learning, deep learning) were to be found in actual studies and real-life applications. By that time, I’d already heard of a solid MSc in Data Science given by one of our top schools, sent my application, and voilà, here I am today, eager to start my new internship next week doing research on adverse drug reaction detection in large databases. More on that later.

This year taught me tremendous amounts of knowledge, but it was hard. Very hard. The fact that I was actually accepted in an applied maths curriculum still surprises me to this day; coming from medicine (which you start right after high school in France), I did not meet the mathematical nor programming prerequisites. Sure, I did spend a year working on statistical analyses, and at that time I thought that I was ready to go deeper, but I soon realized how much I lacked in terms mathematical foundations to truly get the most of out my Data Science MSc.

Having spent the summer digging out good resources to get me to a decent level in a limited amount of time, I figured I would start this blog by sharing what helped me preparing this year and getting through it alive (spoiler: I even enjoyed it). I will list in no specific order some of the courses I had, and the relevant associated resources linked underneath.

Convex Optimization

Convex optimization was one of the hardest subjects for me this year, drawing heavily from both the fields of analysis and linear algebra. It was also my very first lecture this year, which set the pace for what was to come.

  • Grant Sanderson’s video channel 3blue1brown — probably the most helpful resource of this whole post. I first discovered his videos on Khan Academy, when looking for a course on linear algebra. He is a passionate mathematician who has a talent for giving visual intuitions on complicated mathematical notions. I keep on coming back to his videos whenever I need an explanation on anything he’s covered.

  • Convex Optimization by Stephen Boyd and Lieven Vandenberghe — this book was recommended by our teachers. It’s a good book, but I admit that the most helpful part was actually the Appendix: it’s a great summary of the mathematical background needed to follow an optimization course.

  • The Matrix Cookbook by Kaare Brandt Petersen and Michael Syskind Pedersen — this book was my cheatsheet for any matrix related manipulation

  • Fabian Pedregosa’s blog — lots of useful posts that were key for me to understand some of my lectures. Also, a useful inequalities cheatsheet for exercises and exams.

Probabilistic Graphical Models / Bayesian Statistics

Those two subjects were heavily based on a very famous book in the data science field.

  • Pattern recognition and Machine Learning by Christopher Bishop — do not be scared by the size of this book. It’s a well-written one that starts from the basics. Hopefully, you won’t need to read it all in one go (I didn’t), and the most helpul chapters were the ones about Gaussian Mixture, Variational Bayesian inference and Markov Chain Monte Carlo (MCMC) methods.

Machine Learning

  • Data Science from Scratch by Joel Grus — the author dives into each of the widely used machine learning models, coding everything in base Python and Numpy from scratch. It’s a very helpful way, if not the best, to understand how the models work. It is also a good way to learn Python best practices. All code available on the author’s associated github repository.

Speech, Text and Natural Language Processing

  • Speech and Language Processing by Dan Jurafsky and James H. Martin — it was the book recommended by the teachers and the chapters on parsing (from 11 to 14) were really useful when we tackled the subject.

  • NLTK book by Steven Bird, Ewan Klein, and Edward LoperNLTK is one of the most widely used package for natural language processing (NLP) in Python. The autors have written a book commonly referred as “The NLTK book” which is a great introduction to NLP with NLTK (obviously).

Deep Learning

  • The course’s github repo by Olivier Grisel and Charles Ollion — the teachers did an impressive work, updating slides every week to integrate fresh publications and building very didactic and detailed Jupyter notebooks; I can’t imagine a better introduction to deep learning. They made everything freely available here.

  • Deep Learning with Python by François Chollet — the easiest way to start programming deep architectures is with a package called keras, and its creator wrote a really practical book about it, which goes from the basics to implementing state-of-the-art models in some fields. A solid complement to the class great materials.

  • Distill — fundamental papers that explore new ways of publishing in the web era (getting away from static PDFs). All the published articles are very accessible and well-written, but their most notable features are their stunning interactive visualizations. This paper on momentum is amazing.

  • Twitter — it might come as a surprise for some, but as advised by someone, I created an account this year to follow the activity of researchers of interest. I didn’t expect much from it, but boy was I wrong; there are so many interesting discussions going on, that are both high-level and concise (because of the character limit). Moreover, it is an effortless (more or less) way to keep track of the daily evolving field of deep learning.

  • The Matrix Calculus You Need for Deep Learning by Terence Parr and Jeremy Howard — this is a perfect example of what I’ve found thanks to Twitter. I actually read it after the course finished, but this paper is definitely something I wish I’d found at the beginning.

Learning with Agregation

  • Prediction, Learning and Games by Nicolò Cesa-Bianchi and Gabor Lugosi — the book of reference in the field, with progressive build-up and easy-to-follow proofs. Aiming to predict as well as the best out of a bunch of predictors for one or multiple tasks is really fun.

Compressed Sensing

One of the most interesting subject this year, grounded in solid maths and with a lot of cool applications such as medical magnetic resonance imaging (MRI) or matrix completion (the so called Netflix problem).

  • The course’s lecture notes by Guillaume Lecué — these notes were of great quality, although written in French. Still, if you can understand them, by all means have look, they’re cristal clear.

  • Michael Lustig’s website — he is an associate professor as UC Berkeley, and has worked a lot on the applications of compressed sensing in MRI. His website has a lot of useful resources, links to seminal papers and exercises to play with.

Kernel Methods for Machine Learning

  • The course’s slides by Jean-Philippe Vert and Julien Mairal — these slides are really well-made and are self-contained; I usually hate reading slides when I study for exams, but those were the exception.

And that’s it. Those were the resources that helped me make sense of what I was taught during lectures, and I’m sure I’ll come back to them on a regular basis. Do not hesitate to add other links in the comment section, I’m always in the lookout for great teaching content to learn from.

For irregular updates and reflexions

© 2022, dinh-phong nguyen