Tutorial Three: An Introduction to scikit-learn (I) – Gaël Varoquaux, Jake Vanderplas, Olivier Grisel
For a long time I’ve been very curious about machine learning. Up to this point it’s appeared to me much like a mystical unicorn. They seem really cool but you never really know much about them. This tutorial provided me with an excellent chance to break that mysticism down.
After a brief introduction to scikit-learn and a refresher on numpy/matplotlib we used IPython notebooks to walk through basic examples of what the suite is capable of. We then moved into a quick overview of what machine learning is and some common tactics for tackling data analysis. Now that we were a bit more familiar with the suite itself and machine learning principles, we moved onto more complex examples.
Again, using IPython notebooks we walked through examples of supervised learning (classification and regression), unsupervised learning (clustering and dimensionality reduction), and using PCA for data visualization. We ended the morning session with a couple of more advanced supervised learning examples (determining numbers of hand written digits and Boston house prices based on various factors) and an advanced unsupervised learning example in which we analyzed over 20,000 text articles to determine from which of four categories they likely originated.
One note for further research: How much data should be used for train vs test data? What factors play a role in this and are there any common standards or practices which researchers follow?
Tutorial Four: Statistical Data Analysis in Python – Christopher Fonnesbeck
Statistics is an area of for me. Combine that interest with Python Pandas and you’ve got an instant winner, right? Not exactly. While the talk was tagged for beginners it proved to be otherwise.
The speaker clearly had a very strong background in statistics. However, those that background didn’t transition into an easy to follow talk. The statistics language was very far above me and most of the room — if my observations were correct. Additionally, the version of pandas he used wasn’t the same as the version in the required packages noted in the talks description. This resulted in the majority of us not being able to follow along in IPython notebooks and being forced to watch him on the projector.
Please, don’t mistake this for a whine session. Chris knew his stuff and he was able to answer everyones’ questions and smashed some ‘stump the chump’ attempts without batting an eye. But, the talk should have been refined and rehearsed and versions of required packages should have been vetted earlier.
You can’t win them all, right?