A Beginner’s Guide
I have always been a big fan of teaching and mentoring and this is my attempt at making a short guide on learning data analysis and programming. This page is not meant for everyone, but for those who have a slight interest in learning more about data analysis. If you are new to learning about data analysis and programming, I will provide some pointers on how to get started and where to focus your efforts. If you have some experience this may not be as helpful, but feel free to skim it.
Why Should YOU Learn Programming/Data Analysis?
1. Employment Outlook
Today’s economy is relying more heavily on data than ever before. Advertisers use your browsing data to find the right ad for you (at the right price), on demand ride companies use real-time supply and demand data to find the right driver for you, retailers use your shopping data to determine what you might buy next, and the list goes on. Given this shift to a data economy, knowing how to wrangle and find insights in data will be a skill that is valued by employers for the next few decades. I have no doubts that data analysis and data programming will be a valuable skill in the near future. The Economist recently wrote an article along the same lines (here).
2. The Challenge
With the amount of information we have today, the challenge of a modern data analyst is not finding the right data, but extracting valuable insights from it. Aside from the difficulties in getting your data into a usable format (which will take ~70–80% of your time), the ability to take the same information available to others and tell a convincing story is something that is not easy to do. A great data analyst is a essentially a journalist with better programming skills.
3. Day to Day work
My biggest reason why I recommend programming for your future is the day to day work. Every day I get paid to solve programming puzzles. That’s it. I have to take data that looks like X and form it to be like Y and then find something interesting about it. The problems generally get more advanced as you get more experience, and this illustrates how programming continues to be fun day in and day out.
So Now that You Are Convinced…
Let’s move on to the guide. This guide will be split into 3 parts: Programming, Workflow/Organization, and Resources.
Programming will discuss where to get started with your learning and some of my recommended best practices as you get better.
Workflow/Organization will address my thoughts on how you work through problems and organize your data/programs to be both efficient and easy to follow.
Resources will include any external resources that will help you learn a lot more than what this guide can offer.
1. Choosing a language
Before you can start diving into the data you need to choose a programming language. As you can probably guess, I am a big fan of R. I didn’t use to be this way, but the language has grown on me for a variety of reasons:
- R is open source (free) and has thousands of individuals continually developing packages. If you have a particular data problem I can almost guarantee someone has already solved something similar and they wrote a package to do it in fewer lines.
- R has great documentation both for its base language and for the R packages that users rely on. Don’t know what the
filterfunction does in the
dplyrpackage? Go to the documentation on CRAN, R’s documentation management site.
- Network of users. Related to package development, R has enough users that there is a sufficient community to help when you run into problems. I will cover this in more detail when I discuss StackOverflow under Resources.
Overall, for a beginner R is the way to go. SAS/SQL are easier to learn, but R is better as a true scripting language. Python seems to be slightly more popular than R, but I think it is also harder to learn given it is an object oriented programming language.
2. Getting Started
Now that I have convinced you of why you should use R (hopefully!) I suggest you get moving to actually learning programming. The best course I have found is the introductory course offered by DataCamp. I discuss DataCamp in more detail under the “Resources” section below.
After you have done the introductory course above, download RStudio for your desktop. This is the best integrated development environment (IDE) out there for R.
3. Best Practices (a few of my favorites as you get better)
- Never repeat code. If you do something more than twice find a way to write a function or loop to run it multiple times. This will save you much more time in the future if you need to make edits to what you are doing.
- Progress > Perfection. Just keep coding. I cannot tell you enough how important this is. You are going to learn more from trying and failing over and over again than you will from trying to get the perfect piece of code to do something. There are times when perfection matters, but know that it is far better to keep moving forward than to find a perfect solution. I guarantee you will be more successful if you focus on progress over perfection.
- Simplicity > Efficiency. This is probably where I disagree with most programmers. A lot of programmers tend to focus on efficiency and the speed of a program at the expense of how easy it is for other programmers to understand. While efficiency can save computer run time it sometimes can cost more in human time to understand it. If you are working on something like Amazon.com’s homepage load rate or Google’s search results, efficiency is huge. If you are working on a personal project it probably doesn’t matter all that much. Know when to care about efficiency. A lot of times, the simple solution is also the efficient solution, so they don’t have to be mutually exclusive.
- Make everything dynamic. You should never hard code values into your programs. Instead you should assign these values to variables and refer to these variables going forward. For example, if you want to run a simulation 500 times you should define “n_simulations” as 500 (“n_simulations <- 500”) and then refer to “n_simulations” throughout your program. If you decide to later run 10,000 simulations you only need to change the single definition of “n_simulations” rather than replace the number 500 with 10,000 every single time you referenced it. Thinking in this way while programming will make your future workflow much easier and less error prone.
4. Recommended packages for R
Rather than go in-depth on the packages I recommend for every particular data analysis task, I am going to put a blank statement out for the
tidyverse . The tidyverse consists of a handful of packages created by Hadley Wickham, arguably the most prolific developer in R’s history. His packages make data analysis simple and I highly recommend them for beginners as well as those with more experience. I know many experienced R programmers that use and recommend the tidyverse because of its ease of use. If you can master these packages you will be well prepared as a data analyst.
1. Version Control (Git)
The first thing you need to know about programming is that you have to track your changes to your programs. Modern day version control (most people use Git) is just as important as the language you use to do your analysis. It has become an industry standard and I personally never want to work without it again. There is a learning curve involved, but using Git is something you just need to do. It has saved me countless hours in finding older versions of programs and storing code snippets that I can re-use later.
Add-Commit-Pull-Push (once you know some Git)
- This is my version “Gym, Tan, Laundry” and I use these 4 git commands everyday that I am programming. Remember these 4 commands and they will serve you well 95% of the time when using Git. For everything else, Google it.
2. Data/Program Organization
- Build an organizational structure that mirrors your workflow. Everything you do as a data analyst will always follow the same basic procedure:
- Bring data into the system ->
- Manipulate the data to be in the format you want ->
- Analyze the data.
This leads to the import/build/analysis structure I have setup under by GitHub account and I recommend you follow something similar. Programs that import will bring in data, programs that build will manipulate your data, and programs that analyze will summarize or visualize your data in some way. I have worked on hundreds of data cases at a consulting firm and I have never seen this fail. There are times when it might make sense to alter this framework in some way (i.e. you don’t need a build step, etc.), however, for most projects this will work.
- Use numbering to link related programs/data. Words are ambiguous (i.e. is this a geographic analysis or a mapping analysis?), while numbers are not. If you number things correctly you should always be able to trace where something came from. There are times when you use multiple data sources for one analysis and this may not be easy to link with one number, so try your best to make it clear. Remember, Progress > Perfection.
- Always stay organized. I consider being disorganized as something of a programming disease. You start off with some ambiguity (i.e. where does this go?, This name will work for now I guess?, etc.) but it only gets worse with time. In team based settings this can be a nightmare, so spend the time to stay organized and have a standard you use. Staying organized takes more time upfront, but saves you lots of time on the back end when you need to find something later. Organization is an investment in your future productivity.
1. DataCamp (for learning R)
This is the best website for learning R. Hands down, no contest. These guys are dedicated to making great content with some of the best educators in the R community. The training is interactive and engaging and they have courses on everything for getting started in R.
2. StackOverflow (for fixing your broken programs)
Without this website there is a good chance I would be unemployed. StackOverflow is a thread based website discussing programming bugs and how to fix them. You ask a question and answers are upvoted by users.
3. GitHub (for sharing your programs)
GitHub is a site for hosting your Git repositories. This facilitates sharing and collaborating with others programming projects. I use GitHub for hosting all of my work related to this site. Once you have some experience under your belt I recommend you venture over to GitHub and get started. GitHub provides a great graphical user interface (GUI) and allows you to more easily track your progress on particular programming projects.
4. R for Data Science (for taking the next step)
This book is available online for free and provides lots of theoretical and practical applications for learning R with focus on the
tidyverse. I recommend this alongside DataCamp’s interactive tutorials above for learning more about data analysis/data science in R.