Why load everything into a spreadsheet when the terminal can be faster, more powerful, and more easily scriptable?
So you’ve landed on some data you want to analyze. Where do you begin?
Many people used to working in a graphical environment might default to using a spreadsheet tool, but there’s another way that might prove to be faster and more efficient, with just a little more effort. And you don’t need to become an expert in a statistical modeling language or a big data toolset to take advantage of these tools.
I’m talking about the Linux command line. Just using some tools that you’ve probably already got installed on your computer, you can learn a lot about a dataset without ever leaving your terminal. Long-time Linux users will of course laugh—they’ve been using many of these tools for years to parse logs and understand configuration tools. But for the Linux newcomer, the revelation that you’ve got a whole data analysis toolkit already at your fingertips can be a welcomed surprise.
Most of these tools aren’t strictly speaking limited to Linux, either. Most hearken back to the days of Unix, and users of other Unix-like operating systems likely have them installed already or can do so with ease. Many are a part of the GNU Coreutils package, while a few are individually maintained, and with some work, you can even use them on Windows.
So let’s try out a few of the many simple open source tools for data analysis and see how they work! If you’d like to follow along with these examples, go ahead and download this sample data file, from GitHub, which is a CSV (comma separated value) list of articles we published to Opensource.com in January.
head and tail
First, let’s get started by getting a handle on the file. What’s in it? What does its format look like? You can use the cat command to display a file in the terminal, but that’s not going to do us much good if you’re working with files more than a few dozen lines.
Enter head and tail. Both are utilities for showing a specified number of lines from the top or bottom of the file. If you don’t specify the number of lines you want to see, you’ll get 10. Let’s try it with our file.