This week my colleague introduced me to the Unix cut command. It is one of those tools that, once you start using it, you wonder how you ever lived without it before. cut is a great tool for taking flat files and extracting data.
The aim of this tutorial is to give you basic fluency and enough knowledge "to be dangerous".
Say you have a text file that looks like so:
# foods.txt
Name|Type|Rating
Apple Pie|Dessert|5
Cheeseburger|Dinner|5
Green Salad|Side|5
Ravioli|Dinner|5
Pancake|Breakfast|5
Apricot||4
Almonds|Snack|
And you want to extract the values from one of the columns; e.g. Type.
cut works very simply - given a line of input, it will "cut" out a certain amount of text and print it to standard out. cut is analogous to clipping out an article from the newspaper: you have a big piece of paper (a flat file) and you want to grab a smaller snippet of it.
To solve the problem of gathering types, we will use cut's ability to interpret a line with delimited data fields and then selecting one in particular. Here is how cut handles it:
cat foods.txt | cut -d \| -f 2
There are two things happening in this command. The first parameter -d is telling cut that we want to use the pipe character as a delimiter. We need to escape it so that our shell does not interpret it as a regular pipe.
The second parameter, -f, tells cut that we want to read the 2nd field in the input string. Doing this will yield the output below.
Type
Dessert
Dinner
Side
Dinner
Breakfast
Snack
We now have a list of all the food types. Notice that cut also echoed an empty line for Apricot, since that field was empty. Also notice that the header in the file was included. (This is usually desired behavior but good to remember when you are in a hurry).
We just saw how to retrieve the list of fields. Suppose now you get a new requirement: encode the food types based on the first two characters in the word; e.g. the code for "Dessert" is "De". You need to take the list you produced and strip it down so that only the two first characters in each word is shown.
To accomplish our task we will use the -c option.
The -c option takes an input line and cuts out the specified list of characters, appending them all together, and then writing them to standard out. The list can be one or more ranges of characters, comma-delimited. Since we want to grab just the first two characters of each line, all we have to do is modify our previous command to look like the following.
cat foods.txt | cut -d \| -f 2 | cut -c 1-2
By piping our previous command to cut again, we strip out the first two characters. Notice that in our second cut command, we didn't need to specify a custom delimiter since cut reads the raw string in that instance. Running the command produces the output below.
Ty
De
Di
Si
Di
Br
Sn
Notice that we still receive the empty line. cut ignores it.
I hope you found this tutorial helpful. If you like my writing, feel free to subscribe to my feed or say hi on twitter. Thanks for reading!