How to Parse CSV Data in Bash

Configurare noua (How To)

Situatie

A Comma Separated Values file is a text file that holds tabulated data. CSV is a type of delimited data. As the name suggests, a comma “,” is used to separate each field of data—or value—from its neighbors.

CSV is everywhere. If an application has import and export functions, it’ll almost always support CSV. CSV files are human-readable. You can look inside them with less, open them in any text editor, and move them from program to program. For example, you can export the data from an SQLite database and open it in LibreOffice Calc.

However, even CSV can become complicated. Want to have a comma in a data field? That field needs to have quotation marks “"” wrapped around it. To include quotation marks in a field each quotation mark needs to be entered twice.

Of course, if you are working with CSV generated by a program or script that you have written, the CSV format is likely to be simple and straightforward. If you’re forced to work with more complex CSV formats, with Linux being Linux, there are solutions we can use for that too.

Solutie

You can easily generate some sample CSV data, using sites like Online Data Generator. You can define the fields you want and choose how many rows of data you want. Your data is generated using realistic dummy values and downloaded to your computer.

We created a file containing 50 rows of dummy employee information:

  • id: A simple unique integer value.
  • firstname: The first name of the person.
  • lastname: The last name of the person.
  • job-title: The person’s job title.
  • email-address: The person’s email address.
  • branch: The company branch they work in.
  • state: The state the branch is located in.

Some CSV files have a header line that lists the field names. Our sample file has one. Here’s the top of our file:

The sample CSV file

The first line holds the field names as comma-separated values.

Parsing Data Form the CSV file

Let’s write a script that will read the CSV file and extract the fields from each record. Copy this script into an editor, and save it to a file called “field.sh.”

#! /bin/bash

while IFS="," read -r id firstname lastname jobtitle email branch state
do
  echo "Record ID: $id"
  echo "Firstname: $firstname"
  echo " Lastname: $lastname"
  echo "Job Title: $jobtitle"
  echo "Email add: $email"
  echo " Branch: $branch"
  echo " State: $state"
  echo ""
done < <(tail -n +2 sample.csv)

There’s quite a bit packed into our little script. Let’s break it down.

We’re using a while loop. As long as the while loop condition resolves to true, the body of the while loop will be executed. The body of the loop is quite simple. A collection of echo statements are used to print the values of some variables to the terminal window.

The while loop condition is more interesting than the body of the loop. We specify that a comma should be used as the internal field separator, with the IFS="," statement. The IFS is an environment variable. The read command refers to its value when parsing sequences of text.

We’re using the read command’s -r (retain backslashes) option to ignore any backslashes that may be in the data. They’ll be treated as regular characters.

The text that the read command parses is stored in a set of variables named after the CSV fields. They could just as easily have been named field1, field2, ... field7 , but meaningful names make life easier.

The data is obtained as the output from the tail command. We’re using tail because it gives us a simple way to skip over the header line of the CSV file. The -n +2 (line number) option tells tail to start reading at line number two.

The <(...) construct is called process substitution. It causes Bash to accept the output of a process as though it were coming from a file descriptor. This is then redirected into the while loop, providing the text that the read command will parse.

Make the script executable using the chmod command. You’ll need to do this each time you copy a script from this article. Substitute the name of the appropriate script in each case.

chmod +x field.sh

Making a script executable with chmod

When we run the script, the records are correctly split into their constituent fields, with each field stored in a different variable.

./field.sh

The CSV file parsed by the field.sh script.

Each record is printed as a set of fields.

Selecting Fields

Perhaps we don’t want or need to retrieve every field. We can obtain a selection of fields by incorporating the cut command.

This script is called “select.sh.”

#!/bin/bash

while IFS="," read -r id jobtitle branch state
do
  echo "Record ID: $id"
  echo "Job Title: $jobtitle"
  echo " Branch: $branch"
  echo " State: $state"
  echo ""
done < <(cut -d "," -f1,4,6,7 sample.csv | tail -n +2)

We’ve added the cut command into the process substitution clause. We’re using the -d (delimiter) option to tell cut to use commas “,” as the delimiter. The -f (field) option tells cut we want fields one, four, six, and seven. Those four fields are read into four variables, which get printed in the body of the while loop.

This is what we get when we run the script.

./select.sh

Parsing the CSV file with field.sh to extract a specific selection of fields

By adding the cut command, we’re able to select the fields we want and ignore the ones we don’t.

If the CSV you deal with is uncomplicated without commas or quotation marks in field data, what we’ve covered will probably meet your CSV parsing needs. To show the problems we can encounter, we modified a small sample of the data to look like this.

id,firstname,lastname,job-title,email-address,branch,state
1,Rosalyn,Brennan,"Steward, Senior",Rosalyn_Brennan4351@mafthy.com,Minneapolis,Maryland
2,Danny,Redden,"Analyst ""Budget""",Danny_Redden1443@brety.org,Venice,North Carolina
3,Lexi,Roscoe,Pharmacist,,Irlington,Vermont
  • Record one has a comma in the job-title field, so the field needs to be wrapped in quotation marks.
  • Record two has a word wrapped in two sets of quotation marks in the jobs-title field.
  • Record three has no data in the email-address field.

This data was saved as “sample2.csv.” Modify your “field.sh” script to call the “sample2.csv”, and save it as “field2.sh.”

#! /bin/bash

while IFS="," read -r id firstname lastname jobtitle email branch state
do
  echo "Record ID: $id"
  echo "Firstname: $firstname"
  echo " Lastname: $lastname"
  echo "Job Title: $jobtitle"
  echo "Email add: $email"
  echo " Branch: $branch"
  echo " State: $state"
  echo ""
done < <(tail -n +2 sample2.csv)

When we run this script, we can see cracks appearing in our simple CSV parsers.

./field2.sh

Running the field2.sh

The first record splits the job-title field into two fields, treating the second part as the email address. Every field after this is shifted one place to the right. The last field contains both the branch and the state values.

A record with a field split into two fields

The second record retains all quotation marks. It should only have a single pair of quotation marks around the word “Budget.”

A record with mishandled quotation marks

The third record actually handles the missing field as it should. The email address is missing, but everything else is as it should be.

A record with a missing field, which is handled correctly

Counterintuitively, for a simple data format, it is very difficult to write a robust general-case CSV parser. Tools like awk will let you get close, but there are always edge cases and exceptions that slip through.

Trying to write an infallible CSV parser is probably not the best way forward. An alternative approach—especially if you’re working to a deadline of some sort—employs two different strategies.

One is to use a purpose-designed tool to manipulate and extract your data. The second is to sanitize your data and replace problem scenarios such as embedded commas and quotation marks. Your simple Bash parsers can then cope with the Bash-friendly CSV.

The CSV toolkit csvkit is a collection of utilities expressly created to help work with CSV files. You’ll need to install it on your computer.

To install it on Ubuntu, use this command:

sudo apt install csvkit

Installing csvkit on Ubuntu

To install it on Fedora, you need to type:

sudo dnf install python3-csvkit

Installing csvkit on Fedora

On Manjaro the command is:

sudo pacman -S csvkit

Installing csvkit on Manjaro

If we pass the name of a CSV file to it, the csvlook utility displays a table showing the contents of each field. The field content is displayed to show what the field contents represent, not as they’re stored in the CSV file.

Let’s try csvlook with our problematic “sample2.csv” file.

csvlook sample2.csv

troublesome CSV correctly parsed by csvlook

All of the fields are correctly displayed. This proves the problem isn’t the CSV. The problem is our scripts are too simplistic to interpret the CSV correctly.

To select specific columns, use the csvcut command. The -c (column) option can be used with field names or column numbers, or a mix of both.

Suppose we need to extract the first and last names, job titles, and email addresses from each record, but we want to have the name order as “last name, first name.” All we need to do is put the field names or numbers in the order we want them.

These three commands are all equivalent.

csvcut -c lastname,firstname,job-title,email-address sample2.csv
csvcut -c lastname,firstname,4,5 sample2.csv
csvcut -c 3,2,4,5 sample2.csv

Picking fields in a preferred order with csvcut

We can add the csvsort command to sort the output by a field. We’re using the -c (column) option to specify the column to sort by, and the -r (reverse) option to sort in descending order.

csvcut -c 3,2,4,5 sample2.csv | csvsort -c 1 -r

Picking fields and sorting them by a single column

To make the output prettier we can feed it through csvlook .

csvcut -c 3,2,4,5 sample2.csv | csvsort -c 1 -r | csvlook

Using csvlook to pretty print the sorted selection of fields

A neat touch is that, even though the records are sorted, the header line with the field names is kept as the first line. Once we’re happy we have the data the way we want it we can remove the csvlook from the command chain, and create a new CSV file by redirecting the output into a file.

We added more data to the “sample2.file”, removed the csvsort command, and created a new file called “sample3.csv.”

csvcut -c 3,2,4,5 sample2.csv > sample3.csv

If you open a CSV file in LibreOffice Calc, each field will be placed in a cell. You can use the find and replace function to search for commas. You could replace them with “nothing” so that they vanish, or with a character that won’t affect the CSV parsing, like a semi-colon “;” for example.

You won’t see the quotation marks around quoted fields. The only quotation marks you’ll see are the embedded quotation marks inside field data. These are shown as single quotation marks. Finding and replacing these with a single apostrophe “'” will replace the double quotation marks in the CSV file.

Using LibreOffice Calc's find and replace to replace quotation marks with apostrophes

Doing the find and replace in an application like LibreOffice Calc means you can’t accidentally delete any of the field separator commas, nor delete the quotation marks around quoted fields. You’ll only change the data values of fields.

We changed all commas in fields with semicolons and all embedded quotation marks with apostrophes and saved our changes.

The modified CSV file

We then created a script called “field3.sh” to parse “sample3.csv.”

#! /bin/bash

while IFS="," read -r lastname firstname jobtitle email
do
  echo " Lastname: $lastname"
  echo "Firstname: $firstname"
  echo "Job Title: $jobtitle"
  echo "Email add: $email"
  echo ""
done < <(tail -n +2 sample3.csv)

Let’s see what we get when we run it.

./field3.sh

A section of correctly parsed CSV

Tip solutie

Permanent

Voteaza

(9 din 23 persoane apreciaza acest articol)

Despre Autor

Leave A Comment?