Situatie
A Comma Separated Values file is a text file that holds tabulated data. CSV is a type of delimited data. As the name suggests, a comma “,
” is used to separate each field of data—or value—from its neighbors.
CSV is everywhere. If an application has import and export functions, it’ll almost always support CSV. CSV files are human-readable. You can look inside them with less, open them in any text editor, and move them from program to program. For example, you can export the data from an SQLite database and open it in LibreOffice Calc.
However, even CSV can become complicated. Want to have a comma in a data field? That field needs to have quotation marks “"
” wrapped around it. To include quotation marks in a field each quotation mark needs to be entered twice.
Of course, if you are working with CSV generated by a program or script that you have written, the CSV format is likely to be simple and straightforward. If you’re forced to work with more complex CSV formats, with Linux being Linux, there are solutions we can use for that too.
Solutie
You can easily generate some sample CSV data, using sites like Online Data Generator. You can define the fields you want and choose how many rows of data you want. Your data is generated using realistic dummy values and downloaded to your computer.
We created a file containing 50 rows of dummy employee information:
- id: A simple unique integer value.
- firstname: The first name of the person.
- lastname: The last name of the person.
- job-title: The person’s job title.
- email-address: The person’s email address.
- branch: The company branch they work in.
- state: The state the branch is located in.
Some CSV files have a header line that lists the field names. Our sample file has one. Here’s the top of our file:
The first line holds the field names as comma-separated values.
Parsing Data Form the CSV file
Let’s write a script that will read the CSV file and extract the fields from each record. Copy this script into an editor, and save it to a file called “field.sh.”
#! /bin/bash while IFS="," read -r id firstname lastname jobtitle email branch state do echo "Record ID: $id" echo "Firstname: $firstname" echo " Lastname: $lastname" echo "Job Title: $jobtitle" echo "Email add: $email" echo " Branch: $branch" echo " State: $state" echo "" done < <(tail -n +2 sample.csv)
There’s quite a bit packed into our little script. Let’s break it down.
We’re using a while
loop. As long as the while
loop condition resolves to true, the body of the while
loop will be executed. The body of the loop is quite simple. A collection of echo
statements are used to print the values of some variables to the terminal window.
The while
loop condition is more interesting than the body of the loop. We specify that a comma should be used as the internal field separator, with the IFS=","
statement. The IFS is an environment variable. The read
command refers to its value when parsing sequences of text.
We’re using the read
command’s -r
(retain backslashes) option to ignore any backslashes that may be in the data. They’ll be treated as regular characters.
The text that the read
command parses is stored in a set of variables named after the CSV fields. They could just as easily have been named field1, field2, ... field7
, but meaningful names make life easier.
The data is obtained as the output from the tail
command. We’re using tail
because it gives us a simple way to skip over the header line of the CSV file. The -n +2
(line number) option tells tail
to start reading at line number two.
The <(...)
construct is called process substitution. It causes Bash to accept the output of a process as though it were coming from a file descriptor. This is then redirected into the while
loop, providing the text that the read
command will parse.
Make the script executable using the chmod
command. You’ll need to do this each time you copy a script from this article. Substitute the name of the appropriate script in each case.
chmod +x field.sh
When we run the script, the records are correctly split into their constituent fields, with each field stored in a different variable.
./field.sh
Each record is printed as a set of fields.
Selecting Fields
Perhaps we don’t want or need to retrieve every field. We can obtain a selection of fields by incorporating the cut
command.
This script is called “select.sh.”
#!/bin/bash while IFS="," read -r id jobtitle branch state do echo "Record ID: $id" echo "Job Title: $jobtitle" echo " Branch: $branch" echo " State: $state" echo "" done < <(cut -d "," -f1,4,6,7 sample.csv | tail -n +2)
We’ve added the cut
command into the process substitution clause. We’re using the -d
(delimiter) option to tell cut
to use commas “,
” as the delimiter. The -f
(field) option tells cut
we want fields one, four, six, and seven. Those four fields are read into four variables, which get printed in the body of the while
loop.
This is what we get when we run the script.
./select.sh
By adding the cut
command, we’re able to select the fields we want and ignore the ones we don’t.
If the CSV you deal with is uncomplicated without commas or quotation marks in field data, what we’ve covered will probably meet your CSV parsing needs. To show the problems we can encounter, we modified a small sample of the data to look like this.
id,firstname,lastname,job-title,email-address,branch,state 1,Rosalyn,Brennan,"Steward, Senior",Rosalyn_Brennan4351@mafthy.com,Minneapolis,Maryland 2,Danny,Redden,"Analyst ""Budget""",Danny_Redden1443@brety.org,Venice,North Carolina 3,Lexi,Roscoe,Pharmacist,,Irlington,Vermont
- Record one has a comma in the
job-title
field, so the field needs to be wrapped in quotation marks. - Record two has a word wrapped in two sets of quotation marks in the
jobs-title
field. - Record three has no data in the
email-address
field.
This data was saved as “sample2.csv.” Modify your “field.sh” script to call the “sample2.csv”, and save it as “field2.sh.”
#! /bin/bash while IFS="," read -r id firstname lastname jobtitle email branch state do echo "Record ID: $id" echo "Firstname: $firstname" echo " Lastname: $lastname" echo "Job Title: $jobtitle" echo "Email add: $email" echo " Branch: $branch" echo " State: $state" echo "" done < <(tail -n +2 sample2.csv)
When we run this script, we can see cracks appearing in our simple CSV parsers.
./field2.sh
The first record splits the job-title field into two fields, treating the second part as the email address. Every field after this is shifted one place to the right. The last field contains both the branch
and the state
values.
The second record retains all quotation marks. It should only have a single pair of quotation marks around the word “Budget.”
The third record actually handles the missing field as it should. The email address is missing, but everything else is as it should be.
Counterintuitively, for a simple data format, it is very difficult to write a robust general-case CSV parser. Tools like awk
will let you get close, but there are always edge cases and exceptions that slip through.
Trying to write an infallible CSV parser is probably not the best way forward. An alternative approach—especially if you’re working to a deadline of some sort—employs two different strategies.
One is to use a purpose-designed tool to manipulate and extract your data. The second is to sanitize your data and replace problem scenarios such as embedded commas and quotation marks. Your simple Bash parsers can then cope with the Bash-friendly CSV.
The CSV toolkit csvkit
is a collection of utilities expressly created to help work with CSV files. You’ll need to install it on your computer.
To install it on Ubuntu, use this command:
sudo apt install csvkit
To install it on Fedora, you need to type:
sudo dnf install python3-csvkit
On Manjaro the command is:
sudo pacman -S csvkit
If we pass the name of a CSV file to it, the csvlook
utility displays a table showing the contents of each field. The field content is displayed to show what the field contents represent, not as they’re stored in the CSV file.
Let’s try csvlook
with our problematic “sample2.csv” file.
csvlook sample2.csv
All of the fields are correctly displayed. This proves the problem isn’t the CSV. The problem is our scripts are too simplistic to interpret the CSV correctly.
To select specific columns, use the csvcut
command. The -c
(column) option can be used with field names or column numbers, or a mix of both.
Suppose we need to extract the first and last names, job titles, and email addresses from each record, but we want to have the name order as “last name, first name.” All we need to do is put the field names or numbers in the order we want them.
These three commands are all equivalent.
csvcut -c lastname,firstname,job-title,email-address sample2.csv
csvcut -c lastname,firstname,4,5 sample2.csv
csvcut -c 3,2,4,5 sample2.csv
We can add the csvsort
command to sort the output by a field. We’re using the -c
(column) option to specify the column to sort by, and the -r
(reverse) option to sort in descending order.
csvcut -c 3,2,4,5 sample2.csv | csvsort -c 1 -r
To make the output prettier we can feed it through csvlook
.
csvcut -c 3,2,4,5 sample2.csv | csvsort -c 1 -r | csvlook
A neat touch is that, even though the records are sorted, the header line with the field names is kept as the first line. Once we’re happy we have the data the way we want it we can remove the csvlook
from the command chain, and create a new CSV file by redirecting the output into a file.
We added more data to the “sample2.file”, removed the csvsort
command, and created a new file called “sample3.csv.”
csvcut -c 3,2,4,5 sample2.csv > sample3.csv
If you open a CSV file in LibreOffice Calc, each field will be placed in a cell. You can use the find and replace function to search for commas. You could replace them with “nothing” so that they vanish, or with a character that won’t affect the CSV parsing, like a semi-colon “;
” for example.
You won’t see the quotation marks around quoted fields. The only quotation marks you’ll see are the embedded quotation marks inside field data. These are shown as single quotation marks. Finding and replacing these with a single apostrophe “'
” will replace the double quotation marks in the CSV file.
Doing the find and replace in an application like LibreOffice Calc means you can’t accidentally delete any of the field separator commas, nor delete the quotation marks around quoted fields. You’ll only change the data values of fields.
We changed all commas in fields with semicolons and all embedded quotation marks with apostrophes and saved our changes.
We then created a script called “field3.sh” to parse “sample3.csv.”
#! /bin/bash while IFS="," read -r lastname firstname jobtitle email do echo " Lastname: $lastname" echo "Firstname: $firstname" echo "Job Title: $jobtitle" echo "Email add: $email" echo "" done < <(tail -n +2 sample3.csv)
Let’s see what we get when we run it.
./field3.sh
Leave A Comment?