9 Sorting, First and Last Lines
Continuing with the
fasta_stats examples from chapter 8, “The Standard Streams,” the seventh column of the output contains the length of the longest perfect repeat, in characters.
Which sequence contains the longest perfect repeat? To answer this question, we could consider sorting the lines according to this seventh column. (First, we’ll have to remove the header lines themselves, which we can accomplish by filtering out lines matching
# using the
-v flag of grep, or by grepping for
unit:, as in chapter 8.) Enter
sorts lines from a text file (or from standard input) by specified columns:
sort <file> or
... | sort.
sort sorts by all columns, by comparing the entire lines in “dictionary,” or lexicographic order. To sort by specific columns, we need to use a rather sophisticated syntax. We’ll illustrate with a figure.
sort utility takes many potential parameters, though the most important are the
-k parameters that specify the columns by which to sort and how that sorting should be done, and occasionally the
-u flag. The
-k (key) parameters are considered in order; the above specifies that the sorting should be done on columns 2 through 4 (conglomerated into a single “column”), considering them in dictionary order, and sorting them in reverse. In the case of ties, only the first column is considered in normal dictionary order, and in the case of further ties, the fifth column is considered in numeric order. (The difference between
g ordering is that
g can handle entries in scientific notation like
1e-6, but generally
n is preferred because it is faster and not subject to small rounding errors.)
-u flag (which may be specified before or after the keys, or even mixed in) specifies that after all the keys are considered, if there are still any ties between rows, then only the first row should be output. It outputs only “unique” lines according to the overall sorting order.
sort uses whitespace as the column separator, though it can be changed (run
man sort for more information). To view information about the longest perfect repeat, we will use
sort -k7,7nr, indicating that we wish sort on the seventh column only, in reverse numeric order.
The first few lines of output indicate that the longest perfect repeat is 94 bases long and occurs in sequence
PZ805359 (this sequence’s GC content is
0, because it’s composed entirely of a long
The results also indicate that there are a number of ties; several sequences contain perfect repeats of length 18 base pairs. If we only wanted one sequence reported per different repeat length, we could try
sort -k7,7nr -u. We could alternatively modify our sort to secondarily sort by GC content (second column),
sort -k7,7nr -k2,2g.
A useful trick is to perform two sorts: one that initially sorts the data on whatever criteria are wanted, and a second that gets only the first line of a group based on secondary criteria. We may wish report only the highest GC content sequence per different repeat length, for example. We can initially use a
sort -k7,7nr -k2,2gr to sort by repeat length and break ties by GC content, leaving the highest-GC-content sequences on top. From there, we can use a
sort -k7,7nr -u to re-sort the data (even though they are already sorted!) by the seventh column, keeping only the top line per repeat length.
There is one small concern, however: how can we be sure that our careful ordering by GC content wasn’t undone in the second sort? After all, the second sort would technically be free to reorder ties according to the seventh column, resulting in incorrect output. There is an additional flag for
-s flag, indicating that stable sorting should be used. Stable sorting means that, in the case of ties, elements are left in their original order. So, to be safe, we could use a secondary sort of
sort -k7,7nr -u -s, though a careful reading of the documentation for
sort indicates that on most systems the
-u flag implies the
First and Last Lines
Often we wish to extract from a file (or from standard input) the first or last few lines. The tools
tail are designed to do exactly this, and in combination with other tools are surprisingly handy. The
head tool extracts the first lines of a file or standard input:
head -n <number> <file> or
... | head -n <number>. The
tail tool extracts the last lines of a file or standard input:
tail -n <number> <file> or
... | tail -n <number>.
tail utilities, like the others covered previously, write their output to standard output, and so they can be used within pipelines. They are often employed to inspect the beginning or end of a file (to check results or formatting). They also commonly extract test data sets. For example,
head -n 40000 input.fastq > test.fastq would extract the first 10,000 sequence records from
input.fastq and produce
test.fastq (because every four lines of a FASTQ sequence file represents information for a single sequence).
The above shows the first 12 lines of a FASTQ file generated on an Illumina HiSeq 2000. The first line in each set of four represents an identifier for the sequence read, the second line contains the sequence itself, the third line is often unused (containing only a
+, though it may be followed by the identifier and other optional data), and the fourth line contains the “quality” of each base in the sequence encoded as a character. (The encoding has varied in the past, but in recent years, sequencing companies have standardized the encoding used by Sanger sequencing machines.)
With a bit of modified syntax,
tail can also be used to extract all lines of a file starting at a given line. As an example,
tail -n +5 input.fastq > test.fastq would result in
test.fastq having all but the first sequence record of
input.fastq (i.e., starting at the fifth line). This feature is especially useful for stripping off header lines of output or files before further processing, as in
./fasta_stats pz_cDNAs.fasta | tail -n +9, rather than using
grep -v '#' above.
pz_cDNAs.fasta, the seventh column represents the length of the longest perfect repeat found in each sequence. Use only
tailto determine the median value in this column. (You may need to run multiple commands or pipelines.)
pz_cDNAs.fasta, how many different perfect repeat units (column six) are found?
- The file
pz_blastx_yeast_top10.txtis the result of running
blastx -query ../fasta_stats/pz_cDNAs.fasta -db orf_trans -evalue 1e-6 -max_target_seqs 10 -max_hsps 1 -outfmt 7 -out pz_blastx_yeast_top1.txt. Aside from the “comment” lines that start with
#, the first column is the query ID, the second the target (yeast) ID, the third the percentage identity of the HSP, the eleventh the E value, and the twelfth the “bitscore.” Which query ID had the largest bitscore? How many different query sequences (entries in the first column) had one or more HSPs against the database?
- Extract from
pz_blastx_yeast_top10.txta file called
pz_blastx_yeast_top1.txtcontaining only the smallest E-valued HSP line per query ID. (You may remove comment lines starting with
- A word of caution: if a column contains an entry that cannot be interpreted as an integer or general number, it will be treated as
0in the sorting order. ↵