In this blog entry, I would like to share my experience with R and data analysis. A few months back I developed a keen interest in the relatively recent development of NGS (Next Generation Sequencing) data analysis. Subsequently, I decided to work on the evolution of specific features in bacterial genomes. As you may know, bacterial genomes are enormously variable in their structure and composition (e.g. codon usage, GC bias, and gene copy number).
During the course of this project I came across the issue of non-independence in the context of correlation between two traits (ex: genome size and genomic GC content). I was using R and used the correlation function to determine whether these two traits were correlated. The correlation estimate was 0.6 between GenomeSize Vs GC%. However, I soon found out that since a large number of species in my dataset share common ancestry they are not independent. So, If I were to correlate any two traits – I would have to take into account the shared ancestry. The most widely used method for analyzing associations between continuous traits in species is the phylogenetically independent (PIC) contrasts (Felsenstein, 1985). PIC essentially removes the effect of shared ancestry in the traits. As most things, R had a package called ‘geiger’ to perform PIC. After taking into account shared ancestry of species, I obtained a correlation estimate of 0.47 between the same two traits.
Whitlock and Schluter cite a similar example with a dataset of 17 lily species in which the closely related lily species tend to have the same flower type as compared to slightly more distant species.