集団構造解析①PCA（主成分分析-FlashPCAの場合）

基本的な集団構造解析の1つ目、主成分分析（Principal Component Analysis; PCA）について説明します。使いやすい解析ツールや図示例も合わせて解説しますので、是非最後までお読みください。

Contents

1. 主成分分析（PCA）とは？
2. PCAを行うための準備
- 2.1. 遺伝子型ファイルの準備
- 2.2. FlashPCA2のダウンロード
3. 解析手順
4. References

1. 主成分分析（PCA）とは？

集団遺伝学の手法というよりは、統計学のデータ解析手法のひとつです。VCFファイルに記載されている膨大な遺伝子型情報を数値化して、通常は2次元の合成変数に縮約して表示します。例えば、100品種、10,000 SNPsのVCFファイルには、100×10,000 = 1,000,000個の遺伝子型データがあることになり、目で見て特徴を理解することはもちろん、図示することもできません。そこで、この100品種分、10,000次元のデータを、特徴をよく表す2次元のデータに要約してあげようというのがPCAです。詳しい原理は下記のようにわかりやすく説明したサイトがたくさんありますので、ここでは割愛します。

主成分分析とは｜市場調査ならインテージ

【マーケティング用語集】主成分分析とは｜複数の量的な説明変数を、より少数の主成分という指標（合成変数）に要約・集約する分析手法です。

2. PCAを行うための準備

2.1. 遺伝子型ファイルの準備

解析用の遺伝子型ファイルは、事前にフィルタリングしておくことが必要です。下記の記事を参考に、フィルタリングを行い、bedファイルとして出力してください。

集団構造解析を行うための準備

ここでは、集団構造解析を行うにあたり必要な準備について解説します。主に解析用の遺伝子型ファイル（VCFファイル）のフィルタリングの手順について述べます。1. フィルタリング用ツールのインストール1.1. VCFtoolsの...

2.2. FlashPCA2のダウンロード

GitHub - gabraham/flashpca: Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Fast Principal Component Analysis of Large-Scale Genome-Wide Data - gabraham/flashpca

PCAを行う方法はいろいろありますが、管理者はこのFlashPCA2が今のところ最も使いやすいツールだと思います。解析はコマンド一行で終了です。ただ、ソースコードからインストールしようとすると少しハードルが高い（途中で引っかかるポイントがたくさんあり、管理者もできませんでした）ため、ビルド済みのバイナリファイルをダウンロードして使いましょう。

mkdir flashpca2 　#インストール用ディレクトリ作成
cd flashpca2
wget https://github.com/gabraham/flashpca/releases/download/v2.0/flashpca_x86-64.gz　#バイナリファイルのダウンロード
gunzip flashpca_x86-64.gz     #圧縮ファイルの解凍
chmod 755 flashpca_x86-64.gz    #実行権限の付与
mv flashpca_x86-64.gz flashpca   #ファイル名が長いので変更

3. 解析手順

FlashPCA2のヘルプを呼び出してみました。必須のオプションに緑色でコメントをつけています。

flashpca --help
flashpca 2.0
Options:
  --help                  produce help message
  --scca                  perform sparse canonical correlation analysis (SCCA)
                          [EXPERIMENTAL]
  --ucca                  perform per-SNP canonical correlation analysis
                          [EXPERIMENTAL]
  -p [ --project ]        project new samples onto existing principal
                          components
  --batch                 load all genotypes into RAM at once
  -m [ --memory ] arg     size of block, in MB
  -b [ --blocksize ] arg  size of block for, in number of SNPs
  -n [ --numthreads ] arg set number of OpenMP threads
  --seed arg              set random seed
  --bed arg               PLINK bed file　　　　　#PLINKで作成したbedファイルを指定
  --bim arg               PLINK bim file
  --fam arg               PLINK fam file
  --pheno arg             PLINK phenotype file
  --bfile arg             PLINK root name
  -d [ --ndim ] arg       number of PCs to output
  -s [ --standx ] arg     standardization method for genotypes [binom2 | binom]
  --standy arg            standardization method for phenotypes [sd | binom2 |
                          binom | none | center]
  --div arg               whether to divide the eigenvalues by p, n - 1, or
                          don't divide [p | n1 | none]
  --outpc arg             PC output file
  --outpcx arg            X PC output file, for CCA
  --outpcy arg            Y PC output file, for CCA
  --outvec arg            eigenvector output file
  --outload arg           SNP loadings
  --outvecx arg           X eigenvector output file, for CCA
  --outvecy arg           Y eigenvector output file, for CCA
  --outval arg            Eigenvalue output file
  --outpve arg            proportion of variance explained output file
  --outmeansd arg         mean+SD (used to standardize SNPs) output file
  --outproj arg           PCA projection output file
  --inload arg            SNP loadings input file
  --inmeansd arg          mean+SD (used to standardize SNPs) input file
  --inmaf arg             MAF input file
  -v [ --verbose ]        verbose
  --tol arg               tolerance for PCA iterations
  --lambda1 arg           1st penalty for CCA/SCCA
  --lambda2 arg           2nd penalty for CCA/SCCA
  --maxiter arg           maximum number of SCCA iterations
  --debug                 debug, dumps all intermediate data (WARNING: slow,
                          call only on small data)
  -f [ --suffix ] arg     suffix for all output files
  -c [ --check ]          check eigenvalues/eigenvectors
  --precision arg         digits of precision for output
  --notime                don't print timestamp in output
  --save-vinit            saves the initial v eigenvector for SCCA
  --version               version

いろいろとオプションがありますが、実際に使うのは–bfileオプションだけです。コマンドは次の一行で終わりです。

flashpca --bfile minDP8_mm0.2_imputed_maf0.03hwe_pruned   #フィルタリング済bedファイルを指定（.bedはつけない）

[Sun Sep  4 21:50:10 2022] Start flashpca (version 2.0)
[Sun Sep  4 21:50:10 2022] seed: 1
[Sun Sep  4 21:50:10 2022] blocksize: 5365 (4549520 bytes per block)
[Sun Sep  4 21:50:10 2022] PCA begin
[Sun Sep  4 21:50:10 2022] PCA done
[Sun Sep  4 21:50:10 2022] Writing 10 eigenvalues to file eigenvalues.txt
[Sun Sep  4 21:50:10 2022] Writing 10 eigenvectors to file eigenvectors.txt
[Sun Sep  4 21:50:10 2022] Writing 10 PCs to file pcs.txt
[Sun Sep  4 21:50:10 2022] Writing 10 proportion variance explained to file pve.txt
[Sun Sep  4 21:50:10 2022] Goodbye!

「pcs.txt」にPC1～10の主成分得点が、「pve.txt」にその寄与率がそれぞれ出力されます。

4. References

Gad Abraham, Yixuan Qiu, Michael Inouye, FlashPCA2: principal component analysis of Biobank-scale genotype datasets, Bioinformatics, Volume 33, Issue 17, 01 September 2017, Pages 2776–2778, https://doi.org/10.1093/bioinformatics/btx299

Follow me!