大脸猫爱吃鱼

主成分分析法

字数统计: 3.5k阅读时长: 18 min
2018/12/16 Share

毕业设计记录之2

下面首先介绍一下主成分分析法。

主成分分析法(Principal Component Analysis,PCA)

主成分分析法是一种揭示大样本、多变量数据中各变量或样本之间内在关系的一种方法,其主要作用是降低观测空间的维数,以获取最主要的信息。假设我们的研究对象是一个有n个天体组成的样本,每个天体有m个观测参量(在这里我们的参量是一个个波长点),则观测量可表示为矩阵$ X=(x {ij}){n \times m} PCAm线。PCA方法类似于多元回归分析,即利用m个原始观测变量的线性组合得到的 m’(m’ \ll m) mprincipalcomponentpc个既能综合反映原来m个参量的信息且彼此间又相互独立的新变量来描述原始数据。这些新变量被称作主成分(principal component,pc),与观测量x$之间的关系可以表示为:

pc=eX=e1xk1+e2xk2++eixki++emxkmpc=eX={e}_{1}x_{k1}+{e}_{2}x_{k2}+ \dots +{e}_{i}x_{ki}+ \dots +{e}_{m}x_{km}

其中X=(xij)n×mX=(x _{ij})_{n \times m}为观测矩阵,pcpc是主成分,ee为待求的m维特征向量。当ee给定后,对mm个观测量就可以求出一个出成分。主成分pcpc应尽可能多的反映原观测量具有的信息,且彼此互不相关;随机变量的信息可由其方差大小表示,而不同的特征向量ee可以有不同的方差,PCA就是寻求使pc的方差达到最大的一组特征向量ee
初始的观测矩阵x为:

X=(x11x12x1mx21x22x2mxn1xn2xnm)\mathbf{X} = \left( \begin{array}{cccc} x_{11} & x_{12} & \ldots & x_{1m} \\ x_{21} & x_{22} & \ldots & x_{2m} \\ \vdots & \vdots & \vdots & \vdots \\ x_{n1} & x_{n2} & \ldots & x_{nm} \end{array} \right)

矩阵中的行矢量对应着一条光谱,列矢量对应着波长。主成分分析法的任务是寻求是pcpc的方差达到最大的一组特征向量ee。根据最小二乘法原理,此处的ee为观测矩阵x的协方差矩阵’C=(Cjk)m×mC=(C_{jk})_{m \times m}的正交特征矢量,其中

Cjk=1n1i=1n (xijmj)(xikmk)1j,kmC_{jk}=\frac{1}{n-1}\sum_{i=1}^n\ (x_{ij}-m_{j})(x_{ik}-m_{k}) \qquad 1\leq j,k\leq m

xˉj=1ni=1n xijxˉk=1ni=1n xik\bar{x}_{j}=\frac{1}{n}\sum_{i=1}^{n}\ x_{ij} \qquad \bar{x}_{k}=\frac{1}{n}\sum_{i=1}^{n}\ x_{ik}

xi,xjx_{i},x_{j}为列矢量的平均值。通过求解行列式方程ClI=0|C-lI|=0,我们将得到满足方程pc=eXpc=eX的组合,其中ll为行列式的特征根,IIm×m(m \times m)的单位的单位矩阵。求得特征根之后,再由下式求得特征矢量ee

(ClI)ei=0(C-lI)e_{i}=0

方程$ (C-lI)e_{i}=0 m个特征根,根据它们的大小顺序,表示为l_{1} \geq l_{2} \geq l_{3} \ldots \geq l_{m} \geq 0。对应用每一个特征根l_{i},有一个特征矢量e_{i},可以得到一个主成分pc_{i},称作第i个主成分。其中最大的特征根l_{1}$所对应的是第一主成分。

  • 具体计算
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
pro get_espec
wavemin = 3700.
wavemax = 5500.

;;;;;;;;;;;;;;;; read in models ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
niw = 1941
nmod = 25000
wavein = make_array(niw,/float)
fluxin = make_array(niw,nmod,/float)

openr,lun,'wave.txt',/Get_lun
readf,lun,wavein
free_lun,lun
openr,lun,'spec_vdisp.txt',/Get_lun
readf,lun,fluxin
free_lun,lun

;;;;;;;;;;;;;;;;; set the range of wavelength for PCA analysis ;;;;;;
indespec = where(wavein ge wavemin and wavein le wavemax)
wave = wavein[indespec]
flux = fluxin[indespec,*]
flag = replicate(1.,n_elements(wave))

;;;;;;;;;;;;;;;;; set mask regions;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
maskem = -9L
;------------------------------------------------------------------------------
; Mask out emission lines
; OII OII H8 NeIII Hg Hd Hb OIII
em = [3726.03, 3728.82, 3889.05, 3869.06, 4101.73, 4340.46, 4861.33, 4959.91, $
; OIII He I OI NII Ha NII SII SII
5006.84, 5875.67, 6300.30, 6548.04, 6562.82, 6583.41, 6716.44, 6730.81]
emmaskw = 500.
dz = emmaskw / 3.e5 ; clipping interval
for iem = 0, n_elements(em) - 1 do begin
maskout = where(wave gt em[iem]*(1-dz) and wave lt em[iem]*(1+dz))
if maskout[0] ne -1 then maskem = [maskem, maskout]
endfor
maskem = maskem[1:n_elements(maskem)-1]
flag[maskem] = 0.
indok = where(flag ne 0., ctpix, compl = compl)
wave_espec = wave[indok]
flux_espec = flux[indok,*]

;;;;;;;;;;;;;;;;;;; normalize models and subtract mean;;;;;;;;;;;;;;;;;;;;;;;;;;
flux_norm = make_array(ctpix,nmod,/float)
for ii = 0L,nmod-1 do begin
norm_mod = mean(flux_espec[*,ii])
flux_norm[*,ii] = flux_espec[*,ii]/norm_mod
endfor
meanarr = TOTAL(flux_norm,2,/nan) / double(nmod)

openw,lun,'flux.txt',/Get_lun
for ii=0,nmod-1 do printf,lun,flux_norm[*,ii],format='(1601(F10.3))'
free_lun,lun

for ii = 0L,nmod-1 do flux_norm[*,ii] = flux_norm[*,ii]-meanarr

;;;;;;;;;;;;;;;;;;; do PCA ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
help,flux_norm
A=flux_norm # transpose(flux_norm) ;covariance matrix
help,A
A=A/(nmod-1)
trired, A, D, E ,/double
triql, D, E, A, /double
index=sort(D)
D=D[index]
E=E[index]
A=A[*,index]
W=100.0 * reverse(D)/total(D)
eigenval = reverse(D)
espec =reverse(transpose(A))
;espec=A
;espec =reverse(A)
print,max(D)
percentages=W
print,'max(D)',max(D),'min(D)',min(D)
help,espec,eigenval,percentages
print,espec[0,*]

;espec=vwpca_svd(flux_norm,variances=variances,pcs=pcs,evalues=evalues)

;;;;;;;;;;;;;;;;;;;;;;;; output eigenspecta ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
wave_out = wave
espec1_out = replicate(-99., n_elements(wave))
espec1_out[indok] = espec[0,*]
espec2_out = replicate(-99., n_elements(wave))
espec2_out[indok] = espec[1,*]
espec3_out = replicate(-99., n_elements(wave))
espec3_out[indok] = espec[2,*]
espec4_out = replicate(-99., n_elements(wave))
espec4_out[indok] = espec[3,*]
espec5_out = replicate(-99., n_elements(wave))
espec5_out[indok] = espec[4,*]
espec6_out = replicate(-99., n_elements(wave))
espec6_out[indok] = espec[5,*]
espec7_out = replicate(-99., n_elements(wave))
espec7_out[indok] = espec[6,*]
espec8_out = replicate(-99., n_elements(wave))
espec8_out[indok] = espec[7,*]
espec9_out = replicate(-99., n_elements(wave))
espec9_out[indok] = espec[8,*]
espec10_out = replicate(-99., n_elements(wave))
espec10_out[indok] = espec[9,*]

meanarr_out = replicate(-99., n_elements(wave))
meanarr_out[indok] = meanarr
help,meanarr_out,meanarr,espec10_out

espec_out=replicate(-99., n_elements(wave))
openw,lun,'eigenvect.txt',/Get_lun
for ii=0,1600 do begin
espec_out[indok] = espec[*,ii]
printf,lun,espec_out,format='(1721(F10.3))'
end
free_lun,lun
openw,lun,'espec.txt',/get_lun
printf,lun,wave_out,format='(1721(F10.3))'
printf,lun,meanarr_out,format='(1721(F10.3))'
printf,lun,espec1_out,format='(1721(F10.3))'
printf,lun,espec2_out,format='(1721(F10.3))'
printf,lun,espec3_out,format='(1721(F10.3))'
printf,lun,espec4_out,format='(1721(F10.3))'
printf,lun,espec5_out,format='(1721(F10.3))'
printf,lun,espec6_out,format='(1721(F10.3))'
printf,lun,espec7_out,format='(1721(F10.3))'
printf,lun,espec8_out,format='(1721(F10.3))'
printf,lun,espec9_out,format='(1721(F10.3))'
printf,lun,espec10_out,format='(1721(F10.3))'
free_lun,lun


meanarr_out[compl] = 'Nan'
espec1_out[compl] = 'Nan'
espec2_out[compl] = 'Nan'
espec3_out[compl] = 'Nan'
espec4_out[compl] = 'Nan'
espec5_out[compl] = 'Nan'
espec6_out[compl] = 'Nan'
espec7_out[compl] = 'Nan'

set_plot, 'PS'
device, FILE='pcs.eps', /ENCAPSULATED
device,xsize=21
device,ysize=30
;draw mean spec

plot,wave_out,meanarr_out,Thick=3,XRange=[3700,5500],YRange=[0.0,1.5],$
xstyle=1,ystyle=1,position=[0.1,0.8,0.9,0.9]
XYouts,0.15,0.87,'mean spec',/Normal

;draw pc1
plot,wave_out,espec1_out,Thick=3,XRange=[3700,5500],YRange=[-0.06,0.06],$
xstyle=1,ystyle=1,position=[0.1,0.7,0.9,0.8],/noerase
XYouts,0.15,0.77,'pc1',/Normal

;draw pc2
plot,wave_out,espec2_out,Thick=3,XRange=[3700,5500],YRange=[-0.10,0.15],$
xstyle=1,ystyle=1,position=[0.1,0.6,0.9,0.7],/noerase
XYouts,0.15,0.67,'pc2',/Normal

;draw pc3
plot,wave_out,espec3_out,Thick=3,XRange=[3700,5500],YRange=[-0.06,0.08],$
xstyle=1,ystyle=1,position=[0.1,0.5,0.9,0.6],/noerase
XYouts,0.15,0.57,'pc3',/Normal

;draw pc4
plot,wave_out,espec4_out,Thick=3,XRange=[3700,5500],YRange=[-0.10,0.15],$
xstyle=1,ystyle=1,position=[0.1,0.4,0.9,0.5],/noerase
XYouts,0.15,0.47,'pc4',/Normal

;draw pc5
plot,wave_out,espec5_out,Thick=3,XRange=[3700,5500],YRange=[-0.15,0.15],$
xstyle=1,ystyle=1,position=[0.1,0.3,0.9,0.4],/noerase
XYouts,0.15,0.37,'pc5',/Normal

;draw pc6
plot,wave_out,espec6_out,Thick=3,XRange=[3700,5500],YRange=[-0.20,0.10],$
xstyle=1,ystyle=1,position=[0.1,0.2,0.9,0.3],/noerase
XYouts,0.15,0.27,'pc6',/Normal

;draw pc7
plot,wave_out,espec7_out,Thick=3,XRange=[3700,5500],YRange=[-0.10,0.10],$
xstyle=1,ystyle=1,position=[0.1,0.1,0.9,0.2],/noerase
XYouts,0.15,0.17,'pc7',/Normal
!p.font=4
XYOuts, 0.03, 0.4,'Normalized Flux Density f',/Norm,Charsize=1.5,Orientation=90
XYOuts, 0.4, 0.06, /Normal,'Wavelength(A)', Charsize=1.5
device, /CLOSE_FILE
device,ENCAPSULATED=0
end

运行完本程序便会产生4个文件,“flux.txt”、“eigenvect.txt”、“espec.txt”和“pcs.eps”\
“flux.txt”:归一化后的模板谱
“eigenvect.txt”:特征谱
“espec.txt”:波长,平均谱,前10个特征谱
“pcs.eps”:平均谱,前7个特征谱图
在得到了上面的3个txt文件后,可以开始计算模板谱在特征谱上的投影了。
程序读取"espec.txt",“flux.txt"来计算投影,输出文件"projection.txt”
投影程序为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
pro projection
;read pcs
eigenvect_full=fltarr(1721,10)
spec=fltarr(1601,25000)
projec=fltarr(7,25000)
eigenvect=fltarr(1601,10)
mspec_full=fltarr(1721)
mspec=fltarr(1601)
wave_full=fltarr(1721)
openr,lun,'espec.txt',/Get_lun
readf,lun,wave_full
readf,lun,mspec_full
readf,lun,eigenvect_full
free_lun,lun
openr,lun,'flux.txt',/get_lun
;openr,lun,'model.txt',/get_lun
readf,lun,spec
free_lun,lun
index=where(mspec_full GT -50,int)
mspec=mspec_full[index]
for ii=0,9 do eigenvect[*,ii]=eigenvect_full[index,ii]
aaa=rebin(mspec,1601,25000)
help,aaa
X=spec-aaa
help,aaa,eigenvect,spec,x
pcs=fltarr(1601)
for i=0,6 do begin
pcs=eigenvect[*,i]
for j=0,24999 do begin
spec_temp=x[*,j]
sum=0.0
for k=0,1600 do begin
sum=sum+spec_temp[k]*pcs[k]
end
projec[i,j]=sum
end
end

openw,lun,'projection.txt',/Get_lun
;openw,lun,'projection_new.txt',/Get_lun
thisformat='(7(F10.3))'
for i=0,24999 do begin
printf,lun,projec[*,i],format=thisformat
end
free_lun,lun

end

计算出投影后计算每个物理参量对应的未知数。程序读取"projection.txt","model-parameter.tbl"来解方程组以及截断,最后输出eps文件,以及对应的未知数的txt文件。
程序为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
pro solve
project=fltarr(7,25000)
parameter=fltarr(11,25000)
zmet=fltarr(25000)
;read the projection
;openr,lun,'projection_new.txt',/Get_lun
openr,lun,'projection.txt',/Get_lun
readf,lun,project
free_lun,lun
;read the parameter
openr,lun1,'model-parameter.tbl',/Get_lun
readf,lun1,parameter
free_lun,lun1

;zmet=parameter[10,*]
zmet1=parameter[2,*]
zmet2=parameter[3,*]
for i=0,24999 do zmet[i]=zmet1[i]*zmet2[i]
;zmet=9+alog10(zmet) ;if the zmet is age
;zmet=zmet*0.02
;zmet=alog10(zmet)
help,zmet
;zmet=transpose(zmet)
print,max(zmet),min(zmet)
n=8
C=fltarr(n,n)
D=fltarr(n)
for i=0,n-2 do D[i]=zmet ## project[i,*]
D[n-1]=total(zmet)
C[n-1,n-1]=25000
for i=0,n-2 do begin
C[i,n-1]=total(project[i,*])
C[n-1,i]=total(project[i,*])
for j=0,n-2 do begin
C[i,j]=transpose(project[i,*]) ## project[j,*]
end
end
LUDC,C,index
result=LUSOL(C,index,D)
temp5=fltarr(25000)
for i=0,24999 do begin
temp5[i]=0
for j=0,n-2 do begin
temp5[i]=temp5[i]+project[j,i]*result[j]
end
temp5[i]=temp5[i]+result[n-1]
end
print,'result',result
print,'min(temp5),max(temp5)',min(temp5),max(temp5)
print,'min(zmet),max(zmet)',min(zmet),max(zmet)

;out the X
;openw,lun2,'Mu_Tauv.tbl',/Get_lun
openw,lun2,'Mu_Tauv.txt',/Get_lun
printf,lun2,result
free_lun,lun2

;plot the picture
!P.MULTI=[0,1,0,0,0]
set_plot, 'PS'
device,xsize=20
device,ysize=16
device, FILE='Mu_Tauv.eps', /ENCAPSULATED
;plot,temp5,zmet, PSYM=3,xrange=[-0.5,2.3],yrange=[0,2.2],xstyle=1,ystyle=1,position=[0.1,0.1,0.9,0.9]
plot,temp5,zmet, PSYM=3,xrange=[min(temp5),max(temp5)],yrange=[min(zmet),max(zmet)],xstyle=1,ystyle=1,position=[0.1,0.1,0.9,0.9]
XYOuts, 0.06, 0.4,'Mu_Tauv',/Norm,Charsize=1.5,Orientation=90
;XYOuts, 0.4, 0.06, /Normal,'Z+sum(X*C)', Charsize=1.5
device, /CLOSE_FILE
device,ENCAPSULATED=0
end

计算得到了相应参量的系数的值以后,可以计算特征谱在对应物理参量上的贡献度。此程序由于IDL绘图颜色没有调用好,所以出来的eps文件没有对应的颜色,但是可以根据输出的"fps.txt"文件重新绘图。
程序如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
pro Fpc
project=fltarr(7,25000)
X=fltarr(7,5)
temp=fltarr(8)
F=fltarr(7,5)
sum=fltarr(7)
name=strarr(5)
names=strarr(7)
xval=indgen(7)
names[0]='c1' & names[1]='c2' &names[2]='c3' &names[3]='c4'
names[4]='c5'& names[5]='c6'& names[6]='c7'
;read the projection
openr,lun,'projection.txt',/Get_lun
readf,lun,project
free_lun,lun
openr,lun1,'age.txt',/Get_lun
;openr,lun1,'age.txt',/Get_lun
name[0]='age'
readf,lun1,temp
X[*,0]=temp[0:6]
free_lun,lun1
;openr,lun2,'HDA.txt',/Get_lun
openr,lun2,'HDA.txt',/Get_lun
name[1]='HDA'
readf,lun1,temp
X[*,1]=temp[0:6]
free_lun,lun2
;openr,lun3,'D4000.txt',/Get_lun
openr,lun3,'D4000.txt',/Get_lun
name[2]='D4000'
readf,lun1,temp
X[*,2]=temp[0:6]
free_lun,lun3
;openr,lun4,'Vdisp.txt',/Get_lun
openr,lun4,'Vdisp.txt',/Get_lun
name[3]='Vdisp'
readf,lun1,temp
X[*,3]=temp[0:6]
free_lun,lun4
;openr,lun5,'Mu_Tauv.txt',/Get_lun
openr,lun5,'Mu_Tauv.txt',/Get_lun
name[4]='Mu_Tauv'
readf,lun1,temp
X[*,4]=temp[0:6]
free_lun,lun5
sum1=0.
openw,lun6,'Fpc.txt',/Get_lun
for k=0,4 do begin
for i=0,6 do begin
for j=0,24999 do begin
sum1=sum1+abs(X[i,k]*project[i,j])
sum[i]=sum[i]+abs(X[i,k]*project[i,j])
end
end
printf,lun6,sum1,name[k]
for l=0,6 do F[l,k]=sum[l]/sum1
for l=0,6 do printf,lun6,sum[l],sum[l]/sum1
end
free_lun,lun6
print,F
temp1=F[*,0]
help,temp1
print,xval,temp1,names
;plot,xval,temp1,/ynozero,yrange=[0.,0.8],xtickname=names, XTICKV = XVAL,XTICKS = 3,PSYM=10

!P.MULTI=[0,2,0,0,0]
set_plot, 'PS'
device,xsize=20
device,ysize=16
device, FILE='Fpc.eps', /ENCAPSULATED
Device, Decomposed=1 ;(24位模式)
plot,xval,F[*,0],PSYM=10,yrange=[0.,0.8],position=[0.1,0.1,0.9,0.45],/nodata,xtickname=names
oplot,xval,F[*,0],PSYM=10,color=0L ;黑色
oplot,xval,F[*,1],psym=10,color=255L ;红色
oplot,xval,F[*,4],psym=10,color=16711680L ;蓝色
XYOuts, 0.4, 0.06, /Normal,'Z+sum(X*C)', Charsize=1.5
plot,xval,F[*,3],PSYM=10,yrange=[0.,0.8],position=[0.1,0.55,0.9,0.9],/nodata,xtickname=names
oplot,xval,F[*,3],psym=10,color=0L
oplot,xval,F[*,2],psym=10,color=255L
;XYOuts, 0.06, 0.4,'Mu*Tauv',/Norm,Charsize=1.5,Orientation=90
;XYOuts, 0.4, 0.06, /Normal,'Z+sum(X*C)', Charsize=1.5
device, /CLOSE_FILE
device,ENCAPSULATED=0
end
CATALOG
  1. 1. 毕业设计记录之2
    1. 1.1. 主成分分析法(Principal Component Analysis,PCA)