RNA 염기서열 분석 데이터로부터 유전적 집단 구조의 직접적인 추론 및 제어

커뮤니케이션 생물학 6권, 기사 번호: 804(2023) 이 기사 인용

2275 액세스

21 알트메트릭

측정항목 세부정보

RNAseq 데이터는 유전적 변이체를 추론하는 데 사용될 수 있지만 유전적 집단 구조를 추정하는 데 사용되는 연구는 여전히 미흡합니다. 여기에서는 RNAseq 기반 유전적 주요 구성 요소(RG-PC)를 추정하고 RG-PC가 유전자 발현 분석에서 인구 구조를 제어하는 데 사용될 수 있는지 평가하기 위해 무료로 사용 가능한 계산 도구(RGStraP)를 구성합니다. 연구되지 않은 네팔 인구의 전혈 샘플과 Geuvadis 연구를 사용하여 우리는 RG-PC가 쌍을 이루는 배열 기반 유전자형과 비슷한 결과를 보였으며, 높은 유전자형 일치도와 유전적 주요 구성 요소의 높은 상관 관계를 통해 데이터 세트 내의 하위 모집단을 포착했습니다. 차등 유전자 발현 분석에서 우리는 RG-PC를 공변량으로 포함하면 테스트 통계 인플레이션이 감소한다는 것을 발견했습니다. 우리의 논문은 RNAseq 데이터를 사용하여 유전자 집단 구조를 직접 추론하고 제어할 수 있으므로 전사체 데이터의 향상된 회고적 및 향후 분석을 촉진할 수 있음을 보여줍니다.

RNA 시퀀싱(RNAseq)은 전사체에 대한 우리의 이해에 혁명을 일으켰으며, 유전자 발현을 위한 정확한 정량화 방법은 물론 특정 대체 스플라이싱 부위와 세포 유형별 전사체 식별을 모두 제공합니다1,2. 이 응용 프로그램은 임상 환경으로 확장되어 복잡한 질병을 더 자세히 설명하고 전염병 및 비전염성 질병 모두에서 유망한 바이오마커를 식별할 수 있습니다3.

그러나 RNAseq을 사용한 연구에서는 RNAseq 판독 세트에도 포함된 생식선 유전적 변이를 거의 고려하지 않습니다. 이 정보를 활용하지 않는 연구는 집단 간 전사에 영향을 미칠 수 있는 인구 계층화와 같은 편견과 혼란에 취약할 수 있습니다4,5,6,7. 이 문제를 극복하기 위해 연구자들은 일반적으로 RNAseq을 사용하여 동일한 개인에 대해 일치하는 게놈 전체 배열 또는 전체 게놈 서열(WGS) 데이터에 의존해 왔습니다. 이를 통해 연구자들은 유전적 주성분(PC) 계산 및 후속 통계 연관 모델에서 공변량으로 사용하는 등 인구 계층화를 제어하기 위한 접근 방식을 배포할 수 있습니다8,9,10. 유전적 PC는 사회적 환경의 차이로 인해 또는 그룹 간 양적 특성 유전자좌의 이질성으로 인해 혼란을 야기하는 집단 내 및 집단 간 잠재 유전 구조를 나타 내기 위해 사용됩니다. 그러나 RNAseq 데이터와 일치시키기 위한 게놈 전체 배열 또는 WGS의 필요성은 잠재적으로 불필요하며 매우 다양하고 연구가 잘 되지 않는 인구가 있는 저중저소득 국가(LMIC)와 같이 자원이 제한된 환경에서는 실제로 불가능할 수 있습니다.

GATK12,13,14와 같은 도구를 사용하여 RNAseq 데이터에서 유전자형 호출이 이루어질 수 있다는 것이 입증되었습니다. 유전자 구조를 포착하기 위해 RNAseq 데이터를 활용하는 접근 방식은 가축 및 농업 목적15,16,17,18, 예를 들어 재배된 보리(Hordeum vulgare)17의 개체군 구조, 역사 및 적응을 조사하는 데 적용되었습니다. 조직 특이적 변이와 같은 RNAseq 기반 유전자형의 개념 증명 및 후속 유용성이 입증되었지만 인간 집단 구조를 추론하는 적용 가능성은 아직 상대적으로 미흡한 상태로 남아 있습니다.

이 연구의 목적은 (i) RNAseq 기반 유전자형이 다양하지만 충분히 연구되지 않은 인간 집단의 유전적 집단 구조를 포착할 수 있음을 입증하고, (ii) RNAseq 기반 유전적 주요 구성 요소(RG-PC)를 사용하여 다음과 같은 결과를 얻을 수 있음을 보여주는 것입니다. 연관성 분석에서 인구 구조를 효과적으로 제어합니다. 여기에서 우리는 125개 이상의 인종 그룹이 있는 히말라야 산맥에 위치한 내륙 국가인 네팔에서 376명의 개인의 전혈 RNAseq 데이터를 모집하고 생성했습니다. 우리는 RNAseq 데이터에서 직접 유전적 주요 구성 요소를 계산하기 위해 RNAseq 분석 파이프라인(RGStraP)을 개발한 다음 동일한 네팔 개인의 게놈 전체 배열 유전자형 데이터를 사용하여 RGStraP의 성능을 검증했습니다. 우리는 또한 1000개의 게놈 집단 중 5개에서 얻은 유전자형-RNAseq 데이터 쌍이 있는 465개의 샘플을 포함하는 Geuvadis 컨소시엄의 샘플에 대해 파이프라인을 테스트했습니다23. 마지막으로, 성별에 따른 유전자 발현을 확인하기 위한 연관 분석에서 RG-PC 조정의 타당성을 보여줍니다. 전반적으로, 우리의 연구는 특히 연구가 부족하지만 다양한 집단의 인간 집단 구조가 RNAseq 데이터를 사용하여 직접 효과적으로 포착하고 제어할 수 있음을 확립했습니다.

0.05 and a pairwise LD threshold of r2 < 0.05 struck the optimal balance of offering the most variants for analysis and the highest correlation between RNAseq- and array-based genetic PCs (Supplementary Fig. 2). From the total of 4,921,472 genetic variants, 152,072 SNPs passed the MAF filter (MAF > 0.05), and 36,440 SNPs further passed the LD filter (LD < 0.05). Genetic variants from paired genomic data are available for 299 out of the initial 376 individuals; a total of 552,758 SNPs were identified and passed initial quality control filters (Methods), of which 315,615 SNPs and 29,943 SNPs then passed MAF > 0.05 and further LD < 0.05 filters, respectively. Out of the 299 samples with both RNAseq and paired array genotypes, 280 of them passed quality control and were used for further downstream analyses./p>0.90 concordances. b Canonical correlation analysis between ten RG-PCs and ten array PCs showed significant (Wilks’ Lambda, p-value < 0.05) correlations for the first 7 canonical variates (CVs) between the two sets. The first 3 CVs from 10 RG-PCs strongly captured the genetic information from array PCs (Rc1 = 0.946, Rc2 = 0.864, Rc3 = 0.853), in which the cumulative proportion of shared variance between the two sets reached up to 0.956 from just the 3 CVs./p> 0.05) variants, of which 4887 passed the LD filter (LD < 0.05) and were used to calculate RG-PCs. We also calculated genetic PCs from the 29,943 paired genotype array SNPs as a measure of true genetic structure to be compared against RG-PCs. To assess the consistency of inferred population structure between the two approaches, we calculated Spearman correlation between genetic PCs from paired genotype array SNPs and the RG-PCs. PC1 of both RNAseq and array sets correlated strongly with each other (|ρ| = 0.93), followed by RG-PC3 and PC2 from array data (|ρ| = 0.61) and RG-PC2 and PC3 from array data (|ρ| = 0.6) (Supplementary Fig. 4). As expected, the genetic PCs of one approach do not exclusively correspond to only one PC of the other approach, as can be seen with significant correlations of a single array PC with several RG-PCs. To investigate this further, we performed canonical correlation analysis between the top 10 array PCs and the RG-PCs and found that the RG-PCs fully explained the variance of the top 10 array PCs (Fig. 2b)./p> 0.05) to account for differences in sequencing depths. Only autosomal genes were included in the analyses./p> 1) in the set without considering genetic PCs, and the number decreased to 3 when including either array or RG-PCs. This demonstrates how RG-PCs control for population stratification in downstream RNAseq analysis similar to the genetic PCs calculated from paired array genotypes, reducing significant associations that reflected variations in population structure instead of the biology of interest./p>38.5 °C temperature or history of fever for >72 h. From the total blood sample volumes (≤16 mL for patients >16 years of age, ≤7 mL for ≤16 years), aliquots were subjected to (i) bacteriological culture to identify presence of Salmonella enterica serovars Typhi (S. Typhi); (ii) storage in PAXgene tubes for later RNA extraction; and (iii) DNA extraction and subsequent human genotyping. Blood was also collected from healthy participants in the serosurvey (≤8 mL for patients >16 years of age, ≤7 mL for ≤16 years), from which aliquots were also subjected to (i) serological analysis; (ii) PAXgene storage for RNA analysis; and (iii) DNA extraction./p> 0.05 in at least 20% of the samples from the analyses. Differential gene expression (DGE) analyses was done contrasting males and females using edgeR43,44, taking into account age, disease group, and sequencing batches; we ran the analyses with and without populations structure PCs as an additional covariate to then compare how genetic structure may stratify gene expression. From both results, we also plotted the Q-Q plot and calculated the systematic inflation (m), which is the ratio of the median of the empirically observed chi-squared test statistics (in our case, results of DGE analysis with RG-PCs) to the expected median chi-squared test statistics (results of DGE analysis without RG-PCs), to quantify the stratification due to population structure in gene expression data./p>