Artificial Intelligence (AI) and Machine Learning (ML) are big buzz words these days, and given their explosion into the public sphere, large language models like ChatGPT are often the first thing that comes to mind when we hear about AI. In reality, chatbots are only a small subset of the diverse technologies which fall under the umbrella of AI, and for more than a decade, genomic researchers have been applying AI and ML technologies to their very toughest data challenges.
In the early days of genetic research, sequencing was incredibly time and cost intensive. The first ever fully sequenced human genome took 13 years to complete and cost $3 billion. Since the advent of Next Generation Sequencing, these costs have fallen dramatically, and an entire genome can now be sequenced in a matter of hours for less than $1,000. Today, researchers can routinely sequence multiple individuals for a study, with each genome comprising billions of DNA base pairs. Simply maintaining the computer infrastructure required to store these datasets is a dizzying prospect, let alone trying to analyze them.
Traditional approaches for handling data are infeasible with large genomic datasets. Imagine comparing several human genomes in hopes of finding the genetic cause of a rare disease. If you took just one second to examine each of the 3.2 billion base pairs, you would need over 100 years to work through the entire data table. And because most traits involve the interactions of multiple genes, it’s easy to miss important patterns. Even with the help of a computer, it can be difficult to uncover these interactions. Traditional statistical approaches only compare a pre-selected number of variables at a time. As these are predetermined, testing different combinations of variables requires the manual input of the researcher. This input is not only technically demanding, but also time consuming, creating a bottleneck in data analysis that cannot be overcome without the use of advanced computational power.
This is where AI and ML come in. At their core, these technologies are designed to uncover patterns in complex data. And unlike traditional statistical models, AI and ML approaches analyze all available data across any combination of variables right from the start.
- AI is a broad term which can refer to any computer process which aims to emulate thought processes, even in a limited capacity. All existing AI technologies are considered “weak” AI, meaning they function only within a specific domain of tasks.
- ML refers to a subset of AI technologies which construct algorithms based on patterns found in datasets. Crucially, the researcher does not prescribe which variables should be used in the algorithm – these are determined by the computer automatically based on patterns which emerge from the dataset. After the ML algorithm is constructed, it can be used to make predictions about new datasets. The algorithm is automatically revised each time new data is fed into the model, increasing its predictive power.
Applied to genomics, these approaches can analyze interactions across the entire genome simultaneously and continue to improve as new data is added. This has applications across all areas of genomics research, including untangling the complex interactions between genetic and social determinants of health.
Identifying genetic risk factors for mental health disorders is of key interest to health practitioners. Recent research has focused on identifying patterns of association between mental health disorders and common genetic variants – called ‘genome-wide association studies (GWAS).’ Such studies have successfully identified variants associated with a range of disorders, such as schizophrenia and major depressive disorder. However, these genetic risk factors tend to have low predictive power, partially due to the limited scope of many GWAS data sets, which only include common genetic variants. Whole-genome sequencing and whole-exome sequencing (sequencing the protein-coding regions of the genome) could improve prediction models by capturing rare genetic variants, however, these require much larger datasets. Many GWAS studies also suffer from a lack of racial and ethnic diversity in the populations sampled, which are largely biased towards European ancestry. This makes it difficult to generalize their findings to real-world clinical populations. Integrating genomic data with environmental and socio-demographic factors is yet another approach to improving predictive power, but this too requires larger, more complex datasets.
With the advanced processing power made possible with AI and ML approaches, health researchers can leverage ever larger and more complex datasets for these kinds of challenges. For instance, a new Genome Alberta funded research initiative aims to identify genetic and environmental risk factors for mental health disorders, examining the combined genomic data of 6,450 children and youth. The initiative will ensure diversity among participants through the targeted sampling of under-represented demographics, with the goal of advancing precision medicine for all young Canadians.
Forestry is another area where AI and ML can greatly support genomic research. Breeding trees based off their genome, called genomic selection, can greatly accelerate the development of improved tree seedlings. Such approaches could enable foresters to more readily plant trees with desirable traits, such as faster growth or better drought resistance. However, this necessitates knowing which genetic variants are associated with which traits, and trees have mind-bogglingly big genomes. White spruce, commonly used for reforestation here in Alberta, has up to 10 times as many base pairs in its genome compared to humans.
The Resilient Forests (RES-FOR) project, funded by Genome Alberta, aims to develop robust genomic selection models for lodgepole pine and white spruce tree improvement programs in Alberta. AI and ML approaches are a critical part of their data analysis.
While AI and ML provide powerful insights, they can also amplify underlying data problems if misused. Biases in the datasets used to train the model can become codified within the algorithm without the researcher even knowing. It is critical that genomic researchers and data scientists using these technologies have the expertise and training to anticipate and correct these issues – particularly in high-risk applications like human health.
As AI and ML technologies become more sophisticated, researchers expect to leverage ever larger datasets, unlocking insights never before thought possible. Like Next Generation Sequencing in the early 00s, AI and ML are poised to have a similarly transformative force on the field of genomics.
See our Project Portfolio to learn more about projects Genome Alberta has funded in Technology Platforms.