1 - About the data

The data includes more than 2000 patients and about 500 healthy controls.​

Patient data comes from different departments and most of them are not directly related to a rare disease. ​

All patients have a manually curated ICD10 diagnosis code.​

The data is comprised of clinical lab data, unstructured medical history and doctor’s notes data, genomic data and proteomic data.​

  • The clinical lab data includes blood tests and urine tests. Subjects can have between 0 to 30 terms and parameters. Abnormal findings in these tests are translated into HPO-Terms and added to the graph.​
  • The unstructured data includes Medical History Questionnaire (900 questions) and Doctor’s Letters. These are translated into HPO-Terms.​
  • The genomic data includes the top 10-20 prominent genes selected from about 6000 mutations per subject. Genetic variations that showed a CADD-score of 6 or higher were selected and only the variant with highest value per gene was chosen.​
  • The proteomic data includes between 1000-2000 proteins with quantitative values for each subject.​

2 - Graph Data Model

Graph data model is a widely used framework to describe and analyze relationships between entities in a dataset. In a graph data model, data is organized into nodes representing objects or entities in a dataset, and edges representing the relationships between the nodes. A node can have attributes or properties describing its characteristics and an edge can be directed, undirected or bidirected to give a more accurate description of the relationship. ​

Graph data models are particularly well-suited for modeling complex networks, such as social networks, biological networks, transportation networks, and knowledge graphs. They enable efficient querying, traversal, and analysis of interconnected data, allowing for the discovery of patterns, insights, and dependencies within the dataset.

Knowledge Graph

A Knowledge Graph is a representation of a knowledge domain using the graph data model, readable by both humans and machines. ​

Knowledge graphs are used to model complex information in a way that can be easily understandable. They can use ontologies (e.g., human phenotype ontology) to define the vocabulary and structures of classes, properties and relationships in a domain knowledge to ensure consistency and allow compatibility between different graphs. ​

To learn more about graph data modeling and gain some hands-on experience, you can take the free graph data modeling fundamentals course at Neo4j Graph Academy.

3 - AMIGO

Advanced Medical Intelligence for Guiding Orphan Medicine

Challenge

We invite you to develop machine learning (ML) algorithms that can use knowledge graph data and accomplish these points:​

A. Accurately predict if a child is healthy or is diagnosed with an illness.​

B. Accurately predict patients’ disease category based on the first letter of ICD-10 code system.​

C. Correctly Cluster patients based on the features from the omics and clinical data.​

D. Offer the capability to be executed federatively using the feature cloud functionality.​

Business Outcome

  • Outline how the approach can be implemented and scaled to improve patients’ lives.​

Technical Architecture

  • You are free to use the synthetic data made available for you on Neo4j data base
  • You will be able to connect to the Feature Cloud platform and upload your models
  • You are free to use both supervised and unsupervised methods.

Technical Architecture

Background

  • The disease experts at the Dr. von Hauner Children’s Hospital in Munich are working toward establishing an AI platform to improve diagnosis of rare diseases in children. ​

  • Their goal is to implement federated learning and create a library of ML models to assist clinicians with patients’ diagnosis.​

  • To protect patients’ privacy, we have already preprocessed and analyzed the data and created knowledge graphs which you can use to build your algorithm. ​

  • The knowledge graph data comprise thousands of human phenotype ontology (HPO) features, covering genomic, proteomic, blood values, electronic health records and patient questionnaires.​

  • Additionally, we provide you with a well-crafted unlabeled synthetic data set that closely resembles the structure of the real data.​

Joint initiative of

4 - Create an App on FeatureCloud

Register on FeatureCloud

  1. Go to https://featurecloud.ai/ and click on the login button in the right upper corner.
  2. Click on Sign up.
  3. Make sure that you register as “App Developer”. If you want to publish, select the respective option.

Add an App

  1. Go to https://featurecloud.ai/app-store.
  2. Click on the Development menu option. See:
  1. Click on Add App.
  2. Fill in the details of your app and choose an image name. Note: that no frontend is needed for our purposes and the URL link may direct to our GitHub repository.

Publish an App

  1. Prerequisites
    1. Install FeatureCloud pip package: pip install feacturecloud
    2. Start Controller: feacturecloud controller start
  2. Implement your application
    1. Create and implement an application based on a template:
      featurecloud app new --template-name=app-blank app-blank
    2. Build your application:
      featurecloud app build ./app-blank my-app`
    3. Test your application with Testbed:
      featurecloud test start --controller-host=http://localhost:8000 --app-image=my-app --query-interval=1 --client-dirs=.,.
    Note 1: you may have register first with docker login featurecloud.ai.

Find an App

To see your app in the store you must tick the respective option:

Run an App

  1. To run an app you created, click on the Projects tab.
  2. Click on Create and name you project.
  3. Click on the blue button with arrow next to your app. Make you pushed an updated image of your app.

https://featurecloud.ai/developers

https://github.com/FeatureCloud/FeatureCloud

https://featurecloud.ai/assets/developer_documentation/getting_started.html

5 - Useful Background Knowledge

5.1 - Intoduction to Genetics

Once Upon a Time … Life: The Cell

Video: Once Upon A Time...Life: The Cell

Source: Abandoned Tube, https://youtu.be/V1hAgh77v9U?si=2GTLDG30aJmjM4t8&t=374

Here you can a brief introduction into various topics involved in the challenge.

Image Source: http://personal.cityu.edu.hk/liangdai/post/central-dogma-translation-transcription/

The Central Dogma

In 1958, Francis Crick proposed the Central Dogma of molecular biology. This principle outlines the flow of genetic information though a biological system. Information stored as DNA, is transcribed into RNA which is then translated into proteins.

DNA Replication

Before cells divide, they must replicate their DNA to ensure that each new cell receives a complete set of genetic information. DNA replication is a highly accurate process that involves unwinding the DNA molecule and synthesizing new strands complementary to the original strands.

Transcription

Transcription is the process by which the genetic information encoded in DNA is copied into a complementary RNA molecule. This process takes place in the cell nucleus and is carried out by an enzyme called RNA polymerase. The resulting RNA molecule, known as messenger RNA (mRNA), serves as a template for protein synthesis.

Translation

Translation is the process by which the genetic information carried by mRNA is decoded to produce a specific sequence of amino acids, which are the building blocks of proteins. This process takes place in the ribosomes, cellular structures composed of RNA and protein. Transfer RNA (tRNA) molecules bind to specific amino acids and deliver them to the ribosome, where they are joined together to form a polypeptide chain, or protein.

Gene

Gene is a segment of DNA that contains the instructions for building and functioning of an organism. Genes are the basic units of heredity, passed down from parents to offspring, and they play a crucial role in determining an organism’s traits and characteristics.

Genes typically consist of two main sections: coding regions and non-coding regions. Coding regions, also known as exons, contain the instructions for building proteins. These regions are transcribed into messenger RNA (mRNA) which serves as a template for protein synthesis, with each set of three nucleotides (codon) coding for a specific amino acid, the building block for proteins.

Non-coding regions include introns, which are intervening sequences within genes that are spliced out during mRNA processing, and regulatory regions, which play crucial roles in controlling gene expression. Regulatory regions contain sequences that serve as binding sites for transcription factors, proteins that regulate the initiation of transcription. By binding to specific DNA sequences, transcription factors can modulate the expression of nearby genes, influencing their activity levels.

Gene Expression

Gene expression refers to the process by which the information stored in a gene is used to create a functional product, such as a protein. This process involves transcription of DNA into RNA and translation of RNA into protein. Gene expression includes a series of tightly regulated steps that control when, where, and how much of a particular gene’s product (usually a protein) is produced. This regulation is crucial for maintaining the proper functioning of cells and tissues in an organism. Factors such as environmental cues, developmental stage, and cell type can influence gene expression patterns.

Disruptions in gene expression can lead to abnormal levels or functions of proteins, which can contribute to disease development. For example, mutations in regulatory regions of genes can alter the timing or amount of gene expression, leading to overproduction or underproduction of a particular protein. Similarly, mutations within the coding regions of genes can result in defective proteins or proteins with altered functions, which can disrupt normal cellular processes and contribute to disease phenotypes.

A mutation in the coding part of a gene can lead to various consequences, ranging from premature truncation of the protein production to frameshifting the coding sequence and resulting in a nonfunctional or severely altered protein.

5.2 - Omics Data and Personalized Medicine

Omics data refers to large-scale data generated from high-throughput techniques that study various biological components on a comprehensive scale.​

The term “omics” is derived from disciplines such as genomics, transcriptomics, proteomics, metabolomics, and others, each focusing on different types of biological molecules.​

Genomics: Genomics involves the study of an organism’s entire genome, including its genes and their functions, as well as interactions between genes and other elements within the genome.​

Transcriptomics: Transcriptomics focuses on the study of all RNA molecules present in a cell or tissue at a given time, providing insights into gene expression patterns and regulation.​

Proteomics: Proteomics involves the study of all proteins present in a cell, tissue, or organism, including their structures, functions, and interactions.​

Metabolomics: Metabolomics aims to identify and quantify all small-molecule metabolites present in a biological sample, providing insights into cellular processes and metabolic pathways.​

Integrating multiple omics datasets, allows for a holistic characterization of individual patients and their unique molecular profiles. By integrating omics data with clinical data, electronic health records, and other relevant information, healthcare providers can develop personalized treatment plans tailored to each patient’s specific needs, preferences, and genetic makeup.​

Currently, our study incorporates genomics and proteomics data and in the near future, transcriptomics and metabolomics will also be covered.​

5.3 - Human Phenotype Ontology

The Human Phenotype Ontology (HPO) is a standardized vocabulary and framework for describing phenotypic abnormalities observed in human diseases in terms of clinical features (symptoms) and other observable characteristics, associated with genetic disorders and other medical conditions.

The HPO can be used to support differential diagnostics, translational research, and applications in computational biology by providing the means to compute over the clinical phenotype. The HPO is being used for computational deep phenotyping and precision medicine as well as integration of clinical data into translational research.

HPO terms are organized hierarchically, with more specific terms nested under broader categories. Each term is assigned a unique identifier and includes synonyms, definitions, and relationships to other terms within the ontology.

Example of an HPO term

HPO Term: Microcephaly (HP:0000252)

Definition: A condition characterized by a smaller than normal head circumference.

Hierarchy: Microcephaly is a subtype of “abnormality of head or neck” and is more specific than the broader term “abnormality of head size.“

More info: https://hpo.jax.org/app/

5.4 - The ICD-10 System

The International Classification of Diseases (ICD) is a standardized system used worldwide for classifying diseases, health conditions, and related factors. ICD-10, which is the latest revision, provides a comprehensive framework for organizing and categorizing diseases and health conditions based on their etiology, anatomical location, severity, and other relevant factors.

ICD-10 codes are alphanumeric codes that represent specific diseases, conditions, and medical procedures. These codes are used in healthcare settings for billing, recording, reporting, analyzing and clinical decision-making.

Image source: https://images.app.goo.gl/pgSzs4eVc8zvm5yW9

Image source: https://images.app.goo.gl/pgSzs4eVc8zvm5yW9

For more information, please visit: https://www.who.int/classifications/classification-of-diseases.

5.5 - Combined Annotation Dependent Depletion

The Combined Annotation Dependent Depletion (CADD) score is a numerical measure used in genetics to predict the deleteriousness, or harmfulness, of genetic variants.

Using machine learning models, CADD combines genomic features derived from surrounding sequence context, gene model annotations, evolutionary constraint, epigenetic measurements and functional predictions to estimate the likelihood that a given genetic variant will have a harmful effect on protein function or lead to a disease phenotype.

Imagine you’re a scientist studying genetic mutations in a particular gene associated with a rare disease. You’ve identified a mutation located in a critical region of the gene that codes for an essential protein. Now you want to assess its potential impact on protein function and disease risk.

After running the variant through the CADD tool, you obtain a CADD score of 25. This score indicates that the variant is predicted to be among the top 25% most deleterious variants in the human genome, suggesting a high likelihood of it causing a harmful effect or contributing to disease.

For more information, please visit: https://cadd.gs.washington.edu/.

5.6 - Clinical Laboratory Data

The clinical laboratory data includes results from the blood and urine tests.

Blood Test

The most routine blood test is the complete blood count, and it measures the levels of various components of every major cell in the blood, including red blood cells (RBCs), white blood cells (WBCs), platelets, hemoglobin, and hematocrit. It provides valuable information about the overall health and functioning of the blood and can help detect a wide range of conditions, such as anemia, infections, inflammation, and bleeding disorders.

Another very common test is the basic metabolic panel, which is a group of tests that measure different naturally occurring chemicals in the blood. This is carried out on the plasma part of the blood, and it measures the levels of glucose, electrolytes (such as sodium, potassium, and calcium) and kidney function markers (such as creatinine and blood urea nitrogen). These components provide insights into organ function, metabolic status, and risk factors for certain diseases.

The lipoprotein panel also known as the lipid profile, measures the levels of LDL and HDL cholesterol and triglycerides, indicating the risk of cardiovascular diseases and other conditions.

Image source: https://images.app.goo.gl/PMBVKkdZVdLHGoga6

Urine Test

A routine urine test, also known as a urinalysis, examines the physical, chemical, and microscopic properties of urine. It includes visual, chemical and microscopic examinations.

The appearance of the urine, its clarity and color can indicate the presence of blood, proteins and certain drugs.

The chemical test includes placing a stick with chemical strips in the urine. The strips would change color indicating the presence or levels of different components such as glucose, ketones, proteins, bilirubin, blood, nitrites, leukocytes, and erythrocytes as well as the pH and concentration of the urine.

The microscopic examination involves viewing drops of concentrated urine under a microscope to detect the presence of crystals, casts (tube-shaped proteins), pathogens, red blood cells, white blood cells and epithelial cells. These can provide additional diagnostic information.