1. Introduction

ARCoBAleno is a tool for automated coloring of PathVisio GPML files, with a focus on evolution studies. It’s main objective is to attribute color to a pathway’s gene datanodes based on ancestrality data regarding those genes. This ancestrality data may be taxon alongside evolution where the gene itself first appeared, or may be the origin of those genes most recently acquired molecular function(s). After unraveling the ancestrality data associated with each data node, ARCoBAleno adds color to the pathway by each level of ancestrality, which results in a collection of snapshots that, when viewed in succession, reveal how that system evolved.

2. Registering and submitting jobs

Submission in ARCoBAleno can be made as a guest, without registration. When doing so, you should take note of the process identifier number (PID) that will be displayed at the status area. This number can be used on the PID retrieval area on the right to check the status of the submission. Please, be aware that only valid PIDs that were submitted as a guest can be checked upon this way, and anybody can retrieve the results from a job that was submitted as a guest. When checking for a PID on the retrieval area, four responses can appear: 1) Invalid PID, if the value do not belong to a previously guest-submitted job; 2) Queued, if the job was successfully submitted, but hasn’t started being processed yet; 3) Running, if the job is currently being processed by ARCoBAleno and 4) Download Ready, when the job has finished running. Click the Download Ready button to download the results.

Registration in ARCoBAleno can be quickly performed on the registration page, by simply filling in the form and submitting. When logged in, you’ll have access to your job table, which will display all your previously submitted job, with information regarding their parameters of choice and their completion status. When submitting a new job, the following parameters must be filled in:

a) GPML File: This is the PathVisio GPML file containing the biological pathway to be colorized.

b) NCBI Taxonomy ID: This is the NCBI unique identifier for the organism to which your pathway belongs to. Currently, only human (ID 9606) is supported.

c) Use AutoLCA method: Check this box if you wish ARCoBAleno to determine the ancestrality data for each gene. You can optionally provide a list of UniProt identifiers for each gene on the pathway. If you do, the UniProts must be on the first column and the gene names on the second, separated by a tabulation. Make sure each gene name is an exact match to the counterpart on the pathway. If the AutoLCA method option is checked but no UniProt list is provided, ARCoBAleno will query the UniProt database and determine the best hit for each gene on the pathway, generating a report for this proceding together with the results. If you already have LCA data pre-performed, you can uncheck this box and submit a tabular file containing two columns: the first one having the names for each gene (exactly as is in the pathway) and the second one with an integer from 1 to 31, corresponding to the ancestrality level for that gene along the human lineage.

d) Coloring mode: If the AutoLCA box is checked, choose which mode of ancestrality ARCoBAleno should use. “Gene LCA (via Seed Server)” will determine the origin of each gene themselves along the human lineage. “Last Molecular Function LCA” and "Last Biological Process LCA" will determine the origin of the most recently acquired function by each gene or the most recent process on which they get involved. "Number of Biological Processes" will colorize the genes following a heat scheme, according to the number of Biological Processes they get involved with. See below for details on each method.

e) Cumulative legend: Check this option if you wish for the legends of previous levels to remain colored. If unmarked, only the legend for each level will be colorized.

f) Auto-generate legend: Check this option if you wish for ARCoBAleno to automatically add legend nodes. Note that a crash might occur if nodes with legend taxon names are present on the pathway and this option is checked. Choose if you wish for the legend to be inserted on the vertical, on the right, or on the horizontal, below the pathway.

3. Determining the ancestrality and colorizing the pathway

3.1 Getting the UniProtID for each gene

When performing automated coloring, ARCoBAleno must first determine the UniProt ID for each node on the pathway. Only “Gene Products” or “Data Nodes” entries on the pathways are considered, to prevent false attribution to Labels, Metabolites and other types of nodes. The text label on these nodes are then read by ARCoBAleno after the GPML file is parsed, and are queried against the UniProt database, using the NCBI Taxonomy ID provided as filter. The UniProt entry with the best score against each query is selected, and a log containing which UniProt was selected for each gene name is generated and provided with the results. Consider changing the name of the gene on the pathway to better fit its corresponding UniProt symbol if an invalid match happens.

3.2: Gene LCA (via Seed Server)

When the Gene LCA option is checked, ARCoBAleno will determine the taxon of origin of each gene on the pathway. To do so, it utilizes Seed Server, an algorithm that receives an UniprotID, obtains its protein sequence and uses PSI-Blast and UniRef50 enrichment to generate a cluster of orthologues for that gene. The Taxonomy IDs for the orthologues on the cluster are retrieved and used to discover the lowest common ancestor taxon which originated all of them.

3.3: Last Molecular Function LCA & Last Biological Process LCA

Molecular Function is a subset of the Gene Ontology (GO), comprising terms related to biochemical-level activities performed by gene products, such as binding to other molecules or catalyzing some sort of reaction. Biological Process is another subset of GO, whose terms describe complex biological phenomena and regulatory systems that require several genes. By counting the number of distinct proteins annotated to each GO term at each level of the taxonomic lineage of the Homo sapiens, it was possible to attribute a lowest common ancestor taxon for each term, by choosing the most recent one that had the highest count. This allowed us to create a local database containing the ancestrality for every molecular activity and biological process based on the Gene Ontology annotation. A gene can have any number of GO terms annotated to it, each with its own ancestrality. The Last Molecular Function LCA and Last Biological Process LCA options will colorize each node by the color of the level on which their most recently acquired function/process appeared, allowing for a visualization of how each gene evolved in terms of biochemical activity or involvement in biological systems over time.

3.4: Number of Biological Processes

ARCoBAleno can also use the local GO database to count the number of distinct and unrelated terms of Biological Processes are annotated to each gene on the pathway, using this information to colorize each node following a heat scale, to show which elements are involved in more processes (thus, being more generalized players), and which are more specific to the system depicted, being involved in fewer processes. This color scheme goes from red to blue, and follow these classes: "129 - 256", "65-128", "33-64", "17-32", "9-16", "5-8", "3-4", "1-2" or "None" (painted as white).

3.5: Attributing colors

After ARCoBAleno has generated the ancestrality information for each gene on the pathway, it iterates over each relevant level to colorize all nodes whose ancestrality belongs to each particular level. At this step, ARCoBAleno also colorizes Label nodes containing texts matching the name for each LCA level. Any gene data node whose ancestrality could not been determined will be colorized with a red border and text, while keeping the fill color as white.

4. Samples

Blood-Brain Barrier Pathway file: Download

Blood-Brain Barrier UniProt list file: Download

Blood-Brain Barrier Gene-LCA file: Download