GSoC - 2023 : Wrapped

Final Report - Comprehensive Summary of my Contributions

Amartya Nambiar


August 25, 2023

After the results for Google Summer of Code(GSoC) - 2023 were released, my mentors Martin Beracochea and Sandy Rogers remained in close contact with me during the community bonding period. Subsequently, the coding period unfolded, which I’ve structured into 3 phases to align with the 3 projects I undertook as part of my contributions.

My contribution at EMBL - EBI for GSoC majorly involved creating visualizations that would help in comparative metagenomics. Additionally, I used Vega-Lite which introduced an approach that would allow a common grammar for creating visualisations. This was done to enhance code reusability in both React and Jupyter Notebooks, regardless of programming language.

Phase 1 30th May - 30th Jun

: : : : : : : : : : : : : : : : : :

Phase 2 1st Jul - 14th Aug

  • The second phase deals with exploring and building visualizations for the MGnify website.

Phase 2A 1st Jul - 16th Jul

  • This phase focused on finding visualization solutions that could work seamlessly in both JavaScript and Python environments.

  • Involved testing of packages like Plotly and Highcharts were tested, with a preference for Highcharts due to its usage in the MGnify site.

  • But then Highchart posed an issue as it couldn’t reference data as a link, impacting interoperability between React and Jupyter, especially with large datasets.

  • Then came the idea of a common visualization grammar, enabling easy porting of visualizations to Jupyter notebooks for users to perform deeper visualization and push in custom data as well.

  • Alternatives like pandas-js, Danfo.js, and Vega-Lite were considered.

  • In the end, the Vega ecosystem was selected due to its compatibility and ability to handle large datasets, ensuring seamless interoperability.

. . .

Phase 2B 17th Jul - 14th Aug

  • This phase focused on developing visualizations using the Vega framework.

  • Also, I had the privilege of meeting the EBI team, where I discussed my GSoC experience, shared progress updates, and conducted a live Vega-Lite demo to showcase its capabilities. (Slide Deck)

  • The Visualisations that have been successfully generated using Vega are :

Name Visualisation Type Component used (File)
Number of Sequence Reads per QC step Horizontal Bar Chart (uses tsx preprocessed data) QCChart.tsx
Reads Length Histogram Area Chart + Std Deviation band QC-chart-components/VConcatTop.tsx
Reads GC Distribution Area Chart + Std Deviation band QC-chart-components/VConcatTop.tsx
Reads Length (Min, Avg, Max) Horizontal Bar Chart QC-chart-components/VConcatBottom.tsx
Reads GC-AT Content (%) Horizontal Stacked Bar Chart QC-chart-components/VConcatBottom.tsx
Nucleotide Position Histogram Stacked Area Chart NucleotidesHistogram.tsx
Interpro Sequence feature summary Horizontal Bar Chart InterproBar.tsx
GO Terms Concatenated Horizontal Bar Chart GOBar.tsx
Domain + Phylum Composition Multi-Colour Bar Chart TaxBar.tsx
Pfam Vertical Bar Chart VerticalBar.tsx
KO Vertical Bar Chart VerticalBar.tsx
KEGG Module categories Vertical Bar Chart VerticalBar.tsx
antiSMASH gene clusters Vertical Bar Chart VerticalBar.tsx
COG Analysis Vertical Bar Chart VerticalBar.tsx
KEGG Class Analysis Vertical Bar Chart VerticalBar.tsx
KEGG Module Analysis Vertical Bar Chart VerticalBar.tsx
  • One of my favorite charts that I created was the Nucleotide position Histogram (live example below):
Vega-Lite Chart

: : : : : : : : : : : : : : : : : :

Phase 3 14th Aug

  • Under the guidance of Varsha Kale, I worked on a her idea centered around the Study Summary feature.

  • The major point to be taken note of is that I used Altair for this, a Python package for generating Vega-Lite visualizations.

  • Generated Bar Charts for GO-slim data and Tree Heatmap visualizations for the complete GO annotation data, using ontological web formats for gene ontologies, enabling in-depth analysis.

  • Technologies Involved : Python + JupyterLab + Packages like Altair, graphviz, Owlready2

Unexplored Horizons

Schema for Vega-Lite Visualizations

  • One of the best ideas was to create a common repo that would host the schema for Vega-Lite visualizations.

  • This schema could be utilized by both the frontend (React) and Jupyter Notebooks, facilitating deeper data analysis.

  • I believe this is the most important horizon to be explored with respect to Vega-Lite.

Vega-Lite Dashboard for Comparative Metagenomics

  • A Vega-Lite dashboard within MGnify constituting multiple interactive charts and input controls.

  • A dashboard of this kind would provide the users with an experience to delve deeply into Comparative Metagenomics, explore complex data with ease.

Pre-generated KRONA Charts

  • In addition to the above, the idea of pre-generating KRONA charts for Taxanomic Analysis under Study Analysis Summary.

  • These charts, once generated, could be integrated into Jupyter Notebooks, enabling a complete Analysis Summary of a Study, considering the work done with respect to GO analyses carried out in Phase 3.