Biotech · Pharma

From scattered assay results
to a unified discovery pipeline

Bespoke cloud data pipelines for proteomics-led drug discovery — connecting internal assays, public data and best-in-class tools, fitted to the way your science team actually works.

The status quo:

  1. 01

    Experimental data scattered across instruments, formats and folders — every analysis starts with a hunt.

  2. 02

    Off-the-shelf platforms force a choice: bend the science to fit the tool, or live with manual workarounds.

  3. 03

    Scientists queue for data support, slowing cycle times when speed matters most.

  4. 04

    Critical insights buried in spreadsheets, hard to revisit between experiments.

What we believe

Discovery data infrastructure
should fit the science,
not the other way around.

The teams pushing the frontiers of drug discovery don’t run a standard workflow. They run experiments that don’t exist in commercial software, blend public and proprietary data in non-obvious ways, and refine their methods continuously.

Off-the-shelf platforms force a choice: bend the science to fit the tool, or carry the cost of manual workarounds that don’t scale. We believe the right answer is neither.

A small, well-designed pipeline — built around the team’s actual experiments and the team’s actual mental model — pays back many times over. Scientists move from result to insight without waiting in line. The platform team focuses on real engineering instead of ad-hoc requests. The science compounds.

The cost of building infrastructure has collapsed. Discovery teams have the most to gain.

Case — Cloud Data Pipeline · Covalent Chemoproteomics

A unified pipeline for an ACE discovery platform

Matchpoint Therapeutics is a Boston-based biotech building precision covalent medicines through its Advanced Covalent Exploration (ACE) platform — integrating chemoproteomics, machine learning and covalent chemistry library evolution. A high volume of non-standard experiments and tight discovery cycles meant the platform team needed data infrastructure that could keep pace with the science, not constrain it.

Together with Matchpoint’s science and platform-technology teams, we conceptualised the pipeline in a series of workshops and then implemented it piece by piece in their Google Cloud environment. Own assay results, public data and computational tools — including custom Fortran code — flow into a unified data lake and warehouse. Web-based assistants guide ingestion and quality control, custom dashboards reflect Matchpoint’s specific way of looking at the data, and external annotations layer in automatically.

Twelve weeks from kick-off workshop to first delivery. The pipeline now runs in Matchpoint’s own cloud, fully owned by the internal team.

Pipeline architecture
Drug-discovery proteomics data pipelineA four-stage pipeline. Stage 1 Ingest brings in internal assay results, public databases and external tools. Stage 2 Curate applies guided ingestion, quality checks and standards. Stage 3 Store and Compute runs on a data lake and warehouse with reproducible workflows in your own cloud. Stage 4 Explore and Decide delivers custom dashboards, annotated context and experiment-to-decision flow for scientists.STAGE 01IngestInternal assay resultsPublic databasesExternal tools & codeAny format welcomedSTAGE 02CurateGuided ingestion UIQuality checks built inStandards by designExpert input where it mattersSTAGE 03Store & ComputeData lake → warehouseReproducible workflowsYour cloud, your controlPlug in any analysisSTAGE 04Explore & DecideCustom dashboardsExternal annotationsSelf-serve for scientistsExperiment → decisionDEPLOYED IN YOUR CLOUD · CO-DESIGNED WITH YOUR SCIENCE TEAM
Four-stage pipeline. Sources fan in on the left, expert-guided curation enforces standards, the data lake and warehouse run in your own cloud, and custom dashboards land each experiment in a decision.

01

Twelve weeks from kick-off to first delivery

02

Experiment-to-insight in real time

03

Cross-functional teams work independently of data support

It is a pleasure working with the idalab team on our data and machine learning pipeline. They are an outstanding strategic partner, collaborating seamlessly with our science team. Fast, clear communication, structured — yet always happy to adapt ad hoc, if necessary. We are looking forward to continuing the collaboration.
Suresh Singh, PhD Senior Vice President, Computational Sciences · Matchpoint Therapeutics

Technology & Analysis Stack

Identification through to downstream integration — the workflows and tools we have built into pipelines for clients in the last few years.

Identification & quantification
  • Label-free quantification (LFQ)MaxQuant · FragPipe · Spectronaut
  • Data-independent acquisition (DIA)DIA-NN · Spectronaut · MaxDIA
  • Isobaric labelling (TMT, iTRAQ)FragPipe-TMT · IsobarQuant · MaxQuant
  • PTM site localizationFragPipe-PTM · MSFragger · MaxQuant
  • Cross-linking MS (XL-MS)pLink · XlinkX · MeroX
Statistics & modelling
  • Differential abundance & testingMSstats · limma · DEqMS · Perseus
  • Batch correction & normalisationComBat · RUV-III · vsn · median-MAD
  • Missing-value imputationMissForest · MICE · MinDet
  • Time-course & longitudinal modellinglimma splines · lme4 · MEFISTO
  • ML for biomarker discoveryscikit-learn · XGBoost · SHAP · PyTorch
Downstream & integration
  • Pathway & gene-set enrichmentfgsea · clusterProfiler · Reactome · MSigDB
  • PPI & network analysisSTRING · IntAct · Cytoscape
  • Structure-prediction integrationAlphaFold 2/3 · ColabFold · ESMFold
  • Multi-omics integrationMOFA · mixOmics · DIABLO
  • Affinity & target deconvolutionSAINT · ProHits · mineCETSA · TPP-TR

How we work

  1. Co-designed with science. We work alongside your scientists and platform team from day one. Pipeline structure, dashboards and ingestion flows are shaped together — not handed over at the end.
  2. Built for the way you actually work. Dashboards mirror your team's specific analyses. Ingestion interfaces enforce your standards. Nothing generic, nothing forced.
  3. Deployed in your infrastructure. For maximum security and control, the pipeline lives in your cloud — Google Cloud, AWS, Azure or hybrid. Sealed off from external access where you need it.
  4. You own what we build. From data structure to interface code, everything is yours from delivery onward. We're happy to support your internal team taking over.

Clients

Roche Arkuda Therapeutics Bayer Biotronik Charité Helios Kliniken HotSpot Therapeutics Kintiga Kymera Therapeutics Matchpoint Therapeutics Schwind eye-tech Sofinnova Partners

Frequently asked questions

What does an engagement look like?
We support along the entire process. Working closely with the science team, we conceptualise the pipeline in a series of workshops. From there we implement it piece by piece in your cloud environment, keeping the science team in the loop throughout to ensure data quality and consistency.
How long does it take to build such a pipeline?
Twelve weeks from the kick-off workshop to the delivery of the initial pipeline is a realistic baseline. From there, additional capability extends the pipeline incrementally in 2–4 week sprints.
What kind of data sources or computational tools can be integrated?
Anything goes — even custom tools written in Fortran can be brought into a 21st-century cloud pipeline.
Can our own data engineering team operate this pipeline?
Definitely. We would love to team up with your data engineering team from day one, and actively support phasing us out when the time is right. Everything we build is yours.
What technology do you use?

Choices are made together with your team to fit existing infrastructure and skills. A common stack:

  • Programming language: Python
  • Web-app development: Streamlit
  • Data lake: cloud object storage (Google Cloud Storage, AWS S3, Azure Blob)
  • Data warehouse: cloud-native (BigQuery, Snowflake, Redshift)
  • Web-app deployment: managed app platform (Cloud Run, App Engine, ECS)
  • Securing web-app access: identity-aware proxy or SSO
  • Integration of external tools: serverless functions (Cloud Functions, Lambda)

For the Matchpoint engagement, the requirement was to implement everything in Google Cloud, but any other major cloud provider works equally well.

How do you handle confidential and regulated data?
All work is covered by a mutual NDA and, where applicable, a data processing agreement. The pipeline runs entirely in your environment, under your security and access controls. We work with clients operating under GDPR, HIPAA, GxP and equivalent regimes.

Let's talk

Benjamin Häusler

Senior Consultant

Benjamin leads idalab's drug-discovery data engineering work, partnering with biotech R&D teams to build the data infrastructure their science actually needs.

Not getting what you want out of your discovery data? Let’s talk — fully confidential, no strings attached.