Biotech · Pharma

From scattered assay results
to a unified discovery pipeline

Bespoke cloud data pipelines for proteomics-led drug discovery — connecting internal assays, public data and best-in-class tools, fitted to the way your science team actually works.

Working with discovery teams at Matchpoint Tx, HotSpot Tx and others.

The status quo:

01
Experimental data scattered across instruments, formats and folders — every analysis starts with a hunt.
02
Off-the-shelf platforms force a choice: bend the science to fit the tool, or live with manual workarounds.
03
Scientists queue for data support, slowing cycle times when speed matters most.
04
Critical insights buried in spreadsheets, hard to revisit between experiments.

What we believe

Discovery data infrastructure
should fit the science,
not the other way around.

The teams pushing the frontiers of drug discovery don’t run a standard workflow. They run experiments that don’t exist in commercial software, blend public and proprietary data in non-obvious ways, and refine their methods continuously.

Off-the-shelf platforms force a choice: bend the science to fit the tool, or carry the cost of manual workarounds that don’t scale. We believe the right answer is neither.

A small, well-designed pipeline — built around the team’s actual experiments and the team’s actual mental model — pays back many times over. Scientists move from result to insight without waiting in line. The platform team focuses on real engineering instead of ad-hoc requests. The science compounds.

The cost of building infrastructure has collapsed. Discovery teams have the most to gain.

Case — Cloud Data Pipeline · Covalent Chemoproteomics

A unified pipeline for an ACE discovery platform

Matchpoint Therapeutics is a Boston-based biotech building precision covalent medicines through its Advanced Covalent Exploration (ACE) platform — integrating chemoproteomics, machine learning and covalent chemistry library evolution. A high volume of non-standard experiments and tight discovery cycles meant the platform team needed data infrastructure that could keep pace with the science, not constrain it.

Together with Matchpoint’s science and platform-technology teams, we conceptualised the pipeline in a series of workshops and then implemented it piece by piece in their Google Cloud environment. Own assay results, public data and computational tools — including custom Fortran code — flow into a unified data lake and warehouse. Web-based assistants guide ingestion and quality control, custom dashboards reflect Matchpoint’s specific way of looking at the data, and external annotations layer in automatically.

Twelve weeks from kick-off workshop to first delivery. The pipeline now runs in Matchpoint’s own cloud, fully owned by the internal team.

Pipeline architecture

Four-stage pipeline. Sources fan in on the left, expert-guided curation enforces standards, the data lake and warehouse run in your own cloud, and custom dashboards land each experiment in a decision.

Twelve weeks from kick-off to first delivery

Experiment-to-insight in real time

Cross-functional teams work independently of data support

It is a pleasure working with the idalab team on our data and machine learning pipeline. They are an outstanding strategic partner, collaborating seamlessly with our science team. Fast, clear communication, structured — yet always happy to adapt ad hoc, if necessary. We are looking forward to continuing the collaboration.

Suresh Singh, PhD Senior Vice President, Computational Sciences · Matchpoint Therapeutics

Technology & Analysis Stack

Identification through to downstream integration — the workflows and tools we have built into pipelines for clients in the last few years.

Identification & quantification

Label-free quantification (LFQ)MaxQuant · FragPipe · Spectronaut
Data-independent acquisition (DIA)DIA-NN · Spectronaut · MaxDIA
Isobaric labelling (TMT, iTRAQ)FragPipe-TMT · IsobarQuant · MaxQuant
PTM site localizationFragPipe-PTM · MSFragger · MaxQuant
Cross-linking MS (XL-MS)pLink · XlinkX · MeroX

Statistics & modelling

Differential abundance & testingMSstats · limma · DEqMS · Perseus
Batch correction & normalisationComBat · RUV-III · vsn · median-MAD
Missing-value imputationMissForest · MICE · MinDet
Time-course & longitudinal modellinglimma splines · lme4 · MEFISTO
ML for biomarker discoveryscikit-learn · XGBoost · SHAP · PyTorch

Downstream & integration

Pathway & gene-set enrichmentfgsea · clusterProfiler · Reactome · MSigDB
PPI & network analysisSTRING · IntAct · Cytoscape
Structure-prediction integrationAlphaFold 2/3 · ColabFold · ESMFold
Multi-omics integrationMOFA · mixOmics · DIABLO
Affinity & target deconvolutionSAINT · ProHits · mineCETSA · TPP-TR

How we work

Co-designed with science. We work alongside your scientists and platform team from day one. Pipeline structure, dashboards and ingestion flows are shaped together — not handed over at the end.
Built for the way you actually work. Dashboards mirror your team's specific analyses. Ingestion interfaces enforce your standards. Nothing generic, nothing forced.
Deployed in your infrastructure. For maximum security and control, the pipeline lives in your cloud — Google Cloud, AWS, Azure or hybrid. Sealed off from external access where you need it.
You own what we build. From data structure to interface code, everything is yours from delivery onward. We're happy to support your internal team taking over.

Articles

Data structures for proteomics

The shape of your data decides what you can ask of it. Six hard-won lessons from building proteomics pipelines.

Are volcano plots really the best tool to understand your data?

A contrarian take on the proteomics workhorse — and what to plot instead.

Clients

Frequently asked questions

What does an engagement look like?

We support along the entire process. Working closely with the science team, we conceptualise the pipeline in a series of workshops. From there we implement it piece by piece in your cloud environment, keeping the science team in the loop throughout to ensure data quality and consistency.

How long does it take to build such a pipeline?

Twelve weeks from the kick-off workshop to the delivery of the initial pipeline is a realistic baseline. From there, additional capability extends the pipeline incrementally in 2–4 week sprints.

What kind of data sources or computational tools can be integrated?

Anything goes — even custom tools written in Fortran can be brought into a 21st-century cloud pipeline.

Can our own data engineering team operate this pipeline?

Definitely. We would love to team up with your data engineering team from day one, and actively support phasing us out when the time is right. Everything we build is yours.

What technology do you use?

Choices are made together with your team to fit existing infrastructure and skills. A common stack:

Programming language: Python
Web-app development: Streamlit
Data lake: cloud object storage (Google Cloud Storage, AWS S3, Azure Blob)
Data warehouse: cloud-native (BigQuery, Snowflake, Redshift)
Web-app deployment: managed app platform (Cloud Run, App Engine, ECS)
Securing web-app access: identity-aware proxy or SSO
Integration of external tools: serverless functions (Cloud Functions, Lambda)

For the Matchpoint engagement, the requirement was to implement everything in Google Cloud, but any other major cloud provider works equally well.

How do you handle confidential and regulated data?

All work is covered by a mutual NDA and, where applicable, a data processing agreement. The pipeline runs entirely in your environment, under your security and access controls. We work with clients operating under GDPR, HIPAA, GxP and equivalent regimes.

Let's talk

Benjamin Häusler

Senior Consultant

Benjamin leads idalab's drug-discovery data engineering work, partnering with biotech R&D teams to build the data infrastructure their science actually needs.

Not getting what you want out of your discovery data? Let’s talk — fully confidential, no strings attached.

From scattered assay resultsto a unified discovery pipeline

Discovery data infrastructureshould fit the science,not the other way around.