Scientific Data Engineering

Biotech & Life Sciences

Scientific ETL Pipeline Design

Building reliable data pipelines for scientific and biotech data

Design an ETL pipeline for [scientific data source]:

Data source: [e.g., ChEMBL API, ClinicalTrials.gov AACT database, PubMed, internal LIMS]
Target: [where the data goes — data warehouse, research database, dashboard]
Volume: [estimated records and update frequency]
Tech stack: [Python, R, SQL, Airflow, Docker, cloud provider]

Design the pipeline:
1. Extract — API pagination strategy, rate limiting, authentication, error handling
2. Transform — schema mapping, data type coercion, unit standardization, deduplication
3. Validate — data quality checks (completeness, range validation, referential integrity, outlier detection)
4. Load — incremental vs. full load strategy, upsert logic, audit trail
5. Orchestration — DAG structure (Airflow/Prefect), scheduling, dependency management
6. Monitoring — alerting on failures, data quality dashboards, row count tracking
7. Infrastructure — containerization, deployment, scaling, cost considerations

Provide:
- Python code skeleton for the core extract and transform functions
- SQL schema for the target tables
- Airflow DAG definition
- Docker Compose configuration for local development
- Data quality assertion examples

Try this prompt in:

LMK GPT ChatGPT Perplexity Claude Gemini

LMK GPT, ChatGPT, Claude, and Perplexity will open with the prompt pre-filled. For Gemini, you'll need to paste the prompt manually.

Data Validation & QC Protocol

Ensuring data quality for scientific and clinical data systems

Create a data validation and quality control protocol for [scientific dataset]:

Dataset: [description — bioactivity data, clinical trial data, genomics, etc.]
Source: [where the data comes from]
Consumers: [who uses this data — scientists, analysts, regulatory submissions]
Regulatory context: [GxP, 21 CFR Part 11, or non-regulated]

Define validation rules for:
1. Completeness — which fields are required vs. optional, acceptable null rates
2. Format — data types, allowed values, controlled vocabularies
3. Range — numeric bounds (e.g., IC50 must be > 0, age must be 0-120)
4. Consistency — cross-field rules (e.g., end date must be after start date)
5. Uniqueness — deduplication criteria, acceptable duplicate scenarios
6. Referential integrity — foreign key relationships, lookup table matching
7. Outlier detection — statistical methods for flagging anomalous values
8. Temporal — expected update frequency, staleness thresholds

Provide:
- Validation rule catalog with ID, description, severity (error/warning/info), and SQL/Python implementation
- QC dashboard specification — what metrics to track and visualize
- Curation workflow — how flagged records are reviewed and resolved
- Audit trail requirements — what to log for traceability

Try this prompt in:

LMK GPT ChatGPT Perplexity Claude Gemini

LMK GPT, ChatGPT, Claude, and Perplexity will open with the prompt pre-filled. For Gemini, you'll need to paste the prompt manually.

Multi-Source API Integration Architecture

Integrating multiple scientific APIs into a unified data layer

Design an integration architecture that combines data from multiple scientific APIs:

APIs to integrate: [e.g., ChEMBL, UniProt, PubMed, ClinicalTrials.gov, OpenFDA]
Use case: [what questions the integrated data answers]
Access patterns: [real-time queries, batch sync, on-demand enrichment]

For each API, document:
1. Base URL, authentication method, rate limits
2. Key endpoints and parameters
3. Response schema and relevant fields
4. Caching strategy (TTL, invalidation triggers)
5. Error handling (retries, fallbacks, circuit breakers)

Integration design:
- Data model — how entities map across sources (e.g., ChEMBL target ID → UniProt accession → CT.gov intervention)
- Query orchestration — sequential vs. parallel calls, dependency graph
- Denormalization strategy — where to materialize joined views
- Real-time vs. batch — which queries are served live vs. from pre-computed stores
- Versioning — how to handle API changes without breaking downstream consumers

Provide a TypeScript/Python interface definition for the unified data access layer with method signatures, parameter types, and return types.

Try this prompt in:

LMK GPT ChatGPT Perplexity Claude Gemini

LMK GPT, ChatGPT, Claude, and Perplexity will open with the prompt pre-filled. For Gemini, you'll need to paste the prompt manually.

Scientific Database Schema Design

Designing database schemas for scientific and research applications

Design a database schema for [scientific domain]:

Domain: [e.g., drug discovery DMTA data, clinical trial management, bioactivity, genomics]
Key entities: [list the main things to track]
Query patterns: [most common questions the database needs to answer]
Scale: [estimated row counts, growth rate]
Database: [PostgreSQL, MySQL, BigQuery, etc.]

Design:
1. Entity-relationship diagram — entities, attributes, relationships, cardinality
2. Table definitions — columns, types, constraints, defaults, indexes
3. Controlled vocabularies — enum tables for status, type, category fields
4. Audit fields — created_at, updated_at, created_by on every table
5. Soft delete strategy — if applicable
6. Partitioning — for large tables, what partition key and strategy
7. Views — materialized views for common complex queries
8. Security — row-level security, access control patterns

Provide:
- Complete CREATE TABLE statements
- Index strategy with rationale
- 5 representative queries that demonstrate the schema handles the key query patterns efficiently
- Migration strategy — how to evolve the schema as requirements change

Try this prompt in:

LMK GPT ChatGPT Perplexity Claude Gemini

LMK GPT, ChatGPT, Claude, and Perplexity will open with the prompt pre-filled. For Gemini, you'll need to paste the prompt manually.

Research Data Governance Framework

Establishing data governance for biotech and research organizations

Establish a data governance framework for [research organization]:

Organization type: [biotech, pharma, CRO, academic research lab]
Data types: [compound data, assay results, clinical data, genomics, imaging]
Regulatory context: [GxP, HIPAA, GDPR, 21 CFR Part 11, non-regulated]
Team size: [number of data producers and consumers]

Define:
1. Data ownership model — who owns, stewards, and curates each data domain
2. Data classification — sensitivity levels and handling requirements
3. Access control — role-based access, principle of least privilege, approval workflows
4. Data lifecycle — creation, active use, archival, retention, deletion policies
5. Metadata standards — what metadata accompanies every dataset
6. Master data management — controlled vocabularies, reference data, golden records
7. Data quality — ongoing monitoring, periodic audits, remediation processes
8. Change management — how schema changes, new data sources, and policy updates are governed

Deliver:
- Data governance charter (1-page executive summary)
- RACI matrix for data responsibilities
- Policy templates for data access requests and change control
- Implementation roadmap prioritized by risk and effort

Try this prompt in:

LMK GPT ChatGPT Perplexity Claude Gemini

LMK GPT, ChatGPT, Claude, and Perplexity will open with the prompt pre-filled. For Gemini, you'll need to paste the prompt manually.