Data Pipeline

Data Pipeline

Contents

  1. Welcome
  2. Objectives
  3. Data Pipelines in COUSIN
  4. Apache Airflow
  5. Deployment

1. Welcome

This documentation provides a comprehensive guide for the COUSIN project, focusing on building, deploying, and maintaining data pipelines using Apache Airflow.

Objectives

  • Automate data workflows for biological datasets.
  • Ensure reproducibility and scalability.
  • Integrate with machine learning research, including dimensionality reduction and clustering techniques.
Tip

COUSIN WP5 focuses on infrastructure and data management, where this documentation plays a key role in delivering high-quality data resources.

2. Introduction

The COUSIN Project aims to improve the automation and management of biological data for similarity search research.
This documentation serves as a guide for:

  • Handling large biological datasets.
  • Automating preprocessing and analysis workflows.
  • Ensuring reproducibility of experiments.
  • Supporting machine learning models such as dimensionality reduction and clustering.

Pipelines are orchestrated using Apache Airflow, which ensures that every step is executed in the correct order and at the right time.


Pipeline Architecture Overview

The pipeline used in the COUSIN project is modular, meaning each stage is independent but connected through dependencies.

High-Level Workflow

flowchart TD
    A[Raw Biological Data Sources] -->|Data Ingestion| B[Preprocessing]
    B -->|Cleaned Data| C[Feature Engineering]
    C -->|Optimized Dataset| D[Dimensionality Reduction]
    D -->|Reduced Features| E[Clustering and Analysis]
    E -->|Grouped Data| F[Similarity Models]
    F -->|Results| G[Reports and Visualization]

3. Data Pipelines in COUSIN

Overview

The data pipeline automates processes for: - Data ingestion - Preprocessing - Feature engineering - Model generation - Reporting

Stages of the Pipeline

Stage Description
Ingestion Load biological datasets from multiple sources.
Preprocessing Clean, normalize, and transform data.
Dimensionality Reduction Apply techniques to reduce feature space.
Clustering Group similar data points.
Export & Reporting Save processed data and generate outputs.

4. Apache Airflow

Why Airflow?

Apache Airflow is used for orchestrating workflows, enabling us to: - Schedule complex tasks - Manage dependencies - Monitor pipelines in real time

Core Concepts

Concept Description
DAG Directed Acyclic Graph representing workflow.
Task Individual operation inside a DAG.
Operator Template for creating tasks.
Executor Defines how tasks are executed (e.g., LocalExecutor).

Example DAG

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def preprocess():
    print("Preprocessing biological dataset...")

with DAG(
    'cousin_preprocessing',
    start_date=datetime(2025, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    task = PythonOperator(
        task_id='preprocess_dataset',
        python_callable=preprocess
    )

5. Deployment

Local Deployment with Docker

  1. Install Docker and Docker Compose.
  2. Create a folder airflow with subdirectories:

airflow/

├─ dags/

├─ logs/

└─ plugins/

  1. Start Airflow:
docker-compose up
  1. Access the UI at http://localhost:8080 .