Data Pipeline

Welcome
Objectives
- High-Level Architecture
Data Pipelines in COUSIN
- Overview
- Stages of the Pipeline
Apache Airflow
Deployment
- Local Deployment with Docker

1. Welcome

This documentation provides a comprehensive guide for the COUSIN project, focusing on building, deploying, and maintaining data pipelines using Apache Airflow.

Objectives

Automate data workflows for biological datasets.
Ensure reproducibility and scalability.
Integrate with machine learning research, including dimensionality reduction and clustering techniques.

Tip

COUSIN WP5 focuses on infrastructure and data management, where this documentation plays a key role in delivering high-quality data resources.

2. Introduction

The COUSIN Project aims to improve the automation and management of biological data for similarity search research.
This documentation serves as a guide for:

Handling large biological datasets.
Automating preprocessing and analysis workflows.
Ensuring reproducibility of experiments.
Supporting machine learning models such as dimensionality reduction and clustering.

Pipelines are orchestrated using Apache Airflow, which ensures that every step is executed in the correct order and at the right time.

Pipeline Architecture Overview

The pipeline used in the COUSIN project is modular, meaning each stage is independent but connected through dependencies.

High-Level Workflow

flowchart TD
    A[Raw Biological Data Sources] -->|Data Ingestion| B[Preprocessing]
    B -->|Cleaned Data| C[Feature Engineering]
    C -->|Optimized Dataset| D[Dimensionality Reduction]
    D -->|Reduced Features| E[Clustering and Analysis]
    E -->|Grouped Data| F[Similarity Models]
    F -->|Results| G[Reports and Visualization]

3. Data Pipelines in COUSIN

Overview

The data pipeline automates processes for: - Data ingestion - Preprocessing - Feature engineering - Model generation - Reporting

Stages of the Pipeline

Stage	Description
Ingestion	Load biological datasets from multiple sources.
Preprocessing	Clean, normalize, and transform data.
Dimensionality Reduction	Apply techniques to reduce feature space.
Clustering	Group similar data points.
Export & Reporting	Save processed data and generate outputs.

4. Apache Airflow

Why Airflow?

Apache Airflow is used for orchestrating workflows, enabling us to: - Schedule complex tasks - Manage dependencies - Monitor pipelines in real time

Core Concepts

Concept	Description
DAG	Directed Acyclic Graph representing workflow.
Task	Individual operation inside a DAG.
Operator	Template for creating tasks.
Executor	Defines how tasks are executed (e.g., LocalExecutor).

Example DAG

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def preprocess():
    print("Preprocessing biological dataset...")

with DAG(
    'cousin_preprocessing',
    start_date=datetime(2025, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    task = PythonOperator(
        task_id='preprocess_dataset',
        python_callable=preprocess
    )

5. Deployment

Local Deployment with Docker

Install Docker and Docker Compose.
Create a folder airflow with subdirectories:

airflow/

├─ dags/

├─ logs/

└─ plugins/

Start Airflow:

docker-compose up

Access the UI at http://localhost:8080 .

Data Pipeline

Contents

1. Welcome

Objectives

2. Introduction

Pipeline Architecture Overview

High-Level Workflow

3. Data Pipelines in COUSIN

Overview

Stages of the Pipeline

4. Apache Airflow

Why Airflow?

Core Concepts

Example DAG

5. Deployment

Local Deployment with Docker