Inside the Smart Food Safety System: Architecture, Data Pipelines, and ML Models Explained

A deep technical walkthrough of the data pipelines, algorithms, and design decisions behind my food safety prototype

Architecture overview

Once the prototype moved beyond experimentation, I needed a structure that could survive real-world input.

Food labels are noisy. OCR is imperfect. Safety decisions cannot rely on a single model prediction. The architecture reflects that reality by separating concerns clearly and defensively.

At a high level, the system flows as follows:


Image / Label Input
        ↓
OCR + Text Parsing        ↓
ETL + Validation Layer        ↓
Feature Engineering        ↓
Freshness ML Model        ↓
Rule-Based Safety Engine        ↓
Human-Readable Output

Each layer can fail safely without corrupting the next.

Data Engineering layer (ETL, validation, anonymisation)

This layer exists to answer one question:

Can this data be trusted enough to make a safety decision?

ETL ingestion

Raw inputs enter the system either as:

OCR extracted text
Structured label metadata (during testing)


def ingest_label(raw_text: str, source: str) -> dict:
    return {
        "raw_text": raw_text,
        "source": source,
        "ingested_at": datetime.utcnow()
    }

Nothing downstream assumes correctness.

Validation logic

Before feature engineering, every record is validated.


def validate_label(label: dict) -> bool:
    required_fields = ["expiry_date", "product_type"]
    for field in required_fields:
        if field not in label or label[field] is None:
            return False
    return True

Ambiguous expiry dates or missing fields are flagged and routed for manual review or conservative scoring.

Anonymisation (intentional design choice)

The system does not require business identifiers or customer data.


def anonymise_record(record: dict) -> dict:
    record.pop("restaurant_id", None)
    record["record_id"] = uuid4().hex
    return record

This makes the system:

Privacy preserving by default
Easier to deploy across vendors
Safer for regulatory environments

This decision was architectural, not cosmetic.

Structured schema

After validation, all data conforms to a fixed schema.


LabelSchema = {
    "record_id": str,
    "product_type": str,
    "expiry_date": date,
    "storage_temp": float,
    "allergens": list
}

Downstream logic never handles raw text directly.

Machine Learning layer (freshness estimation)

The ML layer estimates gradual risk, not safety decisions.

Feature engineering

Freshness decays non linearly and differently across food types.


def compute_features(label):
    days_to_expiry = (label["expiry_date"] - date.today()).days

    return {
        "days_to_expiry": days_to_expiry,
        "temp_deviation": abs(label["storage_temp"] - IDEAL_TEMP[label["product_type"]]),
        "product_sensitivity": SENSITIVITY_MAP[label["product_type"]]
    }

Features were chosen for explainability, not model cleverness.

Model choice

I intentionally avoided deep models at this stage.


from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

Why:

Deterministic behaviour
Stable predictions near thresholds
Easier debugging when things go wrong

Evaluation metrics

Accuracy alone is meaningless here.

I focused on:

Error near expiry boundaries
Stability across similar products
Consistency under small input noise


mean_absolute_error(y_test, y_pred)

Interpretability

Every freshness score can be decomposed:


contribution = model.coef_ * feature_vector

If I can’t explain why a score dropped, the model isn’t production worthy.

Rule Engine layer (non negotiable safety logic)

This layer exists because ML cannot be trusted alone in food safety.

Expiry logic


def expiry_rule(label):
    if label["expiry_type"] == "use_by" and date.today() > label["expiry_date"]:
        return "UNSAFE"

No model can override this.

Threshold mapping


def freshness_bucket(score):
    if score >= 0.8:
        return "SAFE"
    elif score >= 0.5:
        return "CAUTION"
    else:
        return "UNSAFE"

Clear, configurable, explicit.

Allergen override


def allergen_check(allergens, user_allergies):
    return bool(set(allergens) & set(user_allergies))

If triggered, freshness is irrelevant.

Final decision engine


def safety_decision(label, score):
    if expiry_rule(label) == "UNSAFE":
        return "UNSAFE"

    if allergen_check(label["allergens"], USER_PROFILE):
        return "ALLERGEN RISK"

    return freshness_bucket(score)

This hierarchy reflects real world responsibility.

Application layer (API + UI readiness)

The system is built API first.


POST /evaluate_food_item

Response example:


{
  "status": "CAUTION",
  "reason": "Low freshness score due to temperature deviation",
  "confidence": 0.72
}

The UI never sees raw ML outputs. Only decisions and explanations.

Scaling considerations (designed, not promised)

Even as a local prototype, the architecture supports:

Batch scoring for retail inventory
Cloud containerisation
Near real time re evaluation
Federated learning without data sharing

These paths exist because of early design discipline.

Reflection

This project forced me to think beyond models.

I had to consider:

What happens when OCR fails
How unsafe data propagates
Where human override is required
How trust is built through explanation

This is no longer an “analytics project.”

It is a system designed around responsibility, uncertainty, and real world constraints.

And it is still evolving.

Comments

Sam30 January 2026 at 10:11
Very clear and in depth explanation, its very helpful to think real time issues with practival mind set.
Imman jey30 January 2026 at 23:36
Looks great! Keep it up
Dan Christopher31 January 2026 at 10:49
This was a really good read. I liked how clearly you walked through the architecture step by step — it made the flow easy to understand. You’ve shown strong system-level thinking and good attention to detail. The article feels well thought out and nicely structured. Well explained and great effort!
Schris20 February 2026 at 06:49
it’s great to see the thinking behind the system, not just the model. I like the focus on safety, explainability, and privacy from the start. Building something that can handle messy real-world data and still be trusted is what makes this impressive.

Applied Data Analytics for Impact