Back to Projects // P-02 · Phishing Detection · Email Security

Phishing Email Analysis Engine

Python Streamlit VirusTotal URLScan.io YARA ReportLab MITRE ATT&CK SPF / DKIM / DMARC

Project Overview

Phishing remains the #1 attack vector — and it's not even close. Over 90% of cyber attacks worldwide start with a phishing email, leading to massive data breaches and disruptions. Yet most email security tools are black-box scanners that offer no analyst-grade visibility into why an email was flagged. This project was built to address that gap — a multi-stage phishing analysis engine that parses raw .eml files and runs them through layered detection covering authentication, URL reputation, and attachment analysis, producing a weighted risk verdict and a structured PDF report.

The pipeline is exposed through a Streamlit dashboard where you can upload an email and receive a full breakdown across six analysis stages — each with its own scoring contribution, findings log, and MITRE ATT&CK mapping. The final output is a ReportLab-generated PDF with IOC tables, authentication analysis, URL and attachment verdicts, and priority-tagged remediation recommendations. The tool was tested against a real phishing email impersonating Microsoft Account Security, returning a Critical verdict.

Pipeline Architecture

Detection Logic

STAGE 02

Header Analysis

Authentication parsing reads the Authentication-Results header stamped by the receiving mail server, extracting SPF, DKIM, and DMARC verdicts (pass / fail / softfail / none / neutral / permerror). Sender consistency cross-checks four fields — From, Reply-To, Return-Path, and Message-ID — requiring base domain matches with defined tolerance rules (e.g. ESP subdomains accepted on Return-Path, null <> flagged as bounce suppression). Social engineering detection scans subject and body against keyword lists in English, Spanish, and French covering urgency phrases, credential harvest language, suspension threats, and prize lures. The received chain parser reconstructs the full delivery path — Hop 1 (reversed) is the true origin server.

STAGE 03

URL Analysis

Static checks run without any API calls and contribute directly to the score:

SignalScore
Raw IP address as host+15
Known URL shortener+10
Suspicious TLD (.xyz, .top, .autos, etc.)+10
Punycode / IDN homograph domain+15

URLScan.io — submits each URL to a browser sandbox that fully renders the page and returns a verdict, screenshot, and DNS metadata. Scans are kept private to protect email content from public exposure.

VirusTotal — runs each URL against 70+ security engines simultaneously, aggregating verdicts into a single malicious, suspicious, or clean result.

STAGE 04

Attachment Analysis

Static attachment checks:

SignalScore
Dangerous extension (.exe, .ps1, .docm, etc.)+30
Suspicious MIME type+20
Extension / MIME mismatch (masquerading)+25 additional
Zero-byte attachment+5

VirusTotal performs SHA-256 hash lookup only — file bytes are never uploaded, protecting potentially sensitive attachment content from VT intelligence subscribers. A not_found result is treated as clean; YARA independently covers unknown files via behavioral pattern matching.

YARA rules (yara_rules/malware_signatures.yar):

RuleDetects
EmbeddedExecutableMZ header (PE file) hidden inside attachment
PowershellInMacroPowerShell execution strings inside Office macros
VBAAutoExecuteAuto-execute VBA macros (AutoOpen, AutoExec, etc.)
SuspiciousPDFJavaScriptJavaScript execution triggers inside PDF files
Base64EncodedPayloadLarge Base64 blobs with decode calls in scripts
DoubleExtensionTrapFiles named .pdf.exe, .doc.exe etc.
HTMLSmugglingBlob + atob() + click() in-browser payload delivery pattern
STAGE 05

Risk Scoring

Base weights are dynamically redistributed — if a stage has no data (no URLs, no attachments), its weight is redistributed proportionally to active stages. An email with headers only gives headers a weight of 1.0, preventing score deflation on targeted attacks that avoid URLs and attachments entirely.

StageDefault Weight
Headers0.25
URLs0.40
Attachments0.35

Cross-stage bonus penalties applied on top of the weighted score:

ConditionBonus
Malicious URL confirmed by URLScan or VirusTotal+25
Malicious attachment confirmed by VirusTotal+30
YARA critical rule triggered+20
SPF + DKIM + DMARC all failed+15
Reply-To mismatch + malicious URL+10

Verdict thresholds:

75 – 100 CRITICAL
50 – 74 HIGH
15 – 49 MEDIUM
0 – 14 LOW
STAGE 06

PDF Report

Generated with ReportLab Platypus. Report sections:

  • Cover — verdict, risk score, report reference (MSL-YYYYMMDD-HHMMSS)
  • IOC Table — sender domain, reply-to domain, all URLs, SHA-256 hashes
  • Authentication Analysis — SPF/DKIM/DMARC table, sender consistency findings, delivery chain reconstruction
  • URL Analysis — per-URL verdict, engine counts, URLScan screenshot
  • Attachment Analysis — file metadata, VT results, YARA match details
  • MITRE ATT&CK Mapping — techniques derived from YARA matches and header findings
  • Recommendations — priority-tagged actions (IMMEDIATE / BLOCKING / DETECTION / HYGIENE)
  • Appendix A — full numbered findings log

Live Test — Real Phishing Email

The engine was tested against a real phishing email received in inbox, impersonating Microsoft Account Security.

Phishing email sample rendered in browser
// Phishing email sample — Microsoft Account Security impersonation

// Note — URLs in this sample have been censored for your protection. If you were curious enough to visit that URL :)

Phishing analysis engine output — Critical verdict 100/100
// Analysis engine output — Critical verdict, 100/100 risk score

Technologies Used

Python 3.11 Streamlit YARA VirusTotal API v3 URLScan.io API ReportLab Platypus python-dotenv email stdlib MITRE ATT&CK SPF / DKIM / DMARC