Phishing Email Analysis Engine

Project Overview

Phishing remains the #1 attack vector — and it's not even close. Over 90% of cyber attacks worldwide start with a phishing email, leading to massive data breaches and disruptions. Yet most email security tools are black-box scanners that offer no analyst-grade visibility into why an email was flagged. This project was built to address that gap — a multi-stage phishing analysis engine that parses raw .eml files and runs them through layered detection covering authentication, URL reputation, and attachment analysis, producing a weighted risk verdict and a structured PDF report.

The pipeline is exposed through a Streamlit dashboard where you can upload an email and receive a full breakdown across six analysis stages — each with its own scoring contribution, findings log, and MITRE ATT&CK mapping. The final output is a ReportLab-generated PDF with IOC tables, authentication analysis, URL and attachment verdicts, and priority-tagged remediation recommendations. The tool was tested against a real phishing email impersonating Microsoft Account Security, returning a Critical verdict.

Pipeline Architecture

Raw .eml File Analyst uploads via Streamlit dashboard

▼

Stage 1 — Email Parser Extracts headers · URLs · attachments · raw body

▼

Stage 2 — Header Analyzer SPF · DKIM · DMARC · sender consistency · social engineering signals · received chain

▼

Stage 3 — URL Analyzer Static checks · URLScan.io browser sandbox · VirusTotal 70+ engines

▼

Stage 4 — Attachment Analyzer Static checks · VirusTotal SHA-256 lookup · YARA behavioral matching

▼

Stage 5 — Risk Scorer Dynamic weighted scoring · cross-stage bonus penalties · Low / Medium / High / Critical verdict

▼

Stage 6 — PDF Report Generator IOC table · MITRE ATT&CK mapping · remediation steps

▼

Streamlit Dashboard — Moaz SOC Lab Single-page analysis UI · live findings · PDF download

Detection Logic

STAGE 02

Header Analysis

Authentication parsing reads the Authentication-Results header stamped by the receiving mail server, extracting SPF, DKIM, and DMARC verdicts (pass / fail / softfail / none / neutral / permerror). Sender consistency cross-checks four fields — From, Reply-To, Return-Path, and Message-ID — requiring base domain matches with defined tolerance rules (e.g. ESP subdomains accepted on Return-Path, null <> flagged as bounce suppression). Social engineering detection scans subject and body against keyword lists in English, Spanish, and French covering urgency phrases, credential harvest language, suspension threats, and prize lures. The received chain parser reconstructs the full delivery path — Hop 1 (reversed) is the true origin server.

STAGE 03

URL Analysis

Static checks run without any API calls and contribute directly to the score:

Signal	Score
Raw IP address as host	+15
Known URL shortener	+10
Suspicious TLD (.xyz, .top, .autos, etc.)	+10
Punycode / IDN homograph domain	+15

URLScan.io — submits each URL to a browser sandbox that fully renders the page and returns a verdict, screenshot, and DNS metadata. Scans are kept private to protect email content from public exposure.

VirusTotal — runs each URL against 70+ security engines simultaneously, aggregating verdicts into a single malicious, suspicious, or clean result.

STAGE 04

Attachment Analysis

Static attachment checks:

Signal	Score
Dangerous extension (.exe, .ps1, .docm, etc.)	+30
Suspicious MIME type	+20
Extension / MIME mismatch (masquerading)	+25 additional
Zero-byte attachment	+5

VirusTotal performs SHA-256 hash lookup only — file bytes are never uploaded, protecting potentially sensitive attachment content from VT intelligence subscribers. A not_found result is treated as clean; YARA independently covers unknown files via behavioral pattern matching.

YARA rules (yara_rules/malware_signatures.yar):

Rule	Detects
EmbeddedExecutable	MZ header (PE file) hidden inside attachment
PowershellInMacro	PowerShell execution strings inside Office macros
VBAAutoExecute	Auto-execute VBA macros (AutoOpen, AutoExec, etc.)
SuspiciousPDFJavaScript	JavaScript execution triggers inside PDF files
Base64EncodedPayload	Large Base64 blobs with decode calls in scripts
DoubleExtensionTrap	Files named .pdf.exe, .doc.exe etc.
HTMLSmuggling	Blob + atob() + click() in-browser payload delivery pattern

STAGE 05

Risk Scoring

Base weights are dynamically redistributed — if a stage has no data (no URLs, no attachments), its weight is redistributed proportionally to active stages. An email with headers only gives headers a weight of 1.0, preventing score deflation on targeted attacks that avoid URLs and attachments entirely.

Stage	Default Weight
Headers	0.25
URLs	0.40
Attachments	0.35

Cross-stage bonus penalties applied on top of the weighted score:

Condition	Bonus
Malicious URL confirmed by URLScan or VirusTotal	+25
Malicious attachment confirmed by VirusTotal	+30
YARA critical rule triggered	+20
SPF + DKIM + DMARC all failed	+15
Reply-To mismatch + malicious URL	+10

Verdict thresholds:

75 – 100 CRITICAL

50 – 74 HIGH

15 – 49 MEDIUM

0 – 14 LOW

STAGE 06

PDF Report

Generated with ReportLab Platypus. Report sections:

Cover — verdict, risk score, report reference (MSL-YYYYMMDD-HHMMSS)
IOC Table — sender domain, reply-to domain, all URLs, SHA-256 hashes
Authentication Analysis — SPF/DKIM/DMARC table, sender consistency findings, delivery chain reconstruction
URL Analysis — per-URL verdict, engine counts, URLScan screenshot
Attachment Analysis — file metadata, VT results, YARA match details
MITRE ATT&CK Mapping — techniques derived from YARA matches and header findings
Recommendations — priority-tagged actions (IMMEDIATE / BLOCKING / DETECTION / HYGIENE)
Appendix A — full numbered findings log

Live Test — Real Phishing Email

The engine was tested against a real phishing email received in inbox, impersonating Microsoft Account Security.

Phishing email sample rendered in browser — // Phishing email sample — Microsoft Account Security impersonation

Phishing analysis engine output — Critical verdict 100/100 — // Analysis engine output — Critical verdict, 100/100 risk score

Technologies Used

Python 3.11 Streamlit YARA VirusTotal API v3 URLScan.io API ReportLab Platypus python-dotenv email stdlib MITRE ATT&CK SPF / DKIM / DMARC