Project Overview
Phishing remains the #1 attack vector — and it's not even close. Over 90% of cyber attacks worldwide start with a phishing email, leading to massive data breaches and disruptions. Yet most email security tools are black-box scanners that offer no analyst-grade visibility into why an email was flagged. This project was built to address that gap — a multi-stage phishing analysis engine that parses raw .eml files and runs them through layered detection covering authentication, URL reputation, and attachment analysis, producing a weighted risk verdict and a structured PDF report.
The pipeline is exposed through a Streamlit dashboard where you can upload an email and receive a full breakdown across six analysis stages — each with its own scoring contribution, findings log, and MITRE ATT&CK mapping. The final output is a ReportLab-generated PDF with IOC tables, authentication analysis, URL and attachment verdicts, and priority-tagged remediation recommendations. The tool was tested against a real phishing email impersonating Microsoft Account Security, returning a Critical verdict.
Pipeline Architecture
Detection Logic
Header Analysis
Authentication parsing reads the Authentication-Results header stamped by the receiving mail server, extracting SPF, DKIM, and DMARC verdicts (pass / fail / softfail / none / neutral / permerror). Sender consistency cross-checks four fields — From, Reply-To, Return-Path, and Message-ID — requiring base domain matches with defined tolerance rules (e.g. ESP subdomains accepted on Return-Path, null <> flagged as bounce suppression). Social engineering detection scans subject and body against keyword lists in English, Spanish, and French covering urgency phrases, credential harvest language, suspension threats, and prize lures. The received chain parser reconstructs the full delivery path — Hop 1 (reversed) is the true origin server.
URL Analysis
Static checks run without any API calls and contribute directly to the score:
| Signal | Score |
|---|---|
| Raw IP address as host | +15 |
| Known URL shortener | +10 |
| Suspicious TLD (.xyz, .top, .autos, etc.) | +10 |
| Punycode / IDN homograph domain | +15 |
URLScan.io — submits each URL to a browser sandbox that fully renders the page and returns a verdict, screenshot, and DNS metadata. Scans are kept private to protect email content from public exposure.
VirusTotal — runs each URL against 70+ security engines simultaneously, aggregating verdicts into a single malicious, suspicious, or clean result.
Attachment Analysis
Static attachment checks:
| Signal | Score |
|---|---|
| Dangerous extension (.exe, .ps1, .docm, etc.) | +30 |
| Suspicious MIME type | +20 |
| Extension / MIME mismatch (masquerading) | +25 additional |
| Zero-byte attachment | +5 |
VirusTotal performs SHA-256 hash lookup only — file bytes are never uploaded, protecting potentially sensitive attachment content from VT intelligence subscribers. A not_found result is treated as clean; YARA independently covers unknown files via behavioral pattern matching.
YARA rules (yara_rules/malware_signatures.yar):
| Rule | Detects |
|---|---|
| EmbeddedExecutable | MZ header (PE file) hidden inside attachment |
| PowershellInMacro | PowerShell execution strings inside Office macros |
| VBAAutoExecute | Auto-execute VBA macros (AutoOpen, AutoExec, etc.) |
| SuspiciousPDFJavaScript | JavaScript execution triggers inside PDF files |
| Base64EncodedPayload | Large Base64 blobs with decode calls in scripts |
| DoubleExtensionTrap | Files named .pdf.exe, .doc.exe etc. |
| HTMLSmuggling | Blob + atob() + click() in-browser payload delivery pattern |
Risk Scoring
Base weights are dynamically redistributed — if a stage has no data (no URLs, no attachments), its weight is redistributed proportionally to active stages. An email with headers only gives headers a weight of 1.0, preventing score deflation on targeted attacks that avoid URLs and attachments entirely.
| Stage | Default Weight |
|---|---|
| Headers | 0.25 |
| URLs | 0.40 |
| Attachments | 0.35 |
Cross-stage bonus penalties applied on top of the weighted score:
| Condition | Bonus |
|---|---|
| Malicious URL confirmed by URLScan or VirusTotal | +25 |
| Malicious attachment confirmed by VirusTotal | +30 |
| YARA critical rule triggered | +20 |
| SPF + DKIM + DMARC all failed | +15 |
| Reply-To mismatch + malicious URL | +10 |
Verdict thresholds:
PDF Report
Generated with ReportLab Platypus. Report sections:
- Cover — verdict, risk score, report reference (MSL-YYYYMMDD-HHMMSS)
- IOC Table — sender domain, reply-to domain, all URLs, SHA-256 hashes
- Authentication Analysis — SPF/DKIM/DMARC table, sender consistency findings, delivery chain reconstruction
- URL Analysis — per-URL verdict, engine counts, URLScan screenshot
- Attachment Analysis — file metadata, VT results, YARA match details
- MITRE ATT&CK Mapping — techniques derived from YARA matches and header findings
- Recommendations — priority-tagged actions (IMMEDIATE / BLOCKING / DETECTION / HYGIENE)
- Appendix A — full numbered findings log
Live Test — Real Phishing Email
The engine was tested against a real phishing email received in inbox, impersonating Microsoft Account Security.
// Note — URLs in this sample have been censored for your protection. If you were curious enough to visit that URL :)