On-Premise Web & System Log Classifier for Attacks, Competitor Scrapers & Real-Customer Traffic
Ingests Nginx, Apache, and system access logs in real time. A fine-tuned 3B-parameter model trained specifically on attack signatures, bot fingerprints, and competitor scraping behaviour classifies every IP session into one of six traffic types. Runs entirely on-premise — no log data ever leaves the host. Daily digest delivered to Slack.
Discuss a Similar ProjectWhat We Built
Real-Time Log Ingestion
Nginx, Apache, and Caddy access logs streamed via file tail and syslog — no agent installation required on the web server. Supports multiple server log streams in a single deployment.
Session Reconstruction
Raw log lines grouped into per-IP sessions with behavioural features — request rate, URL pattern sequence, user-agent fingerprint, referrer chain, and timing intervals — before classification.
Local Fine-Tuned Classifier LLM
Phi-3 Mini 3.8B fine-tuned on labelled log session datasets covering attack patterns, crawl signatures, competitor scraping behaviour, and real user traffic — running fully offline via Ollama.
Six-Class Traffic Classifier
Every session classified into one of: real customer, internal user, known bot, competitor scraper, vulnerability scanner, or active intrusion attempt — with confidence score and evidence summary per classification.
Local IP Enrichment
MaxMind GeoIP2 database and a local blocklist provide ASN, country, and known-bad-actor data. Zero external API calls — all enrichment runs from locally maintained databases.
Daily Slack Digest
End-of-day summary of top threat actors, competitor scraping activity, anomaly count, and auto-generated block recommendations — with links to full session detail for any flagged IP.
Technologies Used
Key Outcomes
Active intrusion attempt detection latency from first request to classification
On-premise: zero log data leaves the host at any stage
Competitor scraper sessions identified and actioned after their first request pattern
Need Something Similar?
Tell us about your log volumes, web server stack, and biggest monitoring blind spots. We will propose an on-premise AI monitoring architecture that fits your environment.