Linguistic Analysis Platform

Edge Condition Analysis is a personal research and delivery framework for exploring complex data and ML systems.

A demonstration of my ability to deliver modern end-to-end systems: from architecture and data pipelines to scalable ML inference and validation.

📈

Linguistic Dimensions

9 BERT models analyzing emotional valence, temporal urgency, certainty collapse, and 6 other dimensions across 2 years of forum discourse

→ Explore Interactive Timeline

💬

Word Burst Analysis

Interactive explorer revealing which topics dominated Hacker News discussions month-by-month throughout 2024-2025

→ Explore Word Bursts

⚙️

Modern Tech Stack

Production-grade data engineering: PyTorch, PostgreSQL, Docker, GPU acceleration, interactive Plotly dashboards

→ View Technologies

🗺️

Project Roadmap

Future enhancements and planned features for expanding the platform's analytical capabilities

→ View Roadmap

Project Context & Design Intent

System Demonstration · Ongoing Exploratory System

This project is an evolving, scalable linguistic analysis platform designed to explore how collective language shifts under changing conditions. It currently analyzes ~200,000 Hacker News discussions (2024–2025) using nine custom BERT models trained with a weakly supervised approach, combining small curated seed labels with unsupervised pattern discovery. The system is built to scale across larger corpora as additional data sources and analytical dimensions are introduced.

The work focuses on surfacing interpretable linguistic signals—including emotional valence, urgency, certainty decay, and topic dominance—rather than producing finalized predictive outputs. The architecture supports ongoing experimentation, model iteration, and methodological refinement, allowing analytical assumptions and measurement strategies to evolve alongside the data.

Overview › Linguistic Dimensions

Interactive Linguistic Dimensions Timeline

Explore 9 BERT-based linguistic dimensions across 2024-2025. Hover for details, toggle dimensions on/off, use the slider to zoom into specific timeframes. Event markers show major 2024-2025 events for context.

Overview › Word Burst Analysis

Word Burst Explorer

Interactive visualization showing which words "burst" (appear significantly more than baseline) in Hacker News discussions each month. Use the slider to navigate through 24 months of data.

Overview › Tech Stack

Technology Stack

🐍

Python 3.11

🔥

PyTorch

🤖

BERT Transformers

🐘

PostgreSQL 16

🐳

Docker

⚡

GPU/CUDA

📊

Plotly

📈

BERTopic

🔄

Alembic

📉

SciPy Stats

🎨

Matplotlib

🔢

NumPy/Pandas

System Architecture

Data Pipeline:

1. Collection → Scraped 197K HN stories via API with incremental checkpointing
2. Processing → Tokenized and normalized text, created word-level indexes
3. ML Inference → Applied 9 BERT models using GPU acceleration (~300 stories/sec)
4. Storage → PostgreSQL with optimized schema, indexes, and foreign keys
5. Analysis → Statistical validation, baseline establishment, coherence scoring
6. Visualization → Interactive Plotly dashboards for exploration

Infrastructure: Docker containerization, Alembic migrations, GPU-optimized PyTorch models

Overview › Methodology

System Architecture & Methodology

Project Scope

Built end-to-end linguistic analysis platform to demonstrate technical program management capabilities for senior roles at companies like Coinbase. The system showcases:

Large-scale data engineering - Processing millions of records with proper database design
ML model deployment - Training and deploying 9 BERT models with GPU acceleration
Statistical rigor - Establishing baselines, running t-tests, calculating effect sizes
Modern tech stack - Docker, PostgreSQL, PyTorch, interactive visualizations
Project delivery - Complete system from architecture to deployment in 6 weeks

01

Data Collection

Designed and implemented incremental data pipeline scraping Hacker News API. Collected 197,496 stories across 2024-2025 with proper error handling, rate limiting, and resume capability.

02

ML Model Training

Trained 9 custom BERT models for linguistic dimension analysis. Optimized for GPU inference achieving ~300 stories/second processing speed. Models saved and versioned for reproducibility.

03

Database Architecture

Designed normalized PostgreSQL schema handling millions of word-level associations. Implemented proper indexing, foreign keys, and Alembic migrations. Optimized queries for analytical workloads.

04

Statistical Analysis

Established baselines across full dataset, performed t-tests for significance, calculated effect sizes. Developed Event Coherence Index measuring cross-dimensional synchronization.

05

Visualization & Delivery

Created interactive Plotly dashboards enabling exploration of 2-year dataset. Built responsive portfolio site demonstrating technical execution. Documented methodology and prepared for GitHub publication.

Overview › Roadmap