AI/ML · 19 May 2026 · Essay · By Babulal Tamang

AI
History
Deep Learning
Transformers

From Symbols to Foundation Models: The Historical Paradigms of Artificial Intelligence

Artificial intelligence did not begin with ChatGPT. It is a chain of competing ideas—logic, rules, statistics, neurons, scale—each promising that machines could reason, learn, or behave intelligently. This post maps those paradigms, the people who shaped them, and where the field stands after more than seventy years of ambition and setback.

In short

AI history is a sequence of paradigms: symbolic reasoning, expert systems, connectionist networks, statistical learning, deep learning, and large foundation models. Today’s systems are powerful pattern engines with real limits; governance and engineering matter as much as model size.

What we mean by a “paradigm”

In science and engineering, a paradigm is more than a buzzword. It is a shared answer to three questions: What is intelligence? How should we build it? How do we know we succeeded? Each era of AI picked different answers—logic and symbols, hand-crafted rules, probability, layered neurons, or trillion-parameter language models trained on the web.

Paradigms rarely disappear overnight. Expert systems inform today’s knowledge graphs; symbolic planners sit inside robotics stacks; classical statistics still underpins evaluation and A/B tests. Understanding history helps you see why a tool works in one context and fails in another, and why “AI” in 2026 is not the same problem statement as in 1956.

Before the name: computing, cybernetics, and the imitation game

Long before anyone said “artificial intelligence,” thinkers asked whether machines could think.

Alan Turing (1950) proposed the imitation game—we now call it the Turing test—as a practical criterion for machine intelligence, sidestepping endless philosophy about consciousness.
Norbert Wiener and the cybernetics movement treated control, feedback, and communication as unifying ideas across brains and machines.
Warren McCulloch and Walter Pitts (1943) modeled neurons as logic gates, hinting that thought might be computation.
Early hardware—ENIAC, later stored-program machines—made simulation conceivable at scale.

These threads set the stage: intelligence as information processing, not magic.

Paradigm 1: Symbolic AI and the “thinking machine” (1950s–1970s)

At the Dartmouth workshop (summer 1956), John McCarthy coined the term artificial intelligence. Attendees—including Marvin Minsky, Claude Shannon, and Nathaniel Rochester—bet that intelligence could be encoded as symbols manipulated by rules: if you could represent knowledge and search the space of inferences, you could build a mind.

Early wins were real but narrow:

Logic Theorist (Newell & Simon) proved theorems in formal logic.
ELIZA (Weizenbaum, 1966) mimicked a Rogerian therapist with pattern matching—convincing enough to unsettle its author about human gullibility.
SHRDLU (Winograd) understood blocks-world language in a constrained microworld.

The paradigm assumed that explicit representation + search would scale. It did not—commonsense knowledge and ambiguity exploded combinatorially. Funding cycles tightened; the first AI winter arrived when promises outran delivery.

Key figures this era: McCarthy, Minsky, Allen Newell, Herbert Simon, Joseph Weizenbaum, Terry Winograd.

Paradigm 2: Expert systems — knowledge as currency (1970s–1980s)

Industry pivoted from general intelligence to expert systems: capture a specialist’s rules in an inference engine. If symbolic AI could not mimic a child, perhaps it could mimic a chemist or a doctor in a bounded domain.

DENDRAL (Stanford) inferred molecular structure from mass spectrometry data—early success in scientific AI.
MYCIN (Shortliffe et al.) recommended antibiotics for blood infections; evaluations were promising, though deployment raised liability and workflow questions.
Commercial shells (e.g. Intellicorp, Teknowledge) spread the pattern: knowledge engineers interviewed experts, codified if–then rules, and attached explanation facilities.

Expert systems delivered ROI in niches but were brittle: maintenance cost grew with rule count; conflicting rules appeared; tacit knowledge resisted interviews. A second winter followed when desktop computing and cheaper custom software undercut expensive AI projects.

Key figures: Edward Feigenbaum, Bruce Buchanan, Randall Davis, Edward Shortliffe.

Paradigm 3: Connectionism — learning in the weights (1940s roots, revivals 1980s–2010s)

A parallel tradition treated intelligence as emergent from networks of simple units. Frank Rosenblatt’s perceptron (1958) could learn linear decision boundaries; Minsky and Papert’s book Perceptrons (1969) highlighted limitations and cooled enthusiasm for years.

The ideas never died. Breakthroughs returned when data, compute, and training algorithms aligned:

Backpropagation popularized by Rumelhart, Hinton, and Williams (1986) enabled multi-layer networks to learn internal representations.
Convolutional networks for vision—Yann LeCun’s LeNet for digit recognition showed structured architectures matter.
Recurrent networks and LSTMs (Hochreiter & Schmidhuber, 1997) addressed sequences and memory for speech and text.
Reinforcement learning—Richard Sutton and Andrew Barto’s framework; later DeepMind’s DQN (Mnih et al.) and AlphaGo (Silver et al.) learned policies through trial and reward.

Connectionism reframed the question: instead of hand-writing rules, let the system learn features from examples. The cost was data hunger and opaque internals.

Key figures: Rosenblatt, Geoffrey Hinton, Yann LeCun, Yoshua Bengio, Jürgen Schmidhuber, Richard Sutton, Andrew Barto, Demis Hassabis.

Paradigm 4: Statistical machine learning — probability beats hand-crafting (1990s–2010s)

As the web generated labeled and unlabeled data, machine learning became an engineering discipline distinct from “AI” in the symbolic sense. The paradigm: choose a model family, define a loss, optimize on data, validate on held-out sets.

Support vector machines (Vapnik) dominated many classification benchmarks.
Ensemble methods—random forests (Leo Breiman), gradient boosting—won tabular competitions before deep nets ate vision and speech.
Bayesian methods and graphical models handled uncertainty and sparse evidence.
Unsupervised and representation learning—clustering, PCA, later autoencoders—fed downstream tasks.

Core ideas from this era still matter: bias–variance tradeoffs, cross-validation, feature leakage, calibration, and honest baselines. Kaggle leaderboards and industrial recommender systems were built here as much as in research labs.

Key figures: Vladimir Vapnik, Leo Breiman, Michael Jordan, Andrew Ng, Pedro Domingos, Tom Mitchell.

Paradigm 5: Deep learning at scale — representation + data + GPUs (2012–2017)

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton’s AlexNet (2012) crushed ImageNet benchmarks using GPUs and deep convolutional layers. The lesson spread: with enough data and compute, end-to-end learning could outperform painstaking feature engineering.

Milestones stacked quickly:

Sequence-to-sequence models and attention for machine translation.
ResNets (He et al.) enabled very deep networks via skip connections.
Generative adversarial networks (Goodfellow et al., 2014) for synthetic images.
AlphaGo combining deep nets with Monte Carlo tree search—symbolic search married to learned intuition.

“Deep learning” became the default for perception (vision, speech) and many structured prediction tasks. MLOps emerged because training pipelines, model registries, and drift monitoring were production problems, not notebook demos.

Paradigm 6: Transformers and foundation models (2017–present)

The paper “Attention Is All You Need” (Vaswani et al., 2017) introduced the Transformer: self-attention replaced recurrence for many sequence tasks, enabling parallel training at scale. Language modeling turned from niche to infrastructure.

What changed in practice:

Pre-train once, adapt many times — BERT (Devlin et al.), GPT lineages (Radford et al.; OpenAI), T5, and open weights from Meta, Mistral, and others.
Scale laws — larger models and datasets often predictably improve capability—until they do not, or until cost dominates.
Multimodality — vision-language models, speech, code; one interface, many modalities.
Agents and tools — models that call APIs, browse, write code—closing the loop between prediction and action (with new failure modes).

This is the era most people mean by “AI” in 2026: foundation models—general-purpose systems fine-tuned or prompted for specific work. It inherits connectionist learning but operates at civilizational data scale and corporate capital intensity.

Key figures: Ashish Vaswani and co-authors; Sam Altman and OpenAI team; Demis Hassabis (DeepMind); Fei-Fei Li (vision and human-centered AI); Timnit Gebru and Margaret Mitchell (ethics and accountability in deployed systems).

Paradigm shifts at a glance

Era	Core idea	Strength	Limit
Symbolic AI	Logic + symbols + search	Interpretable reasoning in small worlds	Knowledge acquisition bottleneck
Expert systems	Encode expert rules	Commercial value in narrow domains	Maintenance, brittleness
Connectionism / deep nets	Learn representations from data	Perception, speech, complex patterns	Data, compute, opacity
Statistical ML	Optimize loss on samples	Strong baselines, rigorous evaluation	Feature and distribution shift
Foundation models	Pre-trained general models + adaptation	Flexibility, language, multimodal tasks	Hallucination, cost, governance

Who this post covers (and who it does not)

No single article can list every contributor. The table below summarizes people and groups referenced above—a map for further reading, not a hall of fame ranked by importance.

Person / group	Contribution
Alan Turing	Computability; imitation game as operational test
John McCarthy	Coined “AI”; Lisp; time-sharing ideas
Marvin Minsky	Symbolic AI; society of mind; perceptron critique
Allen Newell & Herbert Simon	Physical symbol systems; early problem solvers
Edward Feigenbaum & colleagues	Expert systems movement
Frank Rosenblatt	Perceptron; early neural learning
Geoffrey Hinton, Yann LeCun, Yoshua Bengio	Deep learning revival; backprop, CNNs, recognition
Richard Sutton & Andrew Barto	Reinforcement learning textbook and theory
Vladimir Vapnik	Statistical learning theory; SVMs
Vaswani et al. / Transformer authors	Attention-based sequence modeling at scale
OpenAI, DeepMind, Meta AI, Anthropic, others	Large-scale training, alignment research, open and closed model ecosystems
Timnit Gebru, Margaret Mitchell, Joy Buolamwini	Fairness, accountability, and harms of deployed ML

Women and global south researchers are underrepresented in popular narratives; modern AI also depends on data workers, annotators, and maintainers who rarely appear in headlines. A complete picture includes labor, supply chains, and environmental cost—not only architecture papers.

AI winters and why hype repeats

Funding and attention collapsed twice (roughly mid-1970s and late 1980s) when symbolic and expert-system roadmaps stalled. The pattern is familiar: a demo impresses executives, timelines shrink, integration costs are underestimated, and maintenance eats budgets.

Today's risk is different in scale but similar in shape: pilot purgatory—impressive prototypes that never reach governed production; or production systems without monitoring, rollback, or human override. History suggests measuring outcomes (quality, latency, cost, incident rate) beats declaring victory after a keynote.

Where we are now (2026)

Capability and caution coexist.

What works well — drafting and summarization, code assistance, search augmentation, document extraction, many perception tasks, personalization when privacy is handled carefully, and acceleration of research loops with human review.
What remains hard — reliable planning over long horizons, verifiable truth in open domains, robustness to adversarial input, cheap on-device parity with cloud giants, and causal reasoning beyond pattern completion.
Engineering reality — RAG, fine-tuning, eval harnesses, guardrails, cost controls, and observability are how teams ship; the model weights are one layer in a system.
Governance — frameworks such as ISO/IEC 42001 (AI management systems) and sector rules treat AI as organizational risk: roles, lifecycle, bias, transparency, and continual improvement—not only accuracy metrics.
Platform angle — AI workloads need the same discipline as any critical service: data pipelines, identity, secrets, GPU/CPU scheduling, and incident response. Cloud and Kubernetes skills still matter; the “intelligence” is a service with SLAs.

We are not at artificial general intelligence as portrayed in film. We are in a tool-rich, responsibility-heavy phase: foundation models as programmable infrastructure, with society still negotiating trust, copyright, labor impact, and safety research agendas.

How the paradigms connect to practice

If you build or operate systems today, the historical map is a decision guide:

Need explainable rules in a regulated workflow? Symbolic and expert-system thinking still appears in policy engines and decision tables—often alongside ML.
Need prediction from structured logs or metrics? Classical ML and time-series methods may beat a large language model on cost and clarity.
Need language, vision, or multimodal UX? Foundation models plus retrieval and evals are the default experiment—prove value before scaling spend.
Need long-term autonomy? Reinforcement learning and planning research matter—but production usually demands human-in-the-loop until failure modes are bounded.

On this site, Neural Networks in Depth covers perceptrons, backpropagation, and core architectures; Large Language Models in Depth focuses on the text-generation stack (tokens, inference, RAG); AI Foundation Models in Depth unpacks transformers, training, and production patterns; credential posts on AWS Machine Learning Foundations and Generative AI Foundations cover practical building blocks; the ISO/IEC 42001 note covers audit-style governance. This essay is the narrative spine connecting them.

Blog