Teaching EDA in the Age of ChatGPT: What Still Matters
A student came to my office hours last week with a dataset about Boston housing prices. They'd asked ChatGPT to "do exploratory data analysis" and it had generated twenty lines of pandas code: summary statistics, a correlation heatmap, a few histograms. All technically correct. All completely useless for the question they were actually trying to answer.
They wanted to understand why certain neighborhoods had price outliers. ChatGPT gave them a generic EDA template. It didn't know to look at the relationship between renovation year and sale price. It didn't think to check whether the outliers clustered near transit lines. It didn't have the curiosity to ask "wait, what's going on with this bimodal distribution?"
That's the thing about Exploratory Data Analysis. The "exploratory" part is the whole point, and it's the part AI can't do.
Execution vs. Investigation
ChatGPT is excellent at execution. Tell it exactly what you want and it'll write the code. "Create a scatter plot of price vs. square footage, colored by neighborhood." Done. Flawless syntax, reasonable defaults, even a title.
But EDA isn't about execution. EDA is about not knowing what you're looking for and figuring it out through the data. It's iterative, messy, driven by hunches. You look at one plot, notice something weird, generate a hypothesis, create another plot to test it. That cycle of observation and questioning is fundamentally human. It requires domain knowledge, intuition, and a tolerance for ambiguity that language models don't have.
I've been explaining this to my CS3000 students using an analogy that seems to land. ChatGPT is like a very fast lab technician. It can run any test you ask for, perfectly. But it can't decide which tests to run. That's the doctor's job. In data science, that's your job.
What I've Changed in How I Teach
This semester I've shifted my office hours approach. Instead of helping students write code, I spend more time asking them questions about their data before they write anything.
"What do you expect this distribution to look like, and why?"
"If this correlation is strong, what would that mean for your hypothesis?"
"You found an outlier. Before you remove it, can you explain what it represents in the real world?"
These questions force students to think before they code. And thinking before coding is exactly the skill that separates a data scientist from someone who can prompt ChatGPT effectively.
The Plots That Actually Matter
In a typical EDA workflow, maybe 20% of the visualizations you create end up being useful. The other 80% are dead ends, things you checked that didn't reveal anything interesting. That's normal. That's the process.
ChatGPT doesn't generate dead ends. It generates the standard plots for any dataset: histograms of numeric columns, bar charts of categoricals, a correlation matrix. These are fine as a starting point, but they're the warm-up, not the analysis. The insights come from the non-obvious plots. The ones where you segment the data in an unusual way, or transform a variable, or overlay two distributions that nobody thought to compare.
I had a student this week who was analyzing Uber ride data in Boston. ChatGPT gave them a time-series plot of ride counts. Fine. But the student, on their own, decided to split the data by day-of-week and noticed that Friday evening patterns in Back Bay looked completely different from the rest of the city. That observation led to a hypothesis about event venues that turned into the best part of their project.
No prompt would have generated that insight. It came from a person who knew Boston, knew the data, and was genuinely curious.
The Skill That Becomes More Valuable
There's a pattern I keep seeing in tech. When a tool automates the easy parts of a job, the hard parts become more valuable, not less. Calculators didn't make math irrelevant. They made mathematical thinking more important because you could tackle bigger problems.
ChatGPT is doing the same thing for data science. It automates the syntax, the boilerplate, the "how do I pivot this table" questions. That means the value shifts entirely to the analytical thinking layer. What question should I ask? What does this pattern mean? Is this correlation causal or coincidental? What am I not seeing?
Those questions require experience, judgment, and genuine curiosity about the world. I don't know how to automate that. I don't think anyone does.
So when students ask me "why learn EDA if ChatGPT can do it," I tell them: ChatGPT can make the plots. You have to have the ideas. And in a world where plot-making is free, ideas are everything.
Related Posts
What Teaching 200 Students Taught Me About Explaining Complex Ideas
A semester as a TA for Intro to Data Science changed how I think about communication, patience, and what it really means to understand something.
Claude Code Isn't a Code Editor. It's a New Way to Use a Computer.
After a month of writing about Claude Code, here's the thing I keep coming back to: this isn't a developer tool. It's a new interface for computing.
Permissions, Security, and Trusting an AI with Your Codebase
Claude Code can edit files, run commands, and push to GitHub. The permission model determines what it can do and when. Here's how I think about trusting an AI agent with my code.