5 Python Libraries Every Data Science Student Should Know in 2023
I've spent the last semester as a TA for CS3000, Intro to Data Science, at Northeastern. Over 200 students. A lot of them come in knowing pandas and matplotlib because that's what the tutorials teach. And look, those tools still work. But the landscape in 2023 looks genuinely different from what it looked like even two years ago.
Here are five libraries I wish I could add to the syllabus. Not just because they're new, but because each one reflects a shift in how data science is actually practiced right now.
1. Polars
Pandas has been the default DataFrame library for a decade. It's also single-threaded, memory-hungry, and full of API quirks that trip up beginners constantly. Polars is a DataFrame library written in Rust that runs operations in parallel by default.
import polars as pl
df = pl.read_csv("big_dataset.csv")
result = df.filter(pl.col("score") > 90).group_by("category").agg(pl.col("score").mean())The syntax is cleaner than pandas, and on large datasets it's often 5-10x faster without any optimization. I've had students hit memory errors on pandas with datasets that Polars handles without complaint. The bigger point: the "just use pandas" era is ending. Polars, DuckDB, and other tools are proving that you don't have to sacrifice ergonomics for performance.
2. Plotly
Matplotlib served us well. But in 2023, when every stakeholder expects interactive dashboards and web-ready visualizations, static plots feel limiting. Plotly generates interactive, browser-native charts with surprisingly little code.
import plotly.express as px
fig = px.scatter(df, x="experience", y="salary", color="department", hover_data=["name"])
fig.show()You get zoom, hover tooltips, and export options out of the box. For students who are going to present their work to non-technical audiences, this matters a lot more than getting the perfect matplotlib colormap.
3. Pydantic
This one surprises people when I mention it in a data science context. Pydantic is a data validation library, and in 2023, every data science project eventually becomes an API. Whether you're building a model endpoint with FastAPI or ingesting data from external sources, you need to validate what's coming in.
from pydantic import BaseModel
class PredictionRequest(BaseModel):
age: int
income: float
credit_score: int
request = PredictionRequest(age=25, income=75000.0, credit_score=720)It catches bad data before it reaches your model. I've seen too many production bugs that boiled down to "someone passed a string where we expected a float." Pydantic makes that class of error almost impossible.
4. HuggingFace Transformers
This one is obvious, but it's worth stating clearly: if you're a data science student in 2023 and you haven't used HuggingFace, you're behind. The Transformers library gives you access to thousands of pre-trained models with a consistent API.
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("This library changed how I think about NLP.")Three lines to run inference with a state-of-the-art model. The real value isn't just convenience. It's that HuggingFace has become the standard hub for sharing models, datasets, and benchmarks. Understanding this ecosystem is as important as understanding the models themselves.
5. LangChain
The newest library on this list, and the most divisive. LangChain provides abstractions for building applications on top of LLMs. Chains, agents, retrieval systems, memory. It's the framework people reach for when they want to build something with GPT-4 or Claude.
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template("Explain {topic} to a college freshman.")
chain = prompt | ChatOpenAI()
response = chain.invoke({"topic": "backpropagation"})Is it over-abstracted in places? Yes. Does the API change constantly? Also yes. But the patterns it introduces, chaining LLM calls, retrieval-augmented generation, tool-using agents, are becoming standard across the industry. Learning LangChain in 2023 is less about the specific library and more about understanding the architecture of LLM applications.
The Bigger Picture
What strikes me about this list is how much it reflects the broader shifts in the field. Performance-first data tools. Interactive visualization. API-native workflows. Pre-trained models as building blocks. LLM orchestration as its own discipline.
The sacred cows of the Python data stack, pandas, matplotlib, scikit-learn, aren't going away. But they're no longer the whole story. The students who will thrive are the ones who treat the ecosystem as a living thing, not a fixed curriculum. That's something I try to communicate in every office hour, even when the syllabus doesn't quite keep up.
Related Posts
What Teaching 200 Students Taught Me About Explaining Complex Ideas
A semester as a TA for Intro to Data Science changed how I think about communication, patience, and what it really means to understand something.
Teaching EDA in the Age of ChatGPT: What Still Matters
ChatGPT can generate a pandas plot in seconds. It cannot tell you which plot to generate. That distinction matters more than people think.