ยท5 min read

5 Python Libraries Every Data Science Student Should Know in 2023

pythondata-scienceresources

I've spent the last semester as a TA for CS3000, Intro to Data Science, at Northeastern. Over 200 students. A lot of them come in knowing pandas and matplotlib because that's what the tutorials teach. And look, those tools still work. But the landscape in 2023 looks genuinely different from what it looked like even two years ago.

Here are five libraries I wish I could add to the syllabus. Not just because they're new, but because each one reflects a shift in how data science is actually practiced right now.

1. Polars

Pandas has been the default DataFrame library for a decade. It's also single-threaded, memory-hungry, and full of API quirks that trip up beginners constantly. Polars is a DataFrame library written in Rust that runs operations in parallel by default.

import polars as pl
 
df = pl.read_csv("big_dataset.csv")
result = df.filter(pl.col("score") > 90).group_by("category").agg(pl.col("score").mean())

The syntax is cleaner than pandas, and on large datasets it's often 5-10x faster without any optimization. I've had students hit memory errors on pandas with datasets that Polars handles without complaint. The bigger point: the "just use pandas" era is ending. Polars, DuckDB, and other tools are proving that you don't have to sacrifice ergonomics for performance.

2. Plotly

Matplotlib served us well. But in 2023, when every stakeholder expects interactive dashboards and web-ready visualizations, static plots feel limiting. Plotly generates interactive, browser-native charts with surprisingly little code.

import plotly.express as px
 
fig = px.scatter(df, x="experience", y="salary", color="department", hover_data=["name"])
fig.show()

You get zoom, hover tooltips, and export options out of the box. For students who are going to present their work to non-technical audiences, this matters a lot more than getting the perfect matplotlib colormap.

3. Pydantic

This one surprises people when I mention it in a data science context. Pydantic is a data validation library, and in 2023, every data science project eventually becomes an API. Whether you're building a model endpoint with FastAPI or ingesting data from external sources, you need to validate what's coming in.

from pydantic import BaseModel
 
class PredictionRequest(BaseModel):
    age: int
    income: float
    credit_score: int
 
request = PredictionRequest(age=25, income=75000.0, credit_score=720)

It catches bad data before it reaches your model. I've seen too many production bugs that boiled down to "someone passed a string where we expected a float." Pydantic makes that class of error almost impossible.

4. HuggingFace Transformers

This one is obvious, but it's worth stating clearly: if you're a data science student in 2023 and you haven't used HuggingFace, you're behind. The Transformers library gives you access to thousands of pre-trained models with a consistent API.

from transformers import pipeline
 
classifier = pipeline("sentiment-analysis")
result = classifier("This library changed how I think about NLP.")

Three lines to run inference with a state-of-the-art model. The real value isn't just convenience. It's that HuggingFace has become the standard hub for sharing models, datasets, and benchmarks. Understanding this ecosystem is as important as understanding the models themselves.

5. LangChain

The newest library on this list, and the most divisive. LangChain provides abstractions for building applications on top of LLMs. Chains, agents, retrieval systems, memory. It's the framework people reach for when they want to build something with GPT-4 or Claude.

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
 
prompt = ChatPromptTemplate.from_template("Explain {topic} to a college freshman.")
chain = prompt | ChatOpenAI()
response = chain.invoke({"topic": "backpropagation"})

Is it over-abstracted in places? Yes. Does the API change constantly? Also yes. But the patterns it introduces, chaining LLM calls, retrieval-augmented generation, tool-using agents, are becoming standard across the industry. Learning LangChain in 2023 is less about the specific library and more about understanding the architecture of LLM applications.

The Bigger Picture

What strikes me about this list is how much it reflects the broader shifts in the field. Performance-first data tools. Interactive visualization. API-native workflows. Pre-trained models as building blocks. LLM orchestration as its own discipline.

The sacred cows of the Python data stack, pandas, matplotlib, scikit-learn, aren't going away. But they're no longer the whole story. The students who will thrive are the ones who treat the ecosystem as a living thing, not a fixed curriculum. That's something I try to communicate in every office hour, even when the syllabus doesn't quite keep up.