Skip to content

Python Data Analysis

1 | Introduction

This repository provides the source code and documentation for a Python Data Analysis workshop based on some version of these talks:

Talk Description Slides
Simplifying Data Analysis with GitHub Codespaces, Jupyter Notebooks & Open AI PyData NYC (Nov 2023) Slides
Simplifying Data Analysis (with Developer Tools & AI (for non-Python devs) Data Science Day (Mar 2024) Slides

2 | Workshop Roadmap

The slide shows a high-level roadmap for what we want to cover in this workshop. The goal is to help someone who is familiar with development - but not with data science or Python - skill up quickly on topics related to data science, analysis and visualization using developer tools and AI assistance.

2024 Roadmap

The actual roadmap for the workshop may switch steps up in sections, or combine exercises for efficiency to align with the Week 2 schedule on #14DaysOfDataScience. Click the "Day" entry to read the post, and the "Details" entry to visit the exercise, once series is complete.

Day Description Details
1️⃣ Setup a Consistent Development Environment with Dev Containers (GitHub Codespaces) + Jupyter Notebooks. Run sample notebooks. Setup Dev Environment
2️⃣ Create a Data Science driven Visual Studio Code Profile for productivity. Do the VS Code Data Science Tutorial, use Data Wrangler. Setup Custom VS Code Profile
3️⃣ Install the Copilot extension and use it to create Notebooks, explain code, or debug issues inline (or in chat mode). Install & explore GitHub Copilot
4️⃣ Learn about open datasets on Kaggle, Hugging Face, Azure, Open ML etc. Explore sites to learn from community & experiment. Pick a dataset, do EDA, build Model
5️⃣ Learn about Responsible AI principles & explore the components of the Responsible AI Toolbox for assessing these in practice. Debug Models & Decision-Making
6️⃣ Learn about Project LIDA and how you can use Large Language Models to summarize and visualize data, using natural language prompts Visualize Data using LLM & NLP
7️⃣ Learn about the paradigm shift from MLOps to LLM Ops and how to use the Azure AI SDK (Python) to build, evaluate, & deploy, AI apps. Explore LLM Ops with Azure AI

3 | Pre-Requisites

  • Familiarity with GitHub and Visual Studio Code
  • Familiarity with at least one high-level programming language
  • Access to an OpenAI or Azure OpenAI key - optional, for Project LIDA
  • Access to a GitHub Copilot enabled account - optional, for AI-assisted learning
  • Knowledge of Python and Jupyter Notebooks is a plus
  • Having a specific data set or data analysis problem to work on is a plus

4 | Learning Objectives

By the end of this workshop, you will be able to:

  • Describe what the various tools do, and why they matter
  • Setup a reusable development environment - with GitHub Codespaces
  • Document your learning with code - with Jupyter Notebooks
  • Generate notebooks and explore unfamiliar code - with GitHub Copilot
  • Discover, analyze and visualize open datasets - from Kaggle, Hugging Face or Azure
  • Train & debug models for responsible AI practices - with Responsible AI Toolbox
  • Build intuition and explore data visualizations - with Project LIDA
  • Explore paradigm shift to LLM Ops - with Azure AI Studio

5 | Developer Tools

Here are the main tools we will cover, with links to relevant documentation and a short description on why they are useful.

Tool Description
Jupyer Notebooks Interactive development environment for sharing code with documentation, for collaborative learning & debugging
GitHub Codespaces Reusable and reproducible development environments based on development containers for configuration-as-code.
Visual Studio Code Visual Studio Code extensions for Data Science, with Data Science Profile Template
GitHub Copilot AI-based assistance for developers - from code explainers to code creation and debugging
Open Datasets Jumpstart exploration with curated public datasets for machine learning - from Kaggle, Hugging Face, and Azure
Responsible AI Responsible AI Toolbox includes tools for identifying, diagnosing, and mitigating, issues in data.
Project LIDA Automated generation of visualizations with LLM, with user customization and suggestion options
Azure AI Studio Unified platform for building generative AI applications from model exploration to API deployment

6 | Python Libraries

Developer tools will streamline your learning journey, but you will need to skill up on a few core Python libraries, to be productive. Start with these libraries, in the recommended order. For each, start by creating a new notebook in your fork of this repo - then try to walk through the quickstart tutorial step-by-step, to get a sense of what the library does and how it works.

Important | To start using these libraries, you will first need to install them in your development environment. Look for installation instructions that use pip - and add those libraries to the requirements.txt file if absent. Rebuild container - and you should be ready to go.

Library Description Quickstart
NumPy Numerical computing with Python Quickstart Tutorial
Pandas Data manipulation and analysis Getting Started
Matplotlib Data visualization (static + interactive) Quickstart - Depth
Seaborn Data visualization (matplotlib-based) Quickstart - Depth
Scikit-learn Machine learning in Python (Core) Quickstart - Depth
TensorFlow Machine learning in Python (E2E) Quickstart - Depth
PyTorch Machine learning in Python (CV, NLP) Quickstart - Msft
Bokeh Interactive visualizations in Python Quickstart - Depth

This is not a comprehensive list - feel free to add more to the list as you skill up, based on the goals you are driving towards in usage.

7 | Learning Resources

These are resources I recommend for skilling up on these topics. Note that open access editions of published books are often made available by the authors, on their sites, for supporting learners - and not for commercial use. Please read and honor the author's usage requirements when exploring content or code.

Resource Description
2024: Data Science Day Collection My curated collection of Microsoft Resources for Data Science development
2024: Responsible AI for Developers My curated collection of Microsoft Resources for Responsible AI usage
2024: Generative AI Code-First on Azure My curated collection of Microsoft Resources for Generative AI with the Azure AI Platform
Python Data Science Handbook - Jake VanderPlas O'Reilly book made available online with MIT license and tutorial notebooks on GitHub
Python for Data Analysis 3E - Wes Mckinney O'Reilly book made available online with MIT-licensed code examples on GitHub
Introduction to Machine Learning with Python - Andreas C. Müller & Sarah Guido O'Reilly book covering scikit-learn usage with notebooks on GitHub. Book content is not online.

8 | Feedback Welcome

Have questions or comments, or found areas for improvement in the coede or documentation? File an issue to let me know.