[Tutorial Kit] Using RStudio to Analyze and Visualize Air Quality
- Shamini V De Silva
- 9 minutes ago
- 8 min read
Video Chapters
Recorded live Feb 24 2026, timestamps below
00:00 Intro
01:06 Bar Chart Overview
02:47 PM 2.5 What is it?
03:55 Step-by-Step
04:00 Step 1. Data Files
06:19 Step 2. Posit Cloud
07:05 Step 3. Install Packages
09:39 Step 4. R code
16:38 Step 5. Bar Chart
🎯 Data Challenge
Create a bar chart in RStudio on air pollution for your state that compares three counties with the highest air pollution levels to the primary (health-based) annual standard set by U.S. Environmental Protection Agency (EPA).

Steps
Collect data on fine particulate matter (PM2.5), identify 3 counties in your state with the highest pollution concentrations, calculate how many people in your state are exposed to high levels of PM2.5, and create a bar chart summarizing results.
3 Learning Objectives
Query & Collect 📊 data on county-level air pollution in your state.
Analyze data using the data tool: 🧰 🛠️ RStudio Online (PositCloud).
Visualize & Humanize data by creating a bar chart exploring how local air pollution concentrations compare to the EPA primary (health-based) standards.
Prerequisites. Beginner-friendly, some knowledge required:
R packages and functions
assigning a variable in R
Keywords and core concepts also covered:
🧰 🛠️ R programming basics: Installing packages, assigning a variable, using functions, and creating a bar chart
How do we measure air pollution and why does it matter to health?
Defining the EPA primary (health-based) standards for air pollution.
What you'll need to complete this data challenge
⏰ Time: 30-45 minutes
🧰 Tools:
An account in Posit Cloud (a tool for running a cloud version of RStudio - no R software installation needed)
📊 Data:
Indicator: PM2.5: Highest Annual Average Concentration (Monitor + Modeled Data), 2020, by County.
Data Source: National Environmental Public Health Tracking Network, Data Explorer tool. https://ephtracking.cdc.gov/DataExplorer/
📁 Files
Please download the zipped folder 'tutorial-kit-R-air-pollution' containing the files and data mentioned above.
air_pollution_and_health.R - sample R Code
pm_2020.csv - PM2.5 data in a CSV (.csv) file
population_data_2020.csv - Number of people living in a region
fips_counties_2021.txt - County FIPS code geographic identifiers and and county names, a Text (.txt) file
DOWNLOAD ZIP FILE
click the folder below to download all files
Key Terms and Definitions
PM2.5 (or PM2.5) is fine particulate matter that is 2.5 micrometers (i.e. microns, µm) in diameter or smaller.
PM2.5 concentration (µg/m³) is measured as particle weight (micrograms, µg) for every cubic meter of air (m³).
The primary (health-based) standard for PM2.5. The Environmental Protection Agency's (EPA's) annual National Ambient Air Quality Standard (NAAQS) for fine particulate matter (PM2.5) is 9.0 µg/m³. Above 9.0 µg/m³ is considered harmful to health (EPA, 2025).
Step-by-Step Walkthrough
Overview
The steps below will guide you to analyze air pollution data from the National Environmental Public Health Tracking Network and create a bar chart showing county-level PM2.5 levels. You’ll also calculate key metrics, such as the proportion of people in a state exposed to pollution high enough to harm health, and include this information in a dynamic caption below the bar chart.
🔍 Data Hunt: Finding Data
Air Pollution Data
How do you find PM2.5 data from the National Environmental Public Health Tracking Network?
In the Query Panel, select:
STEP 1: CONTENT
Content Area: Air Quality
Indicator: Current and Historical Air Quality
Measure: PM2.5 (highest annual average concentration, monitored + modeled)
STEP 2: GEOGRAPHY TYPE: National by county
STEP 3: GEOGRAPHY: All Counties
STEP 4: TIME: 2020
STEP 5: ADVANCED OPTIONS: No Advanced Options
Click Button: GO
Navigate to table view (button in upper right corner) and EXPORT data


Population Data
There are many sources of population data published from the U.S. Census Bureau, we are going to use the indicators shared in the National Environmental Public Health Tracking Network Data Explorer Tool.
While still in the Explorer tool, open the Query Panel:
click: SELECT DATA (button upper left corner)
In Query Panel:
STEP 1: CONTENT
Content Area: Demographic & Socioeconomics
Indicator: Demographics
Measure: Number of People by Demographic Group
STEP 2: GEOGRAPHY TYPE: National by county
STEP 3: GEOGRAPHY: All Counties
STEP 4: TIME: 2020
STEP 5: ADVANCED OPTIONS: No advanced options selected
Click Button: GO

🧰 🛠️ Data Tool
Once you have the data it is time to analyze the data using the data tool.
Step 1. 📁 Data Files
Get the data files by downloading the ZIP file below.
DOWNLOAD ZIP FILE
click the folder below to download all files
Step 2: ☁️ Posit Cloud
Set up Posit Cloud and create a new RStudio Project
Instead of installing R and RStudio locally, this tutorial uses Posit Cloud, a browser-based environment that runs RStudio online. We are using the Posit Cloud service to ensure that everyone is working in the same R environment.
The template code has been designed and tested specifically in the Posit Cloud environment (using R version 4.5.3 and tidyverse version 2.0.0). If you choose to run the code in a local RStudio setup, some parts may not work as expected and may require modification.
To get started:
Go to posit.cloud
Click 'Sign Up' in the top-right corner
Choose the free version for this project → Click 'Learn more'
Then click ‘Sign Up’ and create an account using email or services like Google or GitHub.
Once your workspace loads, click 'New Project' on the right-hand side of the screen.
Select 'New RStudio Project' to start a new project. This will also open the RStudio interface within your browser.
Step 3: 📦 Install Packages
Open the R Script, install, and load the required packages and data files
In the Files pane, click the .R script file ('air_pollution_and_health.R') to open it. The code script contains the code used to generate the bar graph and will be displayed in the Script Editor.
The script also includes a header at the beginning that explains:
the purpose of the code,
how to use the code,
how we accessed the data, and
copyright and attribution information (please provide appropriate credit if you plan to adapt and share the code).
Highlight and run (Ctrl + Enter or click ‘Run’ in the top-right corner of the Script pane) the lines of code below the header to install the tidyverse and scales packages (you only need to do this once).
The tidyverse package is a collection of R packages for data cleaning (or wrangling), analysis, and visualization. The scales package contains scaling functions that can make plots easier to read, for instance, by turning raw numbers into readable formats (6,500,000 → 6.5 million). The installation process will run in the Console pane and may take a few minutes. Once complete, run the next two lines containing the library( ) function to load these packages into the R environment.
To learn more about some of the basics of R programming, please watch the step-by-step walkthrough in the video above (timestamp - 03:55 Step-by-Step)
About the Interface
The RStudio interface contains several panes:
Script Editor (upper left): where you write and edit code
Console (bottom left): where commands are executed, and outputs/errors are displayed
Environment (upper right): shows variables and datasets loaded in the session
Files/Plots/Packages (bottom right): used to upload files, view plots, and access documentation

Because Posit Cloud runs online, any files you want to use must be uploaded manually.
Upload Files
After downloading the zipped folder containing the necessary files and data, upload the files to PositCloud.
To upload:
Navigate to the 'Files' tab in the bottom right pane.
Locate the ‘Upload’ button → Click on ‘Browse’ → Select and upload the four files in the folder (.R file, .csv files, and .txt file). Leave the target directory at the default location “/cloud/project/”.
After uploading, you should see the files listed in the Files pane.
If needed, please watch the video above for a demonstration of these steps (timestamp - 03:55 Step-by-Step).
Step 4: 👩🏽💻 R Code
Edit and run the code to calculate key metrics
Specify your state and check file name variables
Locate and update the section where you define:
your state of interest and assign to the variable your_state (e.g., "Washington" or "Louisiana")
file names for your datasets - update these if needed so they match the uploaded files in the Files pane (see screenshot below)

Run code
Start by running the code that imports:
PM2.5 data (county-level pollution) - pm_2020.csv
Population data - population_data_2020.csv
These will appear as data frames (‘pm_2020.csv’ and ‘population_data_2020.csv’) in the Environment pane.
Run the next section of code to calculate the percent of the population in the specified state exposed to PM2.5 levels greater than the EPA standard (9 µg/m³) and to store it in a dataframe assigned to ‘percent_pop_high_pollution’.
Then run the following section to find the national percentage for comparison. This will be summarized in a dataframe assigned to ‘percent_pop_high_pollution_US’.
These metrics will later be highlighted in the bar chart caption.
Step 5: 📊 Bar Chart
Run the code to create the bar graph
Start by preparing the data for visualization. This includes merging datasets using FIPS codes contained in the ‘fips_counties_2021.txt’ file, adding the county names, and filtering for counties with the highest pollution levels in your selected state.
The subsequent lines of code will build the visualization using ggplot2, which works in layers:
Base layer initializes the plot
Additional layers add:
Bars (PM2.5 levels by county)
Labels and formatting
A vertical line marking the EPA standard (9.0 µg/m³)
A dynamic caption with your calculated metrics
Run all plotting code to generate the final chart, which you can preview in the bottom right pane under the ‘Plots’ tab.
Save and Export the Graph
Run the ggsave( ) function to save the chart as a .png file. This should appear under the ‘Files’ pane.

Click on the .png file in the Files pane to open the image in your browser → right-click on the image to save to your computer.
Final Output
Your completed visualization will show:
Three counties with the highest PM2.5 levels
The EPA primary (health-based) threshold
A caption summarizing population exposure and health context
Example
For example, in Washington in 2020 (please see the example graph below):
Benton County, Skamania County, and Grant County were areas with PM2.5 levels greater than the EPA threshold of 9.0 µg/m³
About 7 million people (87.9%) in Washington live in areas where air pollution is high, which is also greater than the U.S. average of 38%

You've Earned a Certificate! | |
BroadStreet Certificate (FREE) | |
CPH - Certified in Public Health Recertification Credits (1 credit hr) ($10) | Submission in progress to National Board of Public Health Examiners (NBPHE) |
Note: We review projects every 2-4 weeks, and typically at the end of the month.
Instructors
![]() | Teresa Tse, MS Public Health Data Analyst Teresa Tse uses R every week to support the data and epidemiology teams of a metropolitan public health department. With a background in biomedical engineering, Teresa has a passion for using programming, research, and data analysis skills to help improve health outcomes. Teresa is a long-time contributor to BroadStreet Institute as a training program manager on the Maternal and Infant Health Track. |
![]() | Shamini V De Silva Program Planner, BroadStreet Institute Shamini is a BroadStreet Program Planner and aspiring researcher learning RStudio. With a background in Biomedical Science and experience working in Clinical Research, Shamini has realized the potential and impact of high-quality data and the growing demand for data handling and analysis skills. As Shamini learns R, she is sharing that learning with others. |

