Think Summer: Introduction — 2024
Submission
Students need to submit the following file by 10:00PM EST through Gradescope inside Brightspace.
-
A Jupyter notebook (a
.ipynb
file).
We’ve provided you with a template notebook for you to use. Please carefully read this section to get started.
When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click |
Project
Motivation: SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data — amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses.
Context: This project is specially designed for Purdue University’s Think Summer program, and is coordinated by The Data Mine.
Scope: SQL, SQL in R
Dataset
The following questions will use the imdb
database found in Anvil, our computing cluster.
This database has 6 tables, namely:
akas
, crew
, episodes
, people
, ratings
, and titles
.
You have a variety of options to connect with, and run queries on our database:
-
Run SQL queries directly within a Jupyter Lab cell.
First, you must launch a new Jupyter Lab instance. To do so, please follow the instructions below.
-
Open a browser and navigate to ondemand.anvil.rcac.purdue.edu, and login using your ACCESS credentials.
-
Click on "My Interactive Sessions".
-
Click on Jupyter Notebook in the left-hand menu under "The Data Mine" section (near the bottom of the screen). Select the following settings:
-
Allocation: cis220051
-
Queue: shared
-
Time in Hours: 3
-
Cores: 1
-
Use Jupyter Lab instead of Jupyter Notebook: Checked
-
-
When satisfied, click Launch, and wait for a minute. In a few moments, you should get a note indicating that your session is ready to run.
-
When you are ready, click Connect to Jupyter. A new browser tab will launch.
-
Under the "Notebook" menu, please select the seminar (look for the big "S"; we do not want seminar-r).
You now have a running Jupyter Lab notebook ready for you to use. This Jupyter Lab instance is running on the Anvil cluster. By using OnDemand, you’ve essentially carved out a small portion of the compute power to use. Congratulations!
To run queries directly in a Jupyter Lab cell, please do the following.
-
In the first cell, run the following code. This code establishes a connection to the
imdb.db
database, which allows you to directly run SQL queries in a cell as long as that cell has%%sql
at the top of the cell.%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
-
After running that cell (for example, using Ctrl+Enter), you can directly run future queries in each cell by starting the cell with
%%sql
in the first line. For example.%%sql SELECT * FROM titles LIMIT 5;