DATA

SQL Tutorial Series Vol. 3 — Application

Time to Practice!

Bruce A. Lee

--

Photo by Caspar Camille Rubin on Unsplash

Hello there! Welcome to part three of my SQL Tutorial Series, where I break down the fundamentals for those just learning the language.

Part 1, Part 2, Part 4

In this section, we’re going to take a look at how we can actually practice SQL and some tools we can look at to navigate the process (while everything has been vetted for use, please open links at your own risk).

With that said, first things first.

SQL Tutorials

I wanted to take this opportunity to say that the approaches listed here should be paired with some online tutorials to help bring the fundamentals into focus when learning SQL.

Going the route I’m detailing here is good for those who want more of a free flowing, unstructured, hands on experience; where they can just go for it and type out queries that will help some concepts stick.

However, not everyone thrives by learning this way, so here are some more structured resources that I recommend to help.

For hands on practice:

sqlbolt.com

datalemur.com

sql-practice.com

sqlzoo.net

For conceptual information:

postgresqltutorial.com

dataengineering.wiki

w3schools.org

reddit.com

Now, onwards to the tutorial!

Installing an IDE

An IDE is an Integrated Development Environment. It’s a software whose main purpose is to make coding easier by streamlining the tools and workflows needed to code efficiently.

Being that we’re working with SQL, we have many several IDEs to choose from among the various dialects of SQL that exists. Here, we’ll use PostgreSQL to do the legwork, it also helps to use PostgreSQL because many businesses use it in production and dialects like Snowflake (a tool very popular with businesses that employ big data) is based from it. It’s also open source, so it offers a lot of portability for both business owners and teams alike.

We’re using PostgreSQL for the first time, and you’ll need to install it on your computer to proceed. Do to so, you’ll need to go the official site of PostgreSQL here to download and install it.

Since PostgreSQL is open source, there are a plethora of options for installation on any kind of operating system you use. For the purposes of this tutorial, I’m using Windows.

This link lists a few good PostgreSQL IDEs that are free and open source. I personally use Beekeeper Studio Community Edition, and will screenshot images from it for documentation; but I recommend looking to use something else. If you want to use Beekeeper CE, click the link in the GitHub that says “releases page” and get the version that’s appropriate for your OS.

Once you install PostgreSQL and an IDE, we’re going to be creating a local server that you’ll be hosting on your own computer through pgAdmin 4, which you’ll then use to load the database creation queries that will make the content you’ll be practicing with. pgAdmin 4 comes bundled with your PostgreSQL 15 installation.

pgAdmin 4

When initializing for the first time, the software will prompt you for a password that you’ll need make to access the server everytime afterwords. Make sure it’s documented somewhere safe while also being easy to remember as it’s not so easy to change!

Making the Databases

We’re making a couple new databases

As you can see in the screenshot, I have three example databases made here already that I’ll link to in a bit. It’s important to note the Create -> Database options. Right-clicking “Database…” will enable us to proceed.

Where the magic happens!

Firstly, we’ll need to name the database that we’re going to be importing in the “General” tab, the other ones aren’t important to use at this time and I recommend you keep them default.

Since I’m assuming that you the reader are doing this for the first time, you likely do not know how to write the database creation code that is needed here. This is where we’ll use a couple of the example databases you saw above.

HR Sample Database: https://www.sqltutorial.org/sql-sample-database/

Northwind Sample Database: https://github.com/pthom/northwind_psql/blob/master/northwind.sql

In the link above, go to where it says “PostgreSQL” and after making the database, follow the screenshot below:

This is where you’ll make the database and tables..

In the website where the HR Database lives, click both of the links under the PostgreSQL header that say Create HR Sample Database in PostgreSQL and Load HR Data in PostgreSQL; and then copy and paste the contains of these pages into the query editor as shown above by right-clicking the new database you made and then going to “CREATE Script”, then click the play button near the top-center to run everything after you’ve pasted in the queries.

Afterwards, where you see “Tables” to the left under your database, you should see all the new tables that the query generated. That is where the data that we’re going to pull in comes from when we use the IDE.

The same also process also works for the Northwind DB, but for that I recommend copy and pasting the contents of the .sql file itself in their GitHub into the same query editor as a new database and then running it in the same manner as above.

Connecting…

pgAdmin 4 saves everything once you run the queries that make the data content, so nothing else needs to be done here. The only thing that’s left is to connect to your new databases via your IDE to get started!

I’m not going to limit you to using Beekeeper CE for this, so I’ll leave you to refer to your IDE’s documentation for instructions on how to connect to your local database. However, just in case you are using it, see below:

Beekeeper CE

When using “New Connection”, to the right after first starting up the platform, you’ll see a dialog box asking you for the credentials you gave to your local server. Here in the image above lists the Test Server I made with all of its login information. If you were to be apart of an organization that employs a SQL database in production, you’ll need to use similar credentials to log into their database.

For my Test Server here, I just used all the defaults and made up a password in pgAdmin 4 when I created everything. Remember, you’re the only one using this since it’s a local database, so it doesn’t need to be too complex.

Now we’re cooking!

Your workstation

If you’re using Beekeeper, and have connected to your new database, we can get started making queries.

When using SQL in a job-setting, or even for practice like we’re doing here, it’s important to understand what it is you’re looking for before you even type a letter.

Having a question in mind before looking for anything helps you to contextualize what you want your query to do. I think that line of thought even helps if you’re doing a Data Analysis project, as you need your hypothesis in mind to bring into perspective the data you’ll need to wrangle in order to piece together the question you’re looking to answer using your data.

Jobs:

Here’s the database schema for the HR database we’re using for my next example — this is a great representation of the star schema that we’ll learn about later:

HR Database Schema

This image outlines how all different tables align with each other and the columns they contain — below each of the table’s names, you see that to the left of columns like “employee_id” and “job_id” there’s an asterisk.

These are index columns that you can use as reference points to join each table to one another. However, there isn’t much normalization in the tables within this schema.

Using the database I’ve brought up above — let’s say you work in Human Resources for your company and your Director comes up to you and asks:

Hey, I need a list of all of our current employees and their contact information.

This is likely the query you’d come up with to give them what they asked for:

SELECT
CONCAT(first_name, ' ', last_name) AS "Name",
email AS "Email Address",
coalesce(phone_number, 'N/A') AS "Phone Number",
hire_date AS "Hiring Date",
dep.department_name AS "Department",
loc.city AS "City",
country.country_name AS "Country",
reg.region_name AS "Region"
FROM
employees AS em
JOIN departments AS dep ON dep.department_id = em.department_id
JOIN locations AS loc on loc.location_id = dep.location_id
JOIN countries AS country on country.country_id = loc.country_id
JOIN regions AS reg on reg.region_id = country.region_id

To break this down a bit, we’re giving the Director the employee’s whole name by concatenating both of the name columns together, their email address, phone number (while also telling SQL to give us an N/A if there isn’t one), as well as their hiring date, what department they work under, and lastly location data in the last 3 columns.

We sourced everything from the employees table and joined everything else to other adjoining tables using the “id” columns identified above.

Let’s look at another example, let’s say that same Director is looking to do some downsizing, so they come to you and ask “What employees make over $100,000USD, in descending order?” That’s where you’ll make a query like the one below that answers their question:

SELECT 
CONCAT(first_name, ' ', last_name) as "Name",
department_name AS "Department",
job_title AS "Title",
salary::int*10 AS "Salary"
FROM employees AS em
JOIN departments AS dep ON dep.department_id = em.department_id
JOIN jobs ON em.job_id = jobs.job_id
GROUP BY first_name, last_name, department_name, job_title, salary
HAVING salary::int*10 > 100000
ORDER BY "Salary" DESC

Again, we use the concatenate call to make our full name column, then we list the department the employee works for, their job title, and finally their salary.

Note how I used ‘::’ to cast the column as an integer, and then multiplied the salary number by 10.

I did that to make the numbers align with our current reality as far as wages go.

The part of this query that really makes everything work is the HAVING clause, as it’s filtering through the group of salaried workers that make more than $100,000 a year.

Northwind:

The schema for the Northwind DB

The following example will be a bit more advanced, but it should give some insight into what you can potentially do. The above is an excellent example of a snowflake schema that I break down in further detail here.

If you notice, there is much more specificity in how the tables and data are arranged, compared to the last one.

Now for this trick I learned just recently:

Lets say I work at “Northwind” and my manager comes up to me and asks me:

“Make a unique ID number that we can attribute to each customer in our database, since our previous Data Analyst Lead had to leave the organization before he could put his query into production.”

This is what I would probably do, or at least a skeleton of what I would do:

WITH ids AS
(
SELECT concat(cust.customer_id, cust.postal_code, cust.phone) AS id,
contact_name AS name, contact_title AS job_title,
city, country, ord.order_date AS order_date,
ord.required_date, ord.shipped_date, ord.ship_via AS delivery_method,
ship_city AS destination_city,
ship_country AS destination_country
FROM customers
AS cust
JOIN orders
AS ord on ord.customer_id = cust.customer_id
)
SELECT md5(id) AS customer_id, name, job_title, city, country, order_date,
required_date, shipped_date, delivery_method, destination_city,
destination_country
FROM ids;

The ‘md5’ call is an interesting one, as it’s a string manipulation function that’s present in PostgreSQL that can take a string-based column (in this case ‘id’), and calculate a MD5 128-bit checksum for it. The value is returned as a binary string of 32 hex digits, or NULL if the argument was NULL. (credit: w3schools.org)

We could then work with our Data Engineer to append the columns within the tables themselves to refer to the newly generated string an as ID, using it as an index that the rest of the tables in the database can refer to.

The main logic of the query is wrapped in a CTE, and then the md5 function is called after we wrangled the rest of the data. Later on in this tutorial series, I’ll break down what Common Table Expressions are — as well as their sister concepts subqueries, and temp tables.

To summarize here we learned:

  • Resources to practice SQL in a structured way
  • What an IDE is and how to choose one
  • Different dialects of SQL
  • pgAdmin 4 and how to setup and connect to your databases
  • A bit on the databases themselves and how they’re laid out
  • More advanced query logic

I wanted to impart that the process of learning new concepts can be better reinforced through doing.

Learning data and by extension SQL can be fun, but sourcing the right information can sometimes be overwhelming. I hope that you were able to take away something from this.

In the next chapter, I’ll touch more on schema in a primer that elaborates what the two main ones are that you’ll find out in the wild — star and snowflake.

Thank you for reading!

Bruce

--

--

Bruce A. Lee

Sometimes I'll write about data, sometimes I'll just rant but either way, it's sure to be entertaining.