Google’s Data Analytics: Foundations Review week 4

David

Google’s Data Analytics: Foundations Review week 4

Week 4 The ins and outs of core data tools – are split into 4 sections

  • Mastering spreadsheet basics
  • Structured Query Language (SQL)
  • Data visualization
  • Weekly Challenge 4 – a test

The objective of this week’s materials is to describe spreadsheets, query languages, and data visualization tools. We are given a brief demonstration of the uses, basic features, and functions of a spreadsheet; shown basic concepts involved in the use of SQL, such as specific examples of queries; then finished with examples of basic data visualization concepts. Much of what is learned in this portion of the course seems, like a slow handhold into the introduction of new concepts.

Mastering spreadsheet basics

Unlike the title of this section, mastery of basics is not obtained, but an introduction to the usage of a spreadsheet is learned.  This section has you using either Google’s Sheets or Microsoft’s Excel.  There is not a lot of mental demand on the basics learned here, but could be intimidating to someone who has never learned or used either program before. This section has you working on a small assignment requiring you to insert data in a spreadsheet and then use the available visualization tools to make a graph or bar graph, then share on the discussion boards. 

Here are the graphs I shared in the discussion boards on dividends.

In addition to creating, your “first” spreadsheet, you are given three resources to expand your knowledge of spreadsheet and their uses.  It is expected that you review these resources as the end of the section is a practice quiz containing a few questions not covered in the course but found in the resources.

Resource gave in week four to learn how to use Google spreadsheets

https://support.google.com/a/users/answer/9282959?visit_id=637361702049227170-1815413770&rd=1

A cheat sheet for using google spreadsheets

https://support.google.com/a/users/answer/9300022

a website to watch Excel training videos

https://support.microsoft.com/en-us/office/excel-video-training-9bc05390-e94c-46af-a5b3-d7c22f6990bb

Structured Query Language (SQL)

SQL does a lot of the same things with data that spreadsheets can do. It can store, organize and analyze data, among other things, but is capable of handling more information that a spreadsheet. On a small project when dealing with data, Excel or Google sheets may be the perfect tools for handling the workload. But when using larger data sets around the size of a larger company like Walmart with a customer base well within the millions, SQL would be the tool to handle data analysis queries. The great this about SQL is no matter which database you use, SQL basically works for the same in each. SQL queries are essentially universal. 

So what is a query?

query is a request for data or information from a database. When you query databases, you use SQL to communicate your question or request. Every programming language, including SQL, follows a unique set of guidelines known as syntaxSyntax is the predetermined structure of a language that includes all required words, symbols, and punctuation, as well as their proper placement.

The syntax of every SQL query is the same: 

  • Use SELECT to choose the columns you want to return.
  • Use FROM to choose the tables where the columns you want are located.
  • Use WHERE to filter for certain information.

A SQL query is like filling in a template. You will find that if you are writing a SQL query from scratch, it is helpful to start a query by writing the SELECT, FROM, and WHERE keywords in the following format: 

                SELECT

                FROM

                WHERE

Next, enter the table name after the FROM; the table columns you want after the SELECT; and, finally, the conditions you want to place on your query after the WHERE. Make sure to add a new line and indent when adding these, as shown below:

                SELECT

                                Columns you want to look at

                FROM

                                Tables the data lives in

                WHERE

                                Certain condition is met

Following this method, each time makes it easier to write SQL queries. It can also help you make fewer syntax errors.

Example of a query

Here is how a simple query would appear in BigQuery, a data warehouse on the Google Cloud Platform.

The above query uses three commands to locate customers with the first name Tony:

  1. SELECT the column named first_name
  2. FROM a table named customer_name (in a dataset named customer_data) (The dataset name is always followed by a dot, and then the table name.)
  3. But only return the data WHERE the first_name is Tony

The results from the query might be similar to the following:

first_name
Tony
Tony
Tony

As you can conclude, this query had the correct syntax, but wasn’t very useful after the data was returned.

Multiple columns in a query

In real life, you will need to work with more data beyond customers named Tony. Multiple columns that are chosen by the same SELECT command can be indented and grouped together.

If you are requesting multiple data fields from a table, you need to include these columns in your SELECT command. Each column is separated by a comma as shown below:

                SELECT

                                ColumnsA,

                               ColumnsB,

                               ColumnsC

                FROM

                                Tables the data lives in

                WHERE

                                Certain condition is met

  1. SELECT the columns named customer_id, first_name, and last_name
  2. FROM a table named customer_name (in a dataset named customer_data) (The dataset name is always followed by a dot, and then the table name.)
  3. But only return the data WHERE the first_name is Tony

Here is an example of how it would appear in BigQuery:

The only difference between this query and the previous one is that more data columns are selected. The previous query selected first_name only while this query selects customer_id and last_name in addition to first_name. In general, it is a more efficient use of resources to select only the columns that you need. For example, it makes sense to select more columns if you will actually use the additional fields in your WHERE clause. If you have multiple conditions in your WHERE clause, they may be written like this:

                SELECT

                                ColumnsA,

                               ColumnsB,

                               ColumnsC

                FROM

                                Tables the data lives in

                WHERE

                                Condition 1

                               AND Condition 2

                               AND Condition 3

Notice that, unlike the SELECT command that uses a comma to separate fields/variables/parameters, the WHERE command uses the AND statement to connect conditions. As you become a more advanced writer of queries, you will make use of other connectors/operators such as OR and NOT. 

Here is a BigQuery example with multiple fields used in a WHERE clause:

The above query uses three commands to locate customers with a valid (greater than 0) customer ID whose first name is Tony and last name is Magnolia.

  1. SELECT the columns named customer_idfirst_name, and last_name
  2. FROM a table named customer_name (in a dataset named customer_data) (The dataset name is always followed by a dot, and then the table name.)
  3. But only return the data WHERE customer_id is greater than 0, first_name is Tony, and last_name is Magnolia.

Note that one of the conditions is a logical condition that checks to see if customer_id is greater than zero.

If only one customer is named Tony Magnolia, the results from the query could be:

customer_idfirst_namelast_name
1967TonyMagnolia

If more than one customer has the same name, the results from the query could be:

customer_idfirst_namelast_name
1967TonyMagnolia
7689TonyMagnolia

Capitalization, indentation, and semicolons

You can write your SQL queries in all lowercase and don’t have to worry about extra spaces between words. However, using capitalization and indentation can help you read the information more easily. Keep your queries neat, and they will be easier to review or troubleshoot if you need to check them later on.

                SELECT

                                field1

                FROM

                                table

                WHERE

                                field1  = condition;

Notice that the SQL statement shown above has a semicolon at the end. The semicolon is a statement terminator and is part of the American National Standards Institute (ANSI) SQL-92 standard, which is a recommended common syntax for adoption by all SQL databases. However, not all SQL databases have adopted or enforce the semicolon, so it’s possible you may come across some SQL statements that aren’t terminated with a semicolon. If a statement works without a semicolon, it’s fine.

WHERE conditions

In the query shown above, the SELECT clause identifies the column you want to pull data from by name, field1, and the FROM clause identifies the table where the column is located by name, table. Finally, the WHERE clause narrows your query so that the database returns only the data with an exact value match or the data that matches a certain condition that you want to satisfy. 

For example, if you are looking for a specific customer with the last name Chavez, the WHERE clause would be: 

WHERE field1 = ‘Chavez’

However, if you are looking for all customers with the last name that begins with the letters “Ch,” the WHERE clause would be:

WHERE field1 LIKE ‘Ch%’

You can conclude that the LIKE clause is very powerful because it allows you to tell the database to look for a certain pattern! The percent sign (%) is used as a wildcard to match one or more characters. In the example above, both Chavez and Chen would be returned. Note that in some databases an asterisk (*) is used as the wildcard instead of a percent sign (%).

SELECT all columns

Can you use  SELECT * ?

In the example, if you replace SELECT field1 with SELECT * , you would be selecting all of the columns in the table instead of the field1 column only. From a syntax point of view, it is a correct SQL statement, but you should use the asterisk (*) sparingly and with caution. Depending on how many columns a table has, you could be selecting a tremendous amount of data. Selecting too much data can cause a query to run slowly.

Comments

Some tables aren’t designed with descriptive enough naming conventions. In the example, field1 was the column for a customer’s last name, but you wouldn’t know it by the name. A better name would have been something such as last_name. In these cases, you can place comments alongside your SQL to help you remember what the name represents. Comments are text placed between certain characters, /* and */, or after two dashes () as shown below. 

Comments can also be added outside of a statement as well as within a statement. You can use this flexibility to provide an overall description of what you are going to do, step-by-step notes about how you achieve it, and why you set different parameters/conditions. 

The more comfortable you get with SQL, the easier it will be to read and understand queries at a glance. Still, it never hurts to have comments in a query to remind yourself of what you’re trying to do. This also makes it easier for others to understand your query if your query is shared. As your queries become more and more complex, this practice will save you a lot of time and energy to understand complex queries you wrote months or years ago. 

In the above example, a comment has been added before the SQL statement to explain what the query does. Additionally, a comment has been added next to each of the column names to describe the column and its use. Two dashes (–) are generally supported. You can use # in place of — in the above query, but # is not recognized in all SQL versions; for example, MySQL doesn’t recognize #.  You can also place comments between /* and */ if the database you are using supports it. 

As your queries become more and more complex, the practice of adding helpful comments will save you a lot of time and energy to understand queries that you may have written months or years prior.

Aliases

You can also make it easier on yourself by assigning a new name or alias to the column or table names to make them easier to work with (and avoid the need for comments). This is done with a SQL AS clause. In the example below, the alias last_name has been assigned to field1 and the alias customers assigned to table. These aliases are good for the duration of the query only. An alias doesn’t change the actual name of a column or table in the database.

Data visualization

Previously, we learned that data visualization is the graphical representation of information. As a data analyst, you will create visualizations that make your data interesting and easy to understand. Due to the importance of visualization most data analytics tools, such as spreadsheets and databases, have a built-in visualization component. Other applications, such as Tableau, and R, specialize in visualization as their primary value.  

Spreadsheets (Excel or Google Sheets) are great for creating simple visualizations like bar graphs and pie charts and even provide some advanced visualizations. But sometimes you need a more powerful tool to truly bring your data to life. Tableau and RStudio are two examples of widely used platforms that can help you plan, create, and present effective and compelling data visualizations.

Tableau is a popular data visualization tool that lets you pull data from nearly any system and turn it into compelling visuals or actionable insights. The platform offers built-in visual best practices, which makes analyzing and sharing data fast, easy, and (most importantly) useful. Tableau works well with a wide variety of data and includes an interactive dashboard that lets you and your stakeholders click to explore the data interactively. 

Start exploring Tableau from the How-to Video resources. Tableau Public is free, easy to use, and full of helpful information. The Resources page is a one-stop-shop for how-to videos, examples, and datasets for you to practice with. I would advise exploring what other data analysts are sharing on Tableau, by visiting the Viz of the Day But with a fair warning, you could easily lose a couple of hours ogling the content found on that page as I did.

R with RStudio is another programming language a lot of data analysts work with. Most people who work with R end up also using RStudio, an integrated developer environment (IDE), for their data visualization needs. As with Tableau, you can create dashboard-style data visualizations using RStudio.

To learn more about RStudio check out their website, and spend a few days exploring all the resources provided at RStudio.com. The RStudio Cheatsheets and the RStudio Visualize Data Primer are great places to start.