{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "D8bsQl6JmfvJ" }, "source": [ "# Data Transformation\n", "\n", "In previous notebooks we learned how to use marks and visual encodings to represent individual data records. Here we will explore methods for *transforming* data, including the use of aggregates to summarize multiple records. Data transformation is an integral part of visualization: choosing the variables to show and their level of detail is just as important as choosing appropriate visual encodings. After all, it doesn't matter how well chosen your visual encodings are if you are showing the wrong information!\n", "\n", "As you work through this module, we recommend that you open the [Altair Data Transformations documentation](https://altair-viz.github.io/user_guide/transform/index.html) in another tab. It will be a useful resource if at any point you'd like more details or want to see what other transformations are available.\n", "\n", "_This notebook is part of the [data visualization curriculum](https://github.com/uwdata/visualization-curriculum)._" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": {}, "colab_type": "code", "id": "zYMsJwmgJ4R7" }, "outputs": [], "source": [ "import pandas as pd\n", "import altair as alt" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "a7BV1iAjml7x" }, "source": [ "## The Movies Dataset" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "yYtrAahLV3eU" }, "source": [ "We will be working with a table of data about motion pictures, taken from the [vega-datasets](https://vega.github.io/vega-datasets/) collection. The data includes variables such as the film name, director, genre, release date, ratings, and gross revenues. However, _be careful when working with this data_: the films are from unevenly sampled years, using data combined from multiple sources. If you dig in you will find issues with missing values and even some subtle errors! Nevertheless, the data should prove interesting to explore...\n", "\n", "Let's retrieve the URL for the JSON data file from the vega_datasets package, and then read the data into a Pandas data frame so that we can inspect its contents." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": {}, "colab_type": "code", "id": "khT9YvmEicpo" }, "outputs": [], "source": [ "movies_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'\n", "movies = pd.read_json(movies_url)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "P6wdIPQsXKHK" }, "source": [ "How many rows (records) and columns (fields) are in the movies dataset?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "RR-fAxzmXM6H", "outputId": "4a4e1680-7da3-48e1-ba34-5129c433fbe1" }, "outputs": [ { "data": { "text/plain": [ "(3201, 16)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies.shape" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "PZlB4ti1XCn1" }, "source": [ "Now let's peek at the first 5 rows of the table to get a sense of the fields and data types..." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 445 }, "colab_type": "code", "id": "Oby1ngC7VoUL", "outputId": "a9b8b839-6428-4da3-db0c-041cb78b703d" }, "outputs": [ { "data": { "text/html": [ "
\n", " | Title | \n", "US_Gross | \n", "Worldwide_Gross | \n", "US_DVD_Sales | \n", "Production_Budget | \n", "Release_Date | \n", "MPAA_Rating | \n", "Running_Time_min | \n", "Distributor | \n", "Source | \n", "Major_Genre | \n", "Creative_Type | \n", "Director | \n", "Rotten_Tomatoes_Rating | \n", "IMDB_Rating | \n", "IMDB_Votes | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "The Land Girls | \n", "146083.0 | \n", "146083.0 | \n", "NaN | \n", "8000000.0 | \n", "Jun 12 1998 | \n", "R | \n", "NaN | \n", "Gramercy | \n", "None | \n", "None | \n", "None | \n", "None | \n", "NaN | \n", "6.1 | \n", "1071.0 | \n", "
1 | \n", "First Love, Last Rites | \n", "10876.0 | \n", "10876.0 | \n", "NaN | \n", "300000.0 | \n", "Aug 07 1998 | \n", "R | \n", "NaN | \n", "Strand | \n", "None | \n", "Drama | \n", "None | \n", "None | \n", "NaN | \n", "6.9 | \n", "207.0 | \n", "
2 | \n", "I Married a Strange Person | \n", "203134.0 | \n", "203134.0 | \n", "NaN | \n", "250000.0 | \n", "Aug 28 1998 | \n", "None | \n", "NaN | \n", "Lionsgate | \n", "None | \n", "Comedy | \n", "None | \n", "None | \n", "NaN | \n", "6.8 | \n", "865.0 | \n", "
3 | \n", "Let's Talk About Sex | \n", "373615.0 | \n", "373615.0 | \n", "NaN | \n", "300000.0 | \n", "Sep 11 1998 | \n", "None | \n", "NaN | \n", "Fine Line | \n", "None | \n", "Comedy | \n", "None | \n", "None | \n", "13.0 | \n", "NaN | \n", "NaN | \n", "
4 | \n", "Slam | \n", "1009819.0 | \n", "1087521.0 | \n", "NaN | \n", "1000000.0 | \n", "Oct 09 1998 | \n", "R | \n", "NaN | \n", "Trimark | \n", "Original Screenplay | \n", "Drama | \n", "Contemporary Fiction | \n", "None | \n", "62.0 | \n", "3.4 | \n", "165.0 | \n", "