Class 9: Data Science 2 — Querying SQL Tables for Network Construction

Class 9: Data Science 2 — Querying SQL Tables for Network Construction#

  1. Come in. Sit down. Open Teams.

  2. Make sure your notebook from last class is saved.

  3. Open up the Jupyter Lab server.

  4. Open up the Jupyter Lab terminal.

  5. Activate Conda: module load anaconda3/2022.05

  6. Activate the shared virtual environment: source activate /courses/PHYS7332.202510/shared/phys7332-env/

  7. Run python3 git_fixer2.py

  8. Github:

    • git status (figure out what files have changed)

    • git add … (add the file that you changed, aka the _MODIFIED one(s))

    • git commit -m “your changes”

    • git push origin main


Goals of today’s class#

  1. Get a bunch of reps in using SQL to learn about a dataset.

  2. Use SQL queries in concert with network science to characterize a network.

Today’s Dataset#

Today we’re going to focus on getting a lot of reps in in order to learn things about our dataset. It will involve a lot of hands-on coding and less of me talking at you.

The dataset we’ll be using today comes from SNAP; it’s the Reddit Hyperlinks dataset. The paper that arose from this dataset, by Kumar et al., can be found here.

Posts in one subreddit can link out to another subreddit for myriad reasons, including hatred, shared interests, and many things in between. I’ve somewhat simplified the dataset we’ll be working with today, but it has some interesting elements nonetheless. For one, each link has a TIMESTAMP attribute that indicates when the hyperlink was created. There’s also a LINK_SENTIMENT attribute that indicates whether the sentiment of the post or title that contained the link was positive (1) or negative (-1). The link_source attribute indicates whether the link came from the title or the body of the post.

First let’s get connected to our database again.

from sqlalchemy import create_engine
import pandas as pd

my_creds = []
with open('../../../shared/student_mysql_credentials.txt', 'r') as f:
    for line in f.readlines():
        my_creds.append(line.strip())

hostname="mysql-0005"
dbname="PHYS7332"
uname=my_creds[0]
pwd=my_creds[1]

engine = create_engine("mysql+pymysql://{user}:{pw}@{host}/{db}"
                           .format(host=hostname, db=dbname,
                                   user=uname,pw=pwd))

Looking at the data#

Next, we’ll take a look at the dataset. Let’s look at the first 10 rows just to get a sense of what the data looks like.

qu = 'SELECT * FROM reddit_hyperlinks LIMIT 10;'
pd.read_sql(qu, engine)

Which linkage pops up the most often in our dataset? Let’s find out.

We’re going to start by grouping our data by the combination of SOURCE_SUBREDDIT and TARGET_SUBREDDIT.

Then we’ll count how many rows each grouping contains and note that, along with the names of the source & target subreddits.

Finally, we’ll sort by the number of total linkages that occurred.

qu = """
SELECT COUNT(*) as count_links, SOURCE_SUBREDDIT, TARGET_SUBREDDIT \
FROM reddit_hyperlinks \
GROUP BY SOURCE_SUBREDDIT, TARGET_SUBREDDIT ORDER BY count_links;
"""
df_link_counts = pd.read_sql(qu, engine)
df_link_counts.sort_values('count_links', ascending=False)

Degree Distribution#

You might also be curious about the in- and out- degree distribution of this network. Can you compute the in- and out- degree distributions using the SQL table? In this case, we want to know how many distinct subreddits each other subreddit linked to, not how many times a subreddit was linked to.

# Your Turn!
import matplotlib.pyplot as plt
import numpy as np

def degree_distribution(k, number_of_bins=15, log_binning=True, density=True):
    """
    Given a degree sequence, return the y values (probability) and the
    x values (support) of a degree distribution that you're going to plot.
    
    Parameters
    ----------
    k: a list of nodes' degrees

    number_of_bins (int):
        length of output vectors
    
    log_binning (bool):
        if you are plotting on a log-log axis, then this is useful
    
    density (bool):
        whether to return counts or probability density (default: True)
        Note: probability densities integrate to 1 but do not sum to 1. 
        
    Returns
    -------
    hist, bins (np.ndarray):
        probability density if density=True node counts if density=False; binned edges
    
    """
    
    kmax = np.max(k)                    # get the maximum degree
    
    
    # Step 2: Then we'll need to construct bins
    if log_binning:
        # array of bin edges including rightmost and leftmost
        bins = np.logspace(0,np.log10(kmax+1),number_of_bins+1)
        bin_edges = []
        for ix in range(len(bins) - 1):
            bin_edges.append(np.exp((np.log(bins[ix])+np.log(bins[ix + 1]))/2))
    else:
        bins = np.linspace(0,kmax+1,num=number_of_bins+1)
        bin_edges = []
        for ix in range(len(bins) - 1):
            bin_edges.append((bins[ix] + bins[ix + 1]) / 2)
    # Step 3: Then we can compute the histogram using numpy
    hist, _ = np.histogram(k,bins,density=density)

    return bin_edges, hist
    
fig, ax = plt.subplots(1,2,figsize=(10,4),dpi=150)

in_degree_bin_edges, in_degree_freqs = degree_distribution(in_degree_list)
out_degree_bin_edges, out_degree_freqs = degree_distribution(out_degree_list)
ax[0].loglog(in_degree_bin_edges, in_degree_freqs, 'o')
ax[0].set_title('in-degree distribution')
ax[1].loglog(out_degree_bin_edges, out_degree_freqs, 'o')
ax[1].set_title('out-degree distribution')

Local Clustering#

Pick a subreddit – any subreddit (I suggest picking one that has a lot of links to keep things interesting). Can you figure out what its local clustering coefficient is in 2014, 2015, and 2016? Let’s assume the graph is unweighted for now, so one hyperlink from subreddit A to subreddit B in 2014 is the same as 10 separate hyperlinks from A to B in the same year. We will also assume the graph is undirected for this exercise.

# Your Turn
def compute_clustering(subreddit_name, engine):
    """
    Given the name of a subreddit and a SQL connection engine,
    return the local clustering coefficient of the subreddit,
    considering the network as undirected, OR 
    raise appropriate errors.

    subreddit_name: string; valid name of a subreddit
    engine: sqlalchemy engine object

    returns: the local clustering coefficient of the subreddit
    """

    pass
compute_clustering('amitheasshole', engine)