{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Class 6: Constructing a Network from Scraped Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Come in. Sit down. Open Teams.\n", "2. Make sure your notebook from last class is saved.\n", "3. Open up the Jupyter Lab server.\n", "4. Open up the Jupyter Lab terminal.\n", "5. Activate Conda: `module load anaconda3/2022.05`\n", "6. Activate the shared virtual environment: `source activate /courses/PHYS7332.202510/shared/phys7332-env/`\n", "7. Run `python3 git_fixer2.py`\n", "8. Github:\n", " - git status (figure out what files have changed)\n", " - git add ... (add the file that you changed, aka the `_MODIFIED` one(s))\n", " - git commit -m \"your changes\"\n", " - git push origin main\n", "________" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Goals of today's class\n", "1. Practice scraping news articles\n", "2. Get comfortable with error handling\n", "3. Build co-occurrence matrices & compare different kinds of co-occurrence matrices." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The KarJenner Social Graph According to TMZ\n", "Now, a fun exercise. We've used [`mediacloud.org`](https://www.mediacloud.org/documentation/search-api-guide) to obtain the URLs of all articles about the KarJenners (Kardashians & Jenners) from January 1st, 2024 until July 30, 2024. We'll load the article URLs and gradually construct a name co-occurrence network from a random subset of articles. Let's load the `.csv` of articles first." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['https://www.tmz.com/2024/07/30/eminem-death-slim-shady-ai-interview-mgk-mother/', 'https://www.tmz.com/2024/07/29/khloe-kardashian-dinosaur-tatum-birthday-party-kris-kim/', 'https://www.tmz.com/2024/07/29/travis-barker-selling-post-crash-boarding-pass-personal-note/', 'https://www.tmz.com/2024/07/28/july-2024-hot-shots-dog-days-of-summer-will-have-you-panting/', 'https://www.tmz.com/2024/07/28/kanye-west-bianca-censori-tiny-shorts-take-north-west-deadpool/']\n" ] } ], "source": [ "import csv\n", "with open('data/kardashian_jenner_urls_jan_1_2024_to_july_31_2024_mediacloud.csv', 'r') as f:\n", " reader = csv.reader(f, delimiter=',')\n", " urls = [line[-1] for line in reader][1:]\n", "print(urls[0:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's scrape an article! \n", "To make it fun, we'll each pick a random article to scrape and parse. We're going to use the `BeautifulSoup` python package to parse the HTML that we'll get via our `requests` module. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "import random\n", "import requests\n", "my_url = random.choice(urls[0:-20]) # leave this; the last 20 URLs are corrupted for Learning Purposes\n", "res = requests.get(my_url)\n", "soup = BeautifulSoup(res.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's print out the soup. This will show us what the raw HTML looks like, and disincentivize us from ever writing our own HTML parsers! Note that there are a lot of Javascript scripts in there, denoted with `\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "Skip to main content\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", " Turn on browser notifications\n", "
\n", "
\n", "\n", "Turn on browser notifications\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", " You have notifications blocked. Unblock.\n", "
\n", "
\n", "\n", "You have notifications blocked\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Josh Flagg explica si los allanamientos de Diddy afectarán los precios inmobiliarios de los vecinos

\n", "

\n", "Redadas a las casas de Diddy\n", "Siguen valiendo muchísimo...\n", "O eso dice Josh Flagg\n", "

\n", "
\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "A los compradores no les importa\n", "
\n", "TMZ.com\n", "
\n", "\n", "\n", "
\n", "
\n", "

Las redadas en las casas de Diddy en Los Ángeles y Miami pueden haber sido noticia nacional, pero el magnate inmobiliario Josh Flagg no está convencido de que todo ese drama legal vaya a afectar el valor de las propiedades.

\n", "
\n", "
\n", "

Nos pusimos al día con la estrella de \"Million Dollar Listing Los Angeles\", quien dice que las residencias de Diddy todavía valen muchísimo, a pesar de estar vinculado a una investigación de tráfico sexual por parte de Seguridad Nacional.

\n", "
\n", "
\n", "\n", "
\n", "
\n", "

Josh, un nativo de Los Ángeles, habla sobre la propiedad del fundador de Bad Boy Records en Holmby Hills, que Diddy adquirió en 2014 por 40 millones de dólares. El bróker destaca su céntrica ubicación en el barrio de las celebridades y de su buen estado, a pesar de que los federales la destrozaron durante el allanamiento.

\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "3/25/24\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "Adentro de la mansión\n", "
\n", "
\n", "\n", "\n", "
\n", "
\n", "

Él le dice a TMZ: \"Si se pone a la venta y se tasa correctamente, se venderá\".

\n", "
\n", "
\n", "

Aunque admite que la mansión de Los Ángeles probablemente no subirá de valor debido al \"estigma negativo\" que rodea a la finca. Sin embargo, como Josh nos dice, incluso las propiedades más \"contaminadas\" pueden encontrar un comprador, basta con echar un vistazo a la casa de Menéndez.

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "3/25/24\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "La calle completa es cerrada\n", "
\n", "TMZ.com\n", "
\n", "\n", "\n", "
\n", "
\n", "

Por si no lo saben, la mansión de Beverly Hills donde Erik y Lyle Menéndez asesinaron a sus padres acaba de venderse por $17 millones en marzo.

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "3/25/24\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "Entrando a la mansión\n", "
\n", "Fox 11\n", "
\n", "\n", "\n", "
\n", "
\n", "

¿Cómo afectará todo este drama legal a los vecinos de Diddy? Josh está igual de convencido de que la atención no perjudicará los precios de los bienes raíces en el largo plazo.

\n", "
\n", "
\n", "\n", "\n", "
\n", "
\n", "

Esta es una buena noticia para Kylie Jenner, quien puso a la venta su propiedad de Holmby Hills antes de que se produjeran las redadas de Diddy.

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "4/2/24\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "Redada en curso\n", "
\n", "
\n", "\n", "\n", "
\n", "
\n", "

Ahora, Josh da un solo consejo de bienes raíces: evitar cualquier mención a la investigación de tráfico sexual en los materiales de marketing.

\n", "
\n", "
\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "

\n", " artículos relacionados\n", "

\n", "
\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\n", "
\n", "

\n", " Old news is old news!
Be First!\n", "

\n", "
\n", " Get TMZ breaking news sent right to your browser!\n", "
\n", "
\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\n", "\n", "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hunting for Meaning in the Soup\n", "The neat part of BeautifulSoup is that we can search for specific tags in the HTML. In this case, we're hunting for `

` tags, which will contain the actual article text. Within the article text, if someone is an important celebrity or if they refer to another article, TMZ will link to the relevant TMZ page using a `` tag. Let's hunt for these links: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Josh Flagg ]\n", "[]\n", "[los federales la destrozaron]\n", "[]\n", "[]\n", "[venderse por $17 millones]\n", "[]\n", "[ Kylie Jenner, produjeran las redadas de Diddy.]\n", "[]\n", "[política de privacidad, términos de uso]\n" ] } ], "source": [ "for text_line in soup.find_all('p'):\n", " a_tags = text_line.find_all('a')\n", " print(a_tags)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say we just want the pages that correspond to people. These pages' URLs have the pattern `https://www.tmz.com/people/NAME`. We can use this pattern to filter out other pages, then use string manipulation to grab just the person's name:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "people_in_article = []\n", "for text_line in soup.find_all('p'):\n", " a_tags = text_line.find_all('a')\n", " for tag in a_tags:\n", " href = tag.get('href')\n", " if 'https://www.tmz.com/people/' in href:\n", " people_in_article.append(href.split('/')[-2])\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['josh-flagg', 'kylie-jenner']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "people_in_article" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Minor Wrinkle\n", "The full list of URLs contains about 20 URLs that I have edited and made invalid. We're going to practice dealing with these invalid URLs as we develop functionality to scrape TMZ articles and get a list of the people who appear in every article coming from a **valid** URL. Without further ado...\n", "\n", "### Handling Errors\n", "Sometimes, when you're scraping websites, a URL might not point to a valid web page, or something else unexpected might happen (the website might be down, for example). We don't want our code to give up entirely when it encounters one error, so we use *exception handling* techniques. The most common structure for handling exceptions in Python is the try/except block. Here's how it looks in its most basic form (as I recommend it):\n", "```\n", "try:\n", " do_something_that_might_fail()\n", "except Exception as e:\n", " print(e)\n", " do_something_else_to_record_the_failure()\n", "```\n", "What we're doing here is the following. First, we attempt the instructions that are wrapped in the `try` block. If it raises an exception, which is any sort of error that breaks the code (maybe the website doesn't resolve, or you're trying to choose the 5th element of an empty list, or you're trying to add a string to a `None`...you get the idea), then this code will print the error and then follow a new set of instructions. \n", "\n", "We can build in multiple exceptions in a single try/except setup. Let's say we're worried about getting blank text from our web request *and* some other error with BeautifulSoup if we get some text back that isn't valid HTML. \n", "```\n", "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "try:\n", " txt = requests.get(url).text\n", " soup = BeautifulSoup(txt)\n", "except requests.ConnectionError as e:\n", " print('connection error')\n", " print(e)\n", "except Exception as e:\n", " print('other error')\n", " print(e)\n", " \n", "```\n", "\n", "### Logging Errors\n", "Print statements are all well and good, but sometimes we want to log things a little more permanently. Enter the [`logging`](https://docs.python.org/3/library/logging.html) package. At a basic level, this lets you specify a file that you write all your log messages to (rather than printing them out to the terminal). The different levels of logging indicate what urgency a message should have in order to make it into your log file. The levels of urgency are (in increasing order) `[debug, info, warning, error]`. \n", "```\n", "import logging\n", "\n", "logger = logging.getLogger(__name__)\n", "\n", "logging.basicConfig(filename='my_log.log', encoding='utf-8', level=logging.DEBUG)\n", "\n", "logging.debug('is this thing on?')\n", "logging.info('yep, i think it's on')\n", "```\n", "\n", "When you handle exceptions, this presents a great opportunity to write to a log and keep track of bad URLs or other errors. Then you can go back and figure out if there are any patterns you need to be concerned about." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Being Considerate While Scraping\n", "In order to be considerate scrapers, we do our best to not overload the website. This means we wait for 5 or so seconds, if not longer, between requests. We also don't scrape too much data; to start, I'll ask you to scrape 30 articles randomly chosen from our list of 330 URLs. \n", "\n", "### Python Scripts\n", "Let's put this all together to make a script. A `.py` file is a Python script that can be used to execute a task. You can also write `.py` files that contain functions, and then you can import those functions from your script into **another** script or an ipython notebook. We run scripts like this: \n", "\n", "`python my_script.py`. \n", "\n", "You can think of running a script as being like executing a cell in a Jupyter notebook, except you run it in the terminal. Scripts are superior to ipython notebooks when you want to run code a lot of times or want to build a testable, reusable package. They are good for more permanent applications or when you have a lot of functions that you want to import elsewhere. \n", "\n", "The script we'll write should have logging functionality as well as a function that takes in an article URL and returns a list of the people whose pages were linked in the article (if the URL is valid). If the URL is not valid, it should complain about it in the log file. It will iterate through a random sample of URLs from our list of URLs, getting the list of people in each article (or complaining if the URL is invalid), and then it will save the data we generated. \n", "\n", "\n", "### Saving data\n", "We'll save the list as a `pickle` file:\n", "```\n", "import pickle\n", "f = some_object\n", "pickle.dump(f, open('name_of_variable.pkl', 'wb'))\n", "```\n", "`pickle` is a Python package that is useful for saving and loading objects. It turns Python objects into files that you can later load into the same object. Please don't use it for data that you're sharing widely with others, intend to persist for years, or for very large data; in all of these cases, things can go wrong and make your life difficult. Better formats are databases, `.csv` or `.tsv` files, or specific, more standardized formats like `.gexf` or `.gml` for networks. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "mystnb": { "remove_code_outputs": true } }, "outputs": [ { "ename": "KeyboardInterrupt", "evalue": "", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 26\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0murl\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0murls\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m30\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0mlists_of_people\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mget_people_in_article\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0murl\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 28\u001b[0;31m \u001b[0mtime\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msleep\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m7\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 29\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 30\u001b[0m \u001b[0mpickle\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdump\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlists_of_people\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'lists_of_people.pkl'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'wb'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyboardInterrupt\u001b[0m: " ] } ], "source": [ "# Your Turn!\n", "import logging\n", "import time\n", "import random\n", "import csv\n", "\n", "logger = logging.getLogger(__name__)\n", "logging.basicConfig(filename='kardashian_problems.log', encoding='utf-8', level=logging.DEBUG)\n", "\n", "with open('data/kardashian_jenner_urls_jan_1_2024_to_july_31_2024_mediacloud.csv', 'r') as f:\n", " reader = csv.reader(f, delimiter=',')\n", " urls = [line[-1] for line in reader][1:]\n", " \n", "def get_people_in_article(url):\n", " \"\"\"\n", " Given a URL (string) of a TMZ article, \n", " return a list of the names of the people (as strings) whose TMZ pages are linked in the article. \n", " You can handle invalid URLs within this function or in the for loop below, \n", " but make sure you're handling them!\n", " \"\"\"\n", " logging.debug('is this thing on?')\n", " \n", " pass\n", "\n", "lists_of_people = []\n", "for url in random.sample(urls, 30):\n", " lists_of_people.append(get_people_in_article(url))\n", " time.sleep(7)\n", " \n", "pickle.dump(lists_of_people, open('lists_of_people.pkl', 'wb'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have saved your `.py` script somewhere on Discovery, we can move on to practicing opening and closing a `screen` session.\n", "\n", "## Sidebar: Linux Screen\n", "You may have run into problems where you are trying to run some code, but it takes a while to run. You want to leave your computer and go do other things, or your computer shuts down in the middle of the night while your code is running. Enter [`screen`](https://linuxhandbook.com/screen-command/). `screen` is a program that's on most Linux/Mac machines. It allows you to open up a new terminal window within `screen`, start running your code, and then close the terminal window *with the code still running*. You can log out of the server you're on, go about your day, and open up the `screen` session again to see how your code is doing. \n", "\n", "### Using `screen`\n", "You open a screen using the command `screen -R SCREEN_NAME`. Name your screen something useful that you'll remember later; if you have multiple sessions going at once, you'll want to know which session is doing which task. Once in the screen session, you can treat it as a normal terminal session. Let's start running our Python script within the screen session, then detach from the screen. To detach from a screen session, we type CTRL-A and then d (for \"detach\"). \n", "\n", "Now let's list our active screen sessions using `screen -list`. Your screen session should pop up here! To resume your screen session, you can type `screen -R SCREEN_NAME` -- the `-R` stands for \"resume.\" If you want to kill (end) a screen session, you can type CTRL-A and then k (for \"kill\"). Don't do this yet - we're still waiting for our code to run! \n", "\n", "While we wait for our code to run, let's switch back to our notebook and practice constructing a network from the data we already have." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Constructing a Network from Scraped Data\n", "Next, we're going to construct a network from our scraped data. How? We're going to put a link between people who are mentioned in the same article. First, we'll build an unweighted network. We'll need to create a blank `networkx` `Graph` object and iterate over each list of people. For each list of people, we'll need to make sure that we get rid of duplicate mentions and then link everyone who is mentioned in the same article (no self links please!)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import networkx as nx\n", "\n", "unweighted_g = nx.Graph()\n", "\n", "# Your turn: build a graph that links people mentioned in the same article!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What can we do with our network?\n", "Let's think about what we can learn from our network. One way to figure out who the important players in a particular network is called *k-core decomposition*. In plain terms, obtaining the [*k-core*](https://www.baeldung.com/cs/graph-k-core) of a graph for a particular value of *k* means we look at all the nodes in our network. If their degree is greater than or equal to *k*, we keep them. Otherwise, we remove them. \n", "\n", "### K-core Decomposition\n", "\n", "The $k$-core is a maximal set of nodes such that its induced subgraph only contains vertices with degree larger than or equal to $k$. For directed graphs, the degree is assumed to be the total (in + out) degree. The algorithm accepts graphs with parallel edges and self loops, in which case these edges contribute to the degree in the usual fashion. This algorithm is described in [1,2,3] and runs in $O(V+E)$ time.\n", "\n", "(Description from: https://graph-tool.skewed.de/static/doc/autosummary/graph_tool.topology.kcore_decomposition.html.)\n", "\n", "The abstract of the original paper introducing k-cores puts it nicely:\n", "\n", " Social network researchers have long sought measures of network cohesion. Density has often been used for this purpose, despite its generally admitted deficiencies. An approach to network cohesion is proposed that is based on minimum degree and which produces a sequence of subgraphs of gradually increasing cohesion. The approach also associates with any network measures of local density which promise to be useful both in characterizing network structures and in comparing networks.\n", "\n", "Below is a helpful graphic, from [3].\n", "\n", "![](images/kcore_example.png)\n", "\n", "[1] Seidman, S. B. (1983). Network structure and minimum degree. *Social Networks*, 5(3), 269-287. https://doi.org/10.1016/0378-8733(83)90028-X\n", "\n", "[2] Batagelj, V., Zaveršnik, M. Fast algorithms for determining (generalized) core groups in social networks. Adv Data Anal Classif 5, 129–145 (2011). https://doi.org/10.1007/s11634-010-0079-y.\n", "\n", "[3] Malliaros, F.D., Giatsidis, C., Papadopoulos, A.N. et al. The core decomposition of networks: theory, algorithms and applications. *The VLDB Journal* 29, 61–92 (2020). https://doi.org/10.1007/s00778-019-00587-4\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___________\n", "### Interactive Moment\n", "What do you think the k-core of the TMZ mention network made of articles about the Kardashian/Jenner family will tell us about the Kardashain/Jenner family?\n", "\n", "### K-core size\n", "First, let's plot the size of the k-core of this network as *k* increases." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "mystnb": { "remove_code_outputs": true } }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "n_in_k_core = []\n", "for k in range(1, 10):\n", " n_in_k_core.append(len(nx.k_core(unweighted_g, k)))\n", "plt.plot([k for k in range(1, 10)], n_in_k_core)\n", "plt.xlabel('Value of k')\n", "plt.ylabel('Number of individuals in k-core')\n", "plt.title('K-core size of Kardashian/Jenner TMZ Mention Network')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fun Activity\n", "How would you answer these questions, using your TMZ mention network and k-core decomposition?\n", "* Who is most important to the Kardashian/Jenner family? \n", "* Who is least important? " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Your Turn!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Another Fun Activity\n", "Try to figure out the following exercises:\n", "* Can you figure out who are the highest-degree nodes in your network? Are there any that are surprising?\n", "* Compare your results with the results from the scraping run you did in your `screen` session. What's similar? What's different?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Your Turn!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Weighted k-core decomposition\n", "The networks we just created and ran k-core decomposition algorithms on were unweighted, as is typical for k-core decomposition. We're going to construct a weighted graph and then use the weighted k-core definition from [this paper](http://www.graphdegeneracy.org/k-cores.pdf) to run a weighted k-core decomposition. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's write a function that takes a list of article URLs as input and constructs a *weighted* graph. Each edge between nodes $(i, j)$ should be weighted according to the number of times $i$ and $j$ appeared together in the same article." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Your Turn!\n", "def make_weighted_graph_from_urls(url_list):\n", " \"\"\"\n", " Given a list of valid TMZ article URLs, constructs a weighted networkx graph object.\n", " Each edge between person i and person j is weighted \n", " according to the number of times i and j appeared in the same article.\n", " \"\"\"\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, write a function that does the weighted k-core decomposition specified in the article linked above. You may find the `networkx` [subgraph view](https://networkx.org/documentation/stable/reference/classes/generated/networkx.classes.graphviews.subgraph_view.html) function helpful here." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Your Turn!\n", "def weighted_kcore_decomposition(nx_graph, k):\n", " \"\"\"\n", " Given a networkx weighted graph object, \n", " compute the weighted k-core decomposition for the given value of k.\n", " \"\"\"\n", " pass\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After that, write code to plot the number of nodes remaining in each weighted k-core over a range of reasonable values. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "mystnb": { "remove_code_outputs": true } }, "outputs": [], "source": [ "# Your Turn!\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "\n", "network_sizes = []\n", "for k in range(0, 100): # TODO: change upper limit as needed\n", " weighted_kcore = weighted_kcore_decomposition(my_graph, k)\n", " network_size = 0 # TODO: get the size of the network\n", " network_sizes.append(network_size)\n", " \n", "plt.title('Size of weighted k-core')\n", "plt.xlabel('Value of k')\n", "plt.ylabel('Nodes in weighted k-core')\n", "plt.plot([k for k in range(0, 100)], network_sizes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, compare the k-core decomposition results from the unweighted network to the weighted network. How are they similar? How are they different?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your Turn!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__________\n", "## Next time...\n", "Big Data 1 — Algorithmic Complexity & Computing Paths `class_07_bigdata1.ipynb`\n", "_______\n", "\n", "## References and further resources:\n", "1. [Are the Kardashians Losing Their Influence?](https://www.elle.com/uk/life-and-culture/culture/a46440253/the-kardashians-influence/) (this is not serious reading sorry).\n", "2. [MediaCloud](https://www.mediacloud.org/) is worth a second look; it's a great source for looking at news across the US and across the world. If you're studying news coverage at all, I would highly recommend checking it out!\n", "3. [More on Linux Screen](https://linuxhandbook.com/screen-command/)\n", "4. [More on logging in Python](https://realpython.com/python-logging/)\n", "5. [An explanation of a graph's k-core](https://www.baeldung.com/cs/graph-k-core)\n", "6. [Giatsidis et al. on weighted k-core decomposition](http://www.graphdegeneracy.org/k-cores.pdf)\n", "7. Seidman, S. B. (1983). Network structure and minimum degree. *Social Networks*, 5(3), 269-287. https://doi.org/10.1016/0378-8733(83)90028-X\n", "8. Batagelj, V., Zaveršnik, M. Fast algorithms for determining (generalized) core groups in social networks. Adv Data Anal Classif 5, 129–145 (2011). https://doi.org/10.1007/s11634-010-0079-y.\n", "9. Malliaros, F.D., Giatsidis, C., Papadopoulos, A.N. et al. The core decomposition of networks: theory, algorithms and applications. *The VLDB Journal* 29, 61–92 (2020). https://doi.org/10.1007/s00778-019-00587-4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 5 }