Lecture 14: Multithreading Practice

Sequential Link Explorer

Here’s a program that downloads the Wikipedia page for “Multithreading,” then sequentially downloads each page, looking for the longest one:

extern crate reqwest;
extern crate select;
#[macro_use]
extern crate error_chain;

use select::document::Document;
use select::predicate::Name;

error_chain! {
   foreign_links {
       ReqError(reqwest::Error);
       IoError(std::io::Error);
   }
}

const TARGET_PAGE: &str = "https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)";

fn get_linked_pages(html_body: &str) -> Result<Vec<String>> {
    Ok(Document::from_read(html_body.as_bytes())?
        .find(Name("a"))
        .filter_map(|n| {
            if let Some(link_str) = n.attr("href") {
                if link_str.starts_with("/wiki/") {
                    Some(format!("{}/{}", "https://en.wikipedia.org",
                        &link_str[1..]))
                } else {
                    None
                }
            } else {
                None
            }
        }).collect::<Vec<String>>())
}

// Adapted from https://rust-lang-nursery.github.io/rust-cookbook/web/scraping.html
fn main() -> Result<()> {
    let html_body = reqwest::blocking::get(TARGET_PAGE)?.text()?;
    // Identify all linked wikipedia pages
    let links = get_linked_pages(&html_body)?;

    let mut longest_article_url = "".to_string();
    let mut longest_article_len = 0;
    for link in &links {
        let body = reqwest::blocking::get(link)?.text()?;
        let curr_len = body.len();
        if curr_len > longest_article_len {
            longest_article_len = curr_len;
            longest_article_url = link.to_string();
        }
    }
    println!("{} was the longest article with length {}", longest_article_url,
        longest_article_len);
    Ok(())
}

Unfortunately, this is terribly slow, and it takes almost 3 minutes to run on my machine. The CPU is idle almost the entire time; we aren’t making good use of system resources.

Adding threads

Adding `Arc`/`Mutex`

We want threads to work together to find the longest article. By the end, we want the threads to collectively update longest_article_url so that we know what the longest article is.

As with last lecture, we’ll want to use an Arc and Mutex to ensure that the threads can all see/update the same longest article. (You can imagine we’re putting the longest article in a bathroom stall, and whenever a thread downloads an article, it’ll go into the bathroom stall to check it against the running longest article.)

However, there can only be one value in a Mutex, and we want to store both the longest article URL and length. To fix this, we can bundle the URL and length together in a tuple or a struct (we’ll opt for a struct), put this in our Mutex, and access from our threads:

struct Article {
    url: String,
    length: usize,
}

fn main() -> Result<()> {
    let longest_article = Arc::new(Mutex::new(Article {url: "".to_string(), length: 0}));
    let mut threads = Vec::new();
    for link in &links {
        let longest_article_handle = longest_article.clone();
        threads.push(thread::spawn(move || {
            let body = reqwest::blocking::get(&link)?.text()?;
            let curr_len = body.len();
            let mut longest_article = longest_article_handle.lock().unwrap();
            if curr_len > longest_article.length {
                longest_article.length = curr_len;
                longest_article.url = link.to_string();
            }
        }));
    }
    for thread in threads {
        thread.join().unwrap();
    }
    let longest_article_ref = longest_article.lock().unwrap();
    println!("{} was the longest article with length {}", longest_article_ref.url,
        longest_article_ref.length);
    Ok(())
}

Error propagation from inside a thread

Compiling the above code gives us an error:

error[E0277]: the `?` operator can only be used in a closure that returns `Result` or `Option` (or another type that implements `std::ops::Try`)
  --> src/main.rs:58:24
   |
57 |           threads.push(thread::spawn(move || {
   |  ____________________________________-
58 | |             let body = reqwest::blocking::get(link)?.text()?;
   | |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot use the `?` operator in a closure that returns `()`
59 | |             let curr_len = body.len();
60 | |             let mut longest_article = longest_article_handle.lock().unwrap();
...  |
64 | |             }
65 | |         }));
   | |_________- this function should return `Result` or `Option` to accept `?`
   |
   = help: the trait `std::ops::Try` is not implemented for `()`
   = note: required by `std::ops::Try::from_error`

What’s this about? main does return Result! And we didn’t change this line when adding threading, so why is it giving us an error now?

If you look carefully, we moved the offending line inside of a closure function that runs inside a different thread. It’s this function isn’t returning Result, which is what is causing problems. Furthermore, there’s a conceptual issue here: if this child thread returns an Error, how should we propagate that to the main thread?

Conveniently, Rust allows threads to return values back to the parent thread: you can add a return type to the closure function, and once the child thread returns, that value will be returned by thread::join:

let t = thread::spawn(move || -> i32 {
    println!("Inside the child thread, returning 5");
    return 5;
}
let x = t.join().expect("Thread panicked!");
println!("Parent thread: {}", x);  // prints 5

This means that the child thread can return a Result back to the parent, which can propagate it after join() returns the error:

for link in &links {
    let longest_article_handle = longest_article.clone();
    threads.push(thread::spawn(move || -> Result<()> {
        // ^ note added "-> Result<()>" return type
        let body = reqwest::blocking::get(link)?.text()?;
        let curr_len = body.len();
        let mut longest_article = longest_article_handle.lock().unwrap();
        if curr_len > longest_article.length {
            longest_article.length = curr_len;
            longest_article.url = link.to_string();
        }
        // Once this thread is done, it needs to return Ok
        Ok(())
    }));
}
for thread in threads {
    thread.join().unwrap()?;
    // ^ note the added ?, which will stop/propagate if a thread returns Error
}
let longest_article_ref = longest_article.lock().unwrap();
println!("{} was the longest article with length {}", longest_article_ref.url,
    longest_article_ref.length);

Ensuring `link` lives long enough

We aren’t finished with the compiler errors:

error[E0597]: `links` does not live long enough
  --> src/main.rs:55:17
   |
55 |     for link in &links {
   |                 ^^^^^^
   |                 |
   |                 borrowed value does not live long enough
   |                 argument requires that `links` is borrowed for `'static`
...
75 | }
   | - `links` dropped here while still borrowed

The link variable is of type &str (i.e. it is a reference to a string owned by the main thread), and the Rust compiler is not 100% convinced that the main thread will outlive the child thread, so we get a lifetime error. (It would be a use-after-free if the child thread were to continue using this reference after the main thread cleaned up the memory.)

A simple fix is to move each link out of the vector and transfer ownership to each thread:

for link in links {
    // `link` is now an owned String
    threads.push(thread::spawn(move || -> Result<()> {
        // `link` is moved into the thread
        let body = reqwest::blocking::get(&link)?.text()?;
        ...
    }
}

Of course, this means that you won’t be able to use links in the main thread after this loop, since all the elements have been moved out of the vector. If you wanted to continue using the vector, you could either clone the vector (e.g. for link in links.clone()), or you could put all the links in an Arc that all the threads share, to ensure that the memory will live long enough.

Limiting network connections

Great, this code finally compiles! However, it crashes shortly after running:

Error: Error(ReqError(reqwest::Error { kind: Request, url:
"https://en.wikipedia.org/wiki/Thread_(computer_science)", source:
hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 24, kind:
Other, message: "Too many open files" })) }), State { next_error: None,
backtrace: InternalBacktrace { backtrace: None } })

The key part of the error is Too many open files. We’ve actually run out of file descriptors! This happens when there are too many concurrent downloads being made, since each network connection takes a file descriptor.

A good way to limit the number of things happening at a time is to use a semaphore. Rust doesn’t have a semaphore in the standard library, but there are crates you can use, such as the sema(https://docs.rs/sema/0.1.4/sema/struct.Semaphore.html). This can be used like a traditional semaphore, as shown in CS 110:

let longest_article = Arc::new(Mutex::new(Article {url: "".to_string(), length: 0}));
// Put the semaphore in an Arc so that it can be shared by all threads
let permits = Arc::new(Semaphore::new(20));     // limit to 20 concurrent downloads
for link in links {
    let longest_article_handle = longest_article.clone();
    let permits = permits.clone();
    threads.push(thread::spawn(move || -> Result<()> {
        permits.wait()?;    // wait/down semaphore
        let body = reqwest::blocking::get(&link)?.text()?;
        permits.post();     // signal/up semaphore
        // do other stuff...
    }));
}

But any time you see “acquire some resources with this function call, then release the resources with a different function call” e.g. malloc/free, lock/unlock, wait/signal, alarm bells should be going off in your head: what if you forget to release the resource? In this case, the above code is buggy, because if the download fails, the ? operator will cause the function to return early and we won’t return the permit back to the semaphore. This is better than C++, where exceptions can cause non-obvious resource leaks (at least we can see the ? early return in the code), but it’s still a problem.

Instead, the sema library allows us to use a SemaphoreGuard, which will return the permit to the semaphore when dropped:

let longest_article = Arc::new(Mutex::new(Article {url: "".to_string(), length: 0}));
// Put the semaphore in an Arc so that it can be shared by all threads
let permits = Arc::new(Semaphore::new(20));     // limit to 20 concurrent downloads
for link in links {
    let longest_article_handle = longest_article.clone();
    let permits = permits.clone();
    threads.push(thread::spawn(move || -> Result<()> {
        let permit = permits.take()?;    // note "take" instead of "wait"
        let body = reqwest::blocking::get(&link)?.text()?;
        drop(permit)    // return the permit so some other thread can have it
        // do other stuff...
    }));
}

On early return, permit will be dropped and the semaphore will be upped.

Limiting active threads

We no longer run out of file descriptors, but this code still periodically fails on some machines:

Error: Error(ReqError(reqwest::Error { kind: Request, url:
"https://en.wikipedia.org/wiki/File:Question_book-new.svg", source:
hyper::Error(Connect, ConnectError("dns error", Custom { kind: Other, error:
"failed to lookup address information: nodename nor servname provided, or not
known" })) }), State { next_error: None, backtrace: InternalBacktrace {
backtrace: None } })

This error is much more cryptic, but is ultimately caused by having too many threads active. The thread limit is flexible and can be increased to have thousands (or even tens of thousands) of threads, but there is usually a thread limit set that is lower than that, and it’s usually not a good idea to spawn so many threads for a task such as this.

Semaphores can also be used here to throttle the creation of threads, but this use case is well suited for a thread pool, which allows you to create a fixed number of threads and then reuse those threads to do many tasks. Rust doesn’t have a thread pool in the standard library, but the threadpool crate provides one:

let threadpool = ThreadPool::new(20);
let permits = Arc::new(Semaphore::new(20));
for link in links {
    let longest_article_handle = longest_article.clone();
    let permits = permits.clone();
    threadpool.execute(move || {
        let permit = permits.take();
        let body = reqwest::blocking::get(&link).unwrap().text().unwrap();
        drop(permit);
        let curr_len = body.len();
        let mut longest_article = longest_article_handle.lock().unwrap();
        if curr_len > longest_article.length {
            longest_article.length = curr_len;
            longest_article.url = link.to_string();
        }
    });
}
threadpool.join();
let longest_article_ref = longest_article.lock().unwrap();
println!("{} was the longest article with length {}", longest_article_ref.url,
    longest_article_ref.length);

CS 110L