Lecture 14: Multithreading Practice
Sequential Link Explorer
Here’s a program that downloads the Wikipedia page for “Multithreading,” then sequentially downloads each page, looking for the longest one:
extern crate reqwest;
extern crate select;
#[macro_use]
extern crate error_chain;
use select::document::Document;
use select::predicate::Name;
error_chain! {
foreign_links {
ReqError(reqwest::Error);
IoError(std::io::Error);
}
}
const TARGET_PAGE: &str = "https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)";
fn get_linked_pages(html_body: &str) -> Result<Vec<String>> {
Ok(Document::from_read(html_body.as_bytes())?
.find(Name("a"))
.filter_map(|n| {
if let Some(link_str) = n.attr("href") {
if link_str.starts_with("/wiki/") {
Some(format!("{}/{}", "https://en.wikipedia.org",
&link_str[1..]))
} else {
None
}
} else {
None
}
}).collect::<Vec<String>>())
}
// Adapted from https://rust-lang-nursery.github.io/rust-cookbook/web/scraping.html
fn main() -> Result<()> {
let html_body = reqwest::blocking::get(TARGET_PAGE)?.text()?;
// Identify all linked wikipedia pages
let links = get_linked_pages(&html_body)?;
let mut longest_article_url = "".to_string();
let mut longest_article_len = 0;
for link in &links {
let body = reqwest::blocking::get(link)?.text()?;
let curr_len = body.len();
if curr_len > longest_article_len {
longest_article_len = curr_len;
longest_article_url = link.to_string();
}
}
println!("{} was the longest article with length {}", longest_article_url,
longest_article_len);
Ok(())
}
Unfortunately, this is terribly slow, and it takes almost 3 minutes to run on my machine. The CPU is idle almost the entire time; we aren’t making good use of system resources.
Adding threads
Adding Arc
/Mutex
We want threads to work together to find the longest article. By the end, we
want the threads to collectively update longest_article_url
so that we know
what the longest article is.
As with last lecture, we’ll want to use an Arc
and Mutex
to ensure that the
threads can all see/update the same longest article. (You can imagine we’re
putting the longest article in a bathroom stall, and whenever a thread
downloads an article, it’ll go into the bathroom stall to check it against the
running longest article.)
However, there can only be one value in a Mutex
, and we want to store both
the longest article URL and length. To fix this, we can bundle the URL and
length together in a tuple or a struct (we’ll opt for a struct), put this in
our Mutex
, and access from our threads:
struct Article {
url: String,
length: usize,
}
fn main() -> Result<()> {
let longest_article = Arc::new(Mutex::new(Article {url: "".to_string(), length: 0}));
let mut threads = Vec::new();
for link in &links {
let longest_article_handle = longest_article.clone();
threads.push(thread::spawn(move || {
let body = reqwest::blocking::get(&link)?.text()?;
let curr_len = body.len();
let mut longest_article = longest_article_handle.lock().unwrap();
if curr_len > longest_article.length {
longest_article.length = curr_len;
longest_article.url = link.to_string();
}
}));
}
for thread in threads {
thread.join().unwrap();
}
let longest_article_ref = longest_article.lock().unwrap();
println!("{} was the longest article with length {}", longest_article_ref.url,
longest_article_ref.length);
Ok(())
}
Error propagation from inside a thread
Compiling the above code gives us an error:
error[E0277]: the `?` operator can only be used in a closure that returns `Result` or `Option` (or another type that implements `std::ops::Try`)
--> src/main.rs:58:24
|
57 | threads.push(thread::spawn(move || {
| ____________________________________-
58 | | let body = reqwest::blocking::get(link)?.text()?;
| | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot use the `?` operator in a closure that returns `()`
59 | | let curr_len = body.len();
60 | | let mut longest_article = longest_article_handle.lock().unwrap();
... |
64 | | }
65 | | }));
| |_________- this function should return `Result` or `Option` to accept `?`
|
= help: the trait `std::ops::Try` is not implemented for `()`
= note: required by `std::ops::Try::from_error`
What’s this about? main
does return Result
! And we didn’t change this
line when adding threading, so why is it giving us an error now?
If you look carefully, we moved the offending line inside of a closure
function that runs inside a different thread. It’s this function isn’t
returning Result
, which is what is causing problems. Furthermore, there’s a
conceptual issue here: if this child thread returns an Error
, how should we
propagate that to the main thread?
Conveniently, Rust allows threads to return values back to the parent thread:
you can add a return type to the closure function, and once the child thread
returns, that value will be returned by thread::join
:
let t = thread::spawn(move || -> i32 {
println!("Inside the child thread, returning 5");
return 5;
}
let x = t.join().expect("Thread panicked!");
println!("Parent thread: {}", x); // prints 5
This means that the child thread can return a Result
back to the parent,
which can propagate it after join()
returns the error:
for link in &links {
let longest_article_handle = longest_article.clone();
threads.push(thread::spawn(move || -> Result<()> {
// ^ note added "-> Result<()>" return type
let body = reqwest::blocking::get(link)?.text()?;
let curr_len = body.len();
let mut longest_article = longest_article_handle.lock().unwrap();
if curr_len > longest_article.length {
longest_article.length = curr_len;
longest_article.url = link.to_string();
}
// Once this thread is done, it needs to return Ok
Ok(())
}));
}
for thread in threads {
thread.join().unwrap()?;
// ^ note the added ?, which will stop/propagate if a thread returns Error
}
let longest_article_ref = longest_article.lock().unwrap();
println!("{} was the longest article with length {}", longest_article_ref.url,
longest_article_ref.length);
Ensuring link
lives long enough
We aren’t finished with the compiler errors:
error[E0597]: `links` does not live long enough
--> src/main.rs:55:17
|
55 | for link in &links {
| ^^^^^^
| |
| borrowed value does not live long enough
| argument requires that `links` is borrowed for `'static`
...
75 | }
| - `links` dropped here while still borrowed
The link
variable is of type &str
(i.e. it is a reference to a string owned
by the main thread), and the Rust compiler is not 100% convinced that the main
thread will outlive the child thread, so we get a lifetime error. (It would be
a use-after-free if the child thread were to continue using this reference
after the main thread cleaned up the memory.)
A simple fix is to move each link out of the vector and transfer ownership to each thread:
for link in links {
// `link` is now an owned String
threads.push(thread::spawn(move || -> Result<()> {
// `link` is moved into the thread
let body = reqwest::blocking::get(&link)?.text()?;
...
}
}
Of course, this means that you won’t be able to use links
in the main thread
after this loop, since all the elements have been moved out of the vector. If
you wanted to continue using the vector, you could either clone the vector
(e.g. for link in links.clone()
), or you could put all the links in an Arc
that all the threads share, to ensure that the memory will live long enough.
Limiting network connections
Great, this code finally compiles! However, it crashes shortly after running:
Error: Error(ReqError(reqwest::Error { kind: Request, url:
"https://en.wikipedia.org/wiki/Thread_(computer_science)", source:
hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 24, kind:
Other, message: "Too many open files" })) }), State { next_error: None,
backtrace: InternalBacktrace { backtrace: None } })
The key part of the error is Too many open files
. We’ve actually run out of
file descriptors! This happens when there are too many concurrent downloads
being made, since each network connection takes a file descriptor.
A good way to limit the number of things happening at a time is to use a
semaphore. Rust doesn’t have a semaphore in the standard library, but there are
crates you can use, such as the
sema
(https://docs.rs/sema/0.1.4/sema/struct.Semaphore.html). This can be used
like a traditional semaphore, as shown in CS 110:
let longest_article = Arc::new(Mutex::new(Article {url: "".to_string(), length: 0}));
// Put the semaphore in an Arc so that it can be shared by all threads
let permits = Arc::new(Semaphore::new(20)); // limit to 20 concurrent downloads
for link in links {
let longest_article_handle = longest_article.clone();
let permits = permits.clone();
threads.push(thread::spawn(move || -> Result<()> {
permits.wait()?; // wait/down semaphore
let body = reqwest::blocking::get(&link)?.text()?;
permits.post(); // signal/up semaphore
// do other stuff...
}));
}
But any time you see “acquire some resources with this function call, then
release the resources with a different function call” e.g. malloc/free,
lock/unlock, wait/signal, alarm bells should be going off in your head: what if
you forget to release the resource? In this case, the above code is buggy,
because if the download fails, the ?
operator will cause the function to
return early and we won’t return the permit back to the semaphore. This is
better than C++, where exceptions can cause non-obvious resource leaks (at
least we can see the ?
early return in the code), but it’s still a problem.
Instead, the sema
library allows us to use a SemaphoreGuard
, which will
return the permit to the semaphore when dropped:
let longest_article = Arc::new(Mutex::new(Article {url: "".to_string(), length: 0}));
// Put the semaphore in an Arc so that it can be shared by all threads
let permits = Arc::new(Semaphore::new(20)); // limit to 20 concurrent downloads
for link in links {
let longest_article_handle = longest_article.clone();
let permits = permits.clone();
threads.push(thread::spawn(move || -> Result<()> {
let permit = permits.take()?; // note "take" instead of "wait"
let body = reqwest::blocking::get(&link)?.text()?;
drop(permit) // return the permit so some other thread can have it
// do other stuff...
}));
}
On early return, permit
will be dropped and the semaphore will be upped.
Limiting active threads
We no longer run out of file descriptors, but this code still periodically fails on some machines:
Error: Error(ReqError(reqwest::Error { kind: Request, url:
"https://en.wikipedia.org/wiki/File:Question_book-new.svg", source:
hyper::Error(Connect, ConnectError("dns error", Custom { kind: Other, error:
"failed to lookup address information: nodename nor servname provided, or not
known" })) }), State { next_error: None, backtrace: InternalBacktrace {
backtrace: None } })
This error is much more cryptic, but is ultimately caused by having too many threads active. The thread limit is flexible and can be increased to have thousands (or even tens of thousands) of threads, but there is usually a thread limit set that is lower than that, and it’s usually not a good idea to spawn so many threads for a task such as this.
Semaphores can also be used here to throttle the creation of threads, but this
use case is well suited for a thread pool, which allows you to create a fixed
number of threads and then reuse those threads to do many tasks. Rust doesn’t
have a thread pool in the standard library, but the
threadpool
crate provides
one:
let threadpool = ThreadPool::new(20);
let permits = Arc::new(Semaphore::new(20));
for link in links {
let longest_article_handle = longest_article.clone();
let permits = permits.clone();
threadpool.execute(move || {
let permit = permits.take();
let body = reqwest::blocking::get(&link).unwrap().text().unwrap();
drop(permit);
let curr_len = body.len();
let mut longest_article = longest_article_handle.lock().unwrap();
if curr_len > longest_article.length {
longest_article.length = curr_len;
longest_article.url = link.to_string();
}
});
}
threadpool.join();
let longest_article_ref = longest_article.lock().unwrap();
println!("{} was the longest article with length {}", longest_article_ref.url,
longest_article_ref.length);