Hadoop is engineered to consume the full capacity of every available resource up to the currently-limiting one. So in general, you should never issue requests against external services from a Hadoop job — one-by-one queries against a database; crawling web pages; requests to an external API. The resulting load spike will effectively be attempting what web security folks call a "DDoS", or distributed denial of service attack.
Unless of course you are trying to test a service for resilience against an adversarial DDoS — in which case that assault is a feature, not a bug!
elephant_stampede
require 'faraday' processor :elephant_stampede do def process(logline) beg_at = Time.now.to_f resp = Faraday.get url_to_fetch(logline) yield summarize(resp, beg_at) end def summarize(resp, beg_at) duration = Time.now.to_f - beg_at bytesize = resp.body.bytesize { duration: duration, bytesize: bytesize } end def url_to_fetch(logline) logline.url end end flow(:mapper){ input > parse_loglines > elephant_stampede }
You must use Wukong’s eventmachine bindings to make more than one simultaneous request per mapper.