On multi-tenant clusters or clusters with varied workload, it is a good idea to protect jobs from starving eachother, and to guarantee a minimum share for certain workloads. Invariably, at some point in time, a job will consume too many resources and, if the cluster is configured with preemption, slots will not become indefinitely monopolized.
There are at least two types of scheduling pools, capacity scheduler and fair scheduler. In this post, I outline details for the fair scheduler. “You have the choice of several job schedulers for MapReduce. Fair Scheduler, developed by Facebook, provides for faster response times for small jobs and quality of service for production jobs. Jobs are grouped into pools, and you can assign pools a minimum number of map slots and reduce slots, and a limit on the number of running jobs. There is no option to cap maximum share for map slots or reduce slots (for example if you are worried about a poorly written job taking up too much cluster capacity), but one option to get around this is giving each pool enough of a guaranteed minimum share that they will hold off the avalanche of an out-of-control job.” – Reference .
With Capacity Scheduler, developed by Yahoo, you can assign priority levels to jobs, which are submitted to queues. Within a queue, when resources become available, they are assigned to the highest priority job. Note that there is no preemption once a job is running to take back capacity that has already been allocated.
To get started, one will need to determine the pools for certain types of workload, eg: marketing, science, production, dev, etc. Next, minimum shares should be allocated to each pool and given appropriate weighting. Last, consider turning on preemption for mapred – understanding there is a slight penalty involved with killing running tasks (time to restart task and any lost progress made by that task prior to being killed).
1. Define filename for allocation policy, eg: allocations.xml
2. Add the following parameters to mapred-site.xml:
3. Define pools, based on available M/R slots for the cluster.
Note: For pools that should not preempt other pools, do not specify a minSharePreemptionTimeout.
4. Restart JobTracker.
5. View JobTracker Dashboard and observe new queue names.
6. View Fair Scheduler Dashboard: http://localhost:50030/scheduler
7. View Advanced Fair Scheduler Dashboard: http://localhost:50030/scheduler?advanced
Points to Consider:
- Killed tasks don’t count as failures. There are separate “killed” and “failed” task states.
- The preemption code will never kill a task that would cause another pool to start needing preemption. In addition, to ensure that all pools can meet their service levels, the minimum shares you set for the pools should not add up to more than the total number of slots in the cluster.
- Take into consideration dead nodes subtracting from minimum shares. Do not commit 100% of cluster slots on account of nodes falling out of the configuration due to failure or maintenance.
- If a job has been below its min share past its min share preemption timeout, it will kill exactly enough tasks from other jobs to reach the min share. Then, after the fair share timeout, it will kill exactly enough tasks to reach the fair share. Note that it will only kill tasks from jobs that are running above their fair share.
- Changes to allocations.xml are dynamic, and usually picked up in 5-10 seconds.
- Heartbeat interval should be checked and tuned if cluster has idle slots compared to workload.