proposal.tex

\section{Proposed Solutions}

Our Vision for meeting the challenge of growth of computing needs beyond what's affordable via a simple
Moore's law extrapolation is threefold. First, we will gain efficiencies by being overall more agile in the way we use the traditional
FNAL based Tier-1 and seven university based Tier-2s center resources. Second, we will grow the resource pool by more tightly 
integrating resources at all other US-CMS universities, DOE and NSF supercomputing centers, and commercial cloud providers 
as much as possible. And third, we will pursue an aggressive R\&D program towards improvements in software algorithms, data formats,
and procedural changes for how we analyze the data we collect and simulate, in order to significantly reduce the computing needs.

Int the following, we start by explaining why the needs grow faster than what's affordable with constant hardware budgets, followed
by the concrete steps towards accomplishing the three goals mentioned above.

{\bf Q by fkw: How much of what we have in this section below is really needed? Can we move some of it into the earlier sections
where we describe Run 1? E.g. all of the reasons why the T2 model makes sense could maybe go into the Run1 section of the proposal.
This way the "Proposed Solutions" section is focused really on the solutions rather than justifying the need for T2s.}

{\bf Q by fkw: I'd like us to consider restructuring this entire section into 4 subsections. First the computing needs, and then the three themes
mentioned in the intro text above. These would be section headings like:
"Towards Increased Agility", "Growing the Resource Pool", "R\&D towards decreasing the resource needs" or something like that.  
This would result in a total of 4 subsections only for this section.
I think this will allow us to put all the content in that we want, but in a way that is much more tightly
integrated into an overall theme and vision.}

Vision for US CMS computing environment for 2017-21 and beyond, comprises
strengthening of its current FNAL based Tier-1 and seven university based
Tier-2s, all functioning as portals to the nation-wide research and commercial
clouds. These service providers will also enable all US universities to 
connect to the seamless CMS cloud, by providing them suitable 
"headnodes", democratizing access CMS computing. The resource 
provisioning targets steady-state operations level  at the owned facilities in
FNAL T1 and University Tier-2s, while peak fluctuations are handled by 
overflowing to the clouds. Non-owned opportunistic resources at all 
campuses are integrated in the CMS cloud.

The advantage of strengthening the university sites is multi-fold:
\begin{itemize}
\item Each university group brings unique experience and expertise to bear
\begin{itemize}
\item MIT: Dynamic data management and production operations expertise
\item Nebraska: Dr. Bockleman et al, brought in numerous innovations to CMS middleware
\item San Diego: Connections to SDSC, Connections to core CMS software developers
\item Wisconsin: Connections to HT-Condor and OSG core-developers
\end{itemize}
\item Connection to strong physics groups at the universities
\begin{itemize}
\item Student and postdoc physics analysts exercise the system providing
appropriate usecases for tuning. 
\item Faculty collaborations at the University level can bring in additional
campus or cloud resources
\end{itemize}
\item Cost of infrastructure is subsidized at the Universities.
\item Cost of personnel is also lower.
\item Friendly competition amongst the sites results in increased productivity.
\end{itemize}

\subsection{Storage Resources}

The storage resource requirements are estimated by scaling current data set 
of 30 fb$^{-1}$ acquired in 2010-2012 (Run-1) and 2015 (Early Run-2) to the
full dataset of 300 fb$^{-1}$.  MiniAOD usage reduces the needed resources
significantly but poses, as indicated earlier.  However, it poses data versioning
issues resulting in multipliers.  The user data storage space, is somewhat
ill defined, but it does scales with the luminosity. We use the current usage
at US T2s with the luminosity ratio to estimate this.  

\begin{itemize}
\item 20-30 PB for MiniAOD (disk for one copy)
\begin{itemize}
\item 40 PB for MiniAOD for analysts including replication.
\end{itemize}
\item 10 PB for AOD (for a fraction of datasets)
\begin{itemize}
\item 10 PB fraction should satisfy users requiring full AOD access.
\end{itemize}
\item Currently user space at T2s is typically 0.5 PB with 10\% of data acquired, i.e., 30 fb$^{-1}$.
Projecting to full dataset one gets 5 PB per T2.
\begin{itemize} 
\item 30 PB of storage for user data (non-archived)
\end{itemize}
\item CMS upgrade design tuning requires custom simulations, which are much more
storage resource intensive than Run-2/3 data.  The pileup level will be an order of
magnitude higher.
\begin{itemize} 
\item 20 PB of storage for upgrade data
\end{itemize} 
\item Analysis workflows and improvements to MiniAOD require access to RAW data
\begin{itemize}
\item 60 PB for RAW (tape archive) for full 300 fb$^{-1}$
\end{itemize} 
\begin{itemize}
\item 10 PB for RAW for special analyses cache
\end{itemize}
\item Total Disk Storage:  110 PB
\item Proposed Approximate Disk Storage Distribution: 30 PB disk and 110 PB tape at T1 and 12 PB at each at seven US T2s
\end{itemize}

\subsection{Computing Resources}

Unlike for Run-1, the Run-2/3 Tier-2 compute resource provisioning details are not 
concretely specified to allow local optimization based on cloud and opportunistic
resource availability. However, the non-owned storage resource use is not yet
well defined. Therefore, it is visualized that much of the storage, which is pared 
down to the minimum with improvements made in LS1, is owned and operated at 
the Tier-1s and Tier-2s. 

The CPU requirements are estimated, in units of number of slots needed, by scaling current 
usage up by a factor of 10 accounting for increase in expected integrated luminosity.  The 
increase in complexity of analysis is assumed to be compensated by the improved 
framework job efficiency and advances in computing power of individual machines.
Use of CMS data federation across the wide-area network at owned resource sites
and the clouds in general, opens up the possibility of provisioning needed compute
resources in a flexible way depending on cost/benefit.  However, it is visualized
that a fraction of resources are housed at existing T2s.

\begin{itemize}
\item Currently 30,000 jobs, averaged over the past month, run at the seven 
US T2s equally split between production and analysis and 10,000 production
jobs at FNAL T1.
\item Bulk of the 15,000 analysis jobs running at Tier-2s recently are 
identified as 13-TeV MC jobs.  The MC for 2015 was generated with the
anticipation that we collect 10 fb$^{-1}$ this year.  Unfortunately,
we only collected 2 fb$^{-1}$.  Nevertheless, the analysis effort 
presently is equivalent to 10 fb$^{-1}$.
\item Therefore, scaling by luminosity, 300 fb$^{-1}$ vs 10 fb$^{-1}$ collected 
to date, we should expect to support about 900,000 jobs at T2s at steady state 
and 300,000 jobs at T1.
\item Proposed job slot availability: 300,000 for production at T1 and 100,000 through each of the seven US T2s.
\end{itemize}

\subsection{Network Resources}

The network bandwidth requirement will also scale with increased data size 
and wide-area distributed computing.  Typically sites are connected through
100 Gbps network presently, and we visualize multi-100 Gbps connections
in the coming years and that they are funded through separate initiatives.

\subsection{Non-traditional resources beyond Tier-1 and Tier-2}
\noindent{(Kevin)}

In the last few years, NSF-ACI made some very substantial investment into networking infrastructure across more than 100 campuses nationwide. Among them are 25 of the 40 collaborating universities in US-CMS. We propose to build on this NSF investment
by working with all of them, as well as any of the remaining 15 universities interested, to fully integrate their campus IT operated hardware infrastructures and ScienceDMZs into the US-CMS Tier-2 infrastructure. This will be done following the model of the NSF funded
``Pacific Research Platform'' (PRP) using Open Science Grid (OSG) tools and processes. 
The PRP deploys single nodes into the ScienceDMZs
of 20 institutions across the West Coast, including the US-CMS institutions UC Davis, UC Santa Barbara, UC Riverside, Caltech, and UC San Diego. These pieces of hardware are collaboratively maintained between the campus IT organizations and the PRP and SDSC teams 
at UCSD such that
local IT is responsible for hardware and user account maintenance, and UCSD is responsible for all OS and software service maintenance. 
The functionality implemented includes interactive data analysis, batch submission, CVMFS software cache, XRootd data cache, and 
XRootd server to export local data. The hardware is effectively a Tier-3 in a box without any of the human maintenance needs from the
local CMS community. The deployment model includes careful custom integration into any existing University clusters accessible to the local group. This is made manageable with minimal effort beyond the initial deployment by management of OS, US-CMS and OSG services, and local configurations via a central Puppet infrastructure at UCSD.

The local CMS community is thus empowered to transparently use any and all local resources the University allows them to share in
combination with the entire Tier-1 and Tier-2 system. Official CMS data is cached locally as needed. Private data by the local community is served out to the Tier-1 and Tier-2 system via XRootd servers. Each Tier-2 will also have an XRootd cache in order to transparently
cache the private data of any of the local communities to avoid IO latencies due to WAN reads given the finite speed of light.
The HTCondor batch systems implemented on these pieces of hardware are all connected to the global CMS HTCondor pool via
glideinWMS. Similarily any University clusters are integrated requiring nothing more than ssh access to a US-CMS account on the
local University cluster. Sharing policies are controlled locally following local rules at each University. We expect that some Universities
will enable access to all of US-CMS to share their spare capacity, while others will be more restrictive. All of this is presently
already deployed and operated by PRP and SDSC for the US-ATLAS group at UC Irvine, and is being deployed at the CMS institutions listed above. Operations for the UC groups is funded via a mix of NSF and state funds. We are proposing to scale out deployment and operations of this model across the US to as many US-CMS institutions as possible, focusing on the 25 institutions that have received ScienceDMZ funding
from NSF-ACI since 2012. The hardware costs as well as the human effort to deploy and operate this system 
will be borne out of the Tier-2 portion of this proposal. At a cost of $\sim$\$10,000 per Tier-3 in a box, this is a modest fraction of the total
Tier-2 hardware budget across the seven Tier-2s and the 5 years of this proposal.

We fully understand that the above model will not be appropriate for all 40 collaborating institutions within US-CMS. We thus augment 
it with an additional hosted service build on the OSG-connect model pioneered by the University of Chicago OSG/ATLAS group.
This service will provide identical functionality to the Tier-3 in a box for institutions that are either lacking appropriate network connectivity
or a local IT organization that would be capable and/or willing to collaborate on the hardware and user account maintenance.
There will be only a single instance of this "CMS-Connect" infrastructure for all these remaining groups. Groups within US-CMS are thus generally better off with a Tier-3 in a box, especially when they have sizable private data collections and large groups of students and post-docs.

Finally, we will fully integrate cloud services access into this infrastructure in such way that local University groups can use local funds to
purchase cloud resources to augment their personal access to computing resources, and thus accelerate their science. We expect to be
collaborating on this functionality with the HEPCloud project at FNAL as well as the Open Science Grid.

In addition to all of the above functionality geared towards data analysis, we propose to also integrate Supercomputing resources at DOE and NSF funded national facilities mostly for the purpose of simulation and reconstruction, i.e. the production of the official CMS datasets. 
Again, we expect to collaborate heavily with HEPCloud and OSG on the detailed access mechanisms and policies.
At this point, December 2015, HEPCloud is focused on AWS, while OSG is working with Comet (NSF) and Cori (DOE) to understand
the technical, operational, and security processes for use of these supercomputers via OSG interfaces.


\subsubsection{Opportunistic}
\subsubsection{Commercial}

\subsection{Middleware / Software}

\subsection{Support personnel roles}

Two persons at each facility are necessary to provide full coverage.  However,
recent experience indicates that about 30-50\% of those person's effort can be
freed up for other work. Most of the effective people involved in CMS computing
are former HEP physicists, who have now become experts in computing. They
are able to provide wide-ranging expertise in physics software development.
The additional services we expect Tier-2 personnel to provide are in the areas:
\begin{itemize}
\item Support for non-Tier-2 university portals to CMS cloud
\begin{itemize}
\item We expect each Tier-2 to support about 7 universities in their neighborhood.
\end{itemize}
\item Computing services for CMS upgrades and research to address future needs
\begin{itemize}
\item Development of simulation program for upgrade detectors
\item Production of simulation data for upgrade detectors
\item Participation in computing research
\item Participation in DIANA/HEP and other community wide computing projects for future.
\end{itemize}
\end{itemize} 

\subsection{Physics Driven Datapaths}
\noindent{(Frank)}


Ideas for processing throughput increase
\begin{verbatim}
Datasets of trigger paths.
On-demand processing for some paths, etc.
\end{verbatim}