adding files

6be5fa5a · Tom Laudeman · 9eecd845 · 6be5fa5a · 6be5fa5a · 6be5fa5a
Commit 6be5fa5a authored Jul 16, 2015 by Tom Laudeman
4 changed files
--- a/tat_requirements/co-op_background.md
+++ b/tat_requirements/co-op_background.md
+Gap analysis
+------------
+This is gap analysis between the current and SNAC2-prototype. Perhaps
+this should be in the Required and Planned Functionality below.
+Data maintenance
+----------------
+A goal of the pilot phase it to demonstrate cooperative maintenance of
+the data resource.  The prototype does not have robust support for
+maintaining the corpus of EAC-CPF identity documents.
+While the current architecture supports the incremental addition of new
+records through input to the match/merge system there is no way to
+directly create a record for a known new identity.  A workflow that
+includes an archival description operator looking up identities in the
+system; and creating a new record if the identity they are trying to
+reference does not exist; must be supported.^[[g]](#cmnt7)^ The new
+maintenance system will allow creation of a new identity record via the
+web interface.
+With the current architecture, textual information that comes from the
+un-merged EAC-CPF documents - such as biographical history notes - could
+be maintained by editing the un-merged EAC-CPF source and re-exporting
+the merged EAC-CPF document.  New information could be added to the
+merged documents by adding another un-merged document into the input
+directory.  New un-merged EAC-CPF documents containing ARKs could be
+matched directly, bypassing the match/merge searching.  It is less clear
+how factual information that comes from VIAF could be modified in the
+current prototype.
+Early research indicates the most import section of a merged EAC-CPF
+document is the section that contains links to primary and secondary
+research materials about the named identity. The overarching goal of
+archival description it to enable researchers to find materials relevant
+to their interests. In the current prototype, the only mechanism to add
+a new link into the “Archival Collections” (a more accurate label would
+be “Archival Materials”) or “Related Resources” sections of links would
+be to generate an un-merged EAC-CPF record with the links and add run
+that through the match/merge system.
+The current prototype maintains information about relationships between
+the entities described by the merged EAC-CPF documents.  Currently,
+“associated with” and “corresponded with” are the only two relationship
+types supported.  Early research suggests that archival description
+practitioners are interested in using a more extensive vocabulary of
+relationship types.
+[ problem of linking to EAD if the URLs are not known ]
+    •    establish procedures to accepting more batch submissions
+    •    XTF is not designed to handle so many documents; XTF will can
+not be run in a “clustered” mode; must scale up, not scale out
+    •    Cheshire II does not have a Open Source Initiative certified
+license
+Pilot phase architecture
+------------------------
+Alternative 1^[[h]](#cmnt8)^
+----------------------------
+The most expeditious way to launch a pilot phase would be to leave the
+basic technical architecture of the prototype in place, and to focus
+initial energies into establishing policies and procedures that work
+within the constraints of this architecture.  Two key systems that would
+need to be set up for this approach to work are a customer relationship
+management (CRM) system and ticketed help desk.
+Customer relationship management systems have historically be used as a
+sales support tool.  Information on current and potential customers,
+including contact information and institutional affiliation, are
+maintained in a database.  All pilot members institutions and designated
+contacts should be entered into a CRM system for the pilot phase.  All
+correspondence, call, contracts and agreements with accepted and
+potential pilot phase members should be logged or stored in the CRM
+system.^[[i]](#cmnt9)^
+The CRM system should support or integrate with a help desk that issues
+work ticket numbers. ^[[j]](#cmnt10)^ Any addition or change in the
+maintained corpus of merged EAC-CPF records will require a work ticket
+number.  Expectations for response times for issued tickets should be
+established, clearly communicated, and measured for compliance.  A
+customer service manager^[[k]](#cmnt11)^ will actively monitor the queue
+of work tickets pending.  An operations manual will be maintained so
+that the customer service manager or any additional first tier support
+staff will be able to handle a set of ticket types.  If a procedure for
+the request is not yet documented in the operations manual - or if the
+manual indicates this is a task for second tier - then the ticket will
+be escalated to the second tier support programmer.  The second tier
+support programmer will have the technical skills to manipulate the
+technical infrastructure; such as through editing XML files or directly
+altering the database.  The second tier support programmer would also be
+responsible for performing data extraction and normalization of non
+EAC-CPF data sources processed during the pilot phase.^[[l]](#cmnt12)^ 
+The volume and type of tickets will help establish priorities for
+establishing procedures that can be automated for first tier support and
+for future phases that do not require pilot members to contact the help
+desk and obtain work tickets.
+An automated way to establish a new identity should established early in
+the pilot phase, so that participants can mint a new ARK identifier
+without creating a work ticket.  Initially, a work ticket would still be
+generated once the participant was ready to submit the new record though
+the match/merge process.
+Given the importance of maintaining links from the merged EAC-CPF record
+to related resources, a link harvesting protocol should be developed
+early in the pilot phase.  When a pilot phase participant identifies a
+match in SNAC with a name they have in a collection description; the
+link harvesting protocol would specify how to publish that link in their
+HTML display of their collection description or through some other
+mechanism (perhaps through an extension to the sitemap protocol, along
+the lines of how ResourceSync works).  Procedures would be established
+to then notify SNAC to harvest links from the participant, and the SNAC
+“related collections” section would be automatically updated.  Such
+updates would be based on a “linked data” technology rather than the
+submission of XML files.
+Alternative 2
+-------------
+Pure XML architecture for edits (edit the merged EAC-CPF records, maybe
+with something like xEAC and with the merged files in revision control.
+ This might make export from the match/merge challenging)
+Alternative 3
+-------------
+Pure RDF architecture
+Current State Conclusion (All, Daniel, Tom)
+-------------------------------------------
+The current systems functions well enough for researchers and other
+stakeholders to see large data sets fully processed. These systems will
+benefit from additional work to make them more mature in the usual ways
+that software develops: robustness, testing and QA, documentation,
+examples, consistent API. Most of the current software will be used in
+the production product.
+Required and Planned Functionality (All authors)
+================================================
+(We need to break out each item into UI functionality, and API
+functionality.)
+Expanded CPF schema requirements
+--------------------------------
+Provenance and history of each element/attribute.
+Unique ID per element of CPF if that element is editable.
+Version control on a per-element basis.
+Expanded Database Schema
+------------------------
+The current database (Postgres) is sufficient for the current project
+only. It will expand, and the expansion will probably be fairly
+dramatic. We need to determine what tables and fields are necessary to
+support additional functions. Each section of this document may need a
+“data” section, or else this database schema section needs to address
+every functional and UI aspect of all APIs that have anything to do with
+the database.
+Each field within CPF may (will?) need provenance meta data. Likewise
+many fields in the database may need data for provenance.
+The database needs audit trail ability to a fairly granular (field)
+level. Audit is a new table at the very least. It seems likely that
+nearly every table will gain some audit related fields.
+Will database records be versioned? How is that handled? Seems like it
+may be done via versioning table and some interesting joins. We need to
+evaluate the various standard methods for database internal versioning.
+CPF record has links to a “watch” table so users can watch each record,
+and can watch for certain types of changes. Need UI for the watch
+system. Need an API for the watch system.
+Need a user table, group table, probably a group permission table so
+that permissions are hard code with groups. We also want to allow
+several permissions per group. Need UI for user, group, and
+group-permission management.
+If we create a generalized workflow system (as opposed to an ad-hoc
+linked set of reports) then we need workflow tables. The tables would
+establish workflow paths, necessary permissions, and would be linked to
+users and groups.
+Need fields to deal with delete/embargo. This may be best implemented
+via a trigger or perhaps a view. By making what appear to be simple
+SELECTs through a view, the view can exclude deleted records. We must
+think about how using a view (or trigger) will effect UPDATE and INSERT.
+Ideally the view is transparent. Is there some clever way we can
+restrict access to the original table only via the view?
+Need record lock on some types of records. This lock needs to be honored
+by several modules, so like “delete”, lock might best be implemented via
+a view and we \*only\* access the table in question via the view.
+If there are different levels of review for different elements in the
+record, then we need extra granularity in the workflow or the edited
+record info to know the type of record edited apropos of workflow
+variations.
+If there different reviewers for different parts of the record, then
+workflow data (and workflow configuration) needs to be able to notify
+multiple people, and would have to get multiple reviewer approvals
+before moving to the next phase of the workflow.
+Institutional affiliation is probably common enough to want a field in
+the user table, as opposed to creating a group for each institution. The
+group is perhaps more generalized and could behave identical (or almost
+identical) to a field (with controlled vocabulary) in the user table.
+Make sure we can write a query (report) to count numbers of records
+based type of edit, institution of the editor, and number of holdings.
+If we want to be able to quickly count some CPF element such as outgoing
+links from CPF to a given institution, then we should put those CPF
+values into the SQL database, as meta data for the CPF record.
+What is: How many referral links to EAC records that they created?
+Be able to count record views, record downloads. Institutional dashboard
+reports need the ability to group-by user, or even filter to a specific
+user.
+Reporting needs to help managers verify performance metrics. This
+assumes that all changes have a date/timestamp. Once workflow and
+process decisions are set, performance requirements for users such as
+load/performance (how many updates and changes to records can be handled
+at once), search response time, edit time (outside of review workflow),
+and update times need to be set.
+Effort reporting to allow SNAC and participants to communicate to others
+the actual level of effort involved. This sounds like a report with time
+span and numbers of records handled in various ways. SNAC might use this
+when going from pilot into production so that everyone knows what effort
+will be required for X number of records/actions (of whatever action
+type).
+Time/activity reporting could allow us to assess viability, utility, and
+efficiency of maintenance system processes.
+Similar reports might be generated to evaluate the discovery interface.
+Something akin to how much time was required to access a certain number
+of records. Rachael said: Assess viability of access funtionality-
+performance time, available features, and ease of use.
+We could try to report on the amount of training necessary before a new
+user was able to work independently in each of various areas (content
+input, review, etc.)
+Introduction to Planned Functionality
+-------------------------------------
+The current system works, but is somewhat skeletal. It requires careful
+attention from the developers to run the data processing pipelines. It
+lacks administrative controls and reporting. Existing software
+development process follows modern agile practices, but the some
+processes are weak or incomplete. The research tools are somewhat
+rudimentary. It needs infrastructure where domain experts can correct
+and update merged authority descriptions.
+The functional requirements below specify in detail all of the
+capabilities of the new [production?] system. A separate section about
+user interface (UI) specifies the visual/functional aspects of the UI
+and includes discussion of the user experience (UX). Some of the
+functional requirements exist only to support actions of the UI, and
+UI-related functions should exist in their own independent API.
+Software development, processes, and project management
+-------------------------------------------------------
+Choices for programming languages, operating system, databases, version
+control, and various related tools and practices are based on extensive
+experience of the developer community, and a complex set of requirements
+for the coding process. Current best practices are agile development
+using practices that allow programmers wide leeway for implementation
+while still keeping the processes manageable.
+Test-driven development ideally means automated testing, with careful
+attention to regression testing. It takes some extra time up front to
+write the tests. Each test is small, and corresponds to small sections
+of code where  both code and text can be quickly created. In this way,
+the software is kept in a working state with only brief downtimes during
+feature creation or bug fixes. Large programs are made up of
+intentionally small functions each of which is tested by a small
+automated test.
+Regression testing refers to verifying that old bugs do not reappear.
+Every bug fix has a corresponding test, even if the function in question
+did not originally have a test for the bug. Each new bug needs a new
+test. Bugs frequently reappear, especially in complex sections of code.
+Source code version control is vital to both development process, and to
+the release process. During development, frequent small changes are
+checked-in to the version control, along with a meaningful comment. The
+history of the code can be tracked. This occasionally helps to
+understand how bugs come into existence. In the Git system, the history
+command is “blame”, a bit of programmer dark humor where the history is
+used to know who to blame for a bug (or any undesirable feature).
+Moving code into Quality Assurance (QA) and then into the production
+environment are both integral with source code management. Many version
+control systems allow tagging a release with a name. The collected
+source code files are marked as a named (virtual) collection, and can be
+used to update a QA area. Human testing and review happens in QA. After
+QA we have release. Depending on the nature of the system release can be
+quite complex with many parties needing to be notified, and coordination
+across groups of developers, sysadmin, managers, support staff, and
+customers. Agile development tends towards small, seamless releases on a
+frequent (weekly or monthly) basis where communication is primarily via
+update of electronic documentation. The process needs to assure that
+fixes and new features are documented. The system must have tools to see
+the current version of the system with its change log, as well as
+comparing that to previous releases. All of these are integrated with
+change management.
+Bug reporting and feature requests fall (broadly speaking) into the
+category of change management. Typically a small group of senior
+developers and stakeholders review the bug/feature tracking system to
+assign priorities, clarify, and investigate. There are good
+off-the-shelf systems for tracking bugs and feature requests, so we have
+several choices. This process happens almost as frequently as the
+features/bug fix coding work of the developers. That means on-going,
+more or less continuous review of fix/features requests every few days,
+depending on how independent the developers are. Agile applies to
+everyone on the project. Ideal change management is not onerous. As
+tasks are completed, someone (developers) update feature status with “in
+progress”, “completed” and so on. There might be additional status
+updates from QA and release, but SNAC probably isn’t large enough to
+justify anything too complex.
+QA and Related Tests for Test-driven Development (Tom, Brian, Ray)
+------------------------------------------------------------------
+The data extraction pipelines manage massive amounts of data, and
+visually checking descriptions for bugs would be inefficient if not
+infeasible. The MARC extraction process is verified by just over 100
+quality assurance descriptions. The output produced from each
+description is checked for some specific value that confirms that the
+code is working correctly and historical bugs have not reappeared. The
+EAD extraction has a set of QA files, but the output verification is not
+yet automated. A variety of file counts and measures of various sorts
+are performed to verify that descriptions have all been processed. All
+CPF output is validated against the Relax NG schema. Processing log
+files are checked for a variety of error messages. Settings used for
+each run are recorded in documentation maintained with the output files.
+The source code is stored in a Subversion repository.
+Our disaster recovery processes must be carefully documented.
+The match/merge process is validated by …
+Required new features
+---------------------
+The majority of new features will be in two areas: the maintenance
+system, and the administration system. None of this code exists. The
+maintenance system has a web UI and a server-based back end that
+interacts with the same database used by the match-merge. The
+maintenance system also requires an authentication system (login) that
+allows us to manage the extensive collaborative efforts. The current
+processing of data is accomplished only on servers at the command line,
+and is handled directly by project programmers. In the new maintenance
+system, that will be driven by content experts via a web site, and
+therefore must expect the issues of authentication and authorization
+inherent in collaborative data manipulation web applications.
+The system will require reports. These will cover broad classes of
+issues related to managing resources, usage statistics, administration,
+maintenance, and some reports for end user researchers.
+(Fill in prose introducing the other subsystems such as reporting)
+One important aspect of the project is long-term viability and
+preservation. We should be able to export all data and metadata in
+standard formats. Part of the API should cover export facilities so that
+over time we can easily add new export features to support emerging
+standards.
+The ability to export all the data for preservation purposes also gives
+us the ability to offer bulk data downloads to researchers and
+collaborating peer institutions.
+Documentation (all authors)
+---------------------------
+Every aspect of the system requires documentation. Most visible to the
+public is the user interface for discovery. Maintenance will be
+complicated, and our processes are somewhat novel, so this will need to
+be extensive, well illustrated with screenshots, and carefully tested.
+Documentation intended for developers might be somewhat sparse by
+comparison, but will be critical to the on-going software development
+process. All the databases, operating system, httpd and other servers
+need complete documentation of installation, configuration, deployment,
+starting, stopping, and emergency procedures.
+It is probably wise to choose a wiki-like documentation system at the
+outset of the project.
--- a/tat_requirements/introduction.md
+++ b/tat_requirements/introduction.md
+TAT Functional Requirements
+Authors
+=======
+Tom Laudeman, Technical lead, University of Virginia, Institute for
+Advanced Technology in the Humanities
+[twl8n@virginia.edu](mailto:twl8n@virginia.edu)
+Brian Tingle, Technical Lead for Digital Special Collections, California
+Digital Library
+Rachael Hu, User Experience Design Manager, California Digital Library
+Ray R. Larson, U.C. Berkeley - School of Information
+(other authors add yourselves here)
+### Discussion items:
+What are .c and .r files in the merged data?
+If an .c file is creatorOf (presumably a resourceRelation) where is that
+preserved in the merged data? Check this out vis-a-vis the cpf SQL db.
+Edit UI data field validation API
+Most data entry needs validation, so we should plan for a validation
+layer that interacts with the UI and with the database. The ideal
+architecture is rule based validation as opposed to some hard coded ad
+hoc system. It would be even better if the rules were saved to the
+database and had a UI of their own, allowing non-programmers to update
+data validation rules (and the concomitant messages show to users when
+there is a validation error).
+#### Edit architecture requirements
+Daniel proposes a plan (which implies important requirements) that human
+edits are applied to a serialized description, and after the first human
+edit, the description is always maintained inside the system in the
+serialized form. Prior to edits, a description consists internally of
+one or more CPF records which are serialized in real time via a specific
+blending algorithm for display/viewing. The edit UI displays the
+serialized description as it would be viewed in the public discovery web
+page. After the first human edit there is no further need to serialize,
+so we would disable serializing. (If serializing is disabled after human
+edits, does this impact any other real-time rendering features or
+formatting that are part of the serializing process? If so, these
+processes must also be applied to the post-human-edit description in
+real time.)
+Prior to human edits, merged records can be algorithmically split by the
+computer, assuming we write code to perform such a split. After human
+edit, a description split must be performed by a human. Daniel proposes
+that all previous versions can be viewed (read-only) during the
+human-mediated split operation so the human can refer back to previous
+information.
+After human edits, rollback only applies to human edited versions. There
+is a fire-break where rollback cannot cross from human edits back to
+machine-merged descriptions.
+#### Merge and watch
+If a file is being watched, and that file is part of an description
+(merged or single) then the watch will apply to the results of human
+edits, regardless of which part of the description was modified. It is
+possible for someone to wish to track a biogHist, but that biogHist
+could be completely removed in lieu of an improved and updated
+description. We do not track individual elements in CPF. We only track
+an entire description, regardless the watcher’s motivation. The original
+motivation for watching might no longer exist after an edit, and if so,
+the watcher can simply disable their watch. After each edit, all
+watchers will get a notification. The watch does not apply to any single
+field, but to the entire description, and therefore also to future
+descriptions which result from merging.
+What happens to a watch on a merged description which is subsequently
+split? Does the watch apply to both split descriptions or to neither
+description? Perhaps is it best to disable the watch, and inform the
+watcher to re-apply to watch a specific record, along with links and
+helpful info to make it easy to add the new watch.
+#### Brian’s API docs need to be merged in or otherwise referred to:
+[https://gist.github.com/tingletech/4a3fc5f59e5af3054286](https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Ftingletech%2F4a3fc5f59e5af3054286&sa=D&sntz=1&usg=AFQjCNEJeJexryBtHbvLw-WtFYjxP4VwlQ)
+### Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:
+Consider implementing linked data standard for relationship links
+instead of having to download an entire document of links (as it is
+configured now.)
+Sort by common subject headings across all of SNAC - right now SNAC has
+subject headings that have been applied locally without common practice
+across the entire corpus.
+Sort by holdings location. Sort by identity's activity location. Sort
+and visualize a person through time (show dates for events in a person
+or organization's lifetime). Sort and visualize an agency or
+organization as it changes over time.
+Continue to develop and refine context widget.
+Sort collection links. Add weighting to understand which collections
+have more material directly related to identity. (How is this best
+handled programmatically or as an input by contributors- maybe both?).
+Increase exposure of SNAC to general public by leveraging partnerships.
+Suggested agreement with Wikipedia to display Wikipedia content in SNAC
+biographical area and work with Wikipedia to allow for links to SNAC at
+the bottom of all applicable identities. This would serve to escalate
+and  drive traffic to SNAC.
+Introduction (Tom, Ray, Sara, Brian, Daniel)
+============================================
+Social Networks and Archival Context (SNAC) is a Mellon-funded project
+to aid end-user researchers in discovering, locating, and using
+distributed historical record descriptions, especially as relates to
+corporate bodies, persons, and families (CPF). These descriptions are
+often in finding aids, and they often exist in electronic format. They
+are distributed across many geographical locations and many networks.
+SNAC brings all this data together in a central system, while retaining
+links to the original descriptions. Critically, SNAC attempts to merge
+descriptions for the same [matching?] CPF identities, linking those
+descriptions to a single authoritative name.
+^[[a]](#cmnt1)^^[[b]](#cmnt2)^
+We have an existing system (SNAC one?) and need additional work to get
+to a new system (SNAC 3?), so part of this document is gap analysis. The
+scope of this document is to outline technical specifications and
+requirements for a production system for the Cooperative
+phase^[[c]](#cmnt3)^ of SNAC. This production system will handle
+ingestion, processing, matching/merging, discovery, and dissemination of
+archival descriptions that are submitted and added to the Cooperative.  
+Evaluation of Existing Technical Architecture
+=============================================
+Overview (All authors)
+----------------------
+This section describes the existing technical architecture, and later
+moving on to describe the required functionality for the production
+system for the Cooperative.
+Many of the archival records that are ingested in SNAC are Encoded
+Archival Context - Corporate bodies, Persons and Families (EAC-CPF,
+hereafter CPF) records. EAC-CPF is an XML schema endorsed as a standard
+by the Society of American Archivists. We speak of CPF descriptions in
+the sense of a “computer record”: often a single text file and not a
+“record” in the archival sense.
+“Linked data” technology related to the Resource Description Framework
+is also employed to manage some controlled vocabularies in the project.
+The current system consists of three main components: extraction,
+match/merge, discovery. Extraction consists of extracting data from
+incoming archival description records (primarily EAD, MARC21 and some
+other unique formats), to create CPF descriptions. Match/merge is to
+process the CPF descriptions in search of name matches and to merge
+well-matched descriptions. The resulting data set includes merged
+descriptions and descriptions with no matches (called singletons), all
+in a single database. Discovery is discovery and dissemination of the
+data via a web application.
+The production system will have two additional components: maintenance
+and administration. Maintenance includes manual corrections, such as
+correcting data within a description, splitting incorrect merges,
+merging descriptions for the same CPF identity, and description embargo
+(embargo hides descriptions from public view for either technical or
+administrative reasons). Administration is the typical management of
+users, accounts, and reporting on the state of the system.
+The first two phases of data processing are extraction, and match/merge.
+A database of descriptions, both merged and unmerged is the end
+result^[[d]](#cmnt4)^. The process of ingesting extracted data and
+merging will continue for the life of the project. An extensive
+web-based search engine lets users discover descriptions.
+We use the term “merged” loosely when applied to the automated system
+since the final database may contain descriptions which should be
+merged, but which a computer is unable to reliably determine.  We take a
+conservative approach, preferring to only merge descriptions that a
+computer program can accurately distinguish.^[[e]](#cmnt5)^ Even so,
+some descriptions will have been incorrectly merged, and thus the need
+for a (future) maintenance system that allows manually splitting of
+descriptions, among other things.
+Both Extraction and Match/merge are script based, batch processing,
+semi-automatic processes managed entirely by software engineers.
+Discovery and Maintenance are both web applications with extensive
+public user interfaces intended for researchers. Administration is done
+mostly via a non-public web application.
+Extraction and match/merge are well developed, although we have some
+planned improvements. Discovery is well developed, but existing features
+are being refined, and adding new features is on-going. Maintenance and
+administration have not yet been created and must be written from the
+ground up.
+Current State of the System (Tom, Ray, Sara, Brian, Daniel)
+-----------------------------------------------------------
+CPF description generation is done at the University of Virginia’s
+Institute for Advanced Technology in the Humanities (IATH). IATH handles
+the CPF data extraction and hosts servers for data processing and the
+SNAC prototype web site. Data processing, XTF indexing (for the
+discovery interface), and web hosting take place on a Linux server with
+24 CPUs and 94 GB of RAM connected to a 1Gbit network switch. This
+server is administered by the IATH sysadmin team. \
+Collections of archival description computer descriptions in a variety
+of formats are extracted into CPF format XML. This process involves
+writing XSLT scripts that extract and transform input descriptions, and
+create CPF files as output. The current state of the extraction is a
+collection of XSLT scripts supplemented by Perl scripts. The input files
+are XML with large numbers of files in EAD, MARC XML, and British
+Library XML, as well as several smaller data sets.  A large XSLT code
+library is shared among most of the extractions. Each type of extraction
+builds a generic internal data structure, which is serialized as EAC-CPF
+XML output. The XSLT takes into account various descriptive practices in
+the input data, and reformats as necessary to create a single type of
+normative CPF output. The complexity of this task centers around the
+large number of small differences in descriptive practice. Currently
+more than 3 million CPF computer descriptions have been created. The
+XSLT processor is Saxon 9 HE, which is the free “home edition” of Saxon.
+Saxon implements XSLT 2.0. There are a small number of Perl scripts that
+integrate the XSLT into a pipeline, automating tasks such as chunking
+data sets into sizes that won’t exceed computer memory.
+The current state of the match/merge is (filled in by Yiming/Ray/Sara,
+initially a one or two paragraph overview with more detail added later
+as necessary).
+Overview of Brian’s UI and programming for the SNAC2 XTF discovery tool
+(add this to another item if there is an umbrella section more
+appropriate).
+Is XTF the only discovery tool we will offer? Will SNAC be fully indexed
+by Google and Bing?
+TK The involvement of the UC Berkeley I School includes the development,
+testing and modification of the matching and merging components of the
+SNAC system. The current system, described in more detail below, takes
+the EAC-CPF records derived from the various source institutions and
+compares the names and associated information (especial dates) to
+identify the records that likely describe the same person,
+organization, or family. The process involves not only comparison across
+input records, but also comparison with information from the Virtual
+International Authority File, and approximate matching for these records
+as well.
+TK The involvement of CDL includes … (Brian)
+TK We have several extant user studies UI/UX … (Rachael, on-going)
+TK The results of these studies are … (Rachael, on-going)
+TK The technical implications of these studies are … (Rachael, on-going)
+The current system uses a fairly loose software development process.
+Source code is primarily maintained on a Linux server which is managed
+by standard practices as relate to hardware, software, network, user
+accounts, back up, and so on. All the data resides on the server. Source
+code is managed by version control systems. The amount of quality
+assurance and testing has been increasing over time, as well as
+documentation, and management aspects such as release process. All tools
+currently used are open source, and the code written for SNAC is open
+source. We have begun to formalize feature request and issue tracking.
+The development process is agile in that there are frequent small
+changes that are committed to the version control, and the code is
+nearly always in a working state.
+### Processing Pipeline (Ray, Yiming, Sara)
+TK Describe algorithmic portions, and add a section for new features.
+Extraction (Tom, Daniel)
+------------------------
+There are currently several CPF extraction software pipelines: MARC21,
+British Library, Smithsonian Agency History, New York State Archives,
+Smithsonian Joseph Henry, Smithsonian Field Books, and EAD from nearly
+60 institutions.
+The first step in adding new records to the SNAC database is to convert
+incoming data into EAC-CPF XML.  One EAC-CPF record is created for each
+successfully extracted reference to an identity from an archival source.
+The processing also allows for some degree of remediation of data
+quality issues and serves to normalize the data into a common format.
+ Scripting data transformation processes is a significant task that
+often requires close communications with data contributors and
+customizations to accommodate local practices of the contributors.
+Creating an extraction is a complex process since we must deal with
+variances in local descriptive practice. The MARC21 tools have been made
+available as a web interface and this demonstrates the feasibility of
+moving more of the processing responsibility to data donors. If we are
+optimistic, we hope that EAD-to-CPF extraction and all other types of
+future extractions can be turned into donor-driven tools. Specifically,
+we create the tools and then deploy them as web applications and/or
+desktop applications. Web hosted extraction tools allow us to leverage
+the power of our servers and programmers so that data donors do not need
+a large computing infrastructure in order to participate. In any case,
+data must be validated before ingest into the match/merge processing.
+XSLT and perl are the predominate technologies used in the generation of
+the XML documents created by this process.  The code architecture
+focuses on reusability of modular routines to facilitate maintenance of
+the customizations needed accommodate the diversity of data sources.
+Code, sample data, and documentation are in Github. The pipeline is
+being run on a server, but the hardware requirements are minimal enough
+that most laptop computers could run the extraction. The system requires
+unix-like features of Linux, MacOS, or cygwin (for MS Windows). The XSTL
+engine is Saxon 9.x HE which is the free, public version of Saxon.
+Match/Merge (Brian, Yiming, Ray)
+--------------------------------
+The match/merge process has three major data input streams, library
+authority records, EAC-CPF documents from the EAC-CPF extract/create
+system, and an ARK identifier minter.
+First, a copy of the Virtual International Authority File (VIAF) is
+indexed as a reference source to aid in the record matching process.  In
+addition to authorized name headings from multiple international
+sources, the VIAF data contains biographical data and links to
+bibliographic records which will be included in the output documents.  
+Then, the EAC-CPF from the extract/create process are serially processed
+against the VIAF and each other to discover and rate potential matches
+between records.  In this phase of processing, matches are noted in a
+database.
+After the matching phase identifies incoming EAC-CPF to merge, a new set
+of EAC-CPF records are generated.  This works by running through all the
+matches in that database, then reading in the EAC-CPF input files, and
+finally outputting a new EAC-CPF records that merges the source EAC-CPF
+with any information found in VIAF.  ARK identifiers are also assigned.
+This architecture allows for incrementally processing more un-merged
+EAC-CPF documents before. It also allows matches to be adjusted in the
+database, or alterations to be made on the un-merged EAC-CPF documents,
+and the merge records can be regenerated.
+Cheshire, postgreSQL, and python are the predominate technologies used
+in the generation of the XML documents created by this process.
+[link to the merge output spec]
+This involves processing that compares the derived EAC-CPF records
+against one another to identify identical names. Because names for
+entities may not match exactly or the same name string may be used for
+more than one entity, contextual information from the finding aids is
+also used to evaluate the probability that closely and exactly matching
+strings designate the same entity.[1] For matches that have a high
+degree of probability, the EAC-CPF records will be merged, retaining
+variations in the name entries where these occur, and retaining links to
+the finding aids from which the name or name variant was derived. When
+no identical names exist, an additional matching stage compares the
+names from the input EAC-CPF records against authority records in the
+Virtual International Authority File (VIAF). Contextual information
+(dates, inferred dates, etc.) is used to enhance the accuracy of the
+matching. Matched VIAF records are merged with the input derived EAC-CPF
+records, with authoritative or preferred forms of names recorded, and a
+union set of alternative names from the various VIAF contributors, will
+also be incorporated into the EAC-CPF records. When exact matching and
+VIAF matching fail, then we attempt to find close variants using Ngram
+(approximate spelling) matching. In addition contextual information,
+when available is used assess the likelihood of the records actually
+being the same. Records that may be for the same entity but the
+available contextual information is insufficient to make a confident
+match will be flagged for human review (as “May be same as”). While
+these records will be flagged for human review, the current prototype
+does not provide facilities to manually merge records. The current
+policy governing matching is to err on the side of not merging rather
+than merging without strong evidence.
+The resulting set of interrelated EAC-CPF records will represent the
+creators and related entities extracted from EAD-encoded finding aids,
+with a subset of the records enhanced with entries from matching VIAF
+records. The EAC-CPF records will thus represent a large set of archival
+authority records, related with one another and to the archival records
+descriptions from which they were derived. This record set will then be
+used to build a prototype corporate body, person, and family name and
+biographical/historical access system.
+In the current system all input records, and potential matches are
+recorded in a relational database with the following structure:
+* * * * *
+[1] Using contextual information in determining that two or more records
+represent the same entity has been successful in matching and merging
+authority records in an international context. See Rick Bennett,
+Christina Hengel-Dittrich, Edward T. O'Neill, and Barbara B. Tillett
+VIAF (Virtual International Authority File): Linking Die Deutsche
+Bibliothek and Library of Congress Name Authority File:
+http://www.ifla.org/IV/ifla72/papers/123-Bennett-en.pdf
+![Screen Shot 2014-06-22 at 3.08.12 PM.png](images/image00.png)
+The the current processing steps are summarized in the following
+diagram:
+![Slide1.jpg](images/image01.jpg)
+Discovery/Dissemination (Brian, Rachael)
+----------------------------------------
+Prototype research tool^[[f]](#cmnt6)^
+--------------------------------------
+The main data input for the prototype research tool are the merged
+EAC-CPF documents produced in the match/merge system.  Some other
+supplemental data sources, such as dbpedia and the Digital Public
+Library of America are also consulted during the indexing process.
+A pre-indexing phase is run on the merged EAC-CPF documents.  During
+pre-processing, name headings and wikipedia links are extracted, and
+then used to look for possible related links and data in supplemental
+sources. The output of the pre-indexing phase consists of XML documents
+recording supplemental.
+Once the supplemental XML files are generated, two types of indexes are
+created to power which serve as the input to the web site.  The first
+index created runs across all documents and provides access to the full
+text and specific facets of metadata extracted from the documents.
+ Additionally, the XML structure of each document is indexed as a
+performance optimization that allows for transformations to be
+efficiently applied to large XML documents.
+The public interface to the prototype research tool utilizes the index
+across all documents to enable full text, metadata, and faceted searches
+of the merged EAC-CPF documents.  Once a search is completed, and a
+specific merged EAC-CPF document is selected for display; the index of
+the XML document structure is used to quickly transform the merged
+document into an HTML presentation for the end user.
+In the SNAC1 prototype a graph database was created after the full text
+indexing was complete.  The graph database was used to power
+relationship visualizations and an API used to dynamically integrate
+links to SNAC into archival description access systems. This graph
+database was then converted into linked data, which was loaded into a
+SQARQL endpoint. This step has not yet been implemented in the SNAC 2
+prototype.  Because the merged EAC-CPF documents are of higher quality
+for the SNAC 2 prototype, the graph extraction process is no longer
+dependent on the full text index being complete, so it could run in
+parallel with pre-indexing and indexing.
+XTF is the main technology used to power public access to the merged
+EAC-CPF records.  XTF integrates lucene for indexing and saxon for XML
+transformation, making heavy use of XSLT for customization and display
+of search results and the merged documents.  EAC-CPF and search results
+are transformed to HTML5 and JSON for consumption by the end users’ web
+browser.  Multiple javascript and CSS3 libraries and technologies are
+used in the production of the “front end” code for the website.  Google
+analytics is used to measure use of the site.  Werker, middleman, and
+bower used to build the front end code for the site.
+This technical architecture
+[links to code]
--- a/tat_requirements/requirements.md
+++ b/tat_requirements/requirements.md
+Maintenance Functionality (All authors)
+---------------------------------------
+Maintenance falls into four areas: discover, split, merge, and edit.
+-   Discover is the process of finding errors.
+-   Splitting is the process of distributing data from one description
+    to two or more new description. This may involve descriptions that
+    have been incorrectly merged.
+-   Merging is the process of combining two or more separate
+    descriptions for the same CPF identity into a single description.
+-   Editing is the modification of descriptions.
+We will build a maintenance system based on a core of researchers
+working in a moderated environment. Primarily, the people involved in
+maintenance in the pilot stage will be professional archivists and
+institution-affiliated experts with a vested interest in the data. In
+the future, the maintenance function may be opened to highly qualified
+(perhaps amateur) content experts. The software must therefore support
+policies such as vetting and moderation so that we avoid the pitfalls of
+unregulated crowdsourcing.
+The system will require changes to be reviewed by a moderator before
+becoming part of the production system. Administrative policy may
+streamline these requirement, but the software functionality needs to
+exist at the most granular level for which we can imagine reasonable
+business logic. For the sake of security and general peace of mind,
+every change to the system must be captured (ala versioning) in an audit
+trail, and there are no destructive changes. For example, there is no
+“delete” per se, because the delete feature only hides descriptions from
+public view. Updated descriptions will be subject to version control so
+changes can be rolled back.
+### Functionality for Discovery
+The discovery tools for maintenance may be somewhat different from the
+normal discovery tools for scholarly research. We have a standard
+discovery tool, but we almost certain need additional tools that
+identify descriptions that are more likely to need manual merging or
+splitting. We assume that as part of some maintenance workflows, a
+person will go fishing for merge/split problems. We will need a user
+interface and functions that support this focused discovery.
+Users will have individual accounts, so we can enable a search history,
+internal bookmarks, and various saved reports (assuming faceted search
+where it could take many mouse clicks to accrete a specific search).
+### 
+### User interface for Discovery (Brian, Rachael)
+### 
+### Functionality for Splitting^[[m]](#cmnt13)^^[[n]](#cmnt14)^ (Tom, Danial, all authors)
+Keeping in mind that our descriptions are authoritative, and will be
+referenced via persistent identifier (ARK), it will be necessary to
+de-authorize or invalidate the ARK of a description which has been
+split. The ARK server will note the new ARKs of the resulting
+descriptions in both human readable, and machine-actionable formats.
+Outside parties with an invalid ARK will probably have to manually
+update their descriptions, since the entity name is too confusing for a
+computer to disambiguate. (Although we can easily create a report of
+deprecated ARKs on a per-institution basis.) When merging descriptions,
+the main ARK will be retained, and merged ARKs can simply redirect to
+it. ^[[o]](#cmnt15)^Note: determine which operations require a new ARK,
+either due to the old ARK being so much changed as to not be want it
+originally referred to, or other causes TBD.
+Having found a description in need of splitting, we need UI to support
+creating one or more additional descriptions. This should have a “save”
+feature so that the work can continue over time. This implies that we
+also mark descriptions that are being worked on as “being worked on”
+that others don’t duplicate the work. Completed splitting is “reviewed”
+by moderators before being “posted”, where posting makes the
+modifications visible to the standard discovery tools. There are also
+some issues in how we manage ARKs of split descriptions.
+In theory, several people in separate locations could collaborate in
+real time on description maintenance. However, that type of
+collaboration is fairly complex. We don’t want to support collaborative
+description splitting in the first version, so we need a feature to
+“lock” descriptions. Which means we need mechanism for seeing who has
+the lock, and for sending that person a
+message.^[[p]](#cmnt16)^^[[q]](#cmnt17)^ Unless we’re going to expose
+the email addresses of our users we will need an anonymized email system
+(or email forwarding system).
+An ideal split UI will easily allow text/fields to be selected and moved
+to one of the possibly multiple splits, via a single mouse click or
+simply drag-and-drop. This feature needs undo, or at least a reciprocal
+ability to move data from a new split back to the original description.
+Meanwhile the UI has to display multiple descriptions in some clear
+manner. It is probably a good idea to have a snapshot of the original
+(pre-split) description to refer back to. This process can be quite
+confusing and time consuming, so people need to know what it was they
+started with, even when they are well along in the splitting process.
+Any description that has been manually modified in any way should have
+special properties that prevent the automated match/merge pipeline from
+touching that description record in the future. We also need to be able
+to search based on the types of modifications that have been performed
+on descriptions, both for reporting, and for future manual modification.
+During the split, new descriptions will be created, but will remain
+locked, and invisible outside the splitting operation. These
+descriptions can be deleted by the person doing the split, but are not
+visible to other users.
+When the split data is ready, the user goes into the review and post
+phases. Review saves all the work, and presents some final, read-only
+view of the work. Review also does a validation of the description/data,
+and gives meaningful messages when validation fails. The “post” button
+should come with various warnings and notifications and the typical “are
+you sure”. Posting will save all work, perform the any required database
+bookkeeping, and unlock all the involved descriptions.
+One type of bookkeeping during the post phase is managing ARKs. The ARK
+of a split description must be deprecated, and new ARKs created for all
+the splits. The deprecated ARK will have a “permanently moved” redirect
+in the ARK system that gives the new ARK values and the names associated
+with the new authority descriptions in both machine actionable and human
+readable formats.
+We need a feature to abandon the split, and this feature needs an “are
+you sure” check.
+Descriptions that are in the process of being modified should have some
+kind of icon/warning in the normal discovery interface, just so
+researchers know that the description in question may soon change.
+To review split:
+1.  lock original description,
+2.  mark descriptions as being maintained,
+3.  give descriptions a locked-for-maintenance icon in the discovery
+    interface,
+4.  view original (unsplit) description,
+5.  create new descriptions (also locked),
+6.  copy data between descriptions,
+7.  undo copy,
+8.  enter new data into any of the description fields,
+9.  edit data in any of the description fields,
+10. delete new descriptions (aka undo create),
+11. “done splitting”,
+12. undo “done splitting” (go back into splitting UI),
+13. review split (just a read-only UI?),
+14. moderator posts  the completed split,
+15. revert entire split,
+16. contact person who has locked description,
+17. see when description was locked,
+18. save progress,
+19. user function to view descriptions I have locked,
+20. admin function to view locked descriptions by user,
+21. choose one of my locked descriptions to continue work.
+### User interface for Splitting (Tom, Daniel, Rachael, others)
+### Functionality for Merging (Tom, Daniel, all authors)
+We need to allow our experts to merge descriptions. This may be far more
+common than splitting since the automated pipeline was designed to only
+merge when the evidence was overwhelming.
+The process begins with discovering two or more descriptions for the
+same CPF entity. Discovery history needs to allow persistent research
+across sessions.
+When starting description maintenance, the descriptions involved are
+locked to prevent other users from modifying them. The system notes this
+lock and makes the locked state visible in the discovery interface. It
+seems safe to assume that one of the merged descriptions will become the
+authoritative recdescriptionord. This single description will be
+retained, and the other merged descriptions marked at deleted. We can
+retain the ARK of the single retained description. The main description
+will be copied, with the original still visible to the discovery tool,
+albeit marked as “under maintenance” or similar. The copy will be
+modified by the merging process, and will not be visible until
+completion of merging.
+The system might be able to automatically join the descriptions into the
+main description, but we always need the ability to edit the main
+description. Secondary descriptions should become read-only, and be
+locked from any modifications. We need the ability edit each field of
+the main description, and we need to be able to create new fields,
+especially alternative name forms. Merging needs the usual save, undo,
+and abandon features.
+When merging is complete, the new description is validated, and sent to
+a moderator for review. The moderator may post or “send back” the
+description for the editor to make additional changes.
+During the post phase, bookkeeping is done. The now-deprecated merged
+descriptions are marked internally as deprecated, and their ARK values
+set to redirect. The original main description is also deprecated,
+replaced by the newly merged description, and the new description open
+for public view.
+Every description retains its modification history.
+To review merging:
+1.  discover two or more descriptions needing merge,
+2.  lock all descriptions,
+3.  show all locked descriptions as locked in the discovery interface,
+4.  copy the main description,
+5.  lock and hide the copy from discovery,
+6.  allow merge save, allow merge abandon,
+7.  allow the public to contact the maintainer,
+8.  auto-join all description to the main description (if possible),
+9.  lock secondary descriptions to be read-only by the maintainer,
+10. allow editing of the main description especially adding new fields
+    such as alternate name forms,
+11. validate merged description,
+12. send merge to review,
+13. review locks changes and notifies a moderator,
+14. moderator can send back,
+15. send back unlocks and notifies the maintainer of additional required
+    work,
+16. moderator can post the merge,
+17. post performs ARK deprecation,
+18. description deprecation,
+19. locks and hides original,
+20. makes merged description publically visible.
+### User interface for Merging (Rachael, Tom, Daniel, others)
+### 
+### Functionality for Editing
+Modifications we expect include but are not limited to: spelling
+corrections, date corrections, editing or expanding biographical data,
+and fixing typographical errors. Editing also includes adding, deleting,
+and correcting relations between descriptions. Metadata such as the URL
+of the original finding aid may also be updated. The maintenance system
+also needs to support bulk data edits of several types.
+### User interface for Editing (Rachael, Tom, Daniel, others)
+Admin Client for Maintenance System
+-----------------------------------
+### User Management (Tom, Brian)
+Authentication is validating user logins to the system. Authorization is
+the related aspect of controlling which parts of the system users may
+access (or even which parts they may know exist).
+Authentication systems require excessively careful programming since
+they are always attacked. The usual recommendation is to use an
+off-the-shelf authentication system although this is often difficult
+since system requirements vary widely. We should search for an open
+source authentication system, and only write our own if nothing
+exists.^[[r]](#cmnt18)^
+Authorization involves controlling what users can do once they are in
+the system. The default is that they can’t do anything that isn’t
+exposed to the non-authenticated public users. Privileges are added and
+users are put into groups from which they inherit privileges, and some
+privileges can be granted on a per-user basis. The authorization system
+is involved in every transaction with the server to the extent that
+every request to the server is check for authorization before being
+passed to the code doing the real work.
+The Linux model of three privilege types “user”, “group”, and “other”
+works well for authorization permissions and we should use this model.
+“User” is an authenticated user. “Group” is a set of users, and a user
+may belong to several groups. “Other” is any non-authenticated user.
+Users can be in multiple groups and have all the privileges of all the
+groups to which they belong. Groups membership can change, therefore we
+need UI and code to manage that. User information such as name, phone
+number, and even password can also change. User ID values cannot be
+changed, and a user ID is never reused.
+By and large when we refer to “accounts” we mean web accounts managed by
+the Manager/Web admin. It should be possible to use the discovery
+interface without an account, but saving history, searches, and other
+session related discovery tools requires an account.
+Every account will be in the “Researcher” group (role). Privileges are
+managed by adding other groups to an individual user’s account.
+[](#)[](#)
+User type
+Group
+Description
+Sysadmin
+Server admin, Web admin
+Maintain server, backups, etc.
+DBA
+Server admin, DB admin, Web admin
+Schema maintenance, data dumps, etc.
+Programmer
+Server admin, Web admin
+Coding, testing, QA, release management, data loading, etc.
+Manager
+Web admin
+Web account creation, account management, privilege management, web
+reporting
+Peer vetting
+Vetting
+Reviewing applicant Moderators, Reviewers, Content experts, uses the
+Vetting UI,
+Moderator
+Moderator
+Reviewing Maintenance changes and posting those changes, is vetted
+Reviewer/editor
+Maintenance
+Has Maintainer privileges, affiliated with an institution and vouched
+for by that institution, vetted, interacts with Moderators
+Content expert
+Maintenance
+Not affiliated with an institution, a domain expert, has Maintainer
+privileges, vetted, interacts with Moderators
+Documentary editor
+Maintenance
+(For our purposes the same as Reviewer/editor?)
+Researcher (read-only)
+Researcher
+The main consumer of SNAC, uses the public web interface to search and
+discover, has an account so they can save searches and use other session
+related features
+Institutional archival description donor
+Block upload
+Member of an institution that donates blocks of descriptions, may have
+block upload privs, may have update privs
+Name authority manager
+Name authority
+Someone in charge of  a name authority, donates descriptions to SNAC,
+may have some Admin privs to update descriptions, may have bulk upload
+privs
+Institutional admins
+Certain users will be distinguished by having access to administrative
+reports for their institution (but probably not for other institutions).
+These users need an admin dashboard with corresponding reports. We may
+need to have sub-institution accounts and that gets tricky because we
+don’t want to be mixed up in internal institutional politics.
+Web Application Administration
+------------------------------
+System administration will be required for the web application and the
+server hosting the web site. This is well understood from a technical
+point of view. We should have more than usual documentation of the
+command line accounts involved, and server configuration. This aspect of
+administration integrates with versioning, backup, and software
+releases.
+Reports ^[[s]](#cmnt19)^^[[t]](#cmnt20)^(Tom, Brian, Rachael, Brad)
+-------------------------------------------------------------------
+While the web interface is the primary public face of SNAC, many other
+views of the data and meta data are necessary, especially for admins and
+governance. These reports will primary be generated via integration of a
+third-party reporting package such as Jaspersoft Business Intelligence
+Suite, which is free, open source, and includes a full range of tools.
+The SNAC data resides in PostgreSQL, the standard SQL relational
+database management system (RDBMS) which simplifies the process of
+adding reporting and business intelligence.
+(How much detail do we want about reports? Maybe just half a dozen
+examples?)
+System Administration (Tom, Brian)
+----------------------------------
+This is boilerplate server administration, for the most part.
+Preservation of original material may not be necessary. Since our data
+is derived from original sources and we know the location of those
+sources, individual lost descriptions could be restored from the
+originals.
+The simplest server model has shell logins for sysadmins, DBA, and
+developers via SSH. If the institution hosting the project can only
+allow employees on the server, then we may need to create a new server
+strategy.
+One option is to do our hosting on Amazon. If so, what is the hosting
+fall back if Amazon has an outage? ^[[u]](#cmnt21)^Where do we house
+things like tape backups? If we’re using Amazon we will have to research
+the list of things that go wrong since our current
+sysadmins^[[v]](#cmnt22)^ are experienced with the model of local
+hardware colocation.
+One common failure of standard server practice is to assume that backups
+are working. We should test our backups on some schedule to verify that
+all the files have been restored as expected. File checksums and file
+counts work well for this verification.
+All storage systems (including RAID arrays) are vulnerable to undetected
+data corruption known as bit rot. Historically this issue has been
+largely ignored (because it is rare). Anti-bit-rot parity error
+correcting file systems (ZFS, Btrfs) may not be production quality at
+this time. If we want to deploy an anti-bit-rot technology, our
+alternative may be limited to using Par2/Quick Par/Parchive.
+Community Contributions (All authors)
+-------------------------------------
+Researcher interface/functionality including public facing discovery and
+dissemination (All, especially Brian)
+In addition to current and planned features (need a list) we should
+consider the following:
+-   Expose all CPF descriptions to search crawlers so that Google and
+    Bing can index our data.^[[w]](#cmnt23)^
+-   Expose the facets of our data as web pages or directories of web
+    pages so that the facets can be browsed outside XTF, and indexed by
+    Google and Bing.
+-   Administration interface/functionality, including private/admin
+    facing, internal discovery tools, and data modification (Tom, Brian,
+    Rachael, Ray)
+The last item above is available only to management and editorial
+admins, but not required by any other users. Not all admins should (or
+need) all admin features. Admins need to create accounts, reset
+passwords, and lock accounts.
+If we have a vetting process then we need to know the related business
+logic that we will support with UI, code, and database tables/fields.
+Many reports will be limited certain roles. Admin users will likely be
+heavy report users.
+Ability to Open/Close the Site during Maintenance (Tom, Brian)
+--------------------------------------------------------------
+If the product has a “closed for maintenance” feature,
+^[[x]](#cmnt24)^this ability would be available to admins, even though
+it is the Linux sysadmins who will do the maintenance. A major failing
+of web applications is the assumption that the product is always up.
+This creates havoc when the site simply fails to load due to an outage,
+planned or otherwise. With a little work we should be able to have an
+orderly “site is closed” web page and status message. This is a low
+priority feature since downtime is probably only a few hours per year.
+At the same time, if it isn’t too difficult to implement, it sets our
+project apart from the majority who either ignore the problem, or let
+their help desk folks spend an hour apologizing to customers.
+When the product is closed, web admins should be able to login (assuming
+login is possible). Discuss: do we want an architecture where the login
+is essentially a separate product so that we can have a “lobby” and
+other front end features that continue to work even when the backend is
+down for maintenance?
+Most sites simply return a server error or site not available (404) when
+the site is down for whatever reason. We can avoid this a couple of
+ways. The simplest is to use some Apache server features and a few
+simple scripts so that users see a nice message when the site is down
+for maintenance. This very simple approach requires little or no change
+to our software architecture. The more elegant approach is to use one of
+several system architectures that  keep a small system front end always
+running.
+Sandbox for Training, perhaps as a clone of the QA system? (All authors)
+------------------------------------------------------------------------
+TK
+ArchiveSpace Feature Planning via Brad
+======================================
+This section will require some discussion (conference calls) with Brad
+and others.
+Staffing Model (Brian’s draft suggestions)
+==========================================
+Production of a cooperatively maintained high profile web site requires
+different types of Technical and non-technical work.
+Operations Team
+-   Communications and interactions with end users and content owners,
+    from marketing to user support, assessment
+-   Manages help desk
+-   Support production web application infrastructure, including
+    monitoring, "on call" for first tier response to system monitors
+-   batch ingest of new data sources
+-   signs up and on-boards new pilot members
+-   Proactive content QA and remediation
+-   work organized around issue queue / customer relationship management
+    system
+Main Artifact: Ticketing Issue tracker that automatically generates a
+ticket for an email to help@example.edu
+Staffing Requirements:
+?? FTE Tech Lead
+?? FTE Project Lead
+?? FTE Programmer/Analyst
+?? FTE General Analyst
+Development Team
+-   Create new features that deliver customer value
+-   Maintain tests for new features
+-   second tier support of deployed features, developers on call for
+    their deployed code
+-   deploy code to test, stage, and production environments
+-   work organized around sprints
+Main Artifact: User story backlog that supports scoring stories by
+points,
+Staffing Requirements:
+?? FTE Tech Lead
+?? FTE Project Lead
+?? FTE Programmer/Analyst
+?? FTE General Analyst
+Research Team
+-   Conduct experiments with new algorithms and technologies
+-   interoperation (and participation in the development) of relevant
+    domain specific standards and practices
+Staffing Requirements:
+?? FTE Tech Lead
+?? FTE Project Lead
+?? FTE Programmer/Analyst
+?? FTE General Analyst
+Main Artifact: Research Agenda, schemas and specifications (esp. merge
+spec)
+How the three teams are coordinated
+Continuous integration, testing, and automated deployment infrastructure
+Operations and Procedure Manual
+Research Agenda
+User Story Backlog
+Design Documents (UI/UX/Graphic Design)
+Professional Standards (content and technical) and local interpretation
+XML, RDB, RDMS schemas
+Github, post-commit hooks
+Roadmap (All authors)
+=====================
+After determining work assignments, development begins by creating a
+prototype. Developers will endeavor to build an API for the prototype be
+that can be carried forward into production. Early work should include
+the authentication system, and framework for the web interface. Back end
+functionality will be divided up into REST API accessible portions, and
+a separate, server-only functional (or class) API. Database schema will
+develop at this time as well.
+All development needs to be test driven, with some way to determine if
+the code is behaving properly. This is especially important for the
+authentication module, and all data-processing pipelines.
+A tight timeline for the prototype is 2 months. During prototyping we
+try out ideas, and discover any discrepancies in the functional plan. At
+the end of the prototype phase we allow a week or two where we evaluate
+which parts of the APIs to retain, and which to rewrite.
+Real project development will proceed based on priority of end user
+needs, with some input from developers about fundamental functionality
+for the API foundations.
+Milestones (All authors)
+========================
+Need something firm for the July meeting (Tom, Rachael, based on CPP
+proposal)
+May 9: Outline and team assignments
+July 15: Outline refinement, milestones, technical details
+September 15: Daniel has draft proposal, tech team (TAT) provides best
+guesses for development milestone
+October 15: Draft proposal refined
+December 15: Proposal complete
+Create the what/how table
+=========================
+TK Is this a table of which function and how we expect it to be
+implemented?
+Governance and Policies, etc.
+=============================
+TK Data curation, preservation, graceful retirement
+Data expulsion vs. embargo
+Duplicates, backups, restore, related policy and technical issues
+Broad pieces that are missing or underdeveloped [Laura]
+Refresh relationship with OCLC [John, Daniel]
+[[a]](#cmnt_ref1)Awkward. Unclear perhaps that "the same" means records
+referring to the same identity, and not "the same" as the previous
+sentence.
+[[b]](#cmnt_ref2)could it be phrased as "...for matching name records,
+linking those descriptions to a single authoritative CF identity."?  
+I am not sure the adverb "Critically" has noteworthy value here.  Or
+should it be replaced with something like "Basically" | "Essentially" |
+"Effectively" ?
+[[c]](#cmnt_ref3)Is this the same as pilot phase; or after the pilot?
+[[d]](#cmnt_ref4)First time readers may not be clear that the database
+contains singleton and merged records. Confusion may arise because we
+alway says, "the merged records are discoverable..." In fact, both
+unmerged and merged records are discoverable.
+[[e]](#cmnt_ref5)I find this sense awkward.  Should the sentence maybe
+end with something more like "...can accurately determine are matching
+descriptions | descriptions for the same identify."?
+[[f]](#cmnt_ref6)We seem to have a name consistency issue. Names here
+should match names on the SNAC web site, grant materials, etc.
+[[g]](#cmnt_ref7)this is a planned feature, thus the next sentence
+instead of this sentence.
+[[h]](#cmnt_ref8)Work on alternative 1 to extract out functions common
+to all prototype architectures, and distill Alt 1 architecture.
+[[i]](#cmnt_ref9)We need user id and group (role) in order to implement
+most of the UI features. Unless the CRM is tightly integrated with the
+Prototype, there will be problems. Correspondence, contracts, etc.
+present an interesting problem.
+[[j]](#cmnt_ref10)It may be asking too much to find an off the shelf CRM
+that integrates both with our UI/UX and an off-the-shelf issue tracker.
+[[k]](#cmnt_ref11)Note this management role!
+[[l]](#cmnt_ref12)Good point that we may need a programmer to handle
+tier 2 help desk issues, if not during the prototype, then later.
+[[m]](#cmnt_ref13)this section doesn't cover the manual splitting of
+parts of the record that go into the various splits. For example, a
+bioghist might need to be split several ways, and for that we need some
+kind of wysiwyg editor.
+[[n]](#cmnt_ref14)This also needs a rewrite to align with the data
+architecture/queue, etc.
+[[o]](#cmnt_ref15)Right? Review the rule of when ARCs are invalid.
+[[p]](#cmnt_ref16)we sketched out an edit queue based approach in one of
+the DC meetings
+[[q]](#cmnt_ref17)If we don't lock, two people could have live edits,
+and one of them is not going to get the expected result, unless I'm
+missing something. Certainly both edits will take place, but the final
+state could result in the first edit being wiped out, just as can happen
+in RDBMS commits. The locking seems to me more a feature of business
+logic than transaction logic.
+[[r]](#cmnt_ref18)why not do something like use OAuth and google?
+[[s]](#cmnt_ref19)I know ASpace uses jasper reports with good success;
+but I'm not convinced the database will record information on everything
+we want to report on.
+[[t]](#cmnt_ref20)A corollary requirement is that the database contain
+all necessary data for any report we anticipate.
+[[u]](#cmnt_ref21)Host in multiple availability zones
+[[v]](#cmnt_ref22)Several teams at CDL including DSC have several years
+experience running production services in Amazon
+[[w]](#cmnt_ref23)I'm pretty sure this is a current feature
+[[x]](#cmnt_ref24)This is just for the backend?  The front end should
+not need to go down.
--- a/tat_requirements/tat_functional_requirements.md
+++ b/tat_requirements/tat_functional_requirements.md
@@ -1616,3 +1616,4 @@ experience running production services in Amazon
 [[x]](#cmnt_ref24)This is just for the backend?  The front end should
 not need to go down.