# Required New Features The majority of new features will be in two areas: the maintenance system, and the administration system. None of this code exists. The maintenance system has a web UI and a server-based back end that interacts with the same database used by the match-merge. The maintenance system also requires an authentication system (login) that allows us to manage the extensive collaborative efforts. The current processing of data is accomplished only on servers at the command line, and is handled directly by project programmers. In the new maintenance system, that will be driven by content experts via a web site, and therefore must expect the issues of authentication and authorization inherent in collaborative data manipulation web applications. The system will require reports. These will cover broad classes of issues related to managing resources, usage statistics, administration, maintenance, and some reports for end user researchers. - Web application (architect: Robbie) The web application is a wrapper for all the APIs. It can have an API of it own, or not. It handles all http requests, validating the data, deciding what needs to be done, doing real work, and handing some output back to the user. Typically the output is HTML, but we are already planning for file downloads, and JSON data as output from REST API calls. - Data validation API Data from the web browser needs sanity checking and untainting before being handed to the rest of the application. Initially the data validation API can consist of nothing more than untaining input from the browser. We can add various checks and tests. We need to decide if the validation API can reject data, and if it can, then it needs to interact with the work flow engine, the actual work flow, and whatever messaging system we use to display messages to end users. - Identitiy Reconciliation (aka IR) (architect: Robbie) This API uses many aspects of identity, testing each against a target population of other identities. The final anwser is a floating point number giving a match strength. IR has two modes of operation. Mode one compares two identities and returns a match strength. Mode two compares a single identity againast the entire database returning match strength. Mode two is somewhat unclear. - workflow manager (Tom) Every action the application can perform is part of the work flow. The names of these actions along with names of their requisites are organized into a work flow table. The work flow engine does not know how to do real work, but it does know the names of the functions which do the real work. A new feature (aka function, task) is added to the application, by adding its name to the work flow, and creating a function of the same name in the application. Likewise, requistes are determined by boolean functions, and every requisite must have a matching function known to the work flow engine. The work flow enforces role-based behavior by testing the requisites. The workflow engine exists, but needs to be ported from Perl to PHP, and the work flow data should be stored in the SQL database. - Support for work history and task staging. Editing consists of several stages of work that may be performed by different people and/or different roles. We need database tables to support saving of work state data. Create a prototype table schema so we can think about this problem and create a functional spec. For an edit we need the CPF id, user id, timedate stamp, bitfield or work flow tags, optional user notes. For search we need: user id, search string, timedate stamp. - SQL schema (Robbie, Tom) All data is stored in a SQL database. Details are given elsewhere. - Controlled vocabulary subsystem or API [Tag system](#controlled-vocabularies-and-tag-system) We need controlled vocabulary for several data fields. This system handles all aspects of all controlled vocabularies. - CPF to SQL parser (Robbie) The input for the application is CPF files. These files need to be parsed into data fields and input into the SQL database. This application exists, but needs some additional functionality. - Name serialization tool, selectable pre-configured formats Outputting name strings based on name data fields in the database is a tricky problem. There are several output formats. The name serialization deals with this issue. - Name string parser Names in CPF files are currently strings. The CPF <part> element has been imported into the SQL database as a string, but data needs require individual name components. Parsing names is a tricky problem, but several parsers exist. We need to integrate one or more parsers, and perhaps tweak those parsers to handle the SNAC names. - Date parser We have several date parsers, but none are fully comprehensive. We can use the existing parsers, but they need to be integrated into a single, comprehensive parser. - CPF record edit, edit each field Record editing on the server is handled by a collection of functions. The specifications for this may evolve in parallel to the code. We know that each field needs to be changed, but the details of work flow and data validation have not been determined. Work flow and validation are both likely to change as the SNAC policies evolve. There are UI requirements for editing. - CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs Record splitting requires a set of functions and UI requirements documented elsewhere. - CPF record merge, combine fields, deprecate old ARKs, mint new ARK Record merge requires a set of functions and UI requirements documented elsewhere. - Object architecture, coding style, class template (architect Robbie) We will have a specific architecture of the web application, and of the classes and objects involved. - UI widgets, mostly off the shelf, some custom written. We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers of values. In fact, part of the meta data for every field is "number of possible entries/reapeat values" or whatever that's called. From a software architecture perspective, the answer is 0, 1, infinite. One important aspect of the project is long-term viability and preservation. We should be able to export all data and metadata in standard formats. Part of the API should cover export facilities so that over time we can easily add new export features to support emerging standards. The ability to export all the data for preservation purposes also gives us the ability to offer bulk data downloads to researchers and collaborating peer institutions.