Contents
Basics
Most developers know about these issues. They apply to almost every software system, no matter how small.- Functional: This is the area most people initially think of when
considering software quality, and is the area most visible to
end users.
- Correct: the system must perform the right operations, preferably in a way that matches the user's expectations. The "principle of least surprise" is applicable here.
- Usable: the system should be as easy to use as is practical. In colloquial terms, it should be user friendly.
- Consistent: the different parts of the system should appear similar to the user. There should be model of the system presented to the user that is easily understandable and as simple as possible. Consistency among different parts of the system allows the user to understand those different parts with less overall to learn.
- Performant: Performance improvements are usually thought of
as quantitative changes, but
a big enough change in performance becomes a qualitative
difference in the system.
As one of my friends says, performance is a feature.
- Fast: the system must be fast enough to satisfy any real-time or near real-time constraints. Worst case scenarios must be considered: can the system keep up under the biggest possible load, and if not, what happens?
- Small: the system must be able to operate within its resource constraints. If you can make the system smaller, that usually means cheaper, faster, and easier to maintain.
- Maintainable: Systems can last for a long time, sometimes far longer
than the original designers might have thought (as happened with
the Y2K problem).
For such systems, the cost of maintenance over the life of the system
can be far more than the original cost of development.
- Modular: the system should be subdivided into separable parts with well-defined integration points between the parts. This allows separate replacement or upgrading of just some parts rather than the whole system.
- Documented: significant aspects of the system design should be captured in written form to reduce the loss of critical knowledge when employees leave the project.
- Commented: a specific form of documentation. Code is easier to write than to read, but is typically read far more often than it is written. Appropriate comments simplify the task of reading and understanding the code.
- Standardized: minimal and standard languages/tools/subsystems should be required. Fewer languages, tools and subsystems such as databases, operating systems and hardware platforms simplifies finding the resources needed to maintain the system.
- Versioned (change-controlled): all source code and other system configuration information should be stored in a version-controlled repository so that new versions can easily be created when fixes are required, and old versions can be recovered when necessary to audit problems or restore functionality that was unintentionally lost in an upgrade.
- Testable: the system should include automated unit and system tests to provide confidence that changes to the system have not broken anything. The system should be designed to support effective tests.
Hardening
For larger systems or systems with more valuable data, these capabilities have particular importance. They are more expensive to implement than the Basics, so are often omitted from smaller systems. They can also be very difficult to retrofit to a system that was not originally designed with them in mind.- Secure: Systems must be protected against unauthorized use.
- Access control: the first line of defense.
- Physical: anyone who has physical access to the system can cause problems. Console access is often not secured as well as remote access, and even if the person can't get in to the software, he can still disconnect or physically damage the system. Depending on the size of the system and the value of its data and services, this could mean physical locks on machines, locked rooms, or secure data centers with controlled access, including ID card with photo, signature, and biometrics such as handprint or retina scan.
- Network: the system should include firewalls to limit access to only the services the system is designed to provide. For more secure systems, IP address filtering, VPN connections, or TLS certificate requirements can limit who can get to the system.
- Authentication: users must be identified. This is typically done by requiring a username and password. More secure systems can require biometrics, such as a thumb scan, or a physical device such as an RFID security card. Every user should have a separate account to allow tracking exactly who it is that is logging in to the system.
- Authorization: once a user is identified through the authentication process, the system can load up the privileges assigned to that user. Each user should have only the privileges necessary to do his job. Giving everyone complete access to the system increases the probability that someone will do something he should not do, whether intentionally or by accident.
- Auditing: the system should record all pertinent information to allow a definitive determination of history for all significant security events. Details should include at a minimum what was done, who did it, and when it happened.
- Internal checks (against bugs and attacks): for improved security, systems should run self-checks to ensure that critical parts of the system continue to run correctly. Ideally the self-check applications should be able to monitor themselves (or each other if more than one) to guard against problems within the self-check programs (including malicious users attempting to circumvent the self-checks).
- Resource management/limits: the system should include mechanisms to allow limiting the resources consumed by each user, including disk space and CPU usage. In addition to allowing for a more fair use of the system by preventing one user from hogging resources, these mechanisms help prevent DOS (denial of service) attacks.
- Access control: the first line of defense.
- Robust: Nothing is perfect. A well designed system takes into account
many different kinds of possible failure.
The probability of failure can not be completely eliminated, but can
be made very small given sufficient resources.
The cost of failure must be compared to the cost of building a system
with a sufficiently low probability of failure.
- Redundant: the system should have no (or minimal) single points of failure. Redundancy can be applied at many levels: a single box can have multiple power supplies, multiple CPUs, multiple disks, and multiple network connections. Within one data center there can be multiple boxes and multiple network routers, with battery backup power. For maximum redundancy there can be multiple geographically separated data centers with multiple network routes connecting them.
- Diverse: monocultures are more susceptible to damage. Just as with biological systems, diversity provides protection against problems that affect a "species". Sharing responsibilities among different operating systems and different applications provides defense against viruses that attack specific operating systems and applications, and against bugs in those components. This aspect of robustness can be very expensive, so is not often considered.
- Forgiving (fault-tolerant): the larger the system, the higher the probability that it will have some problems, including bugs in the software. The system should tolerate small problems in the parts; a small problem should remain a small problem, not be amplified by cascading failures into a larger problem.
- Self-correcting: Self-monitoring can be done at multiple levels, from memory parity up to application-level system checks. More sophisticated techniques allow for automatic correction of errors, such as Reed-Solomon coding instead of simple memory parity. Care must be taken to ensure that the probability of failure or errors in the error-correcting step is less than the parts that it is monitoring.
- Scalable: If you expect usage of the system to grow over time,
your design should allow incremental expansion of the system
and should continue to perform well at the new usage levels.
Scalability should be considered at multiple levels, depending on
how far you need to scale.
When scaling up by adding more identical units,
you also get the benefits of redundancy, because
with more units, the portion that you must set aside purely for
redundancy can be reduced.
Stated another way, effective redundancy can be added to a system
much less expensively when that system is already using
a collection of identical units for scaling purposes.
- Scalable algorithms: algorithms should have appropriate big-O performance. Attention should be paid to word size limits to prevent overflow on large data sets.
- Resource-tight: even a small memory leak can bring down a application when there is enough usage. The system should be tested under load and monitored to ensure it does not run out of memory, file descriptors, database connections, or other resources that could leak from poor coding.
- Parallelizable to multiple threads or processes: the first step in spreading out to multiple units is to use multiple threads or processes on one machine. Areas of concern are shared memory and concurrency: deadlock or livelock, excessive blocking due to excessive synchronization, and stale or corrupt data due to insufficient synchronization. On a multiprocessor machine, be sure you understand the memory model, and be aware of possible cache coherence problems that can arise if code is not properly synchronized.
- Parallelizable to multiple machines: when one machine is not enough, scaling up to multiple machines clustered in one location is the next step. For this level of parallelization, typical issues of concern are network bandwidth limits, how to distribute data for improved throughput, and what to do when one host in the cluster is down.
- Geo-redundant: for the largest applications, having multiple data centers in geographically separated locations provides redundancy and disaster security, as well as sometimes providing improved performance as compared to a single location because of reduced network delays. Typical issues are dealing with network latency, managing routing between data centers, and data replication between data centers for improved performance and redundancy.
Business
In theory you could build a great system without considering these issues, but in practice you had better pay attention to them, else your business will not last very long.- Affordable: The art of engineering includes balancing costs against
other qualities.
The following costs should be considered:
- Design and development: the labor costs of building the system.
- Hardware and external software: the purchase costs of the system.
- Maintenance: the ongoing costs of repairing and upgrading the system.
- Operational: the data center costs of operation.
- Timely: When a product hits the market can determine whether or not
it succeeds.
When a product is scheduled to hit the market should be factored into
the design, both in terms of how much time is available for development,
and in terms of what the expected environment will be like then.
- Soon enough: taking too long to get to market is a well-known concern.
- Late enough: the first to market is not always the most successful. Sometimes it is better to let someone else bear the costs of opening up the market.
- Available support and parts: a good design will plan on using parts that will be cost effective when the product is shipping, which is not necessarily the same as what is cost effective when the product is being designed. This requires predicting the future, so can be tricky to get right.
- Operational: If the system is successful, it may be used for a long
time or by many people, amplifying the value of making the system
easy to operate.
- Visibility: the operator should be able to easily verify that
the system is functioning properly.
This includes being able to determine what level of resources
are being used and that it is not being attacked.
All of the following pieces of information should be
readily available to the operator:
- What version of what component is running where.
- What is happening now.
- What has happened recently.
- Who is accessing now or recently.
- Performance and other resource statistics.
- Usage statistics (e.g. what function is most popular).
- Error and warning conditions.
- Debugging information (e.g. logging).
- Post-mortem info (for bug fixes).
- Control: the operator should be able to adjust aspects of
the system as appropriate to maintain its proper operation and
to protect its security.
This includes being able to do the following:
- Start and stop processes, applications or hosts.
- Manage access control, including adding, removing or modifying accounts, privileges and resource limits for users and groups.
- Enable and disable features.
- Perform maintenance.
- High availability: for systems that require high availability once
put into production,
in addition to the robustness items listed above
the following operations should be possible:
- Piecemeal upgrades or replacement of components while running.
- Ability to run with mixed versions of components (soft rollout), particularly on a system with multiple copies of the same component.
- Granular control: for optimal resource management and flexibility
in managing the business model of usage and payment,
the system should support these capabilities:
- Per-user restrictions, e.g. by time of day, stop when out of money.
- Ability to charge customers per transaction or per other metrics (cpu usage, disk usage, #calls, etc).
- Image: you might not be able to judge a book by its cover,
and beauty is only skin deep, but people respond to how things
look, so it is important to remember to maintain consistency
and quality in these areas:
- Branding: consistent look and feel concerning logos, company colors, slogans, etc.
- Beauty: too often not included in software products.
- Backups: any system with data of any value should be backed up.
This is valid even for small systems, such as your
laptop or home network.
- You should have a backup schedule based on how much you would be bothered by losing your data. For a simple system, a regular schedule of full backups plus an automated incremental backup should suffice and is generally relatively easy to set up. For a larger system, you might want a full backup less often (due to the amount of data) with a layered schedule of weekly and daily backups.
- Backup media should be validated. At a minimum, the backup media should be checked to make sure it contains the data that was intended to be written to it. For a more complete test, the backup data should be used to recreate a system in the same manner as would need to be done following a disaster.
- Copies of backups should be stored off-site for disaster security. If your backups are stored on-site and the building is destroyed, those backups won't do you much good.
- Sensitive data on backup media should be encrypted to prevent its use in the event that media is stolen.
- Visibility: the operator should be able to easily verify that
the system is functioning properly.
This includes being able to determine what level of resources
are being used and that it is not being attacked.
All of the following pieces of information should be
readily available to the operator: