flowchart LR A[Data Collection] --> B[Cleaning / Preprocess] B --> C[Analysis / Model] C --> D[Results] D --> E[Visualization / Reports]
Software design principles
Writing code represents only one aspect of software design. The real challenge lies in creating systems that stand the test of time. Good software design creates systems that enable reproducible science, collaboration, and remain usable and trustworthy for the long term. It provides a shared structure that allows researchers to build on each other’s work rather than starting from scratch.
Why software architecture matters in research
Scientific research thrives on reproducibility and transparency. Software that is poorly structured or insufficiently documented risks becoming a “black box,” where results cannot be trusted or reused. A well-defined architecture helps to:
- Ensure reproducibility
By separating data handling, computation, and analysis, workflows can be rerun and verified.
- Enable collaboration
Clear modular design lowers the entry barrier for new contributors (students, collaborators, or other labs).
- Support extensibility
Research questions evolve; a good design makes it easier to test new hypotheses, add new algorithms, or integrate new data sources.
- Enhance sustainability
Many scientific codes outlive the projects they were written for. Sustainable architecture ensures they can be maintained and reused beyond the initial scope.
- Bridge disciplines
In multidisciplinary research, software is often used by scientists who are not software engineers. Transparent architecture helps communicate design decisions to a wide audience.
Step by step guide to designing research software
Designing architecture sounds abstract, but it can be done in simple steps. Think of it as planning your research workflow, but for code.
Step 1: Define your goals
- Focus on 3–5 goals (also called quality attributes), such as:
- Be reproducible (others can run it and get the same results)
- Be flexible (easy to test new ideas)
- Be efficient (runs in reasonable time on available computers)
- Be portable (runs on my laptop, a cluster, or the cloud)
- Be reproducible (others can run it and get the same results)
- Write these goals down in plain language. They will guide your choices.
Step 2: Capture scope and constraints
- Research questions and hypotheses
- Data types, sizes, and growth expectations
- Compute targets (laptops, HPC, cloud)
- Licensing, ethics, and data sensitivity (e.g., GDPR)
Step 3: Draft a minimal architecture
- Sketch the big steps of your research pipeline, identify modules, and define how data flows between them. For example:
- Data collection
- Cleaning and preprocessing
- Analysis or modeling
- Producing results
- Visualization and reporting
- Once you have these steps, make the design more concrete:
- Identify modules: break the workflow into parts such as ingestion → preprocess → model → analyze → visualize.
- Define data flow: show how inputs and outputs move between steps (what comes in, what goes out).
- Describe interfaces: note what format or structure each step expects (e.g., “CSV file”, “JSON config”, “NetCDF dataset”).
- Draw a simple diagram: use a quick sketch, or a tool like Mermaid, PlantUML, or a C4 diagram (described below in more detail).
- Identify modules: break the workflow into parts such as ingestion → preprocess → model → analyze → visualize.
Step 4: Build a small working version
- Instead of trying to design everything at once, create a very simple version of your software that runs from start to finish.
- Use a small dataset and pass it through all the steps (data → preprocess → analysis → results).
- This “end-to-end slice” shows that your design works in practice and helps uncover problems early.
- It also gives collaborators something concrete they can run, test, and discuss.
Step 5: Automate and record what happens
- Once the small version works, make it easy to repeat and share:
- Put your notes and diagrams in the same repository as the code.
- Add a short test dataset that can be processed quickly (so others can check their setup).
- Automate simple checks (e.g., run the pipeline on the test data after each change).
- Record important details automatically, such as:
- the software version
- the parameters used
- the input data files
- the software version
- Put your notes and diagrams in the same repository as the code.
- These details make it possible for you and others to reproduce results later, which is central to good science.
The C4 Model for research software design
When explaining software to collaborators, it is easy to get lost in too much detail or not enough detail. The C4 model provides a structured way to show software architecture at different levels, from the big picture down to the finer details. It is widely used in both industry and academia because it adapts well to different audiences: principal investigators, collaborators, new students, and developers.
The “C4” stands for Context, Containers, Components, and Code. Think of it as zooming in on your software with a microscope, each step shows more detail.
Level 1: System context
This is the big picture.
It shows how your research software fits into the overall scientific workflow:
- Who uses it (e.g., researchers, students, automated pipelines)
- What external systems or tools it connects to (e.g., databases, HPC cluster, external datasets)
- What the software produces (e.g., processed datasets, figures, reports)
This view is useful for project proposals, publications, and discussions with collaborators who don’t need technical details.
Here is how a simple research workflow could be shown at Level 1 (System Context):
flowchart TB Researcher -->|provides data + config| Software[Research Software System] Software -->|results, figures, reports| Researcher Software -->|stores results| Repository[(Data/Results Repository)] ExternalDB[(External Database)] -->|datasets| Software
Level 2: Containers
This level zooms in to show the main building blocks inside your system.
A “container” here does not mean Docker (though it could), it just means a large part of your software that can run on its own. Examples:
- A command-line tool
- A database or data store
- A web interface or API
- A workflow engine (e.g., Snakemake, Nextflow)
This view shows how these pieces talk to each other and what responsibilities each one has.
Level 3: Components
Each container can be broken down into smaller parts called components.
For example, inside an “analysis” container, you might have:
- A data loader
- A statistical model
- A visualization module
This view is useful for developers and contributors who need to understand how a part of the system works internally.
Level 4: Code (optional)
At the deepest level, you can describe the code structure itself (e.g., class diagrams, functions, or modules).
This level is not always needed, but for critical parts of the software, it can help with onboarding new developers or ensuring correctness.