Uploads guide: How uploads work technically
This page is for developers trying to better understand what kinds of uploads exist in GitLab and how they are implemented.
Kinds of uploads and how to choose between them
We can identify three major use-cases for an upload:
- storage: if we are uploading for storing a file (like artifacts, packages, or discussion attachments). In this case direct upload is the proper level as it's the less resource-intensive operation. Additional information can be found on File Storage in GitLab.
- in-controller/synchronous processing: if we allow processing small files synchronously, using disk buffered upload may speed up development.
- Sidekiq/asynchronous processing: Asynchronous processing must implement direct upload, the reason being that it's the only way to support Cloud Native deployments without a shared NFS.
Selecting the proper acceleration is a tradeoff between speed of development and operational costs.
For more details about currently broken feature see epic &1802.
Handling repository uploads
Some features involves Git repository uploads without using a regular Git client. Some examples are uploading a repository file from the web interface and design management.
Those uploads requires the rails controller to act as a Git client in lieu of the user. Those operation falls into in-controller/synchronous processing category, but we have no warranties on the file size.
In case of a LFS upload, the file pointer is committed synchronously, but file upload to object storage is performed asynchronously with Sidekiq.
Upload encodings
By upload encoding we mean how the file is included within the incoming request.
We have three kinds of file encoding in our uploads:
-
multipart:
multipart/form-data
is the most common, a file is encoded as a part of a multipart encoded request. - body: some APIs uploads files as the whole request body.
- JSON: some JSON APIs upload files as base64-encoded strings. This requires a change to GitLab Workhorse, which is tracked in this issue.
Uploading technologies
By uploading technologies we mean how all the involved services interact with each other.
GitLab supports 3 kinds of uploading technologies, here follows a brief description with a sequence diagram for each one. Diagrams are not meant to be exhaustive.
Rack Multipart upload
This is the default kind of upload, and it's the most expensive in terms of resources.
In this case, Workhorse is unaware of files being uploaded and acts as a regular proxy.
When a multipart request reaches the rails application, Rack::Multipart
leaves behind temporary files in /tmp
and uses valuable Ruby process time to copy files around.
sequenceDiagram
participant c as Client
participant w as Workhorse
participant r as Rails
activate c
c ->>+w: POST /some/url/upload
w->>+r: POST /some/url/upload
r->>r: save the incoming file on /tmp
r->>r: read the file for processing
r-->>-c: request result
deactivate c
deactivate w
Disk buffered upload
This kind of upload avoids wasting resources caused by handling upload writes to /tmp
in rails.
This optimization is not active by default on REST API requests.
When enabled, Workhorse looks for files in multipart MIME requests, uploading any it finds to a temporary file on shared storage. The MIME data in the request is replaced with the path to the corresponding file before it is forwarded to Rails.
To prevent abuse of this feature, Workhorse signs the modified request with a special header, stating which entries it modified. Rails ignores any unsigned path entries.
sequenceDiagram
participant c as Client
participant w as Workhorse
participant r as Rails
participant s as NFS
activate c
c ->>+w: POST /some/url/upload
w->>+s: save the incoming file on a temporary location
s-->>-w: request result
w->>+r: POST /some/url/upload
Note over w,r: file was replaced with its location<br>and other metadata
opt requires async processing
r->>+redis: schedule a job
redis-->>-r: job is scheduled
end
r-->>-c: request result
deactivate c
w->>-w: cleanup
opt requires async processing
activate sidekiq
sidekiq->>+redis: fetch a job
redis-->>-sidekiq: job
sidekiq->>+s: read file
s-->>-sidekiq: file
sidekiq->>sidekiq: process file
deactivate sidekiq
end
Direct upload
This is the more advanced acceleration technique we have in place.
Workhorse asks Rails for temporary pre-signed object storage URLs and directly uploads to object storage.
In this setup, an extra Rails route must be implemented in order to handle authorization. Examples of this can be found in:
Direct upload falls back to disk buffered upload when direct_upload
is disabled inside the object storage setting.
The answer to the /authorize
call contains only a file system path.
sequenceDiagram
participant c as Client
participant w as Workhorse
participant r as Rails
participant os as Object Storage
activate c
c ->>+w: POST /some/url/upload
w ->>+r: POST /some/url/upload/authorize
Note over w,r: this request has an empty body
r-->>-w: presigned OS URL
w->>+os: PUT file
Note over w,os: file is stored on a temporary location. Rails select the destination
os-->>-w: request result
w->>+r: POST /some/url/upload
Note over w,r: file was replaced with its location<br>and other metadata
r->>+os: move object to final destination
os-->>-r: request result
opt requires async processing
r->>+redis: schedule a job
redis-->>-r: job is scheduled
end
r-->>-c: request result
deactivate c
w->>-w: cleanup
opt requires async processing
activate sidekiq
sidekiq->>+redis: fetch a job
redis-->>-sidekiq: job
sidekiq->>+os: get object
os-->>-sidekiq: file
sidekiq->>sidekiq: process file
deactivate sidekiq
end