in blog | Screaming At My Screen |
---|---|
original entry | Security 101: Securing file downloads |
One of the most common way to handle user uploaded content is persisting the data to disk, or uploading it to an object store like AWS S3. Serving the content back to the user (or others) often is handled by returning the URL to the file. What is oftentimes missing is proper authentication and authorization, as engineers seem to believe no one will leak URLs, run enumeration attacks or simply try random strings. This is not just a data breach waiting to happen, it is one happening way too often.
In this post we will look at three options how this can be solved. The examples which you can find in the demo repository are written in Python, using Django. All three should work just fine in basically any modern language and framework used for web development, and with most web servers and reverse proxies such as Nginx. I am using Caddy, as the configuration is concise and simple to follow.
For all examples you can upload a file via Django Admin and browse and download the files by visiting /
.
All examples only check if the user is authenticated. In a real system you will most likely want to extend the check a bit, to make sure the user is actually authorised to view the file they are trying to download.
This is by far the easiest implementation, with the least moving parts. The downside is that you will block one of your application servers workers while the client is downloading the file. For extremely small setups running enough servers (via gunicorn or something similar) might be fine, but the moment you get more than a dozen users, this approach might bring your system to a halt.
The basic idea is you read the content of the file, assign it to HttpResponse.content
and set the file name. That is it, your users can now download files.
def get_file(request: HttpRequest, pk: int) -> HttpResponse:
if request.user.is_anonymous:
raise PermissionError("nope")
upload = get_object_or_404(Upload, pk=pk)
response = HttpResponse()
response.content = upload.file.read()
response["Content-Disposition"] = "attachment; filename={0}".format(
upload.file.name
)
return response
Your application server logs will show you the request and response for browsing files, and the one for downloading data. In this case a 46kb JPEG.
[16/Jan/2022 20:10:52] "GET / HTTP/1.1" 200 299
[16/Jan/2022 20:10:53] "GET /2 HTTP/1.1" 200 46287
It might be a shock to some of you, but people still run applications on servers or single virtual servers. They even store data locally on a disk. Not everyone migrated to AWS, GCP or Azure yet. This is actually a shockingly cheap and performant solution.
If you pursue this way, you likely have a reverse proxy setup terminating SSL and forwarding requests to your application server. Webservers are shockingly efficient in serving files and are doing a way better job than single threaded application servers. Even for applications which are able to handle the load theoretically, letting your webserver or proxy deal with files and keeping resources free to run your actual business logic is surely a welcome idea.
To make this happen we build a response as before. But this time we set the X-Accel-Redirect
header, instead of adding the actual data to serve pointing to the file relative to our media directory. Our proxy will recognise the header and serve the file specified.
def get_file(request: HttpRequest, pk: int) -> HttpResponse:
if request.user.is_anonymous:
raise PermissionError("nope")
upload = get_object_or_404(Upload, pk=pk)
fqp = upload.file.path
rel = fqp.split(settings.MEDIA_ROOT)[1]
response = HttpResponse()
response["X-Accel-Redirect"] = rel
return response
As a reverse proxy I am becoming a big fan of Caddy. It is fast to configure, scalable enough, and I can actually write a configuration without studying the documentation for nearly every line I type.
:8080 {
log {
format single_field common_log
}
reverse_proxy 127.0.0.1:8000 {
@accel header X-Accel-Redirect *
handle_response @accel {
root * /home/timo/fddemo/src/media/
rewrite {http.reverse_proxy.header.X-Accel-Redirect}
file_server
}
}
}
We run the proxy on port 8080
and forward requests to 8000
, Djangos default port for the development server. If the response contains the X-Accel-Redirect
header, we serve the file from our media directory.
Our application server shows zero bytes for the response body.
[16/Jan/2022 20:19:44] "GET / HTTP/1.1" 200 238
[16/Jan/2022 20:19:45] "GET /1 HTTP/1.1" 200 0
Our application server on the other hand shows the actual content being served for the path /1
.
10.211.55.2 - - [16/Jan/2022:21:23:39 +0100] "GET /1 HTTP/1.1" 200 46287
If we call the URL without being signed in, we get the 500 permission error we would expect.
Internal Server Error: /1
Traceback (most recent call last):
File "/home/timo/fddemo/venv/lib/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner
response = get_response(request)
File "/home/timo/fddemo/venv/lib/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/timo/fddemo/src/caddy/views.py", line 16, in get_file
raise PermissionError("nope")
PermissionError: nope
[16/Jan/2022 20:24:27] "GET /1 HTTP/1.1" 500 56345
Using some form of object storage like AWS S3 is an easy and cheap way to store lots of data. What I have seen far to often though is people configuring their object storage to be world readable, to directly serve files from it. For truly public content this might be okay, but most data being stored in world readable buckets is not actually meant for public consumption.
Configuring your bucket as private or "not world readable" is a good start, but also means you have to get the data out of it somehow when you need to download it. You could proxy all data through your application server. Or you generate pre-signed URLs with a short expiration time.
Our example allows initiating a download for five seconds after the URL was generated. While this is still not peak security, it is far better than having your bucket world readable. Depending on the object storage, you might be able to generate true one time use URLs, or might be able to generate temporary users with access to exactly one specific file.
For this example you might want to create a new IAM user with GetObject
and PutObject
permissions for the S3 bucket you are going to use.
def get_file(request: HttpRequest, pk: int) -> HttpResponse:
if request.user.is_anonymous:
raise PermissionError("nope")
upload = get_object_or_404(Upload, pk=pk)
url = _presigned_url(upload.file.name)
return redirect(url)
def _presigned_url(name: str, expire=5):
"""expire in seconds"""
s3_client = boto3.client("s3")
# throws Exception in case of error
response = s3_client.generate_presigned_url(
"get_object",
Params={
"Bucket": settings.AWS_STORAGE_BUCKET_NAME,
"Key": name,
},
ExpiresIn=expire
)
return response
You might go one step further and let your application server handle the redirect to mask S3 as a domain, but I will leave this exercise to the those of you who truly care about the download URLs being shown to your users.
[16/Jan/2022 20:36:03] "GET / HTTP/1.1" 200 242
[16/Jan/2022 20:36:12] "GET /1 HTTP/1.1" 302 0
We end up with lots of data in the URL we are redirected to. I removed some data in the example to make the important parts easier to spot.
https://fddemo.s3.amazonaws.com/s3/djauth_me.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=$snip&X-Amz-Date=20220116T203612Z&X-Amz-Expires=5&X-Amz-SignedHeaders=host&X-Amz-Signature=9
If we visit the same URL a bit later we get an error telling us the request expired and we do not get access to the data.
<Error>
<Code>AccessDenied</Code>
<Message>Request has expired</Message>
<X-Amz-Expires>5</X-Amz-Expires>
<Expires>2022-01-16T20:36:17Z</Expires>
<ServerTime>2022-01-16T20:37:46Z</ServerTime>
<RequestId>Q</RequestId>
<HostId>
NE
</HostId>
</Error>
There is no excuse to expose uploaded content without authorization and authentication checks to the world wide web. Your users expect their data to be kept safe and only be viewed by people authorised to do so.
All three options we looked at should work across most tech stacks and cloud / hosting providers. Implementation details might vary, but being equipped with the correct terminology, most documentations will guide you through the steps necessary to make this work.
You can find the source code for all three examples on GitHub.