View Source

Please note: Although these Information about the usage of S3 are correct, the S3-gateway@AWI is still under construction and not available for users, yet. However, if you have use cases you may contact Pavan Kumar Siligam . He collects use cases.

Clients

There are verity of S3 clients to choose from and here are few that covers both command line based interaction to S3 bucket and also python scripting based interaction.

The clients that are covered here are:

aws
s3cmd
s3fs
boto3

Configuration for each of these tools is a bit different in terms of credentials naming conventions.

First setup the software stack using conda

conda create -y -n s3 python=3.12
conda activate s3
pip install aws-shell 
pip install s3cmd
conda install -y -c conda-forge s3fs boto3 python-magic pyyaml

It is not required to install every thing as listed above. Installing only the required ones also works.

credentials

Lets say the following information is provided by the system administrator

URL:PORT          =>  https://hssrv2.dmawi.de:$PORT
region/location   =>  bhv
ACCESS_KEY        =>  $GRP
SECRET_KEY        =>  $SECRET
CERTS_FILE        =>  https://spaces.awi.de/download/attachments/494210152/HSM_S3gw.cert.pem

These credentials are to be adapted for each of the clients as they fit accordingly.

Please make sure to download the certificate file.

aws

Use aws configure to adapt the credentials to this tool or create the following files.

[default]
aws_access_key_id=$GRP
aws_secret_access_key=$SECRET

[default]
region = bhv
endpoint_url = https://hssrv2.dmawi.de:$PORT
ca_bundle = HSM_S3gw.cert.pem

Listing the buckets

> aws s3 ls
2024-04-06 01:11:30 testdir
> aws s3 ls s3://testdir
2024-04-06 01:11:30     385458 tmp.csv

s3cmd

s3cmd is a free command line tool and client for uploading, retrieving and managing data in Amazon S3 and other cloud storage service providers that use the S3 protocol.

s3cmd look for credentials at ${HOME}/.s3cfg

create the config file as follows

[default]
host_base   = hssrv2.dmawi.de:$PORT
host_bucket = hssrv2.dmawi.de:$PORT
bucket_location = bhv
access_key = $GRP
secret_key = $SECRET
use_https = Yes
ca_certs_file = HSM_S3gw.cert.pem

Listing the buckets

> s3cmd ls
2024-04-06 01:11  s3://testdir
> s3cmd ls s3://testdir
2024-04-06 01:11       385458  s3://testdir/tmp.csv

upload a directory

> s3cmd sync --stats demo-airtemp/ s3://testdir/demo-airtemp/
Done. Uploaded 5569414 bytes in 62.8 seconds, 86.61 KB/s.
Stats: Number of files transferred: 306 (5569414 bytes)

> s3cmd ls s3://testdir/demo-airtemp
                          DIR  s3://testdir/demo-airtemp/

> s3cmd ls s3://testdir/demo-airtemp/
                          DIR  s3://testdir/demo-airtemp/air/
                          DIR  s3://testdir/demo-airtemp/lat/
                          DIR  s3://testdir/demo-airtemp/lon/
                          DIR  s3://testdir/demo-airtemp/time/
2024-04-07 15:57          307  s3://testdir/demo-airtemp/.zattrs
2024-04-07 15:57           24  s3://testdir/demo-airtemp/.zgroup
2024-04-07 15:57         3969  s3://testdir/demo-airtemp/.zmetadata

Note: trailing forward-slash / matters in both listing the objects and as-well in transferring files ( `sync` ) to S3.

s3fs

`s3fs` is a Python library to talk to S3.
It builds on top of `botocore`.
parts of the library uses `fsspec` to map to S3.

Features the following:

`s3fs.S3Filesystem` for file system operations (ls, remove, du, ...)
`s3fs.S3Map` for python dictionary like access (key --> value)
`s3fs.S3File` for file-like object (read, write, seek, ...)

s3fs is a bit flexible with config file naming convention and also with the file format of the config file. Users are free to choose to store their credentials in either yaml or json or any other format that is convenient for them read and load them. Here these credentials are shown as a yaml format just because it a bit reader friendly.

key: $GRP
secret: $SECRET
client_kwargs:
  endpoint_url: https://hssrv2.dmawi.de:$PORT
  verify: HSM_S3gw.cert.pem
  region_name: bhv

Write a utility function to read the config file

import os
import yaml
import s3fs
    
def get_fs():
    with open(os.path.expanduser("~/.s3fs")) as fid:
        credentials = yaml.safe_load(fid)
    return s3fs.S3FileSystem(**credentials)

listing bucket

>>> fs = get_fs()
>>> fs.ls('testdir')
['testdir/demo-airtemp', 'testdir/tmp.csv']
>>>
>>> fs.ls('testdir/demo-airtemp')
['testdir/demo-airtemp/.zattrs',
 'testdir/demo-airtemp/.zgroup',
 'testdir/demo-airtemp/.zmetadata',
 'testdir/demo-airtemp/air',
 'testdir/demo-airtemp/lat',
 'testdir/demo-airtemp/lon',
 'testdir/demo-airtemp/time']

download file

>>> fs.get("testdir/demo-airtemp/.zattrs", "zattrs")
[None]
>>> 
>>> # reading the local file `zattrs` to check if all bytes are transfered 
>>> import json
>>> with open("zattrs") as fid:
...     content = json.load(fid)
... 
>>> print(content)
{'Conventions': 'COARDS',
 'description': 'Data is from NMC initialized reanalysis\n'
                '(4x/day).  These are the 0.9950 sigma level values.',
 'platform': 'Model',
 'references': 'http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html',
 'title': '4x daily NMC reanalysis (1948)'}
>>>

directly read a file from s3

>>> with fs.open("testdir/demo-airtemp/.zattrs", mode="rb") as f:
...     content = f.read().decode()
...     content = json.loads(content)
... 
>>> print(content)
{'Conventions': 'COARDS',
 'description': 'Data is from NMC initialized reanalysis\n'
                '(4x/day).  These are the 0.9950 sigma level values.',
 'platform': 'Model',
 'references': 'http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html',
 'title': '4x daily NMC reanalysis (1948)'}
>>>

Further documentation:

check out their API for function signatures and also their documentation for more examples.

Botocore - low level interface

Botocore is a low-level interface to a growing number of Amazon Web Services.
Botocore serves as the foundation for the AWS-CLI command line utilities.
Sort-of oriented towards library builders

Save the credentials as follows (user is free to choose the convenient file name and file format)

service_name: s3
aws_access_key_id: $GRP
aws_secret_access_key: $SECRET
endpoint_url: https://hssrv2.dmawi.de:$PORT
region_name: bhv
verify: HSM_S3gw.cert.pem

Write a utility function to read the config file

import os
import yaml
import boto3
    
def get_connection():
    with open(os.path.expanduser("~/.s3fs_boto")) as fid:
        credentials = yaml.safe_load(fid)
    return boto3.client(**credentials)

Listing buckets and objects

>>> conn = get_connection()
>>> # Listing buckets
>>> print(conn.list_buckets())
{'Buckets': [{'CreationDate': datetime.datetime(2024, 4, 7, 15, 57, 46, 944296, tzinfo=tzoffset(None, 7200)),
              'Name': 'testdir'}],
 'Owner': {'DisplayName': '', 'ID': '$GRP'},
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'close',
                                      'content-length': '315',
                                      'content-type': 'application/xml',
                                      'date': 'Sun, 07 Apr 2024 21:50:03 GMT',
                                      'server': 'VERSITYGW'},
                      'HTTPStatusCode': 200,
                      'RetryAttempts': 0}}
>>>
>>> # filtering down the results just to show the bucket names
>>> for bucket in conn.list_buckets().get('Buckets'):
...    print(bucket['Name'])
...
'testdir'
>>> # Listing objects
>>> objs = conn.list_objects(Bucket='testdir')
>>> print(obj)
{'Delimiter': '',
 'EncodingType': '',
 'IsTruncated': False,
 'Marker': '',
 'MaxKeys': 1000,
 'Name': 'testdir',
 'NextMarker': '',
 'Prefix': '',
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'close',
                                      'content-length': '67702',
                                      'content-type': 'application/xml',
                                      'date': 'Sun, 07 Apr 2024 21:58:15 GMT',
                                      'server': 'VERSITYGW'},
                      'HTTPStatusCode': 200,
                      'RetryAttempts': 0},
'Contents': [{'ETag': '5f0137574247761b438aa508333f487d',
  'Key': 'tmp.csv',
  'LastModified': datetime.datetime(2024, 4, 6, 1, 11, 30, 890787, tzinfo=tzoffset(None, 7200)),
  'Size': 385458,
  'StorageClass': 'STANDARD'},
 {'ETag': 'd776a1b6e8dc88615118832c552afd4c',
  'Key': 'demo-airtemp/lon/0',
  'LastModified': datetime.datetime(2024, 4, 7, 15, 58, 49, 37104, tzinfo=tzoffset(None, 7200)),
  'Size': 118,
  'StorageClass': 'STANDARD'},
 {'ETag': 'ffe3e35a2a10544db446cb5ffb64516b',
  'Key': 'demo-airtemp/time/.zarray',
  'LastModified': datetime.datetime(2024, 4, 7, 15, 58, 49, 410103, tzinfo=tzoffset(None, 7200)),
  'Size': 319,
  'StorageClass': 'STANDARD'},
 {'ETag': 'c3469e3ac4f2746bdb750335dbcd104a',
  'Key': 'demo-airtemp/time/.zattrs',
  'LastModified': datetime.datetime(2024, 4, 7, 15, 58, 49, 520103, tzinfo=tzoffset(None, 7200)),
  'Size': 172,
  'StorageClass': 'STANDARD'},
  ...
  ...
 {'ETag': '7c6e83fce9aa546ec903ca93f036a2fd',
  'Key': 'demo-airtemp/time/0',
  'LastModified': datetime.datetime(2024, 4, 7, 15, 58, 49, 630102, tzinfo=tzoffset(None, 7200)),
  'Size': 2549,
  'StorageClass': 'STANDARD'}]}

The output for listing the objects is truncated on purpose to avoid filling up this page. Unlike the other clients, botocore provides a lot of metadata information related to buckets and objects.

This is brief introduction to s3 with the focus of knowing some tools and how to configure them in order to talk to s3.

Additional information related to this topic is found here https://pad.gwdg.de/WH0xt_MGTkitDxP3NAM7Xw?view

A talk on this topic also available at https://docs.gwdg.de/lib/exe/fetch.php?media=en:services:application_services:high_performance_computing:coffee:a_brief_introduction_on_ceph_s3-compatible_object_storage_at_gwdg.mp4