...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
conda create -y -n s3 python=3.12 conda activate s3 pip install aws-shell pip install s3cmd conda install -y -c condoconda-forge s3fs boto3 python-magic pyyaml |
It is not required to install every thing as listed above. Installing only the required ones also works.
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
URL:PORT => https://hssrv2.dmawi.de:635$PORT region/location => bhv ACCESS_KEY => HPC_user$GRP SECRET_KEY => t1H13sOUBD/H7NuL$SECRET CERTS_FILE => https://spaces.awi.de/download/attachments/494210152/HSM_S3gw.cert.pem |
...
Please make sure to download the certificate file.
aws
Use aws configure
to to adapt the credentials to this tool or create the following files.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
[default] aws_access_key_id=HPC_user$GRP aws_secret_access_key=t1H13sOUBD/H7NuL$SECRET |
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
[default] region = bhv endpoint_url = https://hssrv2.dmawi.de:635$PORT ca_bundle = /Users/pasili001/Downloads/HSM_S3gw.cert.pem # < CORRECT ME >. using tilde (~) or $HOME in path does *NOT* work. |
Listing the buckets
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
> aws s3 ls 2024-04-06 01:11:30 testdir > aws s3 ls s3://testdir 2024-04-06 01:11:30 385458 tmp.csv |
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
[default] host_base = https://hssrv2.dmawi.de:635$PORT host_bucket = https://hssrv2.dmawi.de:635$PORT bucket_location = bhv access_key = HPC_user$GRP secret_key = t1H13sOUBD/H7NuL$SECRET use_https = Yes ca_certs_file = HSM_S3gw.cert.pem |
Listing the buckets
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
> s3cmd ls
2024-04-06 01:11 s3://testdir
> s3cmd ls s3://testdir
2024-04-06 01:11 385458 s3://testdir/tmp.csv |
upload a directory
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
> s3cmd sync --stats demo-airtemp/ s3://testdir/demo-airtemp/ Done. Uploaded 5569414 bytes in 62.8 seconds, 86.61 KB/s. Stats: Number of files transferred: 306 (5569414 bytes) > s3cmd ls s3://testdir/demo-airtemp DIR s3://testdir/demo-airtemp/ > s3cmd ls s3://testdir/demo-airtemp/ DIR s3://testdir/demo-airtemp/air/ DIR s3://testdir/demo-airtemp/lat/ DIR s3://testdir/demo-airtemp/lon/ DIR s3://testdir/demo-airtemp/time/ 2024-04-07 15:57 307 s3://testdir/demo-airtemp/.zattrs 2024-04-07 15:57 24 s3://testdir/demo-airtemp/.zgroup 2024-04-07 15:57 3969 s3://testdir/demo-airtemp/.zmetadata |
Note: trailing forward-slash /
matters in both listing the objects and as-well in transferring files ( `sync` ) to S3.
s3fs
- `s3fs` is a Python library to talk to S3.
- It builds on top of `botocore`.
- parts of the library uses `fsspec` to map to S3.
Features the following:
- `s3fs.S3Filesystem` for file system operations (ls, remove, du, ...)
- `s3fs.S3Map` for python dictionary like access (key --> value)
- `s3fs.S3File` for file-like object (read, write, seek, ...)
s3fs is a bit flexible with config file naming convention and also with the file format of the config file. Users
...
are free to choose to store their credentials in either yaml or json or any other format that is convenient for them read and load them. Here these credentials are shown as a yaml format just because it a bit reader friendly.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
key: $GRP
secret: $SECRET
client_kwargs:
endpoint_url: https://hssrv2.dmawi.de:$PORT
verify: HSM_S3gw.cert.pem
region_name: bhv |
Write a utility function to read the config file
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
import os
import yaml
import s3fs
def get_fs():
with open(os.path.expanduser("~/.s3fs")) as fid:
credentials = yaml.safe_load(fid)
return s3fs.S3FileSystem(**credentials) |
listing bucket
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
>>> fs = get_fs()
>>> fs.ls('testdir')
['testdir/demo-airtemp', 'testdir/tmp.csv']
>>>
>>> fs.ls('testdir/demo-airtemp')
['testdir/demo-airtemp/.zattrs',
'testdir/demo-airtemp/.zgroup',
'testdir/demo-airtemp/.zmetadata',
'testdir/demo-airtemp/air',
'testdir/demo-airtemp/lat',
'testdir/demo-airtemp/lon',
'testdir/demo-airtemp/time'] |
download file
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
>>> fs.get("testdir/demo-airtemp/.zattrs", "zattrs")
[None]
>>>
>>> # reading the local file `zattrs` to check if all bytes are transfered
>>> import json
>>> with open("zattrs") as fid:
... content = json.load(fid)
...
>>> print(content)
{'Conventions': 'COARDS',
'description': 'Data is from NMC initialized reanalysis\n'
'(4x/day). These are the 0.9950 sigma level values.',
'platform': 'Model',
'references': 'http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html',
'title': '4x daily NMC reanalysis (1948)'}
>>> |
directly read a file from s3
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
>>> with fs.open("testdir/demo-airtemp/.zattrs", mode="rb") as f:
... content = f.read().decode()
... content = json.loads(content)
...
>>> print(content)
{'Conventions': 'COARDS',
'description': 'Data is from NMC initialized reanalysis\n'
'(4x/day). These are the 0.9950 sigma level values.',
'platform': 'Model',
'references': 'http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html',
'title': '4x daily NMC reanalysis (1948)'}
>>> |
Further documentation:
check out their API for function signatures andalso their documentation for more examples.
Botocore - low level interface
- Botocore is a low-level interface to a growing number of Amazon Web Services.
- Botocore serves as the foundation for the AWS-CLI command line utilities.
- Sort-of oriented towards library builders
Save the credentials as follows (user is free to choose the convenient file name and file format)
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
service_name: s3
aws_access_key_id: $GRP
aws_secret_access_key: $SECRET
endpoint_url: https://hssrv2.dmawi.de:$PORT
region_name: bhv
verify: HSM_S3gw.cert.pem |
Write a utility function to read the config file
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
import os
import yaml
import boto3
def get_connection():
with open(os.path.expanduser("~/.s3fs_boto")) as fid:
credentials = yaml.safe_load(fid)
return boto3.client(**credentials) |
Listing buckets and objects
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
>>> conn = get_connection()
>>> # Listing buckets
>>> print(conn.list_buckets())
{'Buckets': [{'CreationDate': datetime.datetime(2024, 4, 7, 15, 57, 46, 944296, tzinfo=tzoffset(None, 7200)),
'Name': 'testdir'}],
'Owner': {'DisplayName': '', 'ID': '$GRP'},
'ResponseMetadata': {'HTTPHeaders': {'connection': 'close',
'content-length': '315',
'content-type': 'application/xml',
'date': 'Sun, 07 Apr 2024 21:50:03 GMT',
'server': 'VERSITYGW'},
'HTTPStatusCode': 200,
'RetryAttempts': 0}}
>>>
>>> # filtering down the results just to show the bucket names
>>> for bucket in conn.list_buckets().get('Buckets'):
... print(bucket['Name'])
...
'testdir'
>>> # Listing objects
>>> objs = conn.list_objects(Bucket='testdir')
>>> print(obj)
{'Delimiter': '',
'EncodingType': '',
'IsTruncated': False,
'Marker': '',
'MaxKeys': 1000,
'Name': 'testdir',
'NextMarker': '',
'Prefix': '',
'ResponseMetadata': {'HTTPHeaders': {'connection': 'close',
'content-length': '67702',
'content-type': 'application/xml',
'date': 'Sun, 07 Apr 2024 21:58:15 GMT',
'server': 'VERSITYGW'},
'HTTPStatusCode': 200,
'RetryAttempts': 0},
'Contents': [{'ETag': '5f0137574247761b438aa508333f487d',
'Key': 'tmp.csv',
'LastModified': datetime.datetime(2024, 4, 6, 1, 11, 30, 890787, tzinfo=tzoffset(None, 7200)),
'Size': 385458,
'StorageClass': 'STANDARD'},
{'ETag': 'd776a1b6e8dc88615118832c552afd4c',
'Key': 'demo-airtemp/lon/0',
'LastModified': datetime.datetime(2024, 4, 7, 15, 58, 49, 37104, tzinfo=tzoffset(None, 7200)),
'Size': 118,
'StorageClass': 'STANDARD'},
{'ETag': 'ffe3e35a2a10544db446cb5ffb64516b',
'Key': 'demo-airtemp/time/.zarray',
'LastModified': datetime.datetime(2024, 4, 7, 15, 58, 49, 410103, tzinfo=tzoffset(None, 7200)),
'Size': 319,
'StorageClass': 'STANDARD'},
{'ETag': 'c3469e3ac4f2746bdb750335dbcd104a',
'Key': 'demo-airtemp/time/.zattrs',
'LastModified': datetime.datetime(2024, 4, 7, 15, 58, 49, 520103, tzinfo=tzoffset(None, 7200)),
'Size': 172,
'StorageClass': 'STANDARD'},
...
...
{'ETag': '7c6e83fce9aa546ec903ca93f036a2fd',
'Key': 'demo-airtemp/time/0',
'LastModified': datetime.datetime(2024, 4, 7, 15, 58, 49, 630102, tzinfo=tzoffset(None, 7200)),
'Size': 2549,
'StorageClass': 'STANDARD'}]}
|
The output for listing the objects is truncated on purpose to avoid filling up this page. Unlike the other clients, botocore provides a lot of metadata information related to buckets and objects.
This is brief introduction to s3 with the focus of knowing some tools and how to configure them in order to talk to s3.
Additional information related to this topic is found here https://pad.gwdg.de/WH0xt_MGTkitDxP3NAM7Xw?view
A talk on this topic also available at https://docs.gwdg.de/lib/exe/fetch.php?media=en:services:application_services:high_performance_computing:coffee:a_brief_introduction_on_ceph_s3-compatible_object_storage_at_gwdg.mp4