Skip to main content

AWS SageMaker 推理实践

· 2 min read

准备

  • 环境: AWS SageMarker Notebook (ml.t3.medium)
  • 服务: AWS SageMarker Inference(Models, Endpoint configurations, Endpoints)
  • Models Repo: huggingface models (other like S3 bucket, fine-tuning models)

Abbreviation

  • tgi: Text Generate Interface

推理代码

1. Packges and ENVs preparation

!pip install sagemaker --upgrade
!pip install boto3 --upgrade
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. Config model info

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.serverless import ServerlessInferenceConfig

# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'deepseek-ai/DeepSeek-R1-Distill-Llama-8B',
'SM_NUM_GPUS': json.dumps(1)
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="3.2.3"),
env=hub,
role=role,
)

print("huggingface_model", huggingface_model)

3.1 Create Models, Endpoint configurations, Endpoints instances

# # Specify MemorySizeInMB and MaxConcurrency in the serverless config object
# serverless_config = ServerlessInferenceConfig(
# memory_size_in_mb=6144, max_concurrency=10,
# )

# # deploy the endpoint endpoint
# predictor = huggingface_model.deploy(
# serverless_inference_config=serverless_config
# )

# deploy model to SageMaker Inference with instance
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)
data = {
"inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = predictor.predict(data=data)
print(res)

or

3.2 Already created endpiont

# To get the endpoint URL for making predictions
predictor_2 = sagemaker.predictor.Predictor(
endpoint_name='huggingface-pytorch-tgi-inference-2025-06-26-09-59-09-507',
sagemaker_session=sess,
serializer=sagemaker.serializers.JSONSerializer(), # Add this line
deserializer=sagemaker.deserializers.JSONDeserializer() # Add this line
)
res_2 = predictor_2.predict(data=data)
print(res_2)

4. Clean resources

predictor.delete_model()
predictor.delete_endpoint()

Issues

  • 有些模型太大,网络下载时间过长,导致 Endpiont 无法正常启动。

Resources