-
Notifications
You must be signed in to change notification settings - Fork 337
SAMZA-2804: Concurrency issues identified in run-class.sh on samza-yarn #1716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0f7f6a8
to
5e611eb
Compare
## Race condition in pathing jar manifest creation A race condition exists when setting up the classpath during container launch. During container launch using samza-yarn, run-class.sh creates a pathing jar file (which holds the classpath for the container launch). However, during the creation of this pathing jar, temporary files, as well as the pathing jar itself is not placed in a location unique to the container. This results in multiple containers writing to the same pathing jar location and temporary file location, which results in a race condition. This race condition may show up in several ways, such as when Yarn removes jars from a finished container (other containers will point to a classpath which no longer exists) or when multiple run-class.sh scripts attempt to write the manifest.txt or pathing jar at the same time. Note that host affinity being enabled will make this problem worse. The pathing.jar is written to the usercache, so when the container which created the pathing.jar is finished and removed, any new container which launches on that host will point to jar files which do not exist anymore. When host affinity is enabled, it will not move to a new host and just keep failing. ## Container logging directory fallback is not unique for each container The fallback log directory is the same among all containers running on the same host. It should be unique per-container. ## Container tmp dir is not unique per-container The JAVA_TMP_DIR directory is the same for all containers. We should make sure that it's safe to use the same directory for all containers.
5e611eb
to
942e0d1
Compare
@@ -97,12 +108,18 @@ else | |||
fi | |||
|
|||
if [ -z "$SAMZA_LOG_DIR" ]; then | |||
SAMZA_LOG_DIR="$base_dir" | |||
# When on samza-yarn, SAMZA_LOG_DIR will point to the symlink located at: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the comments. Please remove when on samza-yarn prefix, since this script is applicable only for yarn deployments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to simpify.
# file containing the classpath string; used to avoid passing long classpaths directly to the jar command | ||
PATHING_MANIFEST_FILE=$CLASSPATH_WORKSPACE_DIR/manifest.txt | ||
|
||
# jar file to include on the classpath for running the main class | ||
PATHING_JAR_FILE=$CLASSPATH_WORKSPACE_DIR/pathing.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if we can log the pathing_jar_file path and the manifest_file path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
# file containing the classpath string; used to avoid passing long classpaths directly to the jar command | ||
PATHING_MANIFEST_FILE=$CLASSPATH_WORKSPACE_DIR/manifest.txt | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to log the location of manifest-file and the pathing-jar-file path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for making the changes
This PR contains fixes to several concurrency related issues when
run-class.sh
andrun-framework-class.sh
is used on samza-yarn.BUG=SAMZA-2804
Details
Race condition in pathing jar manifest creation
A race condition exists when setting up the classpath during container launch.
During container launch using samza-yarn, run-class.sh creates a pathing jar file (which holds the classpath for the container launch). However, during the creation of this pathing jar, temporary files, as well as the pathing jar itself is not placed in a location unique to the container. This results in multiple containers writing to the same pathing jar location and temporary file location, which results in a race condition.
This race condition may show up in several ways, such as when Yarn removes jars from a finished container (other containers will point to a classpath which no longer exists) or when multiple run-class.sh scripts attempt to write the manifest.txt or pathing jar at the same time.
Note that host affinity being enabled will make this problem worse. The pathing.jar is written to the usercache, so when the container which created the pathing.jar is finished and removed, any new container which launches on that host will point to jar files which do not exist anymore. When host affinity is enabled, it will not move to a new host and just keep failing.
This has been modified to use a path unique to the container.
Container logging directory fallback is not unique for each container
The fallback log directory is the same among all containers running on the same host.
This has been modified to use a path unique to the container.
Container tmp dir is not unique per-container
The JAVA_TMP_DIR directory is the same for all containers.
This has been modified to use a path unique to the container.
Testing done
Validated that the
manifest.txt
andpathing.jar
are now located in a per-container directory:Compared to before the patch, which shows them being written to an application-level package cache directory:
See also
An older attempt to solve this issue can be found at #1498, however, that change overlooks that
$base_dir
leads to__package
, which is a symlink to the application level directory (note the use of readlink in the test example above).